SLIP: Self-supervision Meets Language-Image Pre-training

Norman Mu, Alexander Kirillov, David Wagner, Saining Xie

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning with Vision Transformers. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy). Our code is available at:

Original languageEnglish (US)
Title of host publicationComputer Vision – ECCV 2022 - 17th European Conference, 2022, Proceedings
EditorsShai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
PublisherSpringer Science and Business Media Deutschland GmbH
Number of pages16
ISBN (Print)9783031198083
StatePublished - 2022
Event17th European Conference on Computer Vision, ECCV 2022 - Tel Aviv, Israel
Duration: Oct 23 2022Oct 27 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13686 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference17th European Conference on Computer Vision, ECCV 2022
CityTel Aviv

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science


Dive into the research topics of 'SLIP: Self-supervision Meets Language-Image Pre-training'. Together they form a unique fingerprint.

Cite this