SLIP: Self-supervision Meets Language-Image Pre-training

Mu, Norman; Kirillov, Alexander; Wagner, David; Xie, Saining

doi:10.1007/978-3-031-19809-0_30

Norman Mu¹²,
Alexander Kirillov¹³,
David Wagner¹² &
…
Saining Xie¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13686))

Included in the following conference series:

European Conference on Computer Vision

3794 Accesses
50 Citations

Abstract

Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning with Vision Transformers. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy). Our code is available at: github.com/facebookresearch/SLIP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The initial 7\(\times \)7 conv is replaced by three 3\(\times \)3 convs; global average pooling is replaced by a self-attention pooling layer with 14M parameters.
2.
This model achieves 88.4% top-1 accuracy on ImageNet.

References

Bao, H., Dong, L., Wei, F.: Beit: bert pre-training of image transformers. ArXiv abs/2106.08254 (2021)
Google Scholar
Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_9
Chapter Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. ArXiv abs/2104.14294 (2021)
Google Scholar
Changpinyo, S., Sharma, P.K., Ding, N., Soricut, R.: Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3557–3567 (2021)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. ArXiv abs/2002.05709 (2020)
Google Scholar
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. ArXiv abs/2006.10029 (2020)
Google Scholar
Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. ArXiv abs/2104.02057 (2021)
Google Scholar
Desai, K., Johnson, J.: Virtex: Learning visual representations from textual annotations. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11157–11168 (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv abs/2010.11929 (2021)
Google Scholar
El-Nouby, A., Izacard, G., Touvron, H., Laptev, I., Jégou, H., Grave, E.: Are large-scale datasets necessary for self-supervised pre-training? ArXiv abs/2112.10740 (2021)
Google Scholar
Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: NIPS (2013)
Google Scholar
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Goyal, P., et al.: Vissl (2021). https://github.com/facebookresearch/vissl
Google Scholar
Goyal, P., Mahajan, D.K., Gupta, A., Misra, I.: Scaling and benchmarking self-supervised visual representation learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6390–6399 (2019)
Google Scholar
Grill, J.B., et al.: Bootstrap your own latent: A new approach to self-supervised learning. ArXiv abs/2006.07733 (2020)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Doll’ar, P., Girshick, R.B.: Masked autoencoders are scalable vision learners (2021)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9726–9735 (2020)
Google Scholar
Ilharco, G., et al.: Openclip (2021). https://doi.org/10.5281/zenodo.5143773
Jain, A., et al.: Mural: Multimodal, multitask retrieval across languages. ArXiv abs/2109.05125 (2021)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Google Scholar
Joulin, A., van der Maaten, L., Jabri, A., Vasilache, N.: Learning visual features from large weakly supervised data. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 67–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_5
Chapter Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2012)
Article Google Scholar
Li, A., Jabri, A., Joulin, A., van der Maaten, L.: Learning visual n-grams from web data. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4193–4202 (2017)
Google Scholar
Li, Y., et al.: Supervision exists everywhere: a data efficient contrastive language-image pre-training paradigm. ArXiv abs/2110.05208 (2021)
Google Scholar
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. ArXiv abs/1807.03748 (2018)
Google Scholar
Quattoni, A., Collins, M., Darrell, T.: Learning visual representations using images with captions. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Radford, A., Narasimhan, K.: Improving language understanding by generative pre-training (2018)
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Article MathSciNet Google Scholar
Sariyildiz, M.B., Perez, J., Larlus, D.: Learning visual representations with caption annotations. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 153–170. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_10
Chapter Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Google Scholar
Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K.S., Poland, D.N., Borth, D., Li, L.J.: Yfcc100m: the new data in multimedia research. Commun. ACM 59, 64–73 (2016)
Article Google Scholar
Tian, Y., Henaff, O.J., Oord, A.v.d.: Divide and contrast: Self-supervised learning from uncurated data. arXiv preprint arXiv:2105.08054 (2021)
Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR 2011, pp. 1521–1528 (2011)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J’egou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)
Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance-level discrimination. ArXiv abs/1805.01978 (2018)
Google Scholar
Yuan, X., et al.: Multimodal contrastive training for visual representation learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6991–7000 (2021)
Google Scholar
Zhai, X., et al.: A large-scale study of representation learning with the visual task adaptation benchmark. arXiv: Computer Vision and Pattern Recognition (2019)

Download references

Acknowledgements

This work was supported by BAIR, the Berkeley Deep Drive (BDD) project, and gifts from Meta and Open Philanthropy.

Author information

Authors and Affiliations

UC Berkeley, Berkeley, USA
Norman Mu & David Wagner
Meta AI, Menlo Park, USA
Alexander Kirillov & Saining Xie

Authors

Norman Mu
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Kirillov
View author publications
You can also search for this author in PubMed Google Scholar
David Wagner
View author publications
You can also search for this author in PubMed Google Scholar
Saining Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Norman Mu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12508 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mu, N., Kirillov, A., Wagner, D., Xie, S. (2022). SLIP: Self-supervision Meets Language-Image Pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13686. Springer, Cham. https://doi.org/10.1007/978-3-031-19809-0_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-19809-0_30
Published: 01 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19808-3
Online ISBN: 978-3-031-19809-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics