Advertisement

Learning Visual Representations with Caption Annotations

Conference paper
  • 571 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12353)

Abstract

Pretraining general-purpose visual features has become a crucial part of tackling many computer vision tasks. While one can learn such features on the extensively-annotated ImageNet dataset, recent approaches have looked at ways to allow for noisy, fewer, or even no annotations to perform such pretraining. Starting from the observation that captioned images are easily crawlable, we argue that this overlooked source of information can be exploited to supervise the training of visual representations. To do so, motivated by the recent progresses in language models, we introduce image-conditioned masked language modeling (ICMLM) – a proxy task to learn visual representations over image-caption pairs. ICMLM consists in predicting masked words in captions by relying on visual cues. To tackle this task, we propose hybrid models, with dedicated visual and textual encoders, and we show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks. Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations. Project website: https://europe.naverlabs.com/ICMLM.

Supplementary material

504445_1_En_10_MOESM1_ESM.pdf (6.1 mb)
Supplementary material 1 (pdf 6221 KB)

References

  1. 1.
    Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv:1607.06450 (2016)
  2. 2.
    Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: Proceedings of the NeurIPS (2019)Google Scholar
  3. 3.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the ICLR (2015)Google Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. JMLR 3(Jan), 993–1022 (2003)zbMATHGoogle Scholar
  5. 5.
    Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01264-9_9CrossRefGoogle Scholar
  6. 6.
    Caron, M., Bojanowski, P., Mairal, J., Joulin, A.: Unsupervised Pre-training of Image Features on Non-curated Data. In: Proceedings of the ICCV (2019)Google Scholar
  7. 7.
    Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets atrous convolution and fully connected CRFs. PAMI 40(4), 834–848 (2018)CrossRefGoogle Scholar
  8. 8.
    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of the ICML (2020)Google Scholar
  9. 9.
    Chen, X., Gupta, A.: Webly supervised learning of convolutional networks. In: Proceedings of the ICCV (2015)Google Scholar
  10. 10.
    Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., Tan, M.: Visual grounding via accumulated attention. In: Proceedings of the CVPR (2018)Google Scholar
  11. 11.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the CVPR (2009)Google Scholar
  12. 12.
    Deshpande, A., Rock, J., Forsyth, D.: Learning large-scale automatic image colorization. In: Proceedings of the ICCV (2015)Google Scholar
  13. 13.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: ACL (2019)Google Scholar
  14. 14.
    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the ICCV (2015)Google Scholar
  15. 15.
    Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: Proceedings of the ICCV (2017)Google Scholar
  16. 16.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results (2007)Google Scholar
  17. 17.
    Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the CVPR (2017)Google Scholar
  18. 18.
    Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: Proceedings of the ICLR (2018)Google Scholar
  19. 19.
    Gomez, L., Patel, Y., Rusiñol, M.R., Karatzas, D., Jawahar, C.: Self-supervised learning of visual features through embedding images into text topic spaces. In: Proceedings of the CVPR (2017)Google Scholar
  20. 20.
    Gomez, R., Gomez, L., Gibert, J., Karatzas, D.: Self-supervised learning from web data for multimodal retrieval. In: Multimodal Scene Understanding, chap. 9 (2019)Google Scholar
  21. 21.
    Gordo, A., Almazan, J., Revaud, J., Larlus, D.: End-to-end learning of deep visual representations for image retrieval. IJCV 124, 237–254 (2017)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Gordo, A., Larlus, D.: Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In: Proceedings of the CVPR (2017)Google Scholar
  23. 23.
    Goyal, P., Mahajan, D., Gupta, A., Misra, I.: Scaling and benchmarking self-supervised visual representation learning. In: Proceedings of the ICCV (2019)Google Scholar
  24. 24.
    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In: Proceedings of the CVPR (2017)Google Scholar
  25. 25.
    He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the CVPR (2020)Google Scholar
  26. 26.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the CVPR (2016)Google Scholar
  27. 27.
    Hong, S., Yeo, D., Kwak, S., Lee, H., Han, B.: Weakly supervised semantic segmentation using web-crawled videos. In: Proceedings of the CVPR (2017)Google Scholar
  28. 28.
    Honnibal, M., Montani, I.: spaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing (2017, to appear). https://spacy.io
  29. 29.
    Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Proceedings of the CVPR (2016)Google Scholar
  30. 30.
    Jenni, S., Favaro, P.: Self-supervised feature learning by learning to spot artifacts. In: Proceedings of the CVPR (2018)Google Scholar
  31. 31.
    Joulin, A., van der Maaten, L., Jabri, A., Vasilache, N.: Learning visual features from large weakly supervised data. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 67–84. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_5CrossRefGoogle Scholar
  32. 32.
    Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: Proceedings of the CVPR (2019)Google Scholar
  33. 33.
    Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: Proceedings of the CVPR (2017)Google Scholar
  35. 35.
    Li, A., Jabri, A., Joulin, A., van der Maaten, L.: Learning visual n-grams from web data. In: Proceedings of the ICCV (2017)Google Scholar
  36. 36.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  37. 37.
    Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of the NeurIPS (2019)Google Scholar
  38. 38.
    Mahajan, D., et al.: Exploring the limits of weakly supervised pretraining. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 185–201. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01216-8_12CrossRefGoogle Scholar
  39. 39.
    Mahendran, A., Thewlis, J., Vedaldi, A.: Cross pixel optical-flow similarity for self-supervised learning. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 99–116. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-20873-8_7CrossRefGoogle Scholar
  40. 40.
    Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv:1807.03748 (2018)
  41. 41.
    Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: Proceedings of the CVPR (2012)Google Scholar
  42. 42.
    Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the CVPR (2016)Google Scholar
  43. 43.
    Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the NAACL-HLT (2018)Google Scholar
  44. 44.
    Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data. arXiv:2001.07966 (2020)
  45. 45.
    Quattoni, A., Collins, M., Darrell, T.: Learning visual representations using images with captions. In: Proceedings of the CVPR (2007)Google Scholar
  46. 46.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the NeurIPS (2015)Google Scholar
  47. 47.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  48. 48.
    Sariyildiz, M.B., Cinbis, R.G.: Gradient matching generative networks for zero-shot learning. In: Proceedings of the CVPR (2019)Google Scholar
  49. 49.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the ICLR (2015)Google Scholar
  50. 50.
    Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: Proceedings of the ICLR (2020)Google Scholar
  51. 51.
    Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: Proceedings of the ICCV (2019)Google Scholar
  52. 52.
    Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the EMNLP (2019)Google Scholar
  53. 53.
    Thomee, B., et al.: YFCC100M: the new data in multimedia research. arXiv:1503.01817 (2015)
  54. 54.
    Vaswani, A., et al.: Attention is all you need. In: Proceedings of the NeurIPS (2017)Google Scholar
  55. 55.
    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 dataset. Technical report CNS-TR-2011-001, California Institute of Technology (2011)Google Scholar
  56. 56.
    Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the CVPR (2016)Google Scholar
  57. 57.
    Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the CVPR (2018)Google Scholar
  58. 58.
    Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves imagenet classification. In: Proceedings of the CVPR (2020)Google Scholar
  59. 59.
    Yalniz, I.Z., Jégou, H., Chen, K., Paluri, M., Mahajan, D.: Billion-scale semi-supervised learning for image classification. arXiv:1905.00546 (2019)
  60. 60.
    Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_40CrossRefGoogle Scholar
  61. 61.
    Zhang, R., Isola, P., Efros, A.A.: Split-brain autoencoders: unsupervised learning by cross-channel prediction. In: Proceedings of the CVPR (2017)Google Scholar
  62. 62.
    Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. PAMI 40, 1452–1464 (2017)CrossRefGoogle Scholar
  63. 63.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Proceedings of the NeurIPS (2014)Google Scholar
  64. 64.
    Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI (2020)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.NAVER LABS EuropeMeylanFrance

Personalised recommendations