Abstract
Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable of finer-grained alignment, but required dense annotations that were not scalable. We propose a single stream architecture that aligns images and language at multiple levels: global, fine-grained patch-token, and conceptual/semantic, using two novel tasks: symmetric cross-modality reconstruction (XMM) and a pseudo-labeled key word prediction (PSL). In XMM, we mask input tokens from one modality and use cross-modal information to reconstruct the masked token, thus improving fine-grained alignment between the two modalities. In PSL, we use attention to select keywords in a caption, use a momentum encoder to recommend other important keywords that are missing from the caption but represented in the image, and then train the visual encoder to predict the presence of those keywords, helping it learn semantic concepts that are essential for grounding a textual token to an image region. We demonstrate competitive performance and improved data efficiency on image-text retrieval, grounding, visual question answering/reasoning against larger models and models trained on more data. Code and models available at zaidkhan.me/SIMLA.
Keywords
- Vision-language modeling
- Cross-modality learning
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. ArXiv arXiv:2204.14198 (2022)
Alayrac, J., et al.: Self-supervised multimodal versatile networks. CoRR arXiv:2006.16228 (2020)
Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. CoRR arXiv:1911.12667 (2019)
Bao, H., Dong, L., Wei, F.: Beit: BERT pre-training of image transformers. CoRR arXiv:2106.08254 (2021)
Bommasani, R., et al.: On the opportunities and risks of foundation models. ArXiv arXiv:2108.07258 (2021)
Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-and-language tasks via text generation. arXiv preprint arXiv:2102.02779 (2021)
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: practical automated data augmentation with a reduced search space. In: CVPR Workshops, pp. 702–703 (2020)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) NAACL, pp. 4171–4186 (2019)
Do, V., Camburu, O.M., Akata, Z., Lukasiewicz, T.: e-snli-ve-2.0: corrected visual-textual entailment with natural language explanations. ArXiv arXiv:2004.03744 (2020)
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: ICLR (2021)
Gan, Z., Chen, Y., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020)
Gao, D., et al.: Fashionbert: text and image matching with adaptive loss for cross-modal retrieval. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2020)
Goel, S., Bansal, H., Bhatia, S.K., Rossi, R.A., Vinay, V., Grover, A.: Cyclip: cyclic contrastive language-image pretraining. ArXiv arXiv:2205.14459 (2022)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021)
Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. CoRR arXiv:1908.08498 (2019)
Kim, J., Jun, J., Zhang, B.: Bilinear attention networks. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) NIPS, pp. 1571–1581 (2018)
Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334 (2021)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)
Li, J., Li, D., Xiong, C., Hoi, S.C.H.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C., Chang, K.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, W., et al.: Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning. ArXiv arXiv:2012.15409 (2021)
Li, W., et al.: UNIMO-2: end-to-end unified vision-language grounded learning. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 3187–3201. Association for Computational Linguistics, Dublin, Ireland (2022). https://doi.org/10.18653/v1/2022.findings-acl.251
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Li, Y., et al.: Supervision exists everywhere: a data efficient contrastive language-image pre-training paradigm. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=zq1iJkNk3uN
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, X., Li, L., Wang, S., Zha, Z., Meng, D., Huang, Q.: Adaptive reconstruction network for weakly supervised referring expression grounding. In: ICCV, pp. 2611–2620 (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) NeurIPS, pp. 13–23 (2019)
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: CVPR, pp. 10434–10443 (2020)
Mu, N., Kirillov, A., Wagner, D.A., Xie, S.: Slip: self-supervision meets language-image pre-training. ArXiv arXiv:2112.12750 (2021)
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. CoRR arXiv:2107.00135 (2021)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRR arXiv:1807.03748 (2018)
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. CoRR arXiv:1711.00937 (2017)
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.) NIPS, pp. 1143–1151 (2011)
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. CoRR arXiv:1804.03641 (2018)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV, pp. 2641–2649 (2015)
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Rostamzadeh, N., et al.: Fashion-gen: the generative fashion dataset and challenge. ArXiv arXiv:1806.08317 (2018)
Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vision 128, 336–359 (2017)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Gurevych, I., Miyao, Y. (eds.) ACL, pp. 2556–2565 (2018)
Singh, A., et al.: Flava: a foundational language and vision alignment model. ArXiv arXiv:2112.04482 (2021)
Su, W., et al.: Vl-bert: pre-training of generic visual-linguistic representations. In: ICLR (2020)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) EMNLP, pp. 5099–5110 (2019)
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. CoRR arXiv:1906.05849 (2019)
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) NIPS, pp. 5998–6008 (2017)
Wang, P., et al.: Ofa: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. CoRR arXiv:2202.03052 (2022)
Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: Simvlm: simple visual language model pretraining with weak supervision. ArXiv arXiv:2108.10904 (2021)
Xiao, F., Lee, Y.J., Grauman, K., Malik, J., Feichtenhofer, C.: Audiovisual slowfast networks for video recognition. CoRR arXiv:2001.08740 (2020)
Xie, Y., et al.: Visual clues: bridging vision and language foundations for image paragraph captioning. ArXiv arXiv:2206.01843 (2022)
Yang, J., et al.: Vision-language pre-training with triple contrastive learning. In: CVPR 2022 (2022). https://www.amazon.science/publications/vision-language-pre-training-with-triple-contrastive-learning
Yao, L., et al.: Filip: fine-grained interactive language-image pre-training. ArXiv arXiv:2111.07783 (2021)
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: contrastive captioners are image-text foundation models. ArXiv arXiv:2205.01917 (2022)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. ArXiv arXiv:1608.00272 (2016)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Yuan, L., et al.: Florence: a new foundation model for computer vision. ArXiv arXiv:2111.11432 (2021)
Zeng, A., et al.: Socratic models: composing zero-shot multimodal reasoning with language. ArXiv arXiv:2204.00598 (2022)
Zhai, X., et al.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18123–18133 (June 2022)
Zhang, Z., Zhao, Z., Lin, Z., Zhu, J., He, X.: Counterfactual contrastive learning for weakly-supervised vision-language grounding. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020)
Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., Kong, T.: ibot: image Bert pre-training with online tokenizer. In: International Conference on Learning Representations (ICLR) (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Khan, Z., Vijay Kumar, B.G., Yu, X., Schulter, S., Chandraker, M., Fu, Y. (2022). Single-Stream Multi-level Alignment for Vision-Language Pretraining. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_42
Download citation
DOI: https://doi.org/10.1007/978-3-031-20059-5_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20058-8
Online ISBN: 978-3-031-20059-5
eBook Packages: Computer ScienceComputer Science (R0)