Skip to main content

Single-Stream Multi-level Alignment for Vision-Language Pretraining

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13696)


Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable of finer-grained alignment, but required dense annotations that were not scalable. We propose a single stream architecture that aligns images and language at multiple levels: global, fine-grained patch-token, and conceptual/semantic, using two novel tasks: symmetric cross-modality reconstruction (XMM) and a pseudo-labeled key word prediction (PSL). In XMM, we mask input tokens from one modality and use cross-modal information to reconstruct the masked token, thus improving fine-grained alignment between the two modalities. In PSL, we use attention to select keywords in a caption, use a momentum encoder to recommend other important keywords that are missing from the caption but represented in the image, and then train the visual encoder to predict the presence of those keywords, helping it learn semantic concepts that are essential for grounding a textual token to an image region. We demonstrate competitive performance and improved data efficiency on image-text retrieval, grounding, visual question answering/reasoning against larger models and models trained on more data. Code and models available at


  • Vision-language modeling
  • Cross-modality learning

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. 1.

    We depict the masking as a boolean operation for notational simplicity. The implementation follows the strategy of BEiT [4] and BERT [9] for I, T respectively.


  1. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. ArXiv arXiv:2204.14198 (2022)

  2. Alayrac, J., et al.: Self-supervised multimodal versatile networks. CoRR arXiv:2006.16228 (2020)

  3. Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. CoRR arXiv:1911.12667 (2019)

  4. Bao, H., Dong, L., Wei, F.: Beit: BERT pre-training of image transformers. CoRR arXiv:2106.08254 (2021)

  5. Bommasani, R., et al.: On the opportunities and risks of foundation models. ArXiv arXiv:2108.07258 (2021)

  6. Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020).

    CrossRef  Google Scholar 

  7. Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-and-language tasks via text generation. arXiv preprint arXiv:2102.02779 (2021)

  8. Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: practical automated data augmentation with a reduced search space. In: CVPR Workshops, pp. 702–703 (2020)

    Google Scholar 

  9. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) NAACL, pp. 4171–4186 (2019)

    Google Scholar 

  10. Do, V., Camburu, O.M., Akata, Z., Lukasiewicz, T.: e-snli-ve-2.0: corrected visual-textual entailment with natural language explanations. ArXiv arXiv:2004.03744 (2020)

  11. Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  12. Gan, Z., Chen, Y., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020)

    Google Scholar 

  13. Gao, D., et al.: Fashionbert: text and image matching with adaptive loss for cross-modal retrieval. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2020)

    Google Scholar 

  14. Goel, S., Bansal, H., Bhatia, S.K., Rossi, R.A., Vinay, V., Grover, A.: Cyclip: cyclic contrastive language-image pretraining. ArXiv arXiv:2205.14459 (2022)

  15. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)

    Google Scholar 

  16. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  17. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021)

  18. Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)

    Google Scholar 

  19. Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. CoRR arXiv:1908.08498 (2019)

  20. Kim, J., Jun, J., Zhang, B.: Bilinear attention networks. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) NIPS, pp. 1571–1581 (2018)

    Google Scholar 

  21. Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334 (2021)

  22. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)

    CrossRef  MathSciNet  Google Scholar 

  23. Li, J., Li, D., Xiong, C., Hoi, S.C.H.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)

    Google Scholar 

  24. Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)

    Google Scholar 

  25. Li, L.H., Yatskar, M., Yin, D., Hsieh, C., Chang, K.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  26. Li, W., et al.: Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning. ArXiv arXiv:2012.15409 (2021)

  27. Li, W., et al.: UNIMO-2: end-to-end unified vision-language grounded learning. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 3187–3201. Association for Computational Linguistics, Dublin, Ireland (2022).

  28. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020).

    CrossRef  Google Scholar 

  29. Li, Y., et al.: Supervision exists everywhere: a data efficient contrastive language-image pre-training paradigm. In: International Conference on Learning Representations (2022).

  30. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).

    CrossRef  Google Scholar 

  31. Liu, X., Li, L., Wang, S., Zha, Z., Meng, D., Huang, Q.: Adaptive reconstruction network for weakly supervised referring expression grounding. In: ICCV, pp. 2611–2620 (2019)

    Google Scholar 

  32. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  33. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) NeurIPS, pp. 13–23 (2019)

    Google Scholar 

  34. Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: CVPR, pp. 10434–10443 (2020)

    Google Scholar 

  35. Mu, N., Kirillov, A., Wagner, D.A., Xie, S.: Slip: self-supervision meets language-image pre-training. ArXiv arXiv:2112.12750 (2021)

  36. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. CoRR arXiv:2107.00135 (2021)

  37. van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRR arXiv:1807.03748 (2018)

  38. van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. CoRR arXiv:1711.00937 (2017)

  39. Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.) NIPS, pp. 1143–1151 (2011)

    Google Scholar 

  40. Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. CoRR arXiv:1804.03641 (2018)

  41. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV, pp. 2641–2649 (2015)

    Google Scholar 

  42. Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)

  43. Rostamzadeh, N., et al.: Fashion-gen: the generative fashion dataset and challenge. ArXiv arXiv:1806.08317 (2018)

  44. Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vision 128, 336–359 (2017)

    CrossRef  Google Scholar 

  45. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Gurevych, I., Miyao, Y. (eds.) ACL, pp. 2556–2565 (2018)

    Google Scholar 

  46. Singh, A., et al.: Flava: a foundational language and vision alignment model. ArXiv arXiv:2112.04482 (2021)

  47. Su, W., et al.: Vl-bert: pre-training of generic visual-linguistic representations. In: ICLR (2020)

    Google Scholar 

  48. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) EMNLP, pp. 5099–5110 (2019)

    Google Scholar 

  49. Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. CoRR arXiv:1906.05849 (2019)

  50. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) NIPS, pp. 5998–6008 (2017)

    Google Scholar 

  51. Wang, P., et al.: Ofa: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. CoRR arXiv:2202.03052 (2022)

  52. Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: Simvlm: simple visual language model pretraining with weak supervision. ArXiv arXiv:2108.10904 (2021)

  53. Xiao, F., Lee, Y.J., Grauman, K., Malik, J., Feichtenhofer, C.: Audiovisual slowfast networks for video recognition. CoRR arXiv:2001.08740 (2020)

  54. Xie, Y., et al.: Visual clues: bridging vision and language foundations for image paragraph captioning. ArXiv arXiv:2206.01843 (2022)

  55. Yang, J., et al.: Vision-language pre-training with triple contrastive learning. In: CVPR 2022 (2022).

  56. Yao, L., et al.: Filip: fine-grained interactive language-image pre-training. ArXiv arXiv:2111.07783 (2021)

  57. Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: contrastive captioners are image-text foundation models. ArXiv arXiv:2205.01917 (2022)

  58. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. ArXiv arXiv:1608.00272 (2016)

  59. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016).

    CrossRef  Google Scholar 

  60. Yuan, L., et al.: Florence: a new foundation model for computer vision. ArXiv arXiv:2111.11432 (2021)

  61. Zeng, A., et al.: Socratic models: composing zero-shot multimodal reasoning with language. ArXiv arXiv:2204.00598 (2022)

  62. Zhai, X., et al.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18123–18133 (June 2022)

    Google Scholar 

  63. Zhang, Z., Zhao, Z., Lin, Z., Zhu, J., He, X.: Counterfactual contrastive learning for weakly-supervised vision-language grounding. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020)

    Google Scholar 

  64. Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., Kong, T.: ibot: image Bert pre-training with online tokenizer. In: International Conference on Learning Representations (ICLR) (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Zaid Khan .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5147 KB)

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Khan, Z., Vijay Kumar, B.G., Yu, X., Schulter, S., Chandraker, M., Fu, Y. (2022). Single-Stream Multi-level Alignment for Vision-Language Pretraining. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20058-8

  • Online ISBN: 978-3-031-20059-5

  • eBook Packages: Computer ScienceComputer Science (R0)