Contrastive Learning for Weakly Supervised Phrase Grounding

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12348)


Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a \(\sim 10\%\) absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of \(5.7\%\) to achieve \(76.7\%\) accuracy on Flickr30K Entities benchmark. Our code and project material will be available at


Mutual information InfoNCE Grounding Attention 



This work was done partly at NVIDIA and is partly supported by ONR MURI Award N00014-16-1-2007

Supplementary material

504435_1_En_44_MOESM1_ESM.pdf (183 kb)
Supplementary material 1 (pdf 182 KB)
504435_1_En_44_MOESM2_ESM.pdf (19.6 mb)
Supplementary material 2 (pdf 20119 KB)


  1. 1.
    Akbari, H., Karaman, S., Bhargava, S., Chen, B., Vondrick, C., Chang, S.F.: Multi-level multimodal common semantic space for image-phrase grounding. In: CVPR (2018)Google Scholar
  2. 2.
    Alberti, C., Ling, J., Collins, M., Reitter, D.: Fusion of detected objects in text for visual question answering. In: EMNLP (2019)Google Scholar
  3. 3.
    Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. In: ICLR (2017)Google Scholar
  4. 4.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2017)Google Scholar
  5. 5.
    Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910 (2019)
  6. 6.
    Belghazi, M.I., et al.: Mutual information neural estimation. In: ICML (2018)Google Scholar
  7. 7.
    Chen, K., Gao, J., Nevatia, R.: Knowledge aided consistency for weakly supervised phrase grounding. In: CVPR (2018)Google Scholar
  8. 8.
    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020)
  9. 9.
    Chen, Y.C., et al.: UNITER: Learning universal image-text representations. arXiv preprint arXiv:1909.11740 (2019)
  10. 10.
    Choe, J., Oh, S.J., Lee, S., Chun, S., Akata, Z., Shim, H.: Evaluating weakly supervised object localization methods right. ArXiv (2020) Google Scholar
  11. 11.
    Datta, S., Sikka, K., Roy, A., Ahuja, K., Parikh, D., Divakaran, A.: Align2Ground: weakly supervised phrase grounding guided by image-caption alignment. In: ICCV (2019)Google Scholar
  12. 12.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2018)Google Scholar
  13. 13.
    Fang, H., et al.: From captions to visual concepts and back. In: CVPR (2014)Google Scholar
  14. 14.
    Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: AISTATS (2010)Google Scholar
  15. 15.
    He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019)
  16. 16.
    Hénaff, O.J., Razavi, A., Doersch, C., Eslami, S., Oord, A.V.D.: Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019)
  17. 17.
    Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: ICLR (2019)Google Scholar
  18. 18.
    Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learning. In: ICML (2018)Google Scholar
  19. 19.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  20. 20.
    Kim, H., Mnih, A.: Disentangling by factorising. In: ICML (2018)Google Scholar
  21. 21.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  22. 22.
    Li, G., Duan, N., Fang, Y., Jiang, D., Zhou, M.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019)
  23. 23.
    Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
  24. 24.
    Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)Google Scholar
  25. 25.
    Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: NeurIPS (1998)Google Scholar
  26. 26.
    McAllester, D., Stratos, K.: Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251 (2018)
  27. 27.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)Google Scholar
  28. 28.
    Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991 (2019)
  29. 29.
    Mnih, A., Kavukcuoglu, K.: Learning word embeddings efficiently with noise-contrastive estimation. In: NeurIPS (2013)Google Scholar
  30. 30.
    Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv (2018)Google Scholar
  31. 31.
    Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., Tucker, G.: On variational bounds of mutual information. In: ICML (2019)Google Scholar
  32. 32.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)Google Scholar
  33. 33.
    Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). Scholar
  34. 34.
    Song, J., Ermon, S.: Understanding the limitations of variational mutual information estimators. In: ICLR (2020)Google Scholar
  35. 35.
    Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations (2020)Google Scholar
  36. 36.
    Sun, C., Baradel, F., Murphy, K., Schmid, C.: Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743 (2019)
  37. 37.
    Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: ICCV (2019)Google Scholar
  38. 38.
    Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP (2019)Google Scholar
  39. 39.
    Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019)
  40. 40.
    Tschannen, M., Djolonga, J., Rubenstein, P.K., Gelly, S., Lucic, M.: On mutual information maximization for representation learning. In: ICLR (2020)Google Scholar
  41. 41.
    Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)Google Scholar
  42. 42.
    Wang, J., Specia, L.: Phrase localization without paired training examples. In: ICCV (2019)Google Scholar
  43. 43.
    Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR (2018)Google Scholar
  44. 44.
    Yeh, R.A., Do, M.N., Schwing, A.G.: Unsupervised textual grounding: linking words to image concepts. In: CVPR (2018)Google Scholar
  45. 45.
    Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI (2020)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of Illinois Urbana-ChampaignChampaignUSA
  2. 2.Bar Ilan UniversityRamat GanIsrael
  3. 3.NVIDIASanta ClaraUSA

Personalised recommendations