Advertisement

Contrastive Learning for Weakly Supervised Phrase Grounding

Conference paper
  • 644 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12348)

Abstract

Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a \(\sim 10\%\) absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of \(5.7\%\) to achieve \(76.7\%\) accuracy on Flickr30K Entities benchmark. Our code and project material will be available at http://tanmaygupta.info/info-ground.

Keywords

Mutual information InfoNCE Grounding Attention 

Notes

Acknowledgement

This work was done partly at NVIDIA and is partly supported by ONR MURI Award N00014-16-1-2007

Supplementary material

504435_1_En_44_MOESM1_ESM.pdf (183 kb)
Supplementary material 1 (pdf 182 KB)
504435_1_En_44_MOESM2_ESM.pdf (19.6 mb)
Supplementary material 2 (pdf 20119 KB)

References

  1. 1.
    Akbari, H., Karaman, S., Bhargava, S., Chen, B., Vondrick, C., Chang, S.F.: Multi-level multimodal common semantic space for image-phrase grounding. In: CVPR (2018)Google Scholar
  2. 2.
    Alberti, C., Ling, J., Collins, M., Reitter, D.: Fusion of detected objects in text for visual question answering. In: EMNLP (2019)Google Scholar
  3. 3.
    Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. In: ICLR (2017)Google Scholar
  4. 4.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2017)Google Scholar
  5. 5.
    Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910 (2019)
  6. 6.
    Belghazi, M.I., et al.: Mutual information neural estimation. In: ICML (2018)Google Scholar
  7. 7.
    Chen, K., Gao, J., Nevatia, R.: Knowledge aided consistency for weakly supervised phrase grounding. In: CVPR (2018)Google Scholar
  8. 8.
    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020)
  9. 9.
    Chen, Y.C., et al.: UNITER: Learning universal image-text representations. arXiv preprint arXiv:1909.11740 (2019)
  10. 10.
    Choe, J., Oh, S.J., Lee, S., Chun, S., Akata, Z., Shim, H.: Evaluating weakly supervised object localization methods right. ArXiv (2020) Google Scholar
  11. 11.
    Datta, S., Sikka, K., Roy, A., Ahuja, K., Parikh, D., Divakaran, A.: Align2Ground: weakly supervised phrase grounding guided by image-caption alignment. In: ICCV (2019)Google Scholar
  12. 12.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2018)Google Scholar
  13. 13.
    Fang, H., et al.: From captions to visual concepts and back. In: CVPR (2014)Google Scholar
  14. 14.
    Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: AISTATS (2010)Google Scholar
  15. 15.
    He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019)
  16. 16.
    Hénaff, O.J., Razavi, A., Doersch, C., Eslami, S., Oord, A.V.D.: Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019)
  17. 17.
    Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: ICLR (2019)Google Scholar
  18. 18.
    Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learning. In: ICML (2018)Google Scholar
  19. 19.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  20. 20.
    Kim, H., Mnih, A.: Disentangling by factorising. In: ICML (2018)Google Scholar
  21. 21.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  22. 22.
    Li, G., Duan, N., Fang, Y., Jiang, D., Zhou, M.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019)
  23. 23.
    Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
  24. 24.
    Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)Google Scholar
  25. 25.
    Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: NeurIPS (1998)Google Scholar
  26. 26.
    McAllester, D., Stratos, K.: Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251 (2018)
  27. 27.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)Google Scholar
  28. 28.
    Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991 (2019)
  29. 29.
    Mnih, A., Kavukcuoglu, K.: Learning word embeddings efficiently with noise-contrastive estimation. In: NeurIPS (2013)Google Scholar
  30. 30.
    Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv (2018)Google Scholar
  31. 31.
    Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., Tucker, G.: On variational bounds of mutual information. In: ICML (2019)Google Scholar
  32. 32.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)Google Scholar
  33. 33.
    Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_49CrossRefGoogle Scholar
  34. 34.
    Song, J., Ermon, S.: Understanding the limitations of variational mutual information estimators. In: ICLR (2020)Google Scholar
  35. 35.
    Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations (2020)Google Scholar
  36. 36.
    Sun, C., Baradel, F., Murphy, K., Schmid, C.: Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743 (2019)
  37. 37.
    Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: ICCV (2019)Google Scholar
  38. 38.
    Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP (2019)Google Scholar
  39. 39.
    Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019)
  40. 40.
    Tschannen, M., Djolonga, J., Rubenstein, P.K., Gelly, S., Lucic, M.: On mutual information maximization for representation learning. In: ICLR (2020)Google Scholar
  41. 41.
    Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)Google Scholar
  42. 42.
    Wang, J., Specia, L.: Phrase localization without paired training examples. In: ICCV (2019)Google Scholar
  43. 43.
    Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR (2018)Google Scholar
  44. 44.
    Yeh, R.A., Do, M.N., Schwing, A.G.: Unsupervised textual grounding: linking words to image concepts. In: CVPR (2018)Google Scholar
  45. 45.
    Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI (2020)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of Illinois Urbana-ChampaignChampaignUSA
  2. 2.Bar Ilan UniversityRamat GanIsrael
  3. 3.NVIDIASanta ClaraUSA

Personalised recommendations