Skip to main content
Log in

ITContrast: contrastive learning with hard negative synthesis for image-text matching

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

A Correction to this article was published on 11 March 2024

This article has been updated

Abstract

Image-text matching aims to bridge vision and language so as to match the instance of one modality with the instance of another modality. Recent years have seen considerable progress in the research area by exploring local alignment between image regions and sentence words. However, there are still open questions regarding how to learn modality-invariant feature embeddings and effectively utilize hard negatives in the training set to infer more accurate matching scores. In this paper, we introduce a new approach called Image-Text Modality Contrastive Learning (abbreviated as ITContrast) for image-text matching. Our method addresses these challenges by leveraging a pre-trained vision-language model, OSCAR, which is firstly fine-tuned to obtain visual and textual features. We also introduce a hard negative synthesis module, which capitalizes on the difficulty of negative samples. This module profiles negative samples within a mini-match and generates representative embeddings that reflect their hardness in relation to the anchor sample. A novel cost function is designed to comprehensively integrate the information from positives, negatives and synthesized hard negatives. Extensive experiments on the MS COCO and Flickr30K datasets demonstrate that our approach is effective for image-text matching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The authors confirm that the code, network weights and datasets supporting the results of this study can be found in the article.

Change history

References

  1. Ghosh, M., Roy, S.S., Mukherjee, H., Obaidullah, S.M., Santosh, K., Roy, K.: Understanding movie poster: transfer-deep learning approach for graphic-rich text recognition. The Visual Computer, 1–20 (2022)

  2. Macedo, D.V., Rodrigues, M.A.F.: Real-time dynamic reflections for realistic rendering of 3d scenes. Vis. Comput. 34, 337–346 (2018)

    Article  Google Scholar 

  3. Junkert, F., Eberts, M., Ulges, A., Schwanecke, U.: Cross-modal image-graphics retrieval by neural transfer learning. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 330–337 (2017)

  4. Wei, X., Zhang, T., Li, Y., Zhang, Y., Wu, F.: Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10941–10950 (2020)

  5. Liu, C., Mao, Z., Zhang, T., Xie, H., Wang, B., Zhang, Y.: Graph structured network for image-text matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10921–10930 (2020)

  6. Diao, H., Zhang, Y., Ma, L., Lu, H.: Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 1218–1226 (2021)

  7. Jiang, T., Zhang, Z., Yang, Y.: Modeling coverage with semantic embedding for image caption generation. Vis. Comput. 35, 1655–1665 (2019)

    Article  Google Scholar 

  8. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and vqa. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 13041–13049 (2020)

  9. Sun, B., Wu, Y., Zhao, Y., Hao, Z., Yu, L., He, J.: Cross-language multimodal scene semantic guidance and leap sampling for video captioning. Vis. Comput., 1–17 (2022)

  10. Guo, Z., Han, D.: Multi-modal co-attention relation networks for visual question answering. Vis. Comput. 39(11), 5783–95 (2022)

    Article  Google Scholar 

  11. Yan, F., Silamu, W., Li, Y., Chai, Y.: Spca-net: a based on spatial position relationship co-attention network for visual question answering. Vis. Comput. 38(9–10), 3097–3108 (2022)

    Article  Google Scholar 

  12. Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701 (2018)

  13. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)

  14. Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)

  15. Chen, T., Luo, J.: Expressing objects just like words: Recurrent visual embedding for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 10583–10590 (2020)

  16. Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4654–4662 (2019)

  17. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 1597–1607 (2020)

  18. Li, X., Yin, X., Li, C., Zhang, P., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 121–137 (2020)

  19. Feng, Z., Zeng, Z., Guo, C., Li, Z.: Exploiting visual semantic reasoning for video-text retrieval. In: Proceedings of the International Conference on International Joint Conferences on Artificial Intelligence (IJCAI), pp. 1005–1011 (2021)

  20. Wehrmann, J., Kolling, C., Barros, R.C.: Adaptive cross-modal embeddings for image-text alignment. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 12313–12320 (2020)

  21. Liu, C., Mao, Z., Liu, A.A., Zhang, T., Wang, B., Zhang, Y.: Focus your attention: A bidirectional focal attention network for image-text matching. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), pp. 3–11 (2019)

  22. Pan, Z., Wu, F., Zhang, B.: Fine-grained image-text matching by cross-modal hard aligning network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19275–19284 (2023)

  23. Chen, C., Wang, D., Song, B., Tan, H.: Inter-intra modal representation augmentation with dct-transformer adversarial network for image-text matching. IEEE Transactions on Multimedia, 1–13 (2023)

  24. Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5412–5425 (2020)

    Article  PubMed  Google Scholar 

  25. Zhang, K., Mao, Z., Wang, Q., Zhang, Y.: Negative-aware attention framework for image-text matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15661–15670 (2022)

  26. Wang, S., Wang, R., Yao, Z., Shan, S., Chen, X.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1508–1517 (2020)

  27. Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-aware visual-semantic embedding for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 18–34 (2020)

  28. Zhang, H., Mao, Z., Zhang, K., Zhang, Y.: Show your faith: Cross-modal confidence-aware network for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 36, pp. 3262–3270 (2022)

  29. Chen, T., Deng, J., Luo, J.: Adaptive offline quintuplet loss for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 549–565 (2020)

  30. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738 (2020)

  31. Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D.: Supervised contrastive learning. Adv. Neural Inform. Process. Syst. NeurIPS 33, 18661–73 (2020)

    Google Scholar 

  32. Li, W., Gao, C., Niu, G., Xiao, X., Liu, H., Liu, J., Wu, H., Wang, H.: Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol. 1: Long Papers), pp. 2592–2607 (2021)

  33. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), pp. 8748–8763 (2021). PMLR

  34. Kenton, J.D.M.-W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), vol. 1, p. 2 (2019)

  35. Gordo, A., Larlus, D.: Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6589–6598 (2017)

  36. Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning (ICML)), pp. 5583–5594 (2021). PMLR

  37. Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021)

    Google Scholar 

  38. Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., Gao, J.: Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588 (2021)

  39. Qiu, R., Cai, Z., Chang, Z., Liu, S., Tu, G.: A two-stage image process for water level recognition via dual-attention cornernet and ctransformer. Vis. Comput. 39(7), 2933–2952 (2023)

    Article  Google Scholar 

  40. Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., Anderson, P.: Nocaps: Novel object captioning at scale. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 8948–8957 (2019)

  41. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755 (2014). Springer

  42. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

  43. Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6700–6709 (2019)

  44. Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6720–6731 (2019)

  45. Desai, K., Johnson, J.: Virtex: Learning visual representations from textual annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11162–11173 (2021)

  46. Sariyildiz, M.B., Perez, J., Larlus, D.: Learning visual representations with caption annotations. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 153–170 (2020). Springer

  47. Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)

    Article  PubMed  Google Scholar 

  48. Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. In: Machine Learning for Healthcare Conference, pp. 2–25 (2022). PMLR

  49. Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916 (2021). PMLR

  50. Liu, S., Fan, H., Qian, S., Chen, Y., Ding, W., Wang, Z.: Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11915–11925 (2021)

  51. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural Inform. Process. Syst. NeurIPS 28, 91–99 (2015)

    Google Scholar 

  52. Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)

    Google Scholar 

  53. Zhang, Q., Lei, Z., Zhang, Z., Li, S.Z.: Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3536–3545 (2020)

  54. Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12655–12663 (2020)

  55. Wei, J., Yang, Y., Xu, X., Zhu, X., Shen, H.T.: Universal weighting metric learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 6534–45 (2021)

    Article  Google Scholar 

  56. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11336–11344 (2020)

  57. Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, Z., et al.: Uniter: Universal image-text representation learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 104–120 (2020)

  58. Chen, J., Hu, H., Wu, H., Jiang, Y., Wang, C.: Learning the best pooling strategy for visual semantic embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15789–15798 (2021)

  59. Miech, A., Alayrac, J.B., Laptev, I., Sivic, J., Zisserman, A.: Thinking fast and slow: Efficient text-to-visual retrieval with transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9826–9836 (2021)

  60. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)

    Article  Google Scholar 

  61. Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., Van Der Maaten, L.: Exploring the limits of weakly supervised pretraining. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196 (2018)

Download references

Acknowledgements

This work is partially supported by the XJTLU AI University Research Centre, Jiangsu Province Engineering Research Centre of Data Science and Cognitive Computation at XJTLU and SIP AI innovation platform (YZCXPT2022103); National Key Research and Development Project of China Grant (2021ZD0110505); Jiangsu Science and Technology Programme (BE2020006-4); Natural Science Foundation of Zhejiang Province (LY23F020014); The Key Technology R &D Program of Ningbo (2019B10128, 2023Z069), Gusu Innovation and Entrepreneurship Leading Talents Programme (ZXL2023176).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fangyu Wu.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: the affiliation of the third author was not correct.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, F., Wang, Q., Wang, Z. et al. ITContrast: contrastive learning with hard negative synthesis for image-text matching. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03274-w

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00371-024-03274-w

Keywords

Navigation