ITContrast: contrastive learning with hard negative synthesis for image-text matching

Wu, Fangyu; Wang, Qiufeng; Wang, Zhao; Yu, Siyue; Li, Yushi; Zhang, Bailing; Lim, Eng Gee

doi:10.1007/s00371-024-03274-w

ITContrast: contrastive learning with hard negative synthesis for image-text matching

Original article
Published: 15 February 2024

(2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Fangyu Wu ORCID: orcid.org/0000-0001-9618-8965¹,
Qiufeng Wang¹,
Zhao Wang²,
Siyue Yu¹,
Yushi Li¹,
Bailing Zhang² &
…
Eng Gee Lim¹

156 Accesses
Explore all metrics

A Correction to this article was published on 11 March 2024

This article has been updated

Abstract

Image-text matching aims to bridge vision and language so as to match the instance of one modality with the instance of another modality. Recent years have seen considerable progress in the research area by exploring local alignment between image regions and sentence words. However, there are still open questions regarding how to learn modality-invariant feature embeddings and effectively utilize hard negatives in the training set to infer more accurate matching scores. In this paper, we introduce a new approach called Image-Text Modality Contrastive Learning (abbreviated as ITContrast) for image-text matching. Our method addresses these challenges by leveraging a pre-trained vision-language model, OSCAR, which is firstly fine-tuned to obtain visual and textual features. We also introduce a hard negative synthesis module, which capitalizes on the difficulty of negative samples. This module profiles negative samples within a mini-match and generates representative embeddings that reflect their hardness in relation to the anchor sample. A novel cost function is designed to comprehensively integrate the information from positives, negatives and synthesized hard negatives. Extensive experiments on the MS COCO and Flickr30K datasets demonstrate that our approach is effective for image-text matching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Article 03 May 2023

A Multiple Positives Enhanced NCE Loss for Image-Text Retrieval

Structure-Aware Adaptive Hybrid Interaction Modeling for Image-Text Matching

Data availability

The authors confirm that the code, network weights and datasets supporting the results of this study can be found in the article.

Change history

11 March 2024
A Correction to this paper has been published: https://doi.org/10.1007/s00371-024-03348-9

References

Ghosh, M., Roy, S.S., Mukherjee, H., Obaidullah, S.M., Santosh, K., Roy, K.: Understanding movie poster: transfer-deep learning approach for graphic-rich text recognition. The Visual Computer, 1–20 (2022)
Macedo, D.V., Rodrigues, M.A.F.: Real-time dynamic reflections for realistic rendering of 3d scenes. Vis. Comput. 34, 337–346 (2018)
Article Google Scholar
Junkert, F., Eberts, M., Ulges, A., Schwanecke, U.: Cross-modal image-graphics retrieval by neural transfer learning. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 330–337 (2017)
Wei, X., Zhang, T., Li, Y., Zhang, Y., Wu, F.: Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10941–10950 (2020)
Liu, C., Mao, Z., Zhang, T., Xie, H., Wang, B., Zhang, Y.: Graph structured network for image-text matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10921–10930 (2020)
Diao, H., Zhang, Y., Ma, L., Lu, H.: Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 1218–1226 (2021)
Jiang, T., Zhang, Z., Yang, Y.: Modeling coverage with semantic embedding for image caption generation. Vis. Comput. 35, 1655–1665 (2019)
Article Google Scholar
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and vqa. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 13041–13049 (2020)
Sun, B., Wu, Y., Zhao, Y., Hao, Z., Yu, L., He, J.: Cross-language multimodal scene semantic guidance and leap sampling for video captioning. Vis. Comput., 1–17 (2022)
Guo, Z., Han, D.: Multi-modal co-attention relation networks for visual question answering. Vis. Comput. 39(11), 5783–95 (2022)
Article Google Scholar
Yan, F., Silamu, W., Li, Y., Chai, Y.: Spca-net: a based on spatial position relationship co-attention network for visual question answering. Vis. Comput. 38(9–10), 3097–3108 (2022)
Article Google Scholar
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701 (2018)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)
Chen, T., Luo, J.: Expressing objects just like words: Recurrent visual embedding for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 10583–10590 (2020)
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4654–4662 (2019)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 1597–1607 (2020)
Li, X., Yin, X., Li, C., Zhang, P., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 121–137 (2020)
Feng, Z., Zeng, Z., Guo, C., Li, Z.: Exploiting visual semantic reasoning for video-text retrieval. In: Proceedings of the International Conference on International Joint Conferences on Artificial Intelligence (IJCAI), pp. 1005–1011 (2021)
Wehrmann, J., Kolling, C., Barros, R.C.: Adaptive cross-modal embeddings for image-text alignment. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 12313–12320 (2020)
Liu, C., Mao, Z., Liu, A.A., Zhang, T., Wang, B., Zhang, Y.: Focus your attention: A bidirectional focal attention network for image-text matching. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), pp. 3–11 (2019)
Pan, Z., Wu, F., Zhang, B.: Fine-grained image-text matching by cross-modal hard aligning network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19275–19284 (2023)
Chen, C., Wang, D., Song, B., Tan, H.: Inter-intra modal representation augmentation with dct-transformer adversarial network for image-text matching. IEEE Transactions on Multimedia, 1–13 (2023)
Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5412–5425 (2020)
Article PubMed Google Scholar
Zhang, K., Mao, Z., Wang, Q., Zhang, Y.: Negative-aware attention framework for image-text matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15661–15670 (2022)
Wang, S., Wang, R., Yao, Z., Shan, S., Chen, X.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1508–1517 (2020)
Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-aware visual-semantic embedding for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 18–34 (2020)
Zhang, H., Mao, Z., Zhang, K., Zhang, Y.: Show your faith: Cross-modal confidence-aware network for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 36, pp. 3262–3270 (2022)
Chen, T., Deng, J., Luo, J.: Adaptive offline quintuplet loss for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 549–565 (2020)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738 (2020)
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D.: Supervised contrastive learning. Adv. Neural Inform. Process. Syst. NeurIPS 33, 18661–73 (2020)
Google Scholar
Li, W., Gao, C., Niu, G., Xiao, X., Liu, H., Liu, J., Wu, H., Wang, H.: Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol. 1: Long Papers), pp. 2592–2607 (2021)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), pp. 8748–8763 (2021). PMLR
Kenton, J.D.M.-W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), vol. 1, p. 2 (2019)
Gordo, A., Larlus, D.: Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6589–6598 (2017)
Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning (ICML)), pp. 5583–5594 (2021). PMLR
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021)
Google Scholar
Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., Gao, J.: Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588 (2021)
Qiu, R., Cai, Z., Chang, Z., Liu, S., Tu, G.: A two-stage image process for water level recognition via dual-attention cornernet and ctransformer. Vis. Comput. 39(7), 2933–2952 (2023)
Article Google Scholar
Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., Anderson, P.: Nocaps: Novel object captioning at scale. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 8948–8957 (2019)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755 (2014). Springer
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6700–6709 (2019)
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6720–6731 (2019)
Desai, K., Johnson, J.: Virtex: Learning visual representations from textual annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11162–11173 (2021)
Sariyildiz, M.B., Perez, J., Larlus, D.: Learning visual representations with caption annotations. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 153–170 (2020). Springer
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)
Article PubMed Google Scholar
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. In: Machine Learning for Healthcare Conference, pp. 2–25 (2022). PMLR
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916 (2021). PMLR
Liu, S., Fan, H., Qian, S., Chen, Y., Ding, W., Wang, Z.: Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11915–11925 (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural Inform. Process. Syst. NeurIPS 28, 91–99 (2015)
Google Scholar
Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
Google Scholar
Zhang, Q., Lei, Z., Zhang, Z., Li, S.Z.: Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3536–3545 (2020)
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12655–12663 (2020)
Wei, J., Yang, Y., Xu, X., Zhu, X., Shen, H.T.: Universal weighting metric learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 6534–45 (2021)
Article Google Scholar
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11336–11344 (2020)
Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, Z., et al.: Uniter: Universal image-text representation learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 104–120 (2020)
Chen, J., Hu, H., Wu, H., Jiang, Y., Wang, C.: Learning the best pooling strategy for visual semantic embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15789–15798 (2021)
Miech, A., Alayrac, J.B., Laptev, I., Sivic, J., Zisserman, A.: Thinking fast and slow: Efficient text-to-visual retrieval with transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9826–9836 (2021)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., Van Der Maaten, L.: Exploring the limits of weakly supervised pretraining. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196 (2018)

Download references

Acknowledgements

This work is partially supported by the XJTLU AI University Research Centre, Jiangsu Province Engineering Research Centre of Data Science and Cognitive Computation at XJTLU and SIP AI innovation platform (YZCXPT2022103); National Key Research and Development Project of China Grant (2021ZD0110505); Jiangsu Science and Technology Programme (BE2020006-4); Natural Science Foundation of Zhejiang Province (LY23F020014); The Key Technology R &D Program of Ningbo (2019B10128, 2023Z069), Gusu Innovation and Entrepreneurship Leading Talents Programme (ZXL2023176).

Author information

Authors and Affiliations

School of Advanced Technology, Xi’an Jiaotong-liverpool University, Suzhou, 215123, Jiangsu, China
Fangyu Wu, Qiufeng Wang, Siyue Yu, Yushi Li & Eng Gee Lim
Ningbo Innovation Center, Zhejiang University, Ningbo, 315100, Zhejiang, China
Zhao Wang & Bailing Zhang

Authors

Fangyu Wu
View author publications
You can also search for this author in PubMed Google Scholar
Qiufeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Siyue Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yushi Li
View author publications
You can also search for this author in PubMed Google Scholar
Bailing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Eng Gee Lim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fangyu Wu.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: the affiliation of the third author was not correct.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, F., Wang, Q., Wang, Z. et al. ITContrast: contrastive learning with hard negative synthesis for image-text matching. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03274-w

Download citation

Accepted: 08 January 2024
Published: 15 February 2024
DOI: https://doi.org/10.1007/s00371-024-03274-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ITContrast: contrastive learning with hard negative synthesis for image-text matching

Abstract

Access this article

Similar content being viewed by others

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

A Multiple Positives Enhanced NCE Loss for Image-Text Retrieval

Structure-Aware Adaptive Hybrid Interaction Modeling for Image-Text Matching

Data availability

Change history

11 March 2024

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ITContrast: contrastive learning with hard negative synthesis for image-text matching

Abstract

Access this article

Similar content being viewed by others

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

A Multiple Positives Enhanced NCE Loss for Image-Text Retrieval

Structure-Aware Adaptive Hybrid Interaction Modeling for Image-Text Matching

Data availability

Change history

11 March 2024

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation