Skip to main content
Log in

Improving text-image cross-modal retrieval with contrastive loss

  • Special Issue Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Text-image retrieval task has attracted extensive attention nowadays. Due to the different feature distributions, the performance of this task suffers from the large modal discrepancy. Most retrieval methods map images and texts into a common embedding space and measure the similarities. However, in the dataset, there may be multiple texts corresponding to the same image. Previous approaches rarely consider these texts together when calculating the similarities in common space. In this paper, we propose an improving text-image cross-modal retrieval framework with contrastive loss, which considers multiple texts of one image. Using the overall text features, our approach makes better alignment between image and its corresponding text center. Results on the Flickr30K dataset achieve the competitive performance, validating the effectiveness of the proposed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Zhang, D., Yao, L., Chen, K., Wang, S., Chang, X., Liu, Y.: Making sense of spatio-temporal preserving representations for EEG-based human intention recognition. IEEE Trans. Cybern. 50(7), 3033–3044 (2020). https://doi.org/10.1109/TCYB.2019.2905157

    Article  Google Scholar 

  2. Luo, M., Chang, X., Nie, L., Yang, Y., Hauptmann, A.G., Zheng, Q.: An adaptive semisupervised feature analysis for video semantic recognition. IEEE Trans. Cybern. 48(2), 648–660 (2018). https://doi.org/10.1109/TCYB.2017.2647904

    Article  Google Scholar 

  3. Chen, K., Yao, L., Zhang, D., Wang, X., Chang, X., Nie, F.: A semisupervised recurrent convolutional attention model for human activity recognition. IEEE Trans. Neural Networks Learn. Syst. 31(5), 1747–1756 (2020). https://doi.org/10.1109/TNNLS.2019.2927224

    Article  Google Scholar 

  4. Liu, Z., Qian, P., Wang, X., Zhu, L., He, Q., Ji, S.: Smart contract vulnerability detection: from pure neural network to interpretable graph feature and expert pattern fusion, in: IJCAI, 2021, pp. 2751–2759. https://doi.org/10.24963/ijcai.2021/379

  5. Liu, A., Zhou, H., Nie, W., Liu, Z., Liu, W., Xie, H., Mao, Z., Li, X., Song, D.: Hierarchical multi-view context modelling for 3d object classification and retrieval. Inf. Sci. 547, 984–995 (2021). https://doi.org/10.1016/j.ins.2020.09.057

    Article  Google Scholar 

  6. Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching, in: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, IEEE, 2019, pp. 4653–4661. https://doi.org/10.1109/ICCV.2019.00475

  7. Lee, K., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching, in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part IV, Vol. 11208 of Lecture Notes in Computer Science, Springer, 2018, pp. 212–228. https://doi.org/10.1007/978-3-030-01225-0_13

  8. Wang, W., Yang, X., Ooi, B.C., Zhang, D., Zhuang, Y.: Effective deep learning-based multi-modal retrieval. VLDB J. 25(1), 79–101 (2016). https://doi.org/10.1007/s00778-015-0391-4

    Article  Google Scholar 

  9. Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, Computer Vision Foundation/IEEE, 2019, pp. 10394–10403. http://openaccess.thecvf.com/content_CVPR_2019/html/Zhen_Deep_Supervised_Cross-Modal_Retrieval_CVPR_2019_paper.html

  10. Wang, H., Sahoo, D., Liu, C., Lim, E., Hoi, S.C.H.: Learning cross-modal embeddings with adversarial networks for cooking recipes and food images, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Computer Vision Foundation/IEEE, 2019, pp. 11572–11581. http://openaccess.thecvf.com/content_CVPR_2019/html/Wang_Learning_Cross-Modal_Embeddings_With_Adversarial_Networks_for_Cooking_Recipes_and_CVPR_2019_paper.html

  11. Zhai, X., Peng, Y., Xiao, J.: Cross-media retrieval by intra-media and inter-media correlation mining. Multimed. Syst. 19(5), 395–406 (2013). https://doi.org/10.1007/s00530-012-0297-6

    Article  Google Scholar 

  12. Xie, L., Pan, P., Lu, Y.: Analyzing semantic correlation for cross-modal retrieval. Multimed. Syst. 21(6), 525–539 (2015). https://doi.org/10.1007/s00530-014-0397-6

    Article  Google Scholar 

  13. Jiang, A., Li, H., Li, Y., Wang, M.: Learning discriminative representations for semantical crossmodal retrieval. Multimed. Syst. 24(1), 111–121 (2018). https://doi.org/10.1007/s00530-016-0532-7

    Article  Google Scholar 

  14. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. http://arxiv.org/abs/1409.0473

  15. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention, in: F. R. Bach, D. M. Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, Vol. 37 of JMLR Workshop and Conference Proceedings, JMLR.org, 2015, pp. 2048–2057. http://proceedings.mlr.press/v37/xuc15.html

  16. Ji, Z., Wang, H., Han, J., Pang, Y.: Saliency-guided attention network for image-sentence matching, in: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, IEEE, 2019, pp. 5753–5762. https://doi.org/10.1109/ICCV.2019.00585

  17. Chen, Y., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: UNITER: universal image-text representation learning, in: A. Vedaldi, H. Bischof, T. Brox, J. Frahm (Eds.), Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, Vol. 12375 of Lecture Notes in Computer Science, Springer, 2020, pp. 104–120. https://doi.org/10.1007/978-3-030-58577-8_7

  18. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, in: H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8–14, 2019, Vancouver, BC, Canada, 2019, pp. 13–23. https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html

  19. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. https://doi.org/10.18653/v1/n19-1423

  20. Sun, C., Song, X., Feng, F., Zhao, W.X., Zhang, H., Nie, L.: Supervised hierarchical cross-modal hashing, in: B. Piwowarski, M. Chevalier, É. Gaussier, Y. Maarek, J. Nie, F. Scholer (Eds.), Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21–25, 2019, ACM, 2019, pp. 725–734. https://doi.org/10.1145/3331184.3331229

  21. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, Computer Vision Foundation / IEEE Computer Society, 2018, pp. 6077–6086. http://openaccess.thecvf.com/content_cvpr_2018/html/Anderson_Bottom-Up_and_Top-Down_CVPR_2018_paper.html

  22. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7

    Article  MathSciNet  Google Scholar 

  23. Chen, X., Li, L., Fei-Fei, L.: A. Gupta, Iterative visual reasoning beyond convolutions, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, Computer Vision Foundation / IEEE Computer Society, 2018, pp. 7239–7248. http://openaccess.thecvf.com/content_cvpr_2018/html/Chen_Iterative_Visual_Reasoning_CVPR_2018_paper.html

  24. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics 2 ,67–78 (2014). https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/229

  25. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives, in: British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3–6, 2018, BMVA Press, 2018, p. 12. http://bmvc2018.org/contents/papers/0344.pdf

  26. Gu, J., Cai, J., Joty, S.R., Niu, L., Wang, G.: Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, Computer Vision Foundation/IEEE Computer Society, 2018, pp. 7181–7189. http://openaccess.thecvf.com/content_cvpr_2018/html/Gu_Look_Imagine_and_CVPR_2018_paper.html

  27. Huang, Y., Wu, Q., Wang, W., Wang, L.: Image and sentence matching via semantic concepts and order learning. IEEE Trans. Pattern Anal. Mach. Intell. 42(3), 636–650 (2020). https://doi.org/10.1109/TPAMI.2018.2883466

    Article  Google Scholar 

  28. Wang, T., Xu, X., Yang, Y., Hanjalic, A., Shen, H.T., Song, J.: Matching images and text with multi-modal tensor fusion and re-ranking, in: L. Amsaleg, B. Huet, M. A. Larson, G. Gravier, H. Hung, C. Ngo, W. T. Ooi (Eds.), Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21-25, 2019, ACM, 2019, pp. 12–20. https://doi.org/10.1145/3343031.3350875

  29. Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5412–5425 (2020). https://doi.org/10.1109/TNNLS.2020.2967597

    Article  Google Scholar 

  30. Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., Fan, X.: Position focused attention network for image-text matching, in: S. Kraus (Ed.), Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, ijcai.org, 2019, pp. 3792–3798. https://doi.org/10.24963/ijcai.2019/526

Download references

Acknowledgements

This work was supported in part by the National Nature Science Foundation of China (61902277, U21B2024), the China Postdoctoral Science Foundation (2021T140511, 2020M680884), the Funding Project of the State Key Laboratory of Communication Content Cognition (Grant No. A02106), and the Open Funding Project of the State Key Laboratory of Communication Content Cognition (Grant No. 20K04).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dan Song.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, C., Yang, Y., Guo, J. et al. Improving text-image cross-modal retrieval with contrastive loss. Multimedia Systems 29, 569–575 (2023). https://doi.org/10.1007/s00530-022-00962-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-022-00962-2

Keywords

Navigation