Skip to main content
Log in

Multi-level network based on transformer encoder for fine-grained image–text matching

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Enabling image–text matching is important to understand both vision and language. Existing methods utilize the cross-attention mechanism to explore deep semantic information. However, the majority of these methods need to perform two types of alignment, which is extremely time-consuming. In addition, current methods do not consider the digital information within the image or text, which may lead to a reduction in retrieval performance. In this paper, we propose a multi-level network, which is based on the transformer encoder for fine-grained, image–text matching. First, we use the transformer encoder to extract intra-modality relations within the image and text and perform the alignment through an efficient aggregating method, rendering the alignment more efficient and the intra-modality information fully utilized. Second, we capture the discriminative digital information within the image and text to make the representation more distinguishable. Finally, we utilize the global information of the image and text as complementary information to enhance the representation. According to our experimental results, significant improvements in terms of retrieval tasks and runtime estimation can be achieved compared with state-of-the-art algorithms. The source code is available at https://github.com/CQULab/MNTE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

These datasets during the current study were derived from the following public domain resources: http://shannon.cs.illinois.edu/DenotationGraph/ and https://cocodataset.org/.

Notes

  1. Our proposed method is implemented by PyTorch framework with an NVIDIA GeForce GTX 1080Ti GPU.

References

  1. Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N: A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 251–260 (2010)

  2. Feng F, Wang X, Li R: Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 7–16 (2014)

  3. Klein B, Lev G, Sadeh G, Wolf L: Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4437–4446 (2015)

  4. Karpathy A, Fei-Fei L: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)

  5. Yan F, Mikolajczyk K: Deep correlation for matching images and text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3441–3450 (2015)

  6. Faghri F, Fleet DJ, Kiros JR, Fidler S: Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)

  7. Nam H, Ha J-W, Kim J: Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 299–307 (2017)

  8. Wu Y, Wang S, Song G, Huang Q: Learning fragment self-attention embeddings for image-text matching. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2088–2096 (2019)

  9. Jiang, A., Li, H., Li, Y., Wang, M.: Learning discriminative representations for semantical crossmodal retrieval. Multimed. Syst. 24(1), 111–121 (2018)

    Article  Google Scholar 

  10. Ge X, Chen F, Jose JM, Ji Z, Wu Z, Liu X: Structured multi-modal feature embedding and alignment for image-sentence retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5185–5193 (2021)

  11. Huang, F., Zhang, X.: Zhao: Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans. Image Proc. 28(4), 2008–2020 (2018)

    Article  Google Scholar 

  12. Plummer BA, Kordas P, Kiapour MH, Zheng S, Piramuthu R, Lazebnik S: Conditional image-text embedding networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 249–264 (2018)

  13. Karpathy A, Joulin A, Fei-Fei LF: Deep fragment embeddings for bidirectional image sentence mapping. Adv Neural Inf Proc Syst. 27, 1889–1897 (2014)

    Google Scholar 

  14. Lee K-H, Chen X, Hua G, Hu H, He X: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)

  15. Zhang, Y., Zhou, W., Wang, M., Tian, Q., Li, H.: Deep relation embedding for cross-modal retrieval. IEEE Trans. Image Proc. 30, 617–627 (2020)

    Article  Google Scholar 

  16. Messina N, Amato G, Esuli A, Falchi F, Gennaro C, Marchand-Maillet S: Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM). 17(4):1–23 (2021)

  17. Liu C, Mao Z, Zhang T, Xie H, Wang B: Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10921–10930 (2020)

  18. Qu L, Liu M, Cao D, Nie L, Tian Q: Context-aware multi-view summarization network for image-text matching. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1047–1055 (2020)

  19. Chen H, Ding G, Liu X, Lin Z, Liu J, Han J: Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655–12663 (2020)

  20. Peng, Y., Qi, J., Zhuo, Y.: Mava: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism. IEEE Trans. Image Proc. 29, 2728–2741 (2019)

    Article  MATH  Google Scholar 

  21. Wei X, Zhang T, Li Y, Wu F: Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10941–10950 (2020)

  22. Qu L, Liu M Wu, J, Gao Z, Nie L: Dynamic modality interaction modeling for image-text retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1104–1113 (2021)

  23. Yilmaz, T., Yazici, A., Kitsuregawa, M.: Relief-mm: effective modality weighting for multimedia information retrieval. Multimed. Syst. 20(4), 389–413 (2014)

    Article  Google Scholar 

  24. Jiang, S., Song, X., Huang, Q.: Relative image similarity learning with contextual information for internet cross-media retrieval. Multimed. Syst. 20(6), 645–657 (2014)

    Article  Google Scholar 

  25. Eisenschtat A, Wolf L: Linking image and text with 2-way nets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4601–4611 (2017)

  26. Gu J, Cai J, Joty SR, Niu L, Wang G: Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7181–7189 (2018)

  27. Liu Y, Guo Y, Bakker EM, Lew, MS: Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4107–4116 (2017)

  28. Mithun NC, Panda R, Papalexakis EE, Roy-Chowdhury AK: Webly supervised joint embedding for cross-modal image-text retrieval. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1856–1864 (2018)

  29. Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Patt. Analy. Mach. Intell. 41(2), 394–407 (2018)

    Article  Google Scholar 

  30. Wu Y, Wang S, Huang Q: Learning semantic structure-preserved embeddings for cross-modal retrieval. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 825–833 (2018)

  31. Zhang Y, Lu H: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701 (2018)

  32. Hotelling H: Relations between two sets of variates. In: Breakthroughs in Statistics, pp. 162–190 (1992)

  33. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)

    Article  MATH  Google Scholar 

  34. Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., Yan, S.: Cross-modal retrieval with cnn visual features: A new baseline. IEEE Trans. Cybernet. 47(2), 449–460 (2016)

    Google Scholar 

  35. Zhang L, Ma B, Li G, Huang Q, Tian Q: Multi-networks joint learning for large-scale cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 907–915 (2017)

  36. Peng, Y., Qi, J.: Cm-gans: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimed. Comput Commun. Appl. (TOMM). 15(1), 1–24 (2019)

    Article  MathSciNet  Google Scholar 

  37. Wang B, Yang Y, Xu X, Hanjalic A, Shen HT: Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 154–162 (2017)

  38. Ji Z, Wang H, Han J, Pang Y: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5754–5763 (2019)

  39. Ma L, Lu Z, Shang L, Li H: Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2623–2631 (2015)

  40. He K, Zhang X, Ren S, Sun J: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  41. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

  42. Devlin J, Chang M-W, Lee K, Toutanova K: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  43. Young, P., Lai, A., Hodosh, M.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)

    Article  Google Scholar 

  44. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL: Microsoft coco: Common objects in context. In: European Conference on Computer Vision, pp. 740–755 (2014). Springer

  45. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Advanc. Neu. Inform. Proc. Syst. 28, 91–99 (2015)

    Google Scholar 

Download references

Acknowledgements

This work was supported by National Nature Science Foundation of China (No. 62262006), National Natural Science Foundation of China by Mingliang Zhou (No. 62176027),  Zhejiang Lab (No. 2021KE0AB01), Open Fund of Key Laboratory of Monitoring, Evaluation and Early Warning of Territorial Spatial Planning Implementation, Ministry of Natural Resources (No. LMEE-KF2021008), Technology Innovation and Application Development Key Project of Chongqing (No. cstc2021jscx-gksbX0058), and Guangxi Key Laboratory of Trusted Software (No. kx202006).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yong Feng or Mingliang Zhou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, L., Feng, Y., Zhou, M. et al. Multi-level network based on transformer encoder for fine-grained image–text matching. Multimedia Systems 29, 1981–1994 (2023). https://doi.org/10.1007/s00530-023-01079-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-023-01079-w

Keywords

Navigation