Multi-level network based on transformer encoder for fine-grained image–text matching

Yang, Lei; Feng, Yong; Zhou, Mingliang; Xiong, Xiancai; Wang, Yongheng; Qiang, Baohua

doi:10.1007/s00530-023-01079-w

Multi-level network based on transformer encoder for fine-grained image–text matching

Regular Paper
Published: 10 April 2023

Volume 29, pages 1981–1994, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Lei Yang¹,
Yong Feng¹,
Mingliang Zhou ORCID: orcid.org/0000-0001-8302-4435¹,
Xiancai Xiong^2,3,
Yongheng Wang⁴ &
…
Baohua Qiang⁵

389 Accesses
2 Citations
Explore all metrics

Abstract

Enabling image–text matching is important to understand both vision and language. Existing methods utilize the cross-attention mechanism to explore deep semantic information. However, the majority of these methods need to perform two types of alignment, which is extremely time-consuming. In addition, current methods do not consider the digital information within the image or text, which may lead to a reduction in retrieval performance. In this paper, we propose a multi-level network, which is based on the transformer encoder for fine-grained, image–text matching. First, we use the transformer encoder to extract intra-modality relations within the image and text and perform the alignment through an efficient aggregating method, rendering the alignment more efficient and the intra-modality information fully utilized. Second, we capture the discriminative digital information within the image and text to make the representation more distinguishable. Finally, we utilize the global information of the image and text as complementary information to enhance the representation. According to our experimental results, significant improvements in terms of retrieval tasks and runtime estimation can be achieved compared with state-of-the-art algorithms. The source code is available at https://github.com/CQULab/MNTE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bi-directional Image–Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges

Article Open access 12 May 2023

Image-Text Matching: Methods and Challenges

Multi-scale motivated neural network for image-text matching

Article 25 May 2023

Data availability

These datasets during the current study were derived from the following public domain resources: http://shannon.cs.illinois.edu/DenotationGraph/ and https://cocodataset.org/.

Notes

Our proposed method is implemented by PyTorch framework with an NVIDIA GeForce GTX 1080Ti GPU.

References

Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N: A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 251–260 (2010)
Feng F, Wang X, Li R: Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 7–16 (2014)
Klein B, Lev G, Sadeh G, Wolf L: Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4437–4446 (2015)
Karpathy A, Fei-Fei L: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Yan F, Mikolajczyk K: Deep correlation for matching images and text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3441–3450 (2015)
Faghri F, Fleet DJ, Kiros JR, Fidler S: Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
Nam H, Ha J-W, Kim J: Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 299–307 (2017)
Wu Y, Wang S, Song G, Huang Q: Learning fragment self-attention embeddings for image-text matching. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2088–2096 (2019)
Jiang, A., Li, H., Li, Y., Wang, M.: Learning discriminative representations for semantical crossmodal retrieval. Multimed. Syst. 24(1), 111–121 (2018)
Article Google Scholar
Ge X, Chen F, Jose JM, Ji Z, Wu Z, Liu X: Structured multi-modal feature embedding and alignment for image-sentence retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5185–5193 (2021)
Huang, F., Zhang, X.: Zhao: Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans. Image Proc. 28(4), 2008–2020 (2018)
Article Google Scholar
Plummer BA, Kordas P, Kiapour MH, Zheng S, Piramuthu R, Lazebnik S: Conditional image-text embedding networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 249–264 (2018)
Karpathy A, Joulin A, Fei-Fei LF: Deep fragment embeddings for bidirectional image sentence mapping. Adv Neural Inf Proc Syst. 27, 1889–1897 (2014)
Google Scholar
Lee K-H, Chen X, Hua G, Hu H, He X: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)
Zhang, Y., Zhou, W., Wang, M., Tian, Q., Li, H.: Deep relation embedding for cross-modal retrieval. IEEE Trans. Image Proc. 30, 617–627 (2020)
Article Google Scholar
Messina N, Amato G, Esuli A, Falchi F, Gennaro C, Marchand-Maillet S: Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM). 17(4):1–23 (2021)
Liu C, Mao Z, Zhang T, Xie H, Wang B: Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10921–10930 (2020)
Qu L, Liu M, Cao D, Nie L, Tian Q: Context-aware multi-view summarization network for image-text matching. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1047–1055 (2020)
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J: Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655–12663 (2020)
Peng, Y., Qi, J., Zhuo, Y.: Mava: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism. IEEE Trans. Image Proc. 29, 2728–2741 (2019)
Article MATH Google Scholar
Wei X, Zhang T, Li Y, Wu F: Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10941–10950 (2020)
Qu L, Liu M Wu, J, Gao Z, Nie L: Dynamic modality interaction modeling for image-text retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1104–1113 (2021)
Yilmaz, T., Yazici, A., Kitsuregawa, M.: Relief-mm: effective modality weighting for multimedia information retrieval. Multimed. Syst. 20(4), 389–413 (2014)
Article Google Scholar
Jiang, S., Song, X., Huang, Q.: Relative image similarity learning with contextual information for internet cross-media retrieval. Multimed. Syst. 20(6), 645–657 (2014)
Article Google Scholar
Eisenschtat A, Wolf L: Linking image and text with 2-way nets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4601–4611 (2017)
Gu J, Cai J, Joty SR, Niu L, Wang G: Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7181–7189 (2018)
Liu Y, Guo Y, Bakker EM, Lew, MS: Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4107–4116 (2017)
Mithun NC, Panda R, Papalexakis EE, Roy-Chowdhury AK: Webly supervised joint embedding for cross-modal image-text retrieval. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1856–1864 (2018)
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Patt. Analy. Mach. Intell. 41(2), 394–407 (2018)
Article Google Scholar
Wu Y, Wang S, Huang Q: Learning semantic structure-preserved embeddings for cross-modal retrieval. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 825–833 (2018)
Zhang Y, Lu H: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701 (2018)
Hotelling H: Relations between two sets of variates. In: Breakthroughs in Statistics, pp. 162–190 (1992)
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Article MATH Google Scholar
Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., Yan, S.: Cross-modal retrieval with cnn visual features: A new baseline. IEEE Trans. Cybernet. 47(2), 449–460 (2016)
Google Scholar
Zhang L, Ma B, Li G, Huang Q, Tian Q: Multi-networks joint learning for large-scale cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 907–915 (2017)
Peng, Y., Qi, J.: Cm-gans: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimed. Comput Commun. Appl. (TOMM). 15(1), 1–24 (2019)
Article MathSciNet Google Scholar
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT: Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 154–162 (2017)
Ji Z, Wang H, Han J, Pang Y: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5754–5763 (2019)
Ma L, Lu Z, Shang L, Li H: Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2623–2631 (2015)
He K, Zhang X, Ren S, Sun J: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Devlin J, Chang M-W, Lee K, Toutanova K: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Young, P., Lai, A., Hodosh, M.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL: Microsoft coco: Common objects in context. In: European Conference on Computer Vision, pp. 740–755 (2014). Springer
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Advanc. Neu. Inform. Proc. Syst. 28, 91–99 (2015)
Google Scholar

Download references

Acknowledgements

This work was supported by National Nature Science Foundation of China (No. 62262006), National Natural Science Foundation of China by Mingliang Zhou (No. 62176027), Zhejiang Lab (No. 2021KE0AB01), Open Fund of Key Laboratory of Monitoring, Evaluation and Early Warning of Territorial Spatial Planning Implementation, Ministry of Natural Resources (No. LMEE-KF2021008), Technology Innovation and Application Development Key Project of Chongqing (No. cstc2021jscx-gksbX0058), and Guangxi Key Laboratory of Trusted Software (No. kx202006).

Author information

Authors and Affiliations

College of Computer Science, Chongqing University, Chongqing, 400044, China
Lei Yang, Yong Feng & Mingliang Zhou
Key laboratory of Monitoring, Evaluation and Early Warning of Territorial Spatial Planning Implementation, Ministry of Natural Resources, Chongqing, 401147, China
Xiancai Xiong
Chongqing Institute of Planning and Natural Resources Investigation and Monitoring, Chongqing, 401121, China
Xiancai Xiong
Zhejiang Lab, Yuhang district, Hangzhou, 311121, China
Yongheng Wang
Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, 541004, China
Baohua Qiang

Authors

Lei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Feng
View author publications
You can also search for this author in PubMed Google Scholar
Mingliang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xiancai Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Yongheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Baohua Qiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yong Feng or Mingliang Zhou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, L., Feng, Y., Zhou, M. et al. Multi-level network based on transformer encoder for fine-grained image–text matching. Multimedia Systems 29, 1981–1994 (2023). https://doi.org/10.1007/s00530-023-01079-w

Download citation

Received: 12 March 2022
Accepted: 13 March 2023
Published: 10 April 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s00530-023-01079-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-level network based on transformer encoder for fine-grained image–text matching

Abstract

Access this article

Similar content being viewed by others

Bi-directional Image–Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges

Image-Text Matching: Methods and Challenges

Multi-scale motivated neural network for image-text matching

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-level network based on transformer encoder for fine-grained image–text matching

Abstract

Access this article

Similar content being viewed by others

Bi-directional Image–Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges

Image-Text Matching: Methods and Challenges

Multi-scale motivated neural network for image-text matching

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation