Prototype local–global alignment network for image–text retrieval

Meng, Lingtao; Zhang, Feifei; Zhang, Xi; Xu, Changsheng

doi:10.1007/s13735-022-00258-1

Prototype local–global alignment network for image–text retrieval

Regular Paper
Published: 06 October 2022

Volume 11, pages 525–538, (2022)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Lingtao Meng¹,
Feifei Zhang ORCID: orcid.org/0000-0002-8153-9977¹,
Xi Zhang² &
…
Changsheng Xu²

591 Accesses
1 Citation
Explore all metrics

Abstract

Image–text retrieval is a challenging task due to the requirement of thorough multimodal understanding and precise inter-modality relationship discovery. However, most previous approaches resort to doing global image–text alignment and neglect fine-grained correspondence. Although some works explore local region–word alignment, they usually suffer from a heavy computing burden. In this paper, we propose a prototype local–global alignment (PLGA) network for image–text retrieval by jointly performing the fine-grained local alignment and high-level global alignment. Specifically, our PLGA contains two key components: a prototype-based local alignment module and a multi-scale global alignment module. The former enables efficient fine-grained local matching by combining region–prototype alignment and word–prototype alignment, and the latter helps perceive hierarchical global semantics by exploring multi-scale global correlations between the image and text. Overall, the local and global alignment modules can boost their performances for each other via the unified model. Quantitative and qualitative experimental results on Flickr30K and MS-COCO benchmarks demonstrate that our proposed approach performs favorably against state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image Matching from Handcrafted to Deep Features: A Survey

Article Open access 04 August 2020

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Data availability

The images and the data supporting Figs. 1, 2, 3, 4, 5, 6, 10 and 11, and Tables 2, 3, 4, 5 and 6 are publicly available at: https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48. The images and the data supporting Figs. 7, 8 and 9 and Table 1 are publicly available at: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00166/43313/From-image-descriptions-to-visual-denotations-New.

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6077–6086
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12655–12663
Chen J, Hu H, Wu H, Jiang Y, Wang C (2021) Learning the best pooling strategy for visual semantic embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15789–15798
Chen T, Luo J (2020) Expressing objects just like words: recurrent visual embedding for image-text matching. Proc Assoc Adv Artif Intell (AAAI) 34:10583–10590
Google Scholar
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In; Proceedings of the conference of the North American chapter of the association for computational linguistics (ACL), pp 4171–4186
Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. In: Proceedings of the association for the advance of artificial intelligence (AAAI)
Faghri F, Fleet DJ, Kiros JR, Fidler S Vse+ (2017) Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612
Ge Y, Zhu F, Chen D, Zhao R et al (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In: Proceedings of the conference and workshop on neural information processing systems (NIPS) 33:11309–11321
Jocher G et al (2021) Yolov5. https://github.com/ultralytics/yolov5
Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7181–7189
He X, Deng L, Chou W (2008) Discriminative learning in sequential pattern recognition. IEEE Signal Process Mag (SPM) 25(5):14–36
Article Google Scholar
Hu P, Peng X, Zhu H, Zhen L, Lin J (2021) Learning cross-modal retrieval with noisy labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5403–5413
Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2310–2318
Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6163–6171
Ji Z, Wang H, Han J, Pang Y (2019) Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5754–5763
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mapping. arXiv:1406.5679
Klein B, Lev G, Sadeh G, Wolf L (2015) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4437–4446
Ranjay K, Yuke Z, Oliver G, Justin J, Kenji H, Joshua K, Stephanie C, Yannis K, Li-Jia L, Shamma David A et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis (IJCV) 123(1):32–73
Article MathSciNet Google Scholar
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 201–216
Li J, Zhou P, Xiong C, Hoi Steven CH (2020) Prototypical contrastive learning of unsupervised representations. arXiv:2005.04966
Li K, Zhang Y, Li K, Li Y, Fu Y (2019) Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4654–4662
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proceedings of the European conference on computer vision (ECCV), pp 740–755. Springer
Chunxiao L, Zhendong M, Tianzhu Z, Hongtao X, Bin W, Yongdong Z (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10921–10930
Liu Y, Guo Y, Bakker EM, Lew MS (2017) Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4107–4116
Nam H, Ha J-W, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 299–307
Niu Z, Zhou M, Wang L, Gao X, Hua G (2017) Hierarchical multimodal lstm for dense visual-semantic embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1881–1889
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic Differentiation in PyTorch. In: Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS). https://openreview.net/forum?id=BJJsrmfCZ
Peng Y, Qi J (2019) CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans Multimed Comput Commun Appl (TOMM) 15(1):1–24
Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: International joint conferences on artificial intelligence organization (IJCAI), pp 3846–3853
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Proceedings of the conference and workshop on neural information processing systems (NIPS) 28:91–99
Salvador A, Gundogdu E, Bazzani L, Donoser M (2021) Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15475–15484
Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5814–5824
Snell J, Swersky K, Zemel RS (2017) Prototypical networks for few-shot learning. arXiv:1703.05175
Toyama J, Misono M, Suzuki M, Nakayama K, Matsuo Y (2016) Neural machine translation with latent semantic of image and text. arXiv:1611.08459
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN , Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the conference and workshop on neural information processing systems (NIPS), vol 30
Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-aware visual-semantic embedding for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 18–34. Springer
Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5005–5013
Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell (TPAMI) 41(2):394–407
Article Google Scholar
Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV), pp 1508–1517
Wang X, Liu Z, Yu SX (2021) Unsupervised feature learning by cross-level instance-group discrimination. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12586–12595
Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: visual-textual attributes alignment in person search by natural language. In: Proceedings of the European conference on computer vision (ECCV), pp 402–420. Springer
Wehrmann J, Kolling C, Barros RC (2020) Adaptive cross-modal embeddings for image-text alignment. In: Proceedings of the association for the advance of artificial intelligence (AAAI), vol 34, pp 12313–12320
Wei X, Zhang T, Li Y, Zhang Y, Feng W (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10941–10950
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3441–3450
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist (TACL) 2:67–78
Article Google Scholar
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In; Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3536–3545
Zhang X, Ge Y, Qiao Y, Li H (2021) Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3436–3445
Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23
Article Google Scholar

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (No. 2018AAA0102200), National Natural Science Foundation of China (Nos. 62036012, 61720106006, 62002355, 61721004, 61832002, 62072455, 62102415, 62106262 and U1836220), Beijing Natural Science Foundation (L201001).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Tianjin University of Technology, Binshui West Street, Tianjin, 300380, Tianjin, China
Lingtao Meng & Feifei Zhang
Institute of Automation, Chinese Academy of Sciences, East Zhongguancun Road, Beijing, 100080, Beijing, China
Xi Zhang & Changsheng Xu

Authors

Lingtao Meng
View author publications
You can also search for this author in PubMed Google Scholar
Feifei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Changsheng Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feifei Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Meng, L., Zhang, F., Zhang, X. et al. Prototype local–global alignment network for image–text retrieval. Int J Multimed Info Retr 11, 525–538 (2022). https://doi.org/10.1007/s13735-022-00258-1

Download citation

Received: 22 June 2022
Revised: 25 August 2022
Accepted: 06 September 2022
Published: 06 October 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s13735-022-00258-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prototype local–global alignment network for image–text retrieval

Abstract

Access this article

Similar content being viewed by others

Image Matching from Handcrafted to Deep Features: A Survey

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Prototype local–global alignment network for image–text retrieval

Abstract

Access this article

Similar content being viewed by others

Image Matching from Handcrafted to Deep Features: A Survey

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation