Abstract
Image–text retrieval is a challenging task due to the requirement of thorough multimodal understanding and precise inter-modality relationship discovery. However, most previous approaches resort to doing global image–text alignment and neglect fine-grained correspondence. Although some works explore local region–word alignment, they usually suffer from a heavy computing burden. In this paper, we propose a prototype local–global alignment (PLGA) network for image–text retrieval by jointly performing the fine-grained local alignment and high-level global alignment. Specifically, our PLGA contains two key components: a prototype-based local alignment module and a multi-scale global alignment module. The former enables efficient fine-grained local matching by combining region–prototype alignment and word–prototype alignment, and the latter helps perceive hierarchical global semantics by exploring multi-scale global correlations between the image and text. Overall, the local and global alignment modules can boost their performances for each other via the unified model. Quantitative and qualitative experimental results on Flickr30K and MS-COCO benchmarks demonstrate that our proposed approach performs favorably against state-of-the-art methods.
Similar content being viewed by others
Data availability
The images and the data supporting Figs. 1, 2, 3, 4, 5, 6, 10 and 11, and Tables 2, 3, 4, 5 and 6 are publicly available at: https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48. The images and the data supporting Figs. 7, 8 and 9 and Table 1 are publicly available at: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00166/43313/From-image-descriptions-to-visual-denotations-New.
References
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6077–6086
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12655–12663
Chen J, Hu H, Wu H, Jiang Y, Wang C (2021) Learning the best pooling strategy for visual semantic embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15789–15798
Chen T, Luo J (2020) Expressing objects just like words: recurrent visual embedding for image-text matching. Proc Assoc Adv Artif Intell (AAAI) 34:10583–10590
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In; Proceedings of the conference of the North American chapter of the association for computational linguistics (ACL), pp 4171–4186
Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. In: Proceedings of the association for the advance of artificial intelligence (AAAI)
Faghri F, Fleet DJ, Kiros JR, Fidler S Vse+ (2017) Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612
Ge Y, Zhu F, Chen D, Zhao R et al (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In: Proceedings of the conference and workshop on neural information processing systems (NIPS) 33:11309–11321
Jocher G et al (2021) Yolov5. https://github.com/ultralytics/yolov5
Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7181–7189
He X, Deng L, Chou W (2008) Discriminative learning in sequential pattern recognition. IEEE Signal Process Mag (SPM) 25(5):14–36
Hu P, Peng X, Zhu H, Zhen L, Lin J (2021) Learning cross-modal retrieval with noisy labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5403–5413
Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2310–2318
Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6163–6171
Ji Z, Wang H, Han J, Pang Y (2019) Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5754–5763
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mapping. arXiv:1406.5679
Klein B, Lev G, Sadeh G, Wolf L (2015) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4437–4446
Ranjay K, Yuke Z, Oliver G, Justin J, Kenji H, Joshua K, Stephanie C, Yannis K, Li-Jia L, Shamma David A et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis (IJCV) 123(1):32–73
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 201–216
Li J, Zhou P, Xiong C, Hoi Steven CH (2020) Prototypical contrastive learning of unsupervised representations. arXiv:2005.04966
Li K, Zhang Y, Li K, Li Y, Fu Y (2019) Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4654–4662
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proceedings of the European conference on computer vision (ECCV), pp 740–755. Springer
Chunxiao L, Zhendong M, Tianzhu Z, Hongtao X, Bin W, Yongdong Z (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10921–10930
Liu Y, Guo Y, Bakker EM, Lew MS (2017) Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4107–4116
Nam H, Ha J-W, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 299–307
Niu Z, Zhou M, Wang L, Gao X, Hua G (2017) Hierarchical multimodal lstm for dense visual-semantic embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1881–1889
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic Differentiation in PyTorch. In: Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS). https://openreview.net/forum?id=BJJsrmfCZ
Peng Y, Qi J (2019) CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans Multimed Comput Commun Appl (TOMM) 15(1):1–24
Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: International joint conferences on artificial intelligence organization (IJCAI), pp 3846–3853
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Proceedings of the conference and workshop on neural information processing systems (NIPS) 28:91–99
Salvador A, Gundogdu E, Bazzani L, Donoser M (2021) Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15475–15484
Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5814–5824
Snell J, Swersky K, Zemel RS (2017) Prototypical networks for few-shot learning. arXiv:1703.05175
Toyama J, Misono M, Suzuki M, Nakayama K, Matsuo Y (2016) Neural machine translation with latent semantic of image and text. arXiv:1611.08459
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN , Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the conference and workshop on neural information processing systems (NIPS), vol 30
Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-aware visual-semantic embedding for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 18–34. Springer
Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5005–5013
Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell (TPAMI) 41(2):394–407
Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV), pp 1508–1517
Wang X, Liu Z, Yu SX (2021) Unsupervised feature learning by cross-level instance-group discrimination. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12586–12595
Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: visual-textual attributes alignment in person search by natural language. In: Proceedings of the European conference on computer vision (ECCV), pp 402–420. Springer
Wehrmann J, Kolling C, Barros RC (2020) Adaptive cross-modal embeddings for image-text alignment. In: Proceedings of the association for the advance of artificial intelligence (AAAI), vol 34, pp 12313–12320
Wei X, Zhang T, Li Y, Zhang Y, Feng W (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10941–10950
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3441–3450
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist (TACL) 2:67–78
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In; Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3536–3545
Zhang X, Ge Y, Qiao Y, Li H (2021) Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3436–3445
Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23
Acknowledgements
This work was supported by National Key Research and Development Program of China (No. 2018AAA0102200), National Natural Science Foundation of China (Nos. 62036012, 61720106006, 62002355, 61721004, 61832002, 62072455, 62102415, 62106262 and U1836220), Beijing Natural Science Foundation (L201001).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Meng, L., Zhang, F., Zhang, X. et al. Prototype local–global alignment network for image–text retrieval. Int J Multimed Info Retr 11, 525–538 (2022). https://doi.org/10.1007/s13735-022-00258-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-022-00258-1