Skip to main content
Log in

Prototype local–global alignment network for image–text retrieval

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Image–text retrieval is a challenging task due to the requirement of thorough multimodal understanding and precise inter-modality relationship discovery. However, most previous approaches resort to doing global image–text alignment and neglect fine-grained correspondence. Although some works explore local region–word alignment, they usually suffer from a heavy computing burden. In this paper, we propose a prototype local–global alignment (PLGA) network for image–text retrieval by jointly performing the fine-grained local alignment and high-level global alignment. Specifically, our PLGA contains two key components: a prototype-based local alignment module and a multi-scale global alignment module. The former enables efficient fine-grained local matching by combining region–prototype alignment and word–prototype alignment, and the latter helps perceive hierarchical global semantics by exploring multi-scale global correlations between the image and text. Overall, the local and global alignment modules can boost their performances for each other via the unified model. Quantitative and qualitative experimental results on Flickr30K and MS-COCO benchmarks demonstrate that our proposed approach performs favorably against state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability

The images and the data supporting Figs. 1, 2, 3, 4, 5, 6, 10 and 11, and Tables 2, 3, 4, 5 and 6 are publicly available at: https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48. The images and the data supporting Figs. 7, 8 and 9 and Table 1 are publicly available at: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00166/43313/From-image-descriptions-to-visual-denotations-New.

References

  1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6077–6086

  2. Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12655–12663

  3. Chen J, Hu H, Wu H, Jiang Y, Wang C (2021) Learning the best pooling strategy for visual semantic embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15789–15798

  4. Chen T, Luo J (2020) Expressing objects just like words: recurrent visual embedding for image-text matching. Proc Assoc Adv Artif Intell (AAAI) 34:10583–10590

    Google Scholar 

  5. Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In; Proceedings of the conference of the North American chapter of the association for computational linguistics (ACL), pp 4171–4186

  6. Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. In: Proceedings of the association for the advance of artificial intelligence (AAAI)

  7. Faghri F, Fleet DJ, Kiros JR, Fidler S Vse+ (2017) Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612

  8. Ge Y, Zhu F, Chen D, Zhao R et al (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In: Proceedings of the conference and workshop on neural information processing systems (NIPS) 33:11309–11321

  9. Jocher G et al (2021) Yolov5. https://github.com/ultralytics/yolov5

  10. Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7181–7189

  11. He X, Deng L, Chou W (2008) Discriminative learning in sequential pattern recognition. IEEE Signal Process Mag (SPM) 25(5):14–36

    Article  Google Scholar 

  12. Hu P, Peng X, Zhu H, Zhen L, Lin J (2021) Learning cross-modal retrieval with noisy labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5403–5413

  13. Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2310–2318

  14. Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6163–6171

  15. Ji Z, Wang H, Han J, Pang Y (2019) Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5754–5763

  16. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  17. Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mapping. arXiv:1406.5679

  18. Klein B, Lev G, Sadeh G, Wolf L (2015) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4437–4446

  19. Ranjay K, Yuke Z, Oliver G, Justin J, Kenji H, Joshua K, Stephanie C, Yannis K, Li-Jia L, Shamma David A et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis (IJCV) 123(1):32–73

    Article  MathSciNet  Google Scholar 

  20. Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 201–216

  21. Li J, Zhou P, Xiong C, Hoi Steven CH (2020) Prototypical contrastive learning of unsupervised representations. arXiv:2005.04966

  22. Li K, Zhang Y, Li K, Li Y, Fu Y (2019) Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4654–4662

  23. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proceedings of the European conference on computer vision (ECCV), pp 740–755. Springer

  24. Chunxiao L, Zhendong M, Tianzhu Z, Hongtao X, Bin W, Yongdong Z (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10921–10930

  25. Liu Y, Guo Y, Bakker EM, Lew MS (2017) Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4107–4116

  26. Nam H, Ha J-W, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 299–307

  27. Niu Z, Zhou M, Wang L, Gao X, Hua G (2017) Hierarchical multimodal lstm for dense visual-semantic embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1881–1889

  28. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic Differentiation in PyTorch. In: Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS). https://openreview.net/forum?id=BJJsrmfCZ

  29. Peng Y, Qi J (2019) CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans Multimed Comput Commun Appl (TOMM) 15(1):1–24

  30. Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: International joint conferences on artificial intelligence organization (IJCAI), pp 3846–3853

  31. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Proceedings of the conference and workshop on neural information processing systems (NIPS) 28:91–99

  32. Salvador A, Gundogdu E, Bazzani L, Donoser M (2021) Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15475–15484

  33. Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5814–5824

  34. Snell J, Swersky K, Zemel RS (2017) Prototypical networks for few-shot learning. arXiv:1703.05175

  35. Toyama J, Misono M, Suzuki M, Nakayama K, Matsuo Y (2016) Neural machine translation with latent semantic of image and text. arXiv:1611.08459

  36. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN , Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the conference and workshop on neural information processing systems (NIPS), vol 30

  37. Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-aware visual-semantic embedding for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 18–34. Springer

  38. Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5005–5013

  39. Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell (TPAMI) 41(2):394–407

    Article  Google Scholar 

  40. Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV), pp 1508–1517

  41. Wang X, Liu Z, Yu SX (2021) Unsupervised feature learning by cross-level instance-group discrimination. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12586–12595

  42. Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: visual-textual attributes alignment in person search by natural language. In: Proceedings of the European conference on computer vision (ECCV), pp 402–420. Springer

  43. Wehrmann J, Kolling C, Barros RC (2020) Adaptive cross-modal embeddings for image-text alignment. In: Proceedings of the association for the advance of artificial intelligence (AAAI), vol 34, pp 12313–12320

  44. Wei X, Zhang T, Li Y, Zhang Y, Feng W (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10941–10950

  45. Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3441–3450

  46. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist (TACL) 2:67–78

    Article  Google Scholar 

  47. Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In; Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3536–3545

  48. Zhang X, Ge Y, Qiao Y, Li H (2021) Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3436–3445

  49. Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (No. 2018AAA0102200), National Natural Science Foundation of China (Nos. 62036012, 61720106006, 62002355, 61721004, 61832002, 62072455, 62102415, 62106262 and U1836220), Beijing Natural Science Foundation (L201001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feifei Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Meng, L., Zhang, F., Zhang, X. et al. Prototype local–global alignment network for image–text retrieval. Int J Multimed Info Retr 11, 525–538 (2022). https://doi.org/10.1007/s13735-022-00258-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-022-00258-1

Keywords

Navigation