Skip to main content
Log in

Dual visual align-cross attention-based image captioning transformer

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Region-based features widely used in image captioning are typically extracted using object detectors like Faster R-CNN. However, the approach has a limitation due to capturing region-level information and does not consider the holistic global information of the entire image. This limitation hinders the development of complex multi-modal reasoning capabilities in image captioning and leads to issues such as a lack of contextual information, inaccurate object detection, and high computational costs. To address these limitations and leverage the success of transformer-based architectures in image captioning, a transformer-based neural structure called DVAT (Dual Visual Attention-based Image Captioning Transformer) is proposed. DVAT effectively combines two visual features to generate more accurate captions. It divides region features into semi-region feature self-attention operations, which compute hidden features of the image, and semi-region feature convolutional operations, which capture background and contextual information. This approach enhances the receptive field of grid features while accelerating computation. Moreover, DVAT incorporates aligned-cross attention between region features and grid features to better integrate the dual visual features. This innovative design and fusion of dual visual features result in notable performance enhancements. Experimental results on multiple image captioning benchmarks demonstrate that DVAT outperforms previous methods in terms of both inference accuracy and speed. Extensive experiments conducted on the MS COCO dataset further validate that DVAT surpasses many state-of-the-art techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The data used in the study is available at the repository MS-COCO [1324].

References

  1. Zhou L, Palangi H, Zhang L, Corso J, Gao J (2020) Unified vision-language pre-training for image captioning and vqa. Proc AAAI Confer Artif Intell 34(07):13041–13049

  2. Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y, Wang L (2022) Scaling up vision-language pre-training for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, pp 17980–17989

  3. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, p 30

  4. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805

  5. Luo Y, Ji J, Sun X, Cao L, Wu Y, Huang F, Lin C-W, Ji R (2021) Dual-level collaborative transformer for image captioning. Proc AAAI Confer Artif Intell 35(3):2286–2293

  6. Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: Captioning with adaptive attention on visual and non-visual words. Proc IEEE Conf Comput Vis Pattern Recognit 2021:15465–15474

    Google Scholar 

  7. Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. Proc IEEE Conf Comput Vis Pattern Recognit 2020:10578–10587

    Google Scholar 

  8. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. Proc IEEE Int Conf Comput Vis Pattern Recog 2018:6077–6086

    Google Scholar 

  9. Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. Proc IEEE Int Conf Comput Vis Pattern Recog 2017:375–383

    Google Scholar 

  10. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. Proc IEEE Int Conf Comput Vis Pattern Recog 2017:7008–7024

    Google Scholar 

  11. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. Int Conf Mach Learning, PMLR 2015:2048–2057

    Google Scholar 

  12. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159

  13. Xian T, Li Z, Zhang C, Ma H (2022) Dual Global Enhanced Transformer for image captioning. Neural Netw 148:129–141. https://doi.org/10.1016/j.neunet.2022.01.011

    Article  Google Scholar 

  14. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnic CL (2014) Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer International Publishing, pp 740–755

  15. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf Comput Vis Pattern Recognit 2015:3128–3137

    Google Scholar 

  16. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. Proc IEEE Conf Comput Vis Pattern Recognit 2015:3156–3164

    Google Scholar 

  17. Ji J, Ma Y, Sun X, Zhou Y, Wu Y, Ji R (2022) Knowing what to learn: a metric-oriented focal mechanism for image captioning. IEEE Trans Image Process 31:4321–4335

  18. Ma Y, Ji J, Sun X, Zhou Y, Wu Y, Huang F, Ji R (2022) Knowing what it is: semantic-enhanced dual attention transformer. IEEE Trans Multimed 3723–3736

  19. Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. Proc IEEE Int Conf Comput Vis 2017:4894–4902

    Google Scholar 

  20. Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862

  21. Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. Adv Neural Inf Proces Syst 32

  22. Ren S, He K, Girshick R, Sun J (2016) Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  23. Wu L, Xu M, Sang L, Yao T, Mei T (2021) Noise augmented double-stream graph convolutional networks for image captioning. IEEE Trans Circuits Syst Video Technol 31(8):3118–3127

    Article  Google Scholar 

  24. Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. Proc IEEE Conf Comput Vis Pattern Recognit 2020:10971–10980

    Google Scholar 

  25. Wang C, Shen Y, Ji L (2022) Geometry attention transformer with position-aware LSTMs for image captioning. Expert Syst Appl 201:117174. https://doi.org/10.1016/j.eswa.2022.117174

    Article  Google Scholar 

  26. Wang Y, Xu J, Sun Y (2022) A visual persistence model for image captioning. Neurocomputing 468:48–59. https://doi.org/10.1016/j.neucom.2021.10.014

    Article  Google Scholar 

  27. Zhang Z, Wu Q, Wang Y, Chen F (2021) Exploring pairwise relationships adaptively from linguistic context in image captioning. IEEE Trans Multimed 24:3101–3113

  28. Yang X, Liu Y, Wang X (2022) Reformer: the relational transformer for image captioning. In: Proceedings of the 30th ACM International Conference on Multimedia. ACM, pp 5398–5406

  29. Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE, pp 4634–4643

  30. Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3152990

  31. Shao Z, Han J, Debattista K, Pang Y (2023) Textual Context-Aware Dense Captioning With Diverse Words. IEEE Trans Multimedia 25:8753–8766.

    Article  Google Scholar 

  32. Chen C, Han J, Debattista K (2024) Virtual Category Learning: A Semi-Supervised Learning Method for Dense Prediction with Extremely Limited Labels. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2024.3367416

    Article  Google Scholar 

  33. Li Y, Pan Y, Yao T, Mei T (2022) Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, pp 17990–17999

  34. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 770–778

  35. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1):32–73

    Article  MathSciNet  Google Scholar 

  36. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318

  37. Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  38. Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81

  39. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 4566–4575

  40. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer International Publishing, pp 382–398

  41. Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10685–10694

  42. Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699

  43. Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recogn 138:109420

  44. Jiang W, Ma L, Jiang YG, Liu W, Zhang T (2018) Recurrent fusion network for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 499–515

  45. Zha Z-J, Liu D, Zhang H, Zhang Y, Wu F (2022) Context-aware visual policy network for fine-grained image captioning. IEEE Trans Pattern Anal Mach Intell 44(2):710–722

    Article  Google Scholar 

  46. Ji J, Luo Y, Sun X, Chen F, Luo G, Wu Y, Gao Y, Ji R (2021) Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. Proc AAAI Confer Artif Intell 35(2):1655–1663

Download references

Acknowledgements

This research is supported by National Science Foundation of China (No.61976109); Liaoning Revitalization Talents Program (No.XLYC2006005). Fund receiver: Dr. Y. Ren. This research is supported by The Scientific Research Project of Liaoning Province(No.LJKZ0986 and LJKZ0963; Key R&D projects of Liaoning Provincial Department of Science and Technology; Liaoning Provincial Key Laboratory Special Fund. Fund Receiver: Dr. B. Fu. The research is supported by the University of Economics Ho Chi Minh City (UEH), Vietnam. Fund receiver: Dr. Dang N.H. Thanh.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Bo Fu or Dang N. H. Thanh.

Ethics declarations

Ethics statement

The authors declared that they have no conflicts of interest. The research does not involve human and animal participants.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ren, Y., Zhang, J., Xu, W. et al. Dual visual align-cross attention-based image captioning transformer. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19315-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-19315-4

Keywords

Navigation