Dual visual align-cross attention-based image captioning transformer

Ren, Yonggong; Zhang, Jinghan; Xu, Wenqiang; Lin, Yuzhu; Fu, Bo; Thanh, Dang N. H.

doi:10.1007/s11042-024-19315-4

Dual visual align-cross attention-based image captioning transformer

Published: 17 May 2024

(2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yonggong Ren¹,
Jinghan Zhang¹,
Wenqiang Xu¹,
Yuzhu Lin²,
Bo Fu¹ &
…
Dang N. H. Thanh ORCID: orcid.org/0000-0003-2025-8319³

62 Accesses
Explore all metrics

Abstract

Region-based features widely used in image captioning are typically extracted using object detectors like Faster R-CNN. However, the approach has a limitation due to capturing region-level information and does not consider the holistic global information of the entire image. This limitation hinders the development of complex multi-modal reasoning capabilities in image captioning and leads to issues such as a lack of contextual information, inaccurate object detection, and high computational costs. To address these limitations and leverage the success of transformer-based architectures in image captioning, a transformer-based neural structure called DVAT (Dual Visual Attention-based Image Captioning Transformer) is proposed. DVAT effectively combines two visual features to generate more accurate captions. It divides region features into semi-region feature self-attention operations, which compute hidden features of the image, and semi-region feature convolutional operations, which capture background and contextual information. This approach enhances the receptive field of grid features while accelerating computation. Moreover, DVAT incorporates aligned-cross attention between region features and grid features to better integrate the dual visual features. This innovative design and fusion of dual visual features result in notable performance enhancements. Experimental results on multiple image captioning benchmarks demonstrate that DVAT outperforms previous methods in terms of both inference accuracy and speed. Extensive experiments conducted on the MS COCO dataset further validate that DVAT surpasses many state-of-the-art techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features

GateCap: Gated spatial and semantic attention model for image captioning

Article 06 January 2020

SATNet: Captioning with Semantic Alignment and Feature Enhancement

Data availability

The data used in the study is available at the repository MS-COCO [13, 24].

References

Zhou L, Palangi H, Zhang L, Corso J, Gao J (2020) Unified vision-language pre-training for image captioning and vqa. Proc AAAI Confer Artif Intell 34(07):13041–13049
Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y, Wang L (2022) Scaling up vision-language pre-training for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, pp 17980–17989
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, p 30
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805
Luo Y, Ji J, Sun X, Cao L, Wu Y, Huang F, Lin C-W, Ji R (2021) Dual-level collaborative transformer for image captioning. Proc AAAI Confer Artif Intell 35(3):2286–2293
Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: Captioning with adaptive attention on visual and non-visual words. Proc IEEE Conf Comput Vis Pattern Recognit 2021:15465–15474
Google Scholar
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. Proc IEEE Conf Comput Vis Pattern Recognit 2020:10578–10587
Google Scholar
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. Proc IEEE Int Conf Comput Vis Pattern Recog 2018:6077–6086
Google Scholar
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. Proc IEEE Int Conf Comput Vis Pattern Recog 2017:375–383
Google Scholar
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. Proc IEEE Int Conf Comput Vis Pattern Recog 2017:7008–7024
Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. Int Conf Mach Learning, PMLR 2015:2048–2057
Google Scholar
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159
Xian T, Li Z, Zhang C, Ma H (2022) Dual Global Enhanced Transformer for image captioning. Neural Netw 148:129–141. https://doi.org/10.1016/j.neunet.2022.01.011
Article Google Scholar
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnic CL (2014) Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer International Publishing, pp 740–755
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf Comput Vis Pattern Recognit 2015:3128–3137
Google Scholar
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. Proc IEEE Conf Comput Vis Pattern Recognit 2015:3156–3164
Google Scholar
Ji J, Ma Y, Sun X, Zhou Y, Wu Y, Ji R (2022) Knowing what to learn: a metric-oriented focal mechanism for image captioning. IEEE Trans Image Process 31:4321–4335
Ma Y, Ji J, Sun X, Zhou Y, Wu Y, Huang F, Ji R (2022) Knowing what it is: semantic-enhanced dual attention transformer. IEEE Trans Multimed 3723–3736
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. Proc IEEE Int Conf Comput Vis 2017:4894–4902
Google Scholar
Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862
Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. Adv Neural Inf Proces Syst 32
Ren S, He K, Girshick R, Sun J (2016) Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Wu L, Xu M, Sang L, Yao T, Mei T (2021) Noise augmented double-stream graph convolutional networks for image captioning. IEEE Trans Circuits Syst Video Technol 31(8):3118–3127
Article Google Scholar
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. Proc IEEE Conf Comput Vis Pattern Recognit 2020:10971–10980
Google Scholar
Wang C, Shen Y, Ji L (2022) Geometry attention transformer with position-aware LSTMs for image captioning. Expert Syst Appl 201:117174. https://doi.org/10.1016/j.eswa.2022.117174
Article Google Scholar
Wang Y, Xu J, Sun Y (2022) A visual persistence model for image captioning. Neurocomputing 468:48–59. https://doi.org/10.1016/j.neucom.2021.10.014
Article Google Scholar
Zhang Z, Wu Q, Wang Y, Chen F (2021) Exploring pairwise relationships adaptively from linguistic context in image captioning. IEEE Trans Multimed 24:3101–3113
Yang X, Liu Y, Wang X (2022) Reformer: the relational transformer for image captioning. In: Proceedings of the 30th ACM International Conference on Multimedia. ACM, pp 5398–5406
Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE, pp 4634–4643
Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3152990
Shao Z, Han J, Debattista K, Pang Y (2023) Textual Context-Aware Dense Captioning With Diverse Words. IEEE Trans Multimedia 25:8753–8766.
Article Google Scholar
Chen C, Han J, Debattista K (2024) Virtual Category Learning: A Semi-Supervised Learning Method for Dense Prediction with Extremely Limited Labels. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2024.3367416
Article Google Scholar
Li Y, Pan Y, Yao T, Mei T (2022) Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, pp 17990–17999
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 770–778
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1):32–73
Article MathSciNet Google Scholar
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318
Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 4566–4575
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer International Publishing, pp 382–398
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10685–10694
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699
Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recogn 138:109420
Jiang W, Ma L, Jiang YG, Liu W, Zhang T (2018) Recurrent fusion network for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 499–515
Zha Z-J, Liu D, Zhang H, Zhang Y, Wu F (2022) Context-aware visual policy network for fine-grained image captioning. IEEE Trans Pattern Anal Mach Intell 44(2):710–722
Article Google Scholar
Ji J, Luo Y, Sun X, Chen F, Luo G, Wu Y, Gao Y, Ji R (2021) Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. Proc AAAI Confer Artif Intell 35(2):1655–1663

Download references

Acknowledgements

This research is supported by National Science Foundation of China (No.61976109); Liaoning Revitalization Talents Program (No.XLYC2006005). Fund receiver: Dr. Y. Ren. This research is supported by The Scientific Research Project of Liaoning Province(No.LJKZ0986 and LJKZ0963; Key R&D projects of Liaoning Provincial Department of Science and Technology; Liaoning Provincial Key Laboratory Special Fund. Fund Receiver: Dr. B. Fu. The research is supported by the University of Economics Ho Chi Minh City (UEH), Vietnam. Fund receiver: Dr. Dang N.H. Thanh.

Author information

Authors and Affiliations

School of Computer and Artificial Intelligence, Liaoning Normal University, Dalian City, 116029, China
Yonggong Ren, Jinghan Zhang, Wenqiang Xu & Bo Fu
Faculty of Education, Liaoning Normal University, Dalian City, 116029, China
Yuzhu Lin
College of Technology and Design, University of Economics Ho Chi Minh City, Ho Chi Minh City, Vietnam
Dang N. H. Thanh

Authors

Yonggong Ren
View author publications
You can also search for this author in PubMed Google Scholar
Jinghan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenqiang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhu Lin
View author publications
You can also search for this author in PubMed Google Scholar
Bo Fu
View author publications
You can also search for this author in PubMed Google Scholar
Dang N. H. Thanh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Bo Fu or Dang N. H. Thanh.

Ethics declarations

Ethics statement

The authors declared that they have no conflicts of interest. The research does not involve human and animal participants.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ren, Y., Zhang, J., Xu, W. et al. Dual visual align-cross attention-based image captioning transformer. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19315-4

Download citation

Received: 04 February 2024
Revised: 21 April 2024
Accepted: 29 April 2024
Published: 17 May 2024
DOI: https://doi.org/10.1007/s11042-024-19315-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dual visual align-cross attention-based image captioning transformer

Abstract

Access this article

Similar content being viewed by others

GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features

GateCap: Gated spatial and semantic attention model for image captioning

SATNet: Captioning with Semantic Alignment and Feature Enhancement

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Ethics statement

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dual visual align-cross attention-based image captioning transformer

Abstract

Access this article

Similar content being viewed by others

GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features

GateCap: Gated spatial and semantic attention model for image captioning

SATNet: Captioning with Semantic Alignment and Feature Enhancement

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Ethics statement

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation