Abstract
Region-based features widely used in image captioning are typically extracted using object detectors like Faster R-CNN. However, the approach has a limitation due to capturing region-level information and does not consider the holistic global information of the entire image. This limitation hinders the development of complex multi-modal reasoning capabilities in image captioning and leads to issues such as a lack of contextual information, inaccurate object detection, and high computational costs. To address these limitations and leverage the success of transformer-based architectures in image captioning, a transformer-based neural structure called DVAT (Dual Visual Attention-based Image Captioning Transformer) is proposed. DVAT effectively combines two visual features to generate more accurate captions. It divides region features into semi-region feature self-attention operations, which compute hidden features of the image, and semi-region feature convolutional operations, which capture background and contextual information. This approach enhances the receptive field of grid features while accelerating computation. Moreover, DVAT incorporates aligned-cross attention between region features and grid features to better integrate the dual visual features. This innovative design and fusion of dual visual features result in notable performance enhancements. Experimental results on multiple image captioning benchmarks demonstrate that DVAT outperforms previous methods in terms of both inference accuracy and speed. Extensive experiments conducted on the MS COCO dataset further validate that DVAT surpasses many state-of-the-art techniques.
Similar content being viewed by others
References
Zhou L, Palangi H, Zhang L, Corso J, Gao J (2020) Unified vision-language pre-training for image captioning and vqa. Proc AAAI Confer Artif Intell 34(07):13041–13049
Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y, Wang L (2022) Scaling up vision-language pre-training for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, pp 17980–17989
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, p 30
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805
Luo Y, Ji J, Sun X, Cao L, Wu Y, Huang F, Lin C-W, Ji R (2021) Dual-level collaborative transformer for image captioning. Proc AAAI Confer Artif Intell 35(3):2286–2293
Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: Captioning with adaptive attention on visual and non-visual words. Proc IEEE Conf Comput Vis Pattern Recognit 2021:15465–15474
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. Proc IEEE Conf Comput Vis Pattern Recognit 2020:10578–10587
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. Proc IEEE Int Conf Comput Vis Pattern Recog 2018:6077–6086
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. Proc IEEE Int Conf Comput Vis Pattern Recog 2017:375–383
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. Proc IEEE Int Conf Comput Vis Pattern Recog 2017:7008–7024
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. Int Conf Mach Learning, PMLR 2015:2048–2057
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159
Xian T, Li Z, Zhang C, Ma H (2022) Dual Global Enhanced Transformer for image captioning. Neural Netw 148:129–141. https://doi.org/10.1016/j.neunet.2022.01.011
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnic CL (2014) Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer International Publishing, pp 740–755
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf Comput Vis Pattern Recognit 2015:3128–3137
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. Proc IEEE Conf Comput Vis Pattern Recognit 2015:3156–3164
Ji J, Ma Y, Sun X, Zhou Y, Wu Y, Ji R (2022) Knowing what to learn: a metric-oriented focal mechanism for image captioning. IEEE Trans Image Process 31:4321–4335
Ma Y, Ji J, Sun X, Zhou Y, Wu Y, Huang F, Ji R (2022) Knowing what it is: semantic-enhanced dual attention transformer. IEEE Trans Multimed 3723–3736
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. Proc IEEE Int Conf Comput Vis 2017:4894–4902
Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862
Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. Adv Neural Inf Proces Syst 32
Ren S, He K, Girshick R, Sun J (2016) Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Wu L, Xu M, Sang L, Yao T, Mei T (2021) Noise augmented double-stream graph convolutional networks for image captioning. IEEE Trans Circuits Syst Video Technol 31(8):3118–3127
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. Proc IEEE Conf Comput Vis Pattern Recognit 2020:10971–10980
Wang C, Shen Y, Ji L (2022) Geometry attention transformer with position-aware LSTMs for image captioning. Expert Syst Appl 201:117174. https://doi.org/10.1016/j.eswa.2022.117174
Wang Y, Xu J, Sun Y (2022) A visual persistence model for image captioning. Neurocomputing 468:48–59. https://doi.org/10.1016/j.neucom.2021.10.014
Zhang Z, Wu Q, Wang Y, Chen F (2021) Exploring pairwise relationships adaptively from linguistic context in image captioning. IEEE Trans Multimed 24:3101–3113
Yang X, Liu Y, Wang X (2022) Reformer: the relational transformer for image captioning. In: Proceedings of the 30th ACM International Conference on Multimedia. ACM, pp 5398–5406
Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE, pp 4634–4643
Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3152990
Shao Z, Han J, Debattista K, Pang Y (2023) Textual Context-Aware Dense Captioning With Diverse Words. IEEE Trans Multimedia 25:8753–8766.
Chen C, Han J, Debattista K (2024) Virtual Category Learning: A Semi-Supervised Learning Method for Dense Prediction with Extremely Limited Labels. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2024.3367416
Li Y, Pan Y, Yao T, Mei T (2022) Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, pp 17990–17999
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 770–778
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1):32–73
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318
Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 4566–4575
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer International Publishing, pp 382–398
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10685–10694
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699
Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recogn 138:109420
Jiang W, Ma L, Jiang YG, Liu W, Zhang T (2018) Recurrent fusion network for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 499–515
Zha Z-J, Liu D, Zhang H, Zhang Y, Wu F (2022) Context-aware visual policy network for fine-grained image captioning. IEEE Trans Pattern Anal Mach Intell 44(2):710–722
Ji J, Luo Y, Sun X, Chen F, Luo G, Wu Y, Gao Y, Ji R (2021) Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. Proc AAAI Confer Artif Intell 35(2):1655–1663
Acknowledgements
This research is supported by National Science Foundation of China (No.61976109); Liaoning Revitalization Talents Program (No.XLYC2006005). Fund receiver: Dr. Y. Ren. This research is supported by The Scientific Research Project of Liaoning Province(No.LJKZ0986 and LJKZ0963; Key R&D projects of Liaoning Provincial Department of Science and Technology; Liaoning Provincial Key Laboratory Special Fund. Fund Receiver: Dr. B. Fu. The research is supported by the University of Economics Ho Chi Minh City (UEH), Vietnam. Fund receiver: Dr. Dang N.H. Thanh.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Ethics statement
The authors declared that they have no conflicts of interest. The research does not involve human and animal participants.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ren, Y., Zhang, J., Xu, W. et al. Dual visual align-cross attention-based image captioning transformer. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19315-4
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-19315-4