Abstract
Attention mechanisms and grid features are widely used in current visual language tasks like image captioning. The attention scores are the key factor to the success of the attention mechanism. However, the connection between attention scores in different layers is not strong enough since Transformer is a hierarchical structure. Additionally, geometric information is inevitably lost when grid features are flattened to be fed into a transformer model. Therefore, bias scores about geometric position information should be added to the attention scores. Considering that there are three different kinds of attention modules in the transformer architecture, we build three independent paths (residual attention paths, RAPs) to propagate the attention scores from the previous layer as a prior for attention computation. This operation is like a residual connection between attention scores, which can enhance the connection and make each attention layer obtain a global comprehension. Then, we replace the traditional attention module with a novel residual attention with relative position module in the encoder to incorporate relative position scores with attention scores. Residual attention may increase the internal covariate shifts. To optimize the data distribution, we introduce residual attention with layer normalization on query vectors module in the decoder. Finally, we build our Residual Attention Transformer with three RAPs (Tri-RAT) for the image captioning task. The proposed model achieves competitive performance on the MSCOCO benchmark with all the state-of-the-art models. We gain 135.8\(\%\) CIDEr on MS COCO “Karpathy” offline test split and 135.3\(\%\) CIDEr on the online testing server.
Similar content being viewed by others
Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Anderson P, Fernando B, Johnson M et al (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, pp 382–398
Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Cornia M, Stefanini M, Baraldi L et al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,578–10,587
Guo L, Liu J, Zhu X et al (2020) Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,327–10,336
Gupta A, Verma Y, Jawahar C (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence, pp 606–612
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He R, Ravula A, Kanagal B et al (2020) Realformer: transformer likes residual attention. arXiv:2012.11747
Herdade S, Kappeler A, Boakye K et al (2019) Image captioning: transforming objects into words. Adv Neural Inf Process Syst 32:11137–11147
Huang L, Wang W, Chen J et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Ji J, Luo Y, Sun X et al (2021) Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In: Proceedings of the AAAI conference on artificial intelligence, pp 1655–1663
Jiang H, Misra I, Rohrbach M et al (2020) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,267–10,276
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Li G, Zhu L, Liu P et al (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8928–8937
Lin C-Y (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81. https://aclanthology.org/W04-1013
Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Liu Z, Hu H, Lin Y et al (2021) Swin transformer v2: scaling up capacity and resolution. arXiv:2111.09883
Lu J, Xiong C, Parikh D et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Luo Y, Ji J, Sun X et al (2021) Dual-level collaborative transformer for image captioning. arXiv:2101.06462
Mao J, Xu W, Yang Y,et al (2014) Explain images with multimodal recurrent neural networks. arXiv:1410.1090
Mitchell M, Dodge J, Goyal A et al (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, pp 747–756
Ordonez V, Kulkarni G, Berg T (2011) Im2text: describing images using 1 million captioned photographs. Adv Neural Inf Process Syst 24:1143–1151
Pan Y, Yao T, Li Y et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,971–10,980
Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Rennie SJ, Marcheret E, Mroueh Y et al (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Ushiku Y, Yamaguchi M, Mukuta Y et al (2015) Common subspace for model and similarity: phrase learning for caption generation from images. In: Proceedings of the IEEE international conference on computer vision, pp 2668–2676
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048–2057
Yao T, Pan Y, Li Y et al (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699
Ying C, Ke G, He D et al (2021) Lazyformer: self attention with lazy update. arXiv:2102.12702
Zhang X, Sun X, Luo Y et al (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15,465–15,474
Acknowledgements
This work is supported partially by the Chongqing Post-graduate Joint Training Base Project: Computer Technology Professional, Chongqing Normal University and Chongqing Century Keyi Technology Co., Ltd. It is also supported by PhD Start-up Fund/Talent Introduction Project of Chongqing Normal University, Grant No.21XLB03.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, Y., An, Y., Hu, J. et al. Tri-RAT: optimizing the attention scores for image captioning. Int J Multimed Info Retr 11, 705–715 (2022). https://doi.org/10.1007/s13735-022-00260-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-022-00260-7