Tri-RAT: optimizing the attention scores for image captioning

Yang, You; An, Yongzhi; Hu, Juntao; Pan, Longyue

doi:10.1007/s13735-022-00260-7

Tri-RAT: optimizing the attention scores for image captioning

Regular Paper
Published: 06 October 2022

Volume 11, pages 705–715, (2022)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

You Yang^1,2,
Yongzhi An ORCID: orcid.org/0000-0002-7619-672X¹,
Juntao Hu¹ &
…
Longyue Pan¹

303 Accesses
1 Citation
Explore all metrics

Abstract

Attention mechanisms and grid features are widely used in current visual language tasks like image captioning. The attention scores are the key factor to the success of the attention mechanism. However, the connection between attention scores in different layers is not strong enough since Transformer is a hierarchical structure. Additionally, geometric information is inevitably lost when grid features are flattened to be fed into a transformer model. Therefore, bias scores about geometric position information should be added to the attention scores. Considering that there are three different kinds of attention modules in the transformer architecture, we build three independent paths (residual attention paths, RAPs) to propagate the attention scores from the previous layer as a prior for attention computation. This operation is like a residual connection between attention scores, which can enhance the connection and make each attention layer obtain a global comprehension. Then, we replace the traditional attention module with a novel residual attention with relative position module in the encoder to incorporate relative position scores with attention scores. Residual attention may increase the internal covariate shifts. To optimize the data distribution, we introduce residual attention with layer normalization on query vectors module in the decoder. Finally, we build our Residual Attention Transformer with three RAPs (Tri-RAT) for the image captioning task. The proposed model achieves competitive performance on the MSCOCO benchmark with all the state-of-the-art models. We gain 135.8\(\%\) CIDEr on MS COCO “Karpathy” offline test split and 135.3\(\%\) CIDEr on the online testing server.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-Enhanced Attention for Image Captioning

Article Open access 01 April 2024

Relational Attention with Textual Enhanced Transformer for Image Captioning

Complementary Shifted Transformer for Image Captioning

Article 10 June 2023

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Anderson P, Fernando B, Johnson M et al (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, pp 382–398
Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Cornia M, Stefanini M, Baraldi L et al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,578–10,587
Guo L, Liu J, Zhu X et al (2020) Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,327–10,336
Gupta A, Verma Y, Jawahar C (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence, pp 606–612
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He R, Ravula A, Kanagal B et al (2020) Realformer: transformer likes residual attention. arXiv:2012.11747
Herdade S, Kappeler A, Boakye K et al (2019) Image captioning: transforming objects into words. Adv Neural Inf Process Syst 32:11137–11147
Huang L, Wang W, Chen J et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Ji J, Luo Y, Sun X et al (2021) Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In: Proceedings of the AAAI conference on artificial intelligence, pp 1655–1663
Jiang H, Misra I, Rohrbach M et al (2020) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,267–10,276
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Li G, Zhu L, Liu P et al (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8928–8937
Lin C-Y (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81. https://aclanthology.org/W04-1013
Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Liu Z, Hu H, Lin Y et al (2021) Swin transformer v2: scaling up capacity and resolution. arXiv:2111.09883
Lu J, Xiong C, Parikh D et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Luo Y, Ji J, Sun X et al (2021) Dual-level collaborative transformer for image captioning. arXiv:2101.06462
Mao J, Xu W, Yang Y,et al (2014) Explain images with multimodal recurrent neural networks. arXiv:1410.1090
Mitchell M, Dodge J, Goyal A et al (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, pp 747–756
Ordonez V, Kulkarni G, Berg T (2011) Im2text: describing images using 1 million captioned photographs. Adv Neural Inf Process Syst 24:1143–1151
Pan Y, Yao T, Li Y et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,971–10,980
Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Rennie SJ, Marcheret E, Mroueh Y et al (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Ushiku Y, Yamaguchi M, Mukuta Y et al (2015) Common subspace for model and similarity: phrase learning for caption generation from images. In: Proceedings of the IEEE international conference on computer vision, pp 2668–2676
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048–2057
Yao T, Pan Y, Li Y et al (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699
Ying C, Ke G, He D et al (2021) Lazyformer: self attention with lazy update. arXiv:2102.12702
Zhang X, Sun X, Luo Y et al (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15,465–15,474

Download references

Acknowledgements

This work is supported partially by the Chongqing Post-graduate Joint Training Base Project: Computer Technology Professional, Chongqing Normal University and Chongqing Century Keyi Technology Co., Ltd. It is also supported by PhD Start-up Fund/Talent Introduction Project of Chongqing Normal University, Grant No.21XLB03.

Author information

Authors and Affiliations

School of Computer and Information Science, Chongqing Normal University, Chongqing, 401331, China
You Yang, Yongzhi An, Juntao Hu & Longyue Pan
National Center for Applied Mathematics in Chongqing, Chongqing, 401331, China
You Yang

Authors

You Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yongzhi An
View author publications
You can also search for this author in PubMed Google Scholar
Juntao Hu
View author publications
You can also search for this author in PubMed Google Scholar
Longyue Pan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongzhi An.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, Y., An, Y., Hu, J. et al. Tri-RAT: optimizing the attention scores for image captioning. Int J Multimed Info Retr 11, 705–715 (2022). https://doi.org/10.1007/s13735-022-00260-7

Download citation

Received: 20 July 2022
Revised: 01 September 2022
Accepted: 12 September 2022
Published: 06 October 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s13735-022-00260-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tri-RAT: optimizing the attention scores for image captioning

Abstract

Access this article

Similar content being viewed by others

Self-Enhanced Attention for Image Captioning

Relational Attention with Textual Enhanced Transformer for Image Captioning

Complementary Shifted Transformer for Image Captioning

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Tri-RAT: optimizing the attention scores for image captioning

Abstract

Access this article

Similar content being viewed by others

Self-Enhanced Attention for Image Captioning

Relational Attention with Textual Enhanced Transformer for Image Captioning

Complementary Shifted Transformer for Image Captioning

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation