Skip to main content
Log in

Tri-RAT: optimizing the attention scores for image captioning

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Attention mechanisms and grid features are widely used in current visual language tasks like image captioning. The attention scores are the key factor to the success of the attention mechanism. However, the connection between attention scores in different layers is not strong enough since Transformer is a hierarchical structure. Additionally, geometric information is inevitably lost when grid features are flattened to be fed into a transformer model. Therefore, bias scores about geometric position information should be added to the attention scores. Considering that there are three different kinds of attention modules in the transformer architecture, we build three independent paths (residual attention paths, RAPs) to propagate the attention scores from the previous layer as a prior for attention computation. This operation is like a residual connection between attention scores, which can enhance the connection and make each attention layer obtain a global comprehension. Then, we replace the traditional attention module with a novel residual attention with relative position module in the encoder to incorporate relative position scores with attention scores. Residual attention may increase the internal covariate shifts. To optimize the data distribution, we introduce residual attention with layer normalization on query vectors module in the decoder. Finally, we build our Residual Attention Transformer with three RAPs (Tri-RAT) for the image captioning task. The proposed model achieves competitive performance on the MSCOCO benchmark with all the state-of-the-art models. We gain 135.8\(\%\) CIDEr on MS COCO “Karpathy” offline test split and 135.3\(\%\) CIDEr on the online testing server.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Anderson P, Fernando B, Johnson M et al (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, pp 382–398

  2. Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  3. Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  4. Cornia M, Stefanini M, Baraldi L et al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,578–10,587

  5. Guo L, Liu J, Zhu X et al (2020) Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,327–10,336

  6. Gupta A, Verma Y, Jawahar C (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence, pp 606–612

  7. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  8. He R, Ravula A, Kanagal B et al (2020) Realformer: transformer likes residual attention. arXiv:2012.11747

  9. Herdade S, Kappeler A, Boakye K et al (2019) Image captioning: transforming objects into words. Adv Neural Inf Process Syst 32:11137–11147

  10. Huang L, Wang W, Chen J et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643

  11. Ji J, Luo Y, Sun X et al (2021) Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In: Proceedings of the AAAI conference on artificial intelligence, pp 1655–1663

  12. Jiang H, Misra I, Rohrbach M et al (2020) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,267–10,276

  13. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  14. Li G, Zhu L, Liu P et al (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8928–8937

  15. Lin C-Y (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81. https://aclanthology.org/W04-1013

  16. Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755

  17. Liu Z, Hu H, Lin Y et al (2021) Swin transformer v2: scaling up capacity and resolution. arXiv:2111.09883

  18. Lu J, Xiong C, Parikh D et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383

  19. Luo Y, Ji J, Sun X et al (2021) Dual-level collaborative transformer for image captioning. arXiv:2101.06462

  20. Mao J, Xu W, Yang Y,et al (2014) Explain images with multimodal recurrent neural networks. arXiv:1410.1090

  21. Mitchell M, Dodge J, Goyal A et al (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, pp 747–756

  22. Ordonez V, Kulkarni G, Berg T (2011) Im2text: describing images using 1 million captioned photographs. Adv Neural Inf Process Syst 24:1143–1151

  23. Pan Y, Yao T, Li Y et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,971–10,980

  24. Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

  25. Rennie SJ, Marcheret E, Mroueh Y et al (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024

  26. Ushiku Y, Yamaguchi M, Mukuta Y et al (2015) Common subspace for model and similarity: phrase learning for caption generation from images. In: Proceedings of the IEEE international conference on computer vision, pp 2668–2676

  27. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008

  28. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  29. Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  30. Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048–2057

  31. Yao T, Pan Y, Li Y et al (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699

  32. Ying C, Ke G, He D et al (2021) Lazyformer: self attention with lazy update. arXiv:2102.12702

  33. Zhang X, Sun X, Luo Y et al (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15,465–15,474

Download references

Acknowledgements

This work is supported partially by the Chongqing Post-graduate Joint Training Base Project: Computer Technology Professional, Chongqing Normal University and Chongqing Century Keyi Technology Co., Ltd. It is also supported by PhD Start-up Fund/Talent Introduction Project of Chongqing Normal University, Grant No.21XLB03.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongzhi An.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Y., An, Y., Hu, J. et al. Tri-RAT: optimizing the attention scores for image captioning. Int J Multimed Info Retr 11, 705–715 (2022). https://doi.org/10.1007/s13735-022-00260-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-022-00260-7

Keywords

Navigation