Hadamard Product Perceptron Attention for Image Captioning

Jiang, Weitao; Hu, Haifeng

doi:10.1007/s11063-022-10980-w

Hadamard Product Perceptron Attention for Image Captioning

Published: 28 July 2022

Volume 55, pages 2707–2724, (2023)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

350 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Recently, great progress has been made in promoting image captioning by improving Transformer structure. As a key component, dot product self-attention can update the representation of each feature vector in the visual encoder and guide the caption decoding process. However, the pairwise interaction in dot product self-attention enables attention weights to be learned at the instance or local level, making it difficult for attention module to obtain global feature representations. Furthermore, self-attention is always implemented in a multi-head fashion, where the calculation of each attention head is independent. It makes the model unable to exploit the complementary information contained in different heads. In this paper, we propose a Hadamard Product Perceptron Attention (HPPA) for image captioning, which introduces a more global feature interaction and incorporates interaction among attention heads to calculate attention results. Feature interaction method based on Hadamard product can integrate multimodal features more effectively than dot product and provide rich feature representation. Therefore, HPPA first utilizes Hadamard product to fuse the input features. Then, it generates a set of attention memory vectors containing global interaction features. The final attention weights are calculated via these vectors dynamically. When the multi-head mechanism is incorporated, the complementary information between different heads can be utilized by HPPA. We further integrate HPPA into Transformer encoder and propose a Hadamard Product Perceptron Transformer (HPPT) as a feature enhancement encoder. Moreover, HPPA and HPPT can be easily applied to existing attention or Transformer based models. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the effectiveness and generalizability of our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning to Prompt for Vision-Language Models

Article 31 July 2022

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Article 11 October 2019

References

Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: Proceedings of the European Conference on Computer Vision, Springer, pp 382–398
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6077–6086
Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correalation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp 65–72
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5659–5667
Clark K, Khandelwal U, Levy O, Manning CD (2019) What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp10578–10587
Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860
Dehghani M, Gouws S, Vinyals O, Uszkoreit J, Kaiser Ł (2018) Universal transformers. arXiv preprint arXiv:1807.03819
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 248–255
Friedman N, Russell S (1997) Image segmentation in video sequences: A probabilistic approach. In: Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, pp 175–181
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5630–5639
Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Twenty-Sixth AAAI Conference on Artificial Intelligence
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4634–4643
Jiang W, Wang W, Haifeng H (2021) Bi-directional co-attention network for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17(4):1–20
Article Google Scholar
Jiang W, Ma L, Jiang Y-G, Liu W, Zhang T (2018) Recurrent fusion network for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 499–515
Kalimuthu M, Mogadala A, Mosbach M, Klakow D (2021) Fusion models for improved image captioning. In International Conference on Pattern Recognition, Springer, pp 381–395
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3128–3137
Kim J-H, Lee S-W, Kwak D, Heo M-O, Kim J, Ha J-W, Zhang B-T (2016) Multimodal residual learning for visual qa. In: Proceedings of the Conference on Advances in Neural Information Processing Systems, pp 29:361–369
Google Scholar
Kim J-H, On K-W, Lim W, Kim J, Ha J-W, Zhang B-T (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73
Article MathSciNet Google Scholar
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics 2:351–362
Article Google Scholar
Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, pp 8928–8937
Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. In the Workshop on Text Summarization Branches Out, pp 74–81
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Lawrence ZC (2014) Microsoft coco: common objects in context. In Proceedings of the European Conference on Computer Vision, Springer, pp 740–755
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 375–383
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7219–7228
Meng M, Lan M, Jun Yu, Jigang W, Tao D (2019) Constrained discriminative projection learning for image classification. IEEE Trans Image Process 29:186–198
Article MathSciNet MATH Google Scholar
Meng M, Wang H, Jun Yu, Chen H, Jigang W (2020) Asymmetric supervised consistent and specific hashing for cross-modal retrieval. IEEE Trans Image Process 30:986–1000
Article MathSciNet Google Scholar
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé III H (2012) Midge: Generating image descriptions from computer vision detections. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp 747–756
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp 311–318
Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp 2641–2649
Shaoqing R, Kaiming H, Ross G, Jian S (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Proceedings of the Conference on Advances in Neural Information Processing Systems, pp 28:91–99
Google Scholar
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7008–7024
Sammani F, Melas-Kyriazi L (2020) Show, edit and tell: A framework for editing image captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4808–4816
Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2556–2565
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the Conference on Advances in Neural Information Processing Systems, pp 30:5998–6008
Google Scholar
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4566–4575
Vig J (2019) A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3156–3164
Wang J, Tang J, Luo J (2020) Multimodal attention with image text spatial relationship for ocr-based image captioning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 4337–4345
Wang X, Ma L, Fu Y, Xue X (2021) Neural symbolic representation learning for image captioning. In: Proceedings of the 2021 International Conference on Multimedia Retrieval, pp 312–321
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp 2048–2057
Yang L, Wang H, Tang P, Li Q (2021) Captionnet: A tailor-made recurrent neural network for generating image descriptions. IEEE Trans Multimedia 23:835–845
Article Google Scholar
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 10685–10694
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4894–4902
Zhong Y, Wang L, Chen J, Yu D, Li Y (2020) Comprehensive image captioning via scene graph decomposition. In: European Conference on Computer Vision, Springer, pp 211–229
Zhou L, Kalantidis Y, Chen X, Corso JJ, Rohrbach M (2019) Grounded video description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6578–6587

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 62076262, Grant 61673402, Grant 60802069 and Grant 61273270.

Author information

Authors and Affiliations

Sun Yat-Sen University, Guangzhou, China
Weitao Jiang & Haifeng Hu

Authors

Weitao Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haifeng Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jiang, W., Hu, H. Hadamard Product Perceptron Attention for Image Captioning. Neural Process Lett 55, 2707–2724 (2023). https://doi.org/10.1007/s11063-022-10980-w

Download citation

Accepted: 20 July 2022
Published: 28 July 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11063-022-10980-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hadamard Product Perceptron Attention for Image Captioning

Abstract

Access this article

Similar content being viewed by others

Learning to Prompt for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hadamard Product Perceptron Attention for Image Captioning

Abstract

Access this article

Similar content being viewed by others

Learning to Prompt for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation