Abstract
Image caption is a challenging issue in the area of image understanding, in which most of the models are trained by the framework combined a deep convolutional neural network with a recurrent neural network. However, the features extracted by the convolutional neural network could capture the information of salient regions, which fails to cover the details in the image. Moreover, the gradient vanishing problem of the recurrent neural networks would cause the loss of the previous information as the time step grows. In this paper, Cooperative Self-Attention (CSA) is proposed address these problems. Comparing with existing methods, our model enhances the representation of the image by fusing the additional attribute information from the object detection. A sub-module named Inter-Attribute indicating the interaction of objects is proposed to strengthen the context of the entities. In virtue of the advantages of Self-Attention, different from previous methods that predict the next word based on one prior word and hidden state, our model concatenates all of the words generated step by step to solve long-term dependencies. Comparing with published state-of-the-art methods, our CSA demonstrates outstanding performance.
Similar content being viewed by others
References
Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D, Lee S, Anderson P (2019) Nocaps: novel object captioning at scale. In: Proceedings of international conference on computer vision, pp 8947–8956
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6077–6086
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Meeting of the association for computational linguistics, pp. 65–72
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611
Chen H, Ding G, Zhao S (2018) Temporal-difference learning with sampling baseline for image captioning. In: Proceedings of 32nd AAAI conference, pp 6706–6713
Ding G, Chen M, Zhao S, Chen H et al (2018) Neural image caption generation with weighted training and reference. Cognit Comput 11:763–777
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of 2019 IEEE/CVF conference on computer vision and pattern recognition, pp 3146–3154
Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 244–253
Guo L, Liu J, Yao P (2020) MSCap: multi-style image captioning with unpaired stylized text. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4204–4213
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Hu H, Gu J, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3588–3597
Jiasen L, Caiming X, Devi P, Richard S (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Computer Science IEEE Conference on Computer Vision and Pattern Recognition, pp 3242–3250
Jiasen L, Jianwei Y, Dhruv B, Devi P (2018) Neural baby talk. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 7219–7228
Karpathy, Andrej, Feifei L 2017) Deep visual-semantic alignments for generating image descriptions. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 3128–3137
Kiros R, Salakhutdinov R, Zemel RS et al (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS, Fei-Fei L (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Lin C (2004) ROUGE: a package for automatic evaluation of summaries. In: Proceedings of Meeting of the association for computational linguistics, pp 74–81
Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zit-nick (2014) Microsoft coco: common objects in context. In: Proceedings of the European Conference on Computer Vision, pp 740–755
Liu W, Anguelov D, Erhan D, Szegedy C, Reed SR, Fu C, Berg AC (2016) Ssd:single shot multibox detector. In: Proceedings of the European Conference on Computer Vision, pp 21–37
Lu D, Whitehead S, Huang L, Ji H, Chang S (2018) Entity-aware image caption generation. arXiv preprint arXiv:1804.07889
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: Proceedings of the international conference on learning representations, pp 1–17
Papineni K, Roukos S, Ward T et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the meeting of the Association for Computational Linguistics, pp 311–318
Qin Y, Du J, Zhang Y, Lu H (2019) Look Back and Predict Forward in Image Captioning. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8359–8367
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149
Rennie S, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7008–7024
Sharif N, White L, Bennamoun M et al (2020) WEmbSim: a simple yet effective metric for image captioning. In: 2020 Digital Image Computing: Techniques and Applications (DICTA) (2020):1–8
Shirai K, Hashimoto K, Eriguchi A et al (2020) Neural text generation with artificial negative examples. ArXiv, abs/2012.14124
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojn Z (2016) Rethinking the inception architecture for computer vision. In:Proceedings of 2016 IEEE conference on computer vision and pattern recognition, pp 2818–2826
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems neural information processing systems, pp 5998–6008
Vedantam R, Zitnick C, Parikh D (2015) CIDEr: consensus-based image description evaluation. In: Proceedings of Computer Vision and Pattern Recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 3156–3164
Wang X, Girshick R, Gupta A, He K (2017) Non-local neural networks. arXiv preprint arXiv:1711.07971, 10
Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of International Conference on Machine Learning, pp 2048–2057
Yang J, Sun Y, Liang J, Ren B, Lai SH (2019) Image captioning by incorporating affective concepts learned from both visual and textual components. Neurocomputing 328:56–68
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10677–10686
Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning.In: IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482
Zhao D, Chang Z, Guo S (2019) A multimodal fusion approach for image captioning. Neurocomputing 329:476–485
Zhao W, Wu X, Zhang X (2020) MemCap: memorizing style knowledge for image captioning. In: Proceedings of the Association for the Advance of Artificial Intelligence, pp 12984–2992
Zheng Y, Li Y, Wang S (2019) Intention oriented image captions with guiding objects. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8387–8396
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare there is no conflicts of interest regarding the publication of this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhao, D., Yang, R., Wang, Z. et al. A cooperative approach based on self-attention with interactive attribute for image caption. Multimed Tools Appl 82, 1223–1236 (2023). https://doi.org/10.1007/s11042-022-13279-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13279-z