Skip to main content
Log in

Sentimental Visual Captioning using Multimodal Transformer

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We propose a new task called sentimental visual captioning that generates captions with the inherent sentiment reflected by the input image or video. Compared with the stylized visual captioning task that requires a predefined style independent of the image or video, our new task automatically analyzes the inherent sentiment tendency from the visual content. With this in mind, we propose a multimodal Transformer model namely Senti-Transformer for sentimental visual captioning, which integrates both content and sentiment information from multiple modalities and incorporates prior sentimental knowledge to generate sentimental sentence. Specifically, we extract prior knowledge from sentimental corpus to obtain sentimental textual information and design a multi-head Transformer encoder to encode multimodal features. Then we decompose the attention layer in the middle of Transformer decoder to focus on important features of each modality, and the attended features are integrated through an intra- and inter-modality fusion mechanism for generating sentimental sentences. To effectively train the proposed model using the external sentimental corpus as well as the paired images or videos and factual sentences in existing captioning datasets, we propose a two-stage training strategy that first learns to incorporate sentimental elements into the sentences via a regularization term and then learns to generate fluent and relevant sentences with the inherent sentimental styles via reinforcement learning with a sentimental reward. Extensive experiments on both image and video datasets demonstrate the effectiveness and superiority of our Senti-Transformer on sentimental visual captioning. Source code is available at https://github.com/ezeli/InSentiCap_ext.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 6077–6086).

  • Bargal, S. A., Barsoum, E., Ferrer, C. C., & Zhang, C. (2016). Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (pp 433–436).

  • Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. O’Reilly Media, Inc.

    MATH  Google Scholar 

  • Borth, D., Ji, R., Chen, T., Breuel, T., & Chang, S. F. (2013). Large-scale visual sentiment ontology and detectors using adjective noun pairs. In ACM MM (pp 223–232).

  • Campos, V., Jou, B., & Giro-i Nieto, X. (2017). From pixels to sentiment: Fine-tuning cnns for visual sentiment prediction. Image and Vision Computing, 65, 15–22.

    Article  Google Scholar 

  • Chen, C. K., Pan, Z., Liu, M. Y., & Sun, M. (2019). Unsupervised stylish image description generation via domain layer norm. In Proceedings of the AAAI Conference on Artificial Intelligence 33 (pp 8151–8158).

  • Chen, Y., Wang, S., Zhang, W., & Huang, Q. (2018). Less is more: Picking informative frames for video captioning. In Proceedings of the European conference on computer vision (ECCV) (pp 358–373).

  • Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp 10578–10587).

  • Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In proceedings of the ninth workshop on statistical machine translation (pp 376–380).

  • Fang, K., Zhou, L., Jin, C., Zhang, Y., Weng, K., Zhang, T., & Fan, W. (2019). Fully convolutional video captioning with coarse-to-fine and inherited attention. In Proceedings of the AAAI Conference on Artificial Intelligence 33 (pp 8271–8278).

  • Gan, C., Gan, Z., He, X., Gao, J., & Deng, L. (2017). Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 3137–3146).

  • Guo, L., Liu, J., Yao, P., Li, J., & Lu, H. (2019). Mscap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 4204–4213).

  • Guo, L., Liu, J., Zhu, X., Yao, P., Lu, S., & Lu, H. (2020). Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp 10327–10336).

  • Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp 6546–6555).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR) (pp 770–778).

  • Hershey, S., Chaudhuri, S., Ellis, DP., Gemmeke, JF., Jansen, A., Moore, RC., Plakal, M., Platt, D., Saurous, RA., Seybold, B., et al. (2017). Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), IEEE (pp 131–135).

  • Huang, L., Wang, W., Chen, J., & Wei, XY. (2019). Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision (pp 4634–4643).

  • Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR (pp 3128–3137).

  • Kingma, D. P., Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.

  • Lei, J., Wang, L., Shen, Y., Yu, D., Berg, TL., & Bansal, M. (2020). Mart: Memory-augmented recurrent transformer for coherent video paragraph captioning. In ACL.

  • Li, G., Zhai, Y., Lin, Z., & Zhang, Y. (2021a). Similar scenes arouse similar emotions: Parallel data augmentation for stylized image captioning. In Proceedings of the 29th ACM International Conference on Multimedia (pp 5363–5372).

  • Li, T., Hu, Y., & Wu, X. (2021b). Image captioning with inherent sentiment. In 2021 IEEE International Conference on Multimedia and Expo (ICME), IEEE.

  • Lin, C., Zhao, S., Meng, L., & Chua, TS. (2020). Multi-source domain adaptation for visual sentiment classification. arXiv preprint arXiv:2001.03886.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp 740–755). Springer.

  • Luo, Y., Ji, J., Sun, X., Cao, L., Wu, Y., Huang, F., Lin, CW., Ji, R. (2021). Dual-level collaborative transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence (pp 2286–2293).

  • Machajdik, J., & Hanbury, A. (2010). Affective image classification using features inspired by psychology and art theory. In ACM MM (pp 83–92).

  • Mathews, AP., Xie, L., & He, X. (2016). Senticap: Generating image descriptions with sentiments. In Thirtieth AAAI conference on artificial intelligence.

  • Nguyen, D., Nguyen, K., Sridharan, S., Dean, D., & Fookes, C. (2018). Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Computer Vision and Image Understanding, 174, 33–42.

    Article  Google Scholar 

  • Pan, B., Cai, H., Huang, D. A., Lee, K. H., Gaidon, A., Adeli, E., & Niebles, J. C. (2020). Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp 10870–10879).

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp 311–318).

  • Peng, K. C., Sadovnik, A., Gallagher, A., & Chen, T. (2016). Where do emotions come from? predicting the emotion stimuli map. In ICIP (pp 614–618).

  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

    Google Scholar 

  • Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 7008–7024).

  • Stolcke, A., (2002) Srilm-an extensible language modeling toolkit. In Seventh international conference on spoken language processing.

  • Suin, M., & Rajagopalan, A. (2020). An efficient framework for dense video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence 34 (pp 12039–12046).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp 5998–6008).

  • Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 4566–4575).

  • Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 3156–3164).

  • Wang, W., Chen, Z., & Hu, H. (2019). Hierarchical attention network for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence 33, (pp 8957–8964).

  • Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256.

    Article  MATH  Google Scholar 

  • Wu, X., Zhao, W., & Luo, J. (2022). Learning cooperative neural modules for stylized image captioning. International Journal of Computer Vision, 130(9), 2305–2320.

    Article  Google Scholar 

  • Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 5288–5296).

  • Yang, B., Zou, Y., Liu, F., & Zhang, C. (2021). Non-autoregressive coarse-to-fine video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence 35, (pp 3119–3127).

  • Yang, J., She, D., Lai, Y. K., Rosin, P. L., & Yang, M. H. (2018a). Weakly supervised coupled networks for visual sentiment analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 7584–7592).

  • Yang, J., She, D., Sun, M., Cheng, M. M., Rosin, P. L., & Wang, L. (2018). Visual sentiment prediction based on automatic discovery of affective regions. IEEE Transactions on Multimedia, 20(9), 2513–2525.

    Article  Google Scholar 

  • Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision (pp 4507–4515).

  • You, Q., Luo, J., Jin, H., & Yang, J. (2015). Robust image sentiment analysis using progressively trained and domain transferred deep networks. In Twenty-ninth AAAI conference on artificial intelligence.

  • You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 4651–4659).

  • You, Q., Jin, H., & Luo, J. (2017). Visual sentiment analysis by attending on local image regions. In Thirty-First AAAI Conference on Artificial Intelligence.

  • Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016). Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 4584–4593).

  • Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., Gao, J. (2021). Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp 5579–5588).

  • Zhao, S., Ma, Y., Gu, Y., Yang, J., Xing, T., Xu, P., Hu, R., Chai, H., & Keutzer, K. (2020). An end-to-end visual-audio attention network for emotion recognition in user-generated videos. In Proceedings of the AAAI Conference on Artificial Intelligence 34 (pp 303–311).

  • Zhao, W., Wu, X., & Zhang, X. (2020). Memcap: Memorizing style knowledge for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence 34 (pp 12984–12992).

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of China (NSFC) under Grant No. 62072041.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinxiao Wu.

Additional information

Communicated by Shin’ichi Satoh.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, X., Li, T. Sentimental Visual Captioning using Multimodal Transformer. Int J Comput Vis 131, 1073–1090 (2023). https://doi.org/10.1007/s11263-023-01752-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01752-7

Keywords

Navigation