Joint Visual-Textual Sentiment Analysis Based on Cross-Modality Attention Mechanism

  • Xuelin Zhu
  • Biwei Cao
  • Shuai Xu
  • Bo Liu
  • Jiuxin CaoEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11295)


Recently, many researchers have focused on the joint visual-textual sentiment analysis since it can better extract user sentiments toward events or topics. In this paper, we propose that visual and textual information should differ in their contribution to sentiment analysis. Our model learns a robust joint visual-textual representation by incorporating a cross-modality attention mechanism and semantic embedding learning based on bidirectional recurrent neural network. Experimental results show that our model outperforms existing the state-of-the-art models in sentiment analysis under real datasets. In addition, we also investigate different proposed model’s variants and analyze the effects of semantic embedding learning and cross-modality attention mechanism in order to provide deeper insight on how these two techniques help the learning of joint visual-textual sentiment classifier.


Sentiment analysis Cross-modality analysis Recurrent neural network Attention mechanism 



This work is supported by National Natural Science Foundation of China under Grants, No. 61772133, No. 61472081, No. 61402104, No. 61370207, No. 61370208, No. 61300024, No. 61320106007, Key Laboratory of Computer Network Technology of Jiangsu Province, Jiangsu Provincial Key Laboratory of Network and Information Security under Grants No. BM2003201, and Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grants No. 93K-9.


  1. 1.
    Wang, L., Li, Y., Huang, J., et al.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. (2018)Google Scholar
  2. 2.
    You, Q., Jin, H., Luo, J.: Visual sentiment analysis by attending on local image regions. In: AAAI, pp. 231–237 (2017)Google Scholar
  3. 3.
    Vinyals, O., Toshev, A., Bengio, S., et al.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017)CrossRefGoogle Scholar
  4. 4.
    Chen, X., Wang, Y., Liu, Q.: Visual and textual sentiment analysis using deep fusion convolutional neural networks. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 1557–1561. IEEE (2017)Google Scholar
  5. 5.
    You, Q., Cao, L., Jin, H., et al.: Robust visual-textual sentiment analysis: when attention meets tree-structured recursive neural networks. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 1008–1017. ACM (2016)Google Scholar
  6. 6.
    You, Q., Luo, J., Jin, H., et al.: Cross-modality consistent regression for joint visual textual sentiment analysis of social multimedia. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 13–22. ACM (2016)Google Scholar
  7. 7.
    Katsurai, M., Satoh, S.: Image sentiment analysis using latent correlations among visual, textual, and sentiment views. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2837–2841. IEEE (2016)Google Scholar
  8. 8.
    Cao, D., Ji, R., Lin, D., et al.: A cross-media public sentiment analysis system for microblog. Multimedia Syst. 22(4), 479–486 (2016)CrossRefGoogle Scholar
  9. 9.
    Yang, Z., Yang, D., Dyer, C., et al.: Hierarchical attention networks for document classification. In: 2016 Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)Google Scholar
  10. 10.
    Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)Google Scholar
  11. 11.
    Pang, L., Zhu, S., Ngo, C.W.: Deep multimodal learning for affective analysis and retrieval. IEEE Trans. Multimedia 17(11), 2008–2020 (2015)CrossRefGoogle Scholar
  12. 12.
    Xu, K., Ba, J., Kiros, R., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  13. 13.
    Ma, L., Lu, Z., Shang, L., et al.: Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2623–2631 (2015)Google Scholar
  14. 14.
    Campos, V., Salvador, A., Giro-i-Nieto, X., et al.: Diving deep into sentiment: understanding fine-tuned CNNs for visual sentiment prediction. In: Proceedings of the 1st International Workshop on Affect and Sentiment in Multimedia, pp. 57–62. ACM (2015)Google Scholar
  15. 15.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
  16. 16.
    Wang, M., Cao, D., Li, L., et al.: Microblog sentiment analysis based on cross-media bag-of-words model. In: Proceedings of International Conference on Internet Multi-media Computing and Service, p. 76. ACM (2014)Google Scholar
  17. 17.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate arXiv preprint arXiv:1409.0473 (2014)
  18. 18.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  19. 19.
    You, Q., Luo, J.: Towards social imagematics: sentiment analysis in social multimedia. In: Proceedings of the Thirteenth International Workshop on Multimedia Data Mining, p. 3. ACM (2013)Google Scholar
  20. 20.
    Chen, T., Lu, D., Kan, M.Y., et al.: Understanding and classifying image tweets. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 781–784. ACM (2013)Google Scholar
  21. 21.
    Socher, R., Perelygin, A., Wu, J., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)Google Scholar
  22. 22.
    Borth, D., Ji, R., Chen, T., et al.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 223–232. ACM (2013)Google Scholar
  23. 23.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing systems, pp. 1097–1105 (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Xuelin Zhu
    • 1
  • Biwei Cao
    • 2
  • Shuai Xu
    • 1
  • Bo Liu
    • 1
  • Jiuxin Cao
    • 3
    Email author
  1. 1.School of Computer Science and Engineering, Southeast UniversityNanjingChina
  2. 2.ANU College of Engineering and Computer ScienceAustralian National UniversityCanberraAustralia
  3. 3.School of Cyber Science and Engineering, Southeast UniversityNanjingChina

Personalised recommendations