Skip to main content
Log in

A graph convolution-based heterogeneous fusion network for multimodal sentiment analysis

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Multimodal sentiment analysis leverages various modalities, including text, audio, and video, to determine human sentiment tendencies, which holds significance in fields such as intention understanding and opinion analysis. However, there are two critical challenges in multimodal sentiment analysis: one is how to effectively extract and integrate information from various modalities, which is important for reducing the heterogeneity gap among modalities; the other is how to overcome the problem of information forgetting while modelling long sequences, which leads to significant information loss and adversely affect the fusion performance of modalities. Based on the above issues, this paper proposes a multimodal heterogeneity fusion network based on graph convolutional neural networks (HFNGC). A shared convolutional aggregation mechanism is used to overcome the semantic gap among modalities and reduce the noise effect caused by modality heterogeneity. In addition, the model applies Dynamic Routing to convert modality features into graph structures. By learning semantic information in the graph representation space, our model can improve the capability of remote-dependent learning. Furthermore, the model integrates complementary information among modalities and explores the intra- and inter-modal interactions during the modality fusion stage. To validate the effectiveness of our model, we conduct experiments on two benchmark datasets. The experimental results demonstrate that our method outperforms the existing methods, exhibiting strong generalisation capability and high competitiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Availability of data and materials

In this work, we have used two publicly available datasets, CMU-MOSI dataset and CMU-MOSEI dataset, both of which can be available at https://github.com/A2Zadeh/CMU-MultimodalSDK.

References

  1. Zadeh A, Chen M, Poria S, Cambria E, Morency L (2017) Tensor fusion network for multimodal sentiment analysis. In: Palmer M, Hwa R, Riedel S (eds) Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp 1103–1114. https://doi.org/10.18653/v1/d17-1115

  2. Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency L (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Gurevych I, Miyao Y (eds) Proceedings of the 56th annual meeting of the association for computational linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, vol 1: Long papers, pp 2247–2256. https://doi.org/10.18653/v1/P18-1209

  3. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

  4. Wang Y, Shen Y, Liu Z, Liang PP, Zadeh A, Morency L (2019) Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: The thirty-third AAAI conference on artificial intelligence, AAAI 2019, the thirty-first innovative applications of artificial intelligence conference, IAAI 2019, the ninth AAAI symposium on educational advances in artificial intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, pp 7216–7223. https://doi.org/10.1609/aaai.v33i01.33017216

  5. Akhtar MS, Chauhan DS, Ghosal D, Poria S, Ekbal A, Bhattacharyya P (2019) Multi-task learning for multi-modal emotion recognition and sentiment analysis. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, vol 1 (Long and Short Papers). Association for computational linguistics, pp 370–379. https://doi.org/10.18653/v1/n19-1034

  6. Baltrusaitis T, Ahuja C, Morency L (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443. https://doi.org/10.1109/TPAMI.2018.2798607

    Article  Google Scholar 

  7. Gkoumas D, Li Q, Lioma C, Yu Y, Song D (2021) What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis. Inf Fusion 66:184–197. https://doi.org/10.1016/j.inffus.2020.09.005

    Article  Google Scholar 

  8. Abdu SA, Yousef AH, Salem A (2021) Multimodal video sentiment analysis using deep learning approaches, a survey. Inf Fusion 76:204–226. https://doi.org/10.1016/j.inffus.2021.06.003

    Article  Google Scholar 

  9. Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency L-P (2018) Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 32. https://doi.org/10.1609/aaai.v32i1.12021

  10. Mai S, Hu H, Xu J, Xing S (2020) Multi-fusion residual memory network for multimodal human sentiment comprehension. IEEE Trans Affect Comput 13(1):320–334. https://doi.org/10.1109/TAFFC.2020.3000510

    Article  Google Scholar 

  11. Basiri ME, Nemati S, Abdar M, Cambria E, Acharya UR (2021) Abcdm: an attention-based bidirectional cnn-rnn deep model for sentiment analysis. Futur Gener Comput Syst 115:279–294. https://doi.org/10.1016/j.future.2020.08.005

    Article  Google Scholar 

  12. Wu T, Peng J, Zhang W, Zhang H, Tan S, Yi F, Ma C, Huang Y (2022) Video sentiment analysis with bimodal information-augmented multi-head attention. Knowl Based Syst 235:107676. https://doi.org/10.1016/j.knosys.2021.107676

    Article  Google Scholar 

  13. Wang D, Guo X, Tian Y, Liu J, He L, Luo X (2023) TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit 136:109259. https://doi.org/10.1016/j.patcog.2022.109259

    Article  Google Scholar 

  14. Xue X, Zhang C, Niu Z, Wu X (2023) Multi-level attention map network for multimodal sentiment analysis. IEEE Trans Knowl Data Eng 35(5):5105–5118. https://doi.org/10.1109/TKDE.2022.3155290

    Article  Google Scholar 

  15. Zhu T, Li L, Yang J, Zhao S, Liu H, Qian J (2023) Multimodal sentiment analysis with image-text interaction network. IEEE Trans Multim 25:3375–3385. https://doi.org/10.1109/TMM.2022.3160060

    Article  Google Scholar 

  16. Zhang X, Chen Y, He L (2023) Information block multi-head subspace based long short-term memory networks for sentiment analysis. Appl Intell 53(10):12179–12197. https://doi.org/10.1007/s10489-022-03998-z

    Article  Google Scholar 

  17. Peng J, Wu T, Zhang W, Cheng F, Tan S, Yi F, Huang Y (2023) A fine-grained modal label-based multi-stage network for multimodal sentiment analysis. Expert Syst Appl 221:119721. https://doi.org/10.1016/j.eswa.2023.119721

    Article  Google Scholar 

  18. Chen Q, Huang G, Wang Y (2022) The weighted cross-modal attention mechanism with sentiment prediction auxiliary task for multimodal sentiment analysis. IEEE ACM Trans Audio Speech Lang Process 30:2689–2695. https://doi.org/10.1109/TASLP.2022.3192728

    Article  Google Scholar 

  19. Wu J, Mai S, Hu H (2021) Graph capsule aggregation for unaligned multimodal sequences. In: Proceedings of the 2021 international conference on multimodal interaction, pp 521–529. https://doi.org/10.1145/3462244.3479931

  20. Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. In: Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 3856–3866. https://proceedings.neurips.cc/paper/2017/hash/2cad8fa47bbef282badbb8de5374b894-Abstract.html

  21. Yang J, Wang Y, Yi R, Zhu Y, Rehman A, Zadeh A, Poria S, Morency L-P (2021) Mtag: modal-temporal attention graph for unaligned human multimodal language sequences. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1009–1021. https://doi.org/10.18653/v1/2021.naacl-main.79

  22. Yang X, Feng S, Zhang Y, Wang D (2021) Multimodal sentiment detection based on multi-channel graph neural networks. In: Zong C, Xia F, Li W, Navigli R (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, ACL/IJCNLP 2021, (vol 1: Long Papers), Virtual Event, August 1-6, 2021. Association for computational linguistics, pp 328–339. https://doi.org/10.18653/v1/2021.acl-long.28

  23. Zeng Z, Sun S, Li Q (2023) Multimodal negative sentiment recognition of online public opinion on public health emergencies based on graph convolutional networks and ensemble learning. Inf Process Manag 60(4):103378. https://doi.org/10.1016/j.ipm.2023.103378

  24. Zhang Y, Tiwari P, Zheng Q, El-Saddik A, Hossain MS (2023) A multimodal coupled graph attention network for joint traffic event detection and sentiment classification. IEEE Trans Intell Transp Syst 24(8):8542–8554. https://doi.org/10.1109/TITS.2022.3205477

    Article  Google Scholar 

  25. Lu Q, Zhu Z, Zhang G, Kang S, Liu P (2021) Aspect-gated graph convolutional networks for aspect-based sentiment analysis. Appl Intell 51(7):4408–4419. https://doi.org/10.1007/s10489-020-02095-3

    Article  Google Scholar 

  26. Xu Q, Peng J, Zheng C, Tan S, Yi F, Cheng F (2023) Short text classification of chinese with label information assisting. ACM Transactions on Asian and Low-Resource Language Information Processing, 1–18. https://doi.org/10.1145/3582301

  27. Zadeh A, Zellers R, Pincus E, Morency L-P (2016) Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88. https://doi.org/10.1109/MIS.2016.94

    Article  Google Scholar 

  28. Zadeh AB, Liang PP, Poria S, Cambria E, Morency L-P (2018) Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 2236–2246. https://doi.org/10.18653/v1/P18-1208

  29. Tsai Y-HH, Bai S, Liang PP, Kolter JZ, Morency L-P, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for computational linguistics. Meeting, vol 2019, p 6558. https://doi.org/10.18653/v1/p19-1656

  30. Hazarika D, Zimmermann R, Poria S (2020) MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In: Chen CW, Cucchiara R, Hua X, Qi G, Ricci E, Zhang Z, Zimmermann R (eds) MM ’20: the 28th ACM international conference on multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020. ACM, pp 1122–1131. https://doi.org/10.1145/3394171.3413678

Download references

Acknowledgements

The authors would like to thank the funding from the Open Project Program of Shanghai Key Laboratory of Data Science (No. 2020090600004) and the resources and technical support from the High Performance Computing Center of Shanghai University, and Shanghai Engineering Research Center of Intelligent Computing System (No. 19DZ2252600).

Funding

This study was supported by the Open Project Program of Shanghai Key Laboratory of Data Science (No. 2020090600004) and the High Performance Computing Center of Shanghai University, and Shanghai Engineering Research Center of Intelligent Computing System (No. 19DZ2252600).

Author information

Authors and Affiliations

Authors

Contributions

[Tong Zhao]: Conceptualization of this study, Methodology, Software, Writing-Original Draft. [Junjie Peng]: Conceptualization of this study, Writing-Review & Editing, Supervision. [Yansong Huang]: Formal analysis, Visualization. [Lan Wang]: Validation, Investigation. [Huiran Zhang]: Conceptualization of this study, Resources. [Zesu Cai]: Conceptualization of this study, Writing-Review & Editing.

Corresponding author

Correspondence to Junjie Peng.

Ethics declarations

Ethics approval

This article has never been submitted to more than one journal for simultaneous consideration. This article is original.

Consent to participate

The authors have approved this article before submission, including the names and order of authors.

Consent for publication

The authors agreed with the content and gave explicit consent to submit.

Competing interests

The authors declared that they have no conflict of interest to this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, T., Peng, J., Huang, Y. et al. A graph convolution-based heterogeneous fusion network for multimodal sentiment analysis. Appl Intell 53, 30455–30468 (2023). https://doi.org/10.1007/s10489-023-05151-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05151-w

Keywords

Navigation