Skip to main content
Log in

CSAT-FTCN: A Fuzzy-Oriented Model with Contextual Self-attention Network for Multimodal Emotion Recognition

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

Multimodal emotion analysis has become a hot trend because of its wide applications, such as the question-answering system. However, in a real-world scenario, people usually have mixed or partial emotions about evaluating objects. In this paper, we introduce a fuzzy temporal convolutional network based on contextual self-attention (CSAT-FTCN) to address these challenges, which has a membership function modeling various fuzzy emotions for understanding emotions in a more profound sense. Moreover, the CSAT-FTCN can obtain the dependency relationships of target utterances on internal own key information and external contextual information to understand emotions in a more profound sense. Additionally, as for multi-modality data, we introduce an attention fusion (ATF) mechanism to capture the dependency relationship between different modality information. The experimental results show that our CSAT-FTCN outperforms state-of-the-art models on tested datasets. The CSAT-FTCN network provides a novel method for multimodal emotion analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study. The datasets used in this paper are public and available.

References

  1. Cambria E, Liu Q, Decherchi S, Xing F, Kwok K. Senticnet 7: a commonsense-based neurosymbolic AI framework for explainable sentiment analysis. In: Proceedings of LREC 2022. 2022.

  2. Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E. Dialoguernn: an attentive RNN for emotion detection in conversations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33. 2019. p. 6818–25.

  3. Poria S, Majumder N, Hazarika D, Cambria E, Gelbukh A, Hussain A. Multimodal sentiment analysis: Addressing key issues and setting up the baselines. IEEE Intell Syst. 2018;33(6):17–25.

    Article  Google Scholar 

  4. Susanto Y, Livingstone AG, Ng BC, Cambria E. The hourglass model revisited. IEEE Intell Syst. 2020;35(5):96–102.

    Article  Google Scholar 

  5. Hu A, Flaxman S. Multimodal sentiment analysis to explore the structure of emotions. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018;350–8.

  6. Blanchard N, Moreira D, Bharati A, Scheirer W. Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities. In: Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML). 2018. p. 1–10.

  7. Hazarika D, Poria S, Zadeh A, Cambria E, Morency L-P, Zimmermann R. Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the Conference. Association for Computational Linguistics. North American Chapter. Meeting, vol. 2018. NIH Public Access; 2018. p. 2122.

  8. Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R. Icon: Interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. p. 2594–604.

  9. Yang J, Wang Y, Yi R, Zhu Y, Rehman A, Zadeh A, Poria S, Morency L-P. Mtag: Modal-temporal attention graph for unaligned human multimodal language sequences. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. p. 1009–21.

  10. Zhang K, Li Y, Wang J, Cambria E, Li X. Real-time video emotion recognition based on reinforcement learning and domain knowledge. IEEE Trans Circ Syst Video Technol. 2021.

  11. Xiao G, Tu G, Zheng L, Zhou T, Li X, Ahmed SH, Jiang D. Multi-modality sentiment analysis in social internet of things based on hierarchical attentions and CSAT-TCN with MBM network. IEEE Internet Things J. 2020.

  12. Wang W, Tran D, Feiszli M. What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 12695–705.

  13. Du C, Li T, Liu Y, Wen Z, Hua T, Wang Y, Zhao H. Improving multi-modal learning with uni-modal teachers. arXiv:2106.11059 [Preprint]. 2021. Available from: http://arxiv.org/abs/2106.11059.

  14. He K, Mao R, Gong T, Li C, Cambria E. Meta-based self-training and re-weighting for aspect-based sentiment analysis. IEEE Transactions on Affective Computing. 2022. p. 1–13.

  15. Mao R, Li X. Bridging towers of multi-task learning with a gating mechanism for aspect-based sentiment analysis and sequential metaphor identification. Proc AAAI Conf Artif Intell. 2021;35(15):13534–42.

    Google Scholar 

  16. Medhat W, Hassan A, Korashy H. Sentiment analysis algorithms and applications: a survey. Ain Shams Eng J. 2014;5(4):1093–113.

    Article  Google Scholar 

  17. Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C. Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013. p. 1631–42.

  18. Oneto L, Bisio F, Cambria E, Anguita D. Statistical learning theory and elm for big social data analysis. IEEE Comput Intell Mag. 2016;11(3):45–55.

    Article  Google Scholar 

  19. Deb S, Dandapat S. Emotion classification using segmentation of vowel-like and non-vowel-like regions. IEEE Transactions on Affective Computing. 2017.

  20. Atmaja BT, Akagi M. Multitask learning and multistage fusion for dimensional audiovisual emotion recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2020. p. 4482–6.

  21. Peng W, Hong X, Zhao G. Adaptive modality distillation for separable multimodal sentiment analysis. IEEE Intell Syst. 2021;36(3):82–9.

    Article  Google Scholar 

  22. Al Hanai T, Ghassemi MM, Glass JR. Detecting depression with audio/text sequence modeling of interviews. In: Interspeech. 2018. p. 1716–20.

  23. Yang K, Xu H, Gao K. Cm-bert: Cross-modal bert for text-audio sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020. p. 521–8.

  24. Tu G, Wen J, Liu C, Jiang D, Cambria E. Context- and sentiment-aware networks for emotion recognition in conversation. IEEE Trans Artif Intell. 2022;3(5):699–708.

    Article  Google Scholar 

  25. Poria S, Peng H, Hussain A, Howard N, Cambria E. Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing. 2017;261:217–30.

    Article  Google Scholar 

  26. Wen J, Jiang D, Tu G, Liu C, Cambria E. Dynamic interactive multiview memory network for emotion recognition in conversation. Info Fus. 2022.

  27. Dashtipour K, Gogate M, Cambria E, Hussain A. A novel context-aware multimodal framework for persian sentiment analysis. Neurocomputing. 2021;457:377–88.

    Article  Google Scholar 

  28. Mao R, Lin C, Guerin F. End-to-end sequential metaphor identification inspired by linguistic theories. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. p. 3888–98.

  29. Tu G, Wen J, Liu H, Chen S, Zheng L, Jiang D. Exploration meets exploitation: Multitask learning for emotion recognition based on discrete and dimensional models. Knowl-Based Syst. 2022;235.

    Article  Google Scholar 

  30. Yu W, Xu H, Yuan Z, Wu J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. Proc AAAI Conf Artif Intell. 2021;35(12):10790–7.

    Google Scholar 

  31. Peng H, Ma Y, Poria S, Li Y, Cambria E. Phonetic-enriched text representation for chinese sentiment analysis with reinforcement learning. Info Fus. 2021;70:88–99.

    Article  Google Scholar 

  32. Rajapakshe T, Rana R, Khalifa S, Liu J, Schuller B. A novel policy for pre-trained deep reinforcement learning for speech emotion recognition. Australas Comput Sci Week. 2022;2022:96–105.

    Google Scholar 

  33. Li T, Chen X, Zhang S, Dong Z, Keutzer K. Cross-domain sentiment classification with contrastive learning and mutual information maximization. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021. p. 8203–7.

  34. Stappen L, Schumann L, Sertolli B, Baird A, Weigell B, Cambria E, Schuller BW. Muse-toolbox: The multimodal sentiment analysis continuous annotation fusion and discrete class transformation toolbox. in Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge. 2021. p. 75–82.

  35. Serrano-Guerrero J, Romero FP, Olivas JA. Fuzzy logic applied to opinion mining: a review. Knowl-Based Syst. 2021;222.

    Article  Google Scholar 

  36. Zhang Y, Liu Y, Li Q, Tiwari P, Wang B, Li Y, Pandey HM, Zhang P, Song D. Cfn: A complex-valued fuzzy network for sarcasm detection in conversations. IEEE Trans Fuzzy Syst. 2021;29(12):3696–710.

    Article  Google Scholar 

  37. Vega CF, Quevedo J, Escandón E, Kiani M, Ding W, Andreu-Perez J. Fuzzy temporal convolutional neural networks in p300-based brain-computer interface for smart home interaction. Appl Soft Comput. 2022;117.

    Article  Google Scholar 

  38. Zhang Z, Wang H, Geng J, Jiang W, Deng X, Miao W. An information fusion method based on deep learning and fuzzy discount-weighting for target intention recognition. Eng Appl Artif Intell. 2022;109.

    Article  Google Scholar 

  39. Wu M, Su W, Chen L, Pedrycz W, Hirota K. Two-stage fuzzy fusion based-convolution neural network for dynamic emotion recognition. IEEE Transactions on Affective Computing. 2020.

  40. He S, Wang Y. Evaluating new energy vehicles by picture fuzzy sets based on sentiment analysis from online reviews. Artif Intell Rev. 2022;1–22.

  41. Chaturvedi I, Satapathy R, Cavallari S, Cambria E. Fuzzy commonsense reasoning for multimodal sentiment analysis. Patt Recognit Lett. 2019;125.

  42. Jiang D, Wu K, Chen D, Tu G, Zhou T, Garg A, Gao L. A probability and integrated learning based classification algorithm for high-level human emotion recognition problems. Measurement. 2020;150.

    Article  Google Scholar 

  43. Tian Y, Stewart CM. Framing the sars crisis: a computer-assisted text analysis of CNN and BBC online news reports of Sars. Asian J Commun. 2005;15(3):289–301.

    Article  Google Scholar 

  44. Eyben F, Wöllmer M, Schuller B. Opensmile: The Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia. 2010. p. 1459–62.

  45. Ji S, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell. 2012;35(1):221–31.

    Article  Google Scholar 

  46. Chaturvedi S, Mishra V, Mishra N. Sentiment analysis using machine learning for business intelligence. In: 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI). IEEE; 2017. p. 2162–6.

  47. Busso C, Bulut M, Lee C-C, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS. IEMOCAP: Interactive emotional dyadic motion capture database. Lang Resour Eval. 2008;42(4):335–59.

    Article  Google Scholar 

  48. Zadeh A, Zellers R, Pincus E, Morency L-P. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv:1606.06259 [Preprint]. 2016. Available from: http://arxiv.org/abs/1606.06259.

  49. Zadeh AB, Liang PP, Poria S, Cambria E, Morency L-P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. p. 2236–46.

  50. Mai S, Xing S, Hu H. Locally confined modality fusion network with a global perspective for multimodal human affective computing. IEEE Trans Multimedia. 2019;22(1):122–37.

    Article  Google Scholar 

  51. Majumder N, Hazarika D, Gelbukh A, Cambria E, Poria S. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl-Based Syst. 2018;161:124–33.

    Article  Google Scholar 

  52. Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency L-P. Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers). 2017. p. 873–83.

  53. Zadeh A, Chen M, Poria S, Cambria E, Morency L-P. Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. p. 1103–14.

  54. Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh AB, Morency L-P. Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. p. 2247–56.

  55. Mai S, Hu H, Xing S. Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. p. 481–92.

  56. Chen M, Li X. Swafn: Sentimental words aware fusion network for multimodal sentiment analysis. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020. p. 1067–77.

  57. Mai S, Xing S, Hu H. Analyzing multimodal sentiment via acoustic- and visual-lstm with channel-aware temporal convolution network. IEEE/ACM Trans Audio Speech Language Process. 2021;29:1424–37.

    Article  Google Scholar 

  58. Lian Z, Liu B, Tao J. Pirnet: Personality-enhanced iterative refinement network for emotion recognition in conversation. IEEE Transactions on Neural Networks and Learning Systems. p. 1–12.

Download references

Acknowledgements

The authors would like to respect and thank all reviewers for their constructive and helpful review.

Funding

This research is funded by the National Natural Science Foundation of China (62106136, 6190223), Natural Science Foundation of Guangdong Province (2019A1515010943), The Basic and Applied Basic Research of Colleges and Universities in Guangdong Province (Special Projects in Artificial Intelligence) (2019KZDZX1030), 2020 Li Ka Shing Foundation Cross-Disciplinary Research Grant (2020LKSFG04D), Science and Technology Major Project of Guangdong Province (STKJ2021005, STKJ202209002), and the Opening Project of Guangdong Province Key Laboratory of Information Security Technology (2020B1212060078).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Runguo Wei or Geng Tu.

Ethics declarations

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on A Decade of Sentic Computing

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, D., Liu, H., Wei, R. et al. CSAT-FTCN: A Fuzzy-Oriented Model with Contextual Self-attention Network for Multimodal Emotion Recognition. Cogn Comput 15, 1082–1091 (2023). https://doi.org/10.1007/s12559-023-10119-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-023-10119-6

Keywords

Navigation