Skip to main content

Enhancing Emotion Recognition in Conversation Through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14877))

Included in the following conference series:

  • 182 Accesses

Abstract

The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion and ignored the information differences between modalities, resulting in the model being unable to focus on modality-specific emotional information. At the same time, the shared information between modalities was not processed to generate emotions. Information redundancy problem. To overcome these limitations, we propose a cross-modal fusion emotion prediction network based on vector connections. The network mainly includes two stages: the multi-modal feature fusion stage based on connection vectors and the emotion classification stage based on fused features. Furthermore, we design a supervised inter-class contrastive learning module based on emotion labels. Experimental results confirm the effectiveness of the proposed method, demonstrating excellent performance on the IEMOCAP and MELD datasets.

H. Shi and X. Zhang—Equal Contributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhang, T., Chen, Z., Zhong, M., Qian, T.: Mimicking the thinking process for emotion recognition in conversation with prompts and paraphrasing. In: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, pp. 6299–6307 (2023)

    Google Scholar 

  2. Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Qi-TTS: questioning intonation control for emotional speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, pp. 1–5 (2023)

    Google Scholar 

  3. George, S.M., Ilyas, P.M.: A review on speech emotion recognition: a survey, recent advances, challenges, and the influence of noise. Neurocomputing 568, 127015 (2024)

    Article  Google Scholar 

  4. Zhu, K., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Improving EEG-based emotionrecognition by fusing time-frequency and spatial representations. In: IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, pp. 1–5 (2023)

    Google Scholar 

  5. Li, X., Liu, J., Xie, Y., Gong, P., Zhang, X., He, H.: MAGDRA: a multi-modal attention graph network with dynamic routing-by-agreement for multi-label emotion recognition. Knowl.-Based Syst. 283, 111126 (2024)

    Article  Google Scholar 

  6. Leem, S., Fulford, D., Onnela, J., Gard, D., Busso, C.: Selective acoustic feature enhancement for speech emotion recognition with noisy speech. IEEE/ACM Trans. Audio Speech Lang. Process. 32, 917–929 (2024)

    Article  Google Scholar 

  7. Peng, C., Chen, K., Shou, L., Chen, G.: CARAT: contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. In: Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, pp. 14581–14589 (2024)

    Google Scholar 

  8. Zhao, H., Li, B., Zhang, Z.: Speaker-aware cross-modal fusion architecture forconversational emotion recognition. In: Interspeech 2023, 24th Annual Conference of the International Speech Communication Association, pp. 2718–2722 (2023)

    Google Scholar 

  9. Kim, K., Cho, N.: Focus-attention-enhanced crossmodal transformer with metriclearning for multimodal speech emotion recognition. In: Interspeech 2023, 24th Annual Conference of the International Speech Communication Association, pp. 2673–2677 (2023)

    Google Scholar 

  10. Zhao, Z., Wang, Y., Wang, Y.: Knowledge-aware Bayesian co-attention for multimodal emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, pp. 1–5 (2023)

    Google Scholar 

  11. Wei, J., Hu, G., Tuan, L.A., Yang, X., Zhu, W.: Multi-scale receptive field graphmodel for emotion recognition in conversations. In: IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, pp. 1–5 (2023)

    Google Scholar 

  12. Chen, F., Shao, J., Zhu, A., Ouyang, D., Liu, X., Shen, H.T.: Modeling hierarchical uncertainty for multimodal emotion recognition in conversation. IEEE Trans. Cybern. 54(1), 187–198 (2024)

    Article  Google Scholar 

  13. Ghosh, S., Tyagi, U., Ramaneswaran, S., Srivastava, H., Manocha, D.: MMER:multimodal multi-task learning for speech emotion recognition. In: Interspeech 2023, 24th Annual Conference of the International Speech Communication Association, pp. 1209–1213 (2023)

    Google Scholar 

  14. Wang, P., et al.: Leveraginglabel information for multimodal emotion recognition. In: Interspeech 2023, 24th Annual Conference of the International Speech Communication Association, pp. 4219–4223 (2023)

    Google Scholar 

  15. Li, B., et al.: Revisitingdisentanglement and fusion on modality and context in conversational multimodal emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, pp. 5923–5934 (2023)

    Google Scholar 

  16. Zhang, T., Li, S., Chen, B., Yuan, H., Chen, C.L.P.: Aia-net: adaptive interactive attention network for text-audio emotion recognition. IEEE Trans. Cybern. 53(12), 7659–7671 (2023)

    Article  Google Scholar 

  17. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  18. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp. 5998–6008 (2017)

    Google Scholar 

  19. Yang, S., et al.: Speech representation disentanglement with adversarial mutual information learning for one-shot voice conversion. In: Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, pp. 2553–2557 (2022)

    Google Scholar 

  20. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pp. 4171–4186 (2019)

    Google Scholar 

  21. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021 (2021)

    Google Scholar 

  22. Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient multimodal fusion via interactiveprompting. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, pp. 2604–2613 (2023)

    Google Scholar 

  23. Khosla, P., et al.: Supervised contrastive learning. CoRR abs/2004.11362 (2020)

    Google Scholar 

  24. Busso, C., et al.: Iemocap: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359 (2008)

    Article  Google Scholar 

  25. Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: MELD:a multimodal multi-party dataset for emotion recognition in conversations. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, pp. 527–536 (2019)

    Google Scholar 

  26. Shi, T., Huang, S.: Multiemo: an attention-based correlation-aware multimodalfusion framework for emotion recognition in conversations. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023, pp. 14752–14766 (2023)

    Google Scholar 

Download references

Acknowledgments

This paper is supported by the Key Research and Development Program of Guangdong Province under grant No. 2021B0101400003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianzong Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shi, H. et al. (2024). Enhancing Emotion Recognition in Conversation Through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning. In: Huang, DS., Si, Z., Zhang, Q. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14877. Springer, Singapore. https://doi.org/10.1007/978-981-97-5669-8_32

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-5669-8_32

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-5668-1

  • Online ISBN: 978-981-97-5669-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics