Abstract
The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion and ignored the information differences between modalities, resulting in the model being unable to focus on modality-specific emotional information. At the same time, the shared information between modalities was not processed to generate emotions. Information redundancy problem. To overcome these limitations, we propose a cross-modal fusion emotion prediction network based on vector connections. The network mainly includes two stages: the multi-modal feature fusion stage based on connection vectors and the emotion classification stage based on fused features. Furthermore, we design a supervised inter-class contrastive learning module based on emotion labels. Experimental results confirm the effectiveness of the proposed method, demonstrating excellent performance on the IEMOCAP and MELD datasets.
H. Shi and X. Zhang—Equal Contributions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhang, T., Chen, Z., Zhong, M., Qian, T.: Mimicking the thinking process for emotion recognition in conversation with prompts and paraphrasing. In: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, pp. 6299–6307 (2023)
Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Qi-TTS: questioning intonation control for emotional speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, pp. 1–5 (2023)
George, S.M., Ilyas, P.M.: A review on speech emotion recognition: a survey, recent advances, challenges, and the influence of noise. Neurocomputing 568, 127015 (2024)
Zhu, K., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Improving EEG-based emotionrecognition by fusing time-frequency and spatial representations. In: IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, pp. 1–5 (2023)
Li, X., Liu, J., Xie, Y., Gong, P., Zhang, X., He, H.: MAGDRA: a multi-modal attention graph network with dynamic routing-by-agreement for multi-label emotion recognition. Knowl.-Based Syst. 283, 111126 (2024)
Leem, S., Fulford, D., Onnela, J., Gard, D., Busso, C.: Selective acoustic feature enhancement for speech emotion recognition with noisy speech. IEEE/ACM Trans. Audio Speech Lang. Process. 32, 917–929 (2024)
Peng, C., Chen, K., Shou, L., Chen, G.: CARAT: contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. In: Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, pp. 14581–14589 (2024)
Zhao, H., Li, B., Zhang, Z.: Speaker-aware cross-modal fusion architecture forconversational emotion recognition. In: Interspeech 2023, 24th Annual Conference of the International Speech Communication Association, pp. 2718–2722 (2023)
Kim, K., Cho, N.: Focus-attention-enhanced crossmodal transformer with metriclearning for multimodal speech emotion recognition. In: Interspeech 2023, 24th Annual Conference of the International Speech Communication Association, pp. 2673–2677 (2023)
Zhao, Z., Wang, Y., Wang, Y.: Knowledge-aware Bayesian co-attention for multimodal emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, pp. 1–5 (2023)
Wei, J., Hu, G., Tuan, L.A., Yang, X., Zhu, W.: Multi-scale receptive field graphmodel for emotion recognition in conversations. In: IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, pp. 1–5 (2023)
Chen, F., Shao, J., Zhu, A., Ouyang, D., Liu, X., Shen, H.T.: Modeling hierarchical uncertainty for multimodal emotion recognition in conversation. IEEE Trans. Cybern. 54(1), 187–198 (2024)
Ghosh, S., Tyagi, U., Ramaneswaran, S., Srivastava, H., Manocha, D.: MMER:multimodal multi-task learning for speech emotion recognition. In: Interspeech 2023, 24th Annual Conference of the International Speech Communication Association, pp. 1209–1213 (2023)
Wang, P., et al.: Leveraginglabel information for multimodal emotion recognition. In: Interspeech 2023, 24th Annual Conference of the International Speech Communication Association, pp. 4219–4223 (2023)
Li, B., et al.: Revisitingdisentanglement and fusion on modality and context in conversational multimodal emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, pp. 5923–5934 (2023)
Zhang, T., Li, S., Chen, B., Yuan, H., Chen, C.L.P.: Aia-net: adaptive interactive attention network for text-audio emotion recognition. IEEE Trans. Cybern. 53(12), 7659–7671 (2023)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp. 5998–6008 (2017)
Yang, S., et al.: Speech representation disentanglement with adversarial mutual information learning for one-shot voice conversion. In: Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, pp. 2553–2557 (2022)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pp. 4171–4186 (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021 (2021)
Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient multimodal fusion via interactiveprompting. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, pp. 2604–2613 (2023)
Khosla, P., et al.: Supervised contrastive learning. CoRR abs/2004.11362 (2020)
Busso, C., et al.: Iemocap: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359 (2008)
Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: MELD:a multimodal multi-party dataset for emotion recognition in conversations. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, pp. 527–536 (2019)
Shi, T., Huang, S.: Multiemo: an attention-based correlation-aware multimodalfusion framework for emotion recognition in conversations. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023, pp. 14752–14766 (2023)
Acknowledgments
This paper is supported by the Key Research and Development Program of Guangdong Province under grant No. 2021B0101400003.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Shi, H. et al. (2024). Enhancing Emotion Recognition in Conversation Through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning. In: Huang, DS., Si, Z., Zhang, Q. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14877. Springer, Singapore. https://doi.org/10.1007/978-981-97-5669-8_32
Download citation
DOI: https://doi.org/10.1007/978-981-97-5669-8_32
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5668-1
Online ISBN: 978-981-97-5669-8
eBook Packages: Computer ScienceComputer Science (R0)