Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention

Tran, Phuong-Nam; Vu, Thuy-Duong Thi; Dang, Duc Ngoc Minh; Pham, Nhat Truong; Tran, Anh-Khoa

doi:10.1007/978-3-031-47359-3_11

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 531))

Included in the following conference series:

International Conference on Industrial Networks and Intelligent Systems

273 Accesses
2 Citations

Abstract

Recent research has shown that multi-modal learning is a successful method for enhancing classification performance by mixing several forms of input, notably in speech-emotion recognition (SER) tasks. However, the difference between the modalities may affect SER performance. To overcome this problem, a novel approach for multi-modal SER called 3M-SER is proposed in this paper. The 3M-SER leverages multi-head attention to fuse information from multiple feature embeddings, including audio and text features. The 3M-SER approach is based on the SERVER approach but includes an additional fusion module that improves the integration of text and audio features, leading to improved classification performance. To further enhance the correlation between the modalities, a LayerNorm is applied to audio features prior to fusion. Our approach achieved an unweighted accuracy (UA) and weighted accuracy (WA) of 79.96% and 80.66%, respectively, on the IEMOCAP benchmark dataset. This indicates that the proposed approach is better than SERVER and recent methods with similar approaches. In addition, it highlights the effectiveness of incorporating an extra fusion module in multi-modal learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Liu, D., Chen, L., Wang, Z., Diao, G.: Speech expression multimodal emotion recognition based on deep belief network. J. Grid Comput. 19(2), 22 (2021)
Article Google Scholar
Pham, N.T., Dang, D.N.M., Nguyen, S.D.: A method upon deep learning for speech emotion recognition. J. Adv. Eng. Comput. 4(4), 273–285 (2020)
Article Google Scholar
Bao, F., Neumann, M., Vu, N.T.: Cyclegan-based emotion style transfer as data augmentation for speech emotion recognition. In: Kubin, G., Kacic, Z. (eds.) Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019, pp. 2828–2832. ISCA (2019)
Google Scholar
Pham, N.T., et al.: Speech emotion recognition: a brief review of multi-modal multi-task learning approaches. In: AETA 2022-Recent Advances in Electrical Engineering and Related Sciences: Theory and Application. Springer, Cham (2022)
Google Scholar
Pham, N.T., Dang, D.N.M., Pham, B.N.H., Nguyen, S.D.: SERVER: multi-modal speech emotion recognition using transformer-based and vision-based embeddings. In: ICIIT 2023: 8th International Conference on Intelligent Information Technology, Da Nang, Vietnam, 24–26 February 2023. ACM (2023)
Google Scholar
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, 5–9 March 2017, pp. 131–135. IEEE (2017)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, Red Hook, NY, USA, pp. 6000–6010. Curran Associates Inc. (2017)
Google Scholar
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)
Article Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, 5–9 March 2017, pp. 776–780. IEEE (2017)
Google Scholar
Lee, S., Han, D.K., Ko, H.: Multimodal emotion recognition fusion analysis adapting BERT with heterogeneous feature unification. IEEE Access 9, 94557–94572 (2021)
Article Google Scholar
Lee, Y., Yoon, S., Jung, K.: Multimodal speech emotion recognition using cross attention with aligned audio and text. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 2717–2721. ISCA (2020)
Google Scholar
Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, 18–21 December 2018, pp. 112–118. IEEE (2018)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library (2019). https://pytorch.org/
Tseng, S.-Y., Narayanan, S., Georgiou, P.G.: Multimodal embeddings from language models for emotion recognition in the wild. IEEE Signal Process. Lett. 28, 608–612 (2021)
Article Google Scholar
Sun, L., Liu, B., Tao, J., Lian, Z.: Multimodal cross- and self-attention network for speech emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, 6–11 June 2021, pp. 4275–4279. IEEE (2021)
Google Scholar
Pham, N.T., et al.: Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition. Expert Syst. Appl. 120608 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

Computing Fundamental Department, FPT University, Ho Chi Minh City, Vietnam
Phuong-Nam Tran, Thuy-Duong Thi Vu & Duc Ngoc Minh Dang
Department of Integrative Biotechnology, Sungkyunkwan University, Suwon, Republic of Korea
Nhat Truong Pham
Modeling Evolutionary Algorithms Simulation and Artificial Intelligence, Faculty of Electrical and Electronics Engineering, Ton Duc Thang University, Ho Chi Minh City, Vietnam
Anh-Khoa Tran

Authors

Phuong-Nam Tran
View author publications
You can also search for this author in PubMed Google Scholar
Thuy-Duong Thi Vu
View author publications
You can also search for this author in PubMed Google Scholar
Duc Ngoc Minh Dang
View author publications
You can also search for this author in PubMed Google Scholar
Nhat Truong Pham
View author publications
You can also search for this author in PubMed Google Scholar
Anh-Khoa Tran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Duc Ngoc Minh Dang .

Editor information

Editors and Affiliations

Duy Tan University, Da Nang, Vietnam
Nguyen-Son Vo
Vietnam Aviation Institute, Ho Chi Minh City, Vietnam
Hoai-An Tran

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tran, PN., Vu, TD.T., Dang, D.N.M., Pham, N.T., Tran, AK. (2023). Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention. In: Vo, NS., Tran, HA. (eds) Industrial Networks and Intelligent Systems. INISCOM 2023. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 531. Springer, Cham. https://doi.org/10.1007/978-3-031-47359-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-47359-3_11
Published: 31 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47358-6
Online ISBN: 978-3-031-47359-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention