Speech emotion recognition based on multi‐feature and multi‐lingual fusion

Wang, Chunyi; Ren, Ying; Zhang, Na; Cui, Fuwei; Luo, Shiying

doi:10.1007/s11042-021-10553-4

Speech emotion recognition based on multi‐feature and multi‐lingual fusion

1193: Intelligent Processing of Multimedia Signals
Published: 28 August 2021

Volume 81, pages 4897–4907, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Chunyi Wang¹,
Ying Ren²,
Na Zhang²,
Fuwei Cui³ &
…
Shiying Luo⁴

757 Accesses
24 Citations
2 Altmetric
Explore all metrics

Abstract

A speech emotion recognition algorithm based on multi-feature and Multi-lingual fusion is proposed in order to resolve low recognition accuracy caused bylack of large speech dataset and low robustness of acoustic features in the recognition of speech emotion. First, handcrafted and deep automatic features are extractedfrom existing data in Chinese and English speech emotions. Then, the various features are fused respectively. Finally, the fused features of different languages are fused again and trained in a classification model. Distinguishing the fused features with the unfused ones, the results manifest that the fused features significantly enhance the accuracy of speech emotion recognition algorithm. The proposedsolution is evaluated on the two Chinese corpus and two English corpus, and isshown to provide more accurate predictions compared to original solution. As a result of this study, the multi-feature and Multi-lingual fusion algorithm can significantly improve the speech emotion recognition accuracy when the dataset is small.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

References

André Stuhlsatz, Meyer C, Eyben F et al (2011) Deep neural networks for acoustic emotion recognition: Raising the benchmarks. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, pp 5688–5691. https://doi.org/10.1109/ICASSP.2011.5947651
Bertero D, Fung P (2017) A first look into a Convolutional Neural Network for speech emotion detection. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp 5115–5119. https://doi.org/10.1109/ICASSP.2017.7953131
Busso C, Bulut M, Lee CC et al (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359
Article Google Scholar
Davis SB (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):65–74. https://doi.org/10.1109/TASSP.1980.1163420
Article Google Scholar
Dora L (2018) Education in professional defense -possibilities of classification of training level with the help of impulse. J Syst Manag Sci 8(1):23–44
Google Scholar
EybenF SKR, Schuller BW et al (2015) The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing. IEEE Trans Affect Comput 7(2):1–1. https://doi.org/10.1109/TAFFC.2015.2457417
Article Google Scholar
EybenF SKR, Truong KP et al (2016) The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for speech research and affective computing. IEEE Trans Affect Comput 7(2):190–202. https://doi.org/10.1109/TAFFC.2015.2457417
Article Google Scholar
Gemmeke JF, Ellis DPW, Freedman D et al (2017) Audio Set: An ontology and human-labeled dataset for audio events. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, pp 776–780. https://doi.org/10.1109/ICASSP.2017.7952261
Huang JT, Li J, Yu D et al (2013) Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 7304–7308. https://doi.org/10.1109/ICASSP.2013.6639081
Jackson P, UlHaq S (2011) Surrey Audio-Visual Expressed Emotion (SAVEE) database. University of Surrey, Guildford
Kandali AB, Routray A, Basu TK (2008) Emotion recognition from Assamese speeches using MFCC features and GMM classifier. TENCON 2008–2008 IEEE Region 10 Conference 1–5. https://doi.org/10.1109/TENCON.2008.4766487
Kim J, Saurous R (2018) Emotion recognition from human speech using temporal information and deep learning. Interspeech 937–940. https://doi.org/10.21437/Interspeech.2018-1132
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303. https://doi.org/10.1109/TSA.2004.838534
Article Google Scholar
Lee CM, Narayanan S, Pieraccini R (2001) Recognition of negative emotions from the speech signal. IEEE automatic speech recognition understanding workshop 240–243. https://doi.org/10.1109/ASRU.2001.1034632
Li Y, Tao J, Chao L et al (2016) CHEAVD: a Chinese natural emotional audio–visual database. J Ambient Intell Humaniz Comput 8(6):913–924
Article Google Scholar
Lili F, Yinhong D (2018) Research on internet search data in china’s social problems under the background of big data. J Logist Informat Serv Sci 5(2):55–67
Mao X, Zhang B, Luo YI (2007) Speech emotion recognition based on a hybrid of HMM/ANN. Proceedings of the 7th Conference on 7th WSEAS International Conference on Applied Informatics and Communications 7:367–370
Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. International conference on acoustics, speech, and signal processing, New Orleans, LA, USA. https://doi.org/10.1109/ICASSP.2017.7952552
NedjmaOusidhoum,et al (2019) Multilingual and Multi-Aspect Hate Speech Analysis. International joint conference on natural language processing, pp 4675–4684
Pao TL, Chen YT, Yeh JH (2004) Emotion recognition from Mandarin speech signals, 2004 International Symposium on Chinese Spoken Language Processing, pp 301–304. https://doi.org/10.1109/CHINSL.2004.1409646
Petrushin VA (2000) Emotion recognition in speech signal: experimental study, development, and application. Sixth International Conference on Spoken Language Processing, Beijing, China
Picard RW (1997) Affective Computing. MIT Press, Google Scholar, Cambridge
Google Scholar
Voican O (2020) Using data mining methods to solve classification problems in financial-banking institutions. Econ Comput Econ Cybern Stud Res 54(1):159–176. https://doi.org/10.24818/18423264/54.1.20.11
Article Google Scholar
Williams CE, Stevens KN (1972) Emotions and speech: some acoustical correlates. J Acoust Soc Am 52(4):1238–1250. https://doi.org/10.1121/1.1913238
Article Google Scholar
Zhang B, Mower Provost E, Essl G (2019) Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences. IEEE Trans Affect Comput 10(1):85–99. https://doi.org/10.1109/TAFFC.2017.2684799
Article Google Scholar

Download references

Acknowledgements

For advice and discussions, we thank Heyan Huang, professor of the School of Computer Science, Beijing Institute of Technology. We also thank anonymous reviewers for their valuable work.

Author information

Authors and Affiliations

Department of Computer Science, Vanderbilt University, Nashville, TN, 37235, USA
Chunyi Wang
School of Economics and Management, Beijing Jiaotong University, 100044, Beijing, China
Ying Ren & Na Zhang
School of Electronic and Information Engineering, Beijing Jiaotong University, 100044, Beijing, China
Fuwei Cui
School of Computer Science & Technology, University of Chinese Academy of Sciences, 101408, Beijing, China
Shiying Luo

Authors

Chunyi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Ren
View author publications
You can also search for this author in PubMed Google Scholar
Na Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fuwei Cui
View author publications
You can also search for this author in PubMed Google Scholar
Shiying Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ying Ren or Na Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, C., Ren, Y., Zhang, N. et al. Speech emotion recognition based on multi‐feature and multi‐lingual fusion. Multimed Tools Appl 81, 4897–4907 (2022). https://doi.org/10.1007/s11042-021-10553-4

Download citation

Received: 03 August 2020
Revised: 06 January 2021
Accepted: 13 January 2021
Published: 28 August 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11042-021-10553-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech emotion recognition based on multi‐feature and multi‐lingual fusion

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

A comprehensive survey on automatic speech recognition using neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech emotion recognition based on multi‐feature and multi‐lingual fusion

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

A comprehensive survey on automatic speech recognition using neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation