Skip to main content
Log in

Speech emotion recognition based on multi‐feature and multi‐lingual fusion

  • 1193: Intelligent Processing of Multimedia Signals
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

A speech emotion recognition algorithm based on multi-feature and Multi-lingual fusion is proposed in order to resolve low recognition accuracy caused bylack of large speech dataset and low robustness of acoustic features in the recognition of speech emotion. First, handcrafted and deep automatic features are extractedfrom existing data in Chinese and English speech emotions. Then, the various features are fused respectively. Finally, the fused features of different languages are fused again and trained in a classification model. Distinguishing the fused features with the unfused ones, the results manifest that the fused features significantly enhance the accuracy of speech emotion recognition algorithm. The proposedsolution is evaluated on the two Chinese corpus and two English corpus, and isshown to provide more accurate predictions compared to original solution. As a result of this study, the multi-feature and Multi-lingual fusion algorithm can significantly improve the speech emotion recognition accuracy when the dataset is small.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. André Stuhlsatz, Meyer C, Eyben F et al (2011) Deep neural networks for acoustic emotion recognition: Raising the benchmarks. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, pp 5688–5691. https://doi.org/10.1109/ICASSP.2011.5947651

  2. Bertero D, Fung P (2017) A first look into a Convolutional Neural Network for speech emotion detection. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp 5115–5119. https://doi.org/10.1109/ICASSP.2017.7953131

  3. Busso C, Bulut M, Lee CC et al (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359

    Article  Google Scholar 

  4. Davis SB (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):65–74. https://doi.org/10.1109/TASSP.1980.1163420

    Article  Google Scholar 

  5. Dora L (2018) Education in professional defense -possibilities of classification of training level with the help of impulse. J Syst Manag Sci 8(1):23–44

    Google Scholar 

  6. EybenF SKR, Schuller BW et al (2015) The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing. IEEE Trans Affect Comput 7(2):1–1. https://doi.org/10.1109/TAFFC.2015.2457417

    Article  Google Scholar 

  7. EybenF SKR, Truong KP et al (2016) The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for speech research and affective computing. IEEE Trans Affect Comput 7(2):190–202. https://doi.org/10.1109/TAFFC.2015.2457417

    Article  Google Scholar 

  8. Gemmeke JF, Ellis DPW, Freedman D et al (2017) Audio Set: An ontology and human-labeled dataset for audio events. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, pp 776–780. https://doi.org/10.1109/ICASSP.2017.7952261

  9. Huang JT, Li J, Yu D et al (2013) Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 7304–7308. https://doi.org/10.1109/ICASSP.2013.6639081

  10. Jackson P, UlHaq S (2011) Surrey Audio-Visual Expressed Emotion (SAVEE) database. University of Surrey, Guildford

  11. Kandali AB, Routray A, Basu TK (2008) Emotion recognition from Assamese speeches using MFCC features and GMM classifier. TENCON 2008–2008 IEEE Region 10 Conference 1–5. https://doi.org/10.1109/TENCON.2008.4766487

  12. Kim J, Saurous R (2018) Emotion recognition from human speech using temporal information and deep learning. Interspeech 937–940. https://doi.org/10.21437/Interspeech.2018-1132

  13. Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303. https://doi.org/10.1109/TSA.2004.838534

    Article  Google Scholar 

  14. Lee CM, Narayanan S, Pieraccini R (2001) Recognition of negative emotions from the speech signal. IEEE automatic speech recognition understanding workshop 240–243. https://doi.org/10.1109/ASRU.2001.1034632

  15. Li Y, Tao J, Chao L et al (2016) CHEAVD: a Chinese natural emotional audio–visual database. J Ambient Intell Humaniz Comput 8(6):913–924

    Article  Google Scholar 

  16. Lili F, Yinhong D (2018) Research on internet search data in china’s social problems under the background of big data. J Logist Informat Serv Sci 5(2):55–67

  17. Mao X, Zhang B, Luo YI (2007) Speech emotion recognition based on a hybrid of HMM/ANN. Proceedings of the 7th Conference on 7th WSEAS International Conference on Applied Informatics and Communications 7:367–370

  18. Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. International conference on acoustics, speech, and signal processing, New Orleans, LA, USA. https://doi.org/10.1109/ICASSP.2017.7952552

  19. NedjmaOusidhoum,et al (2019) Multilingual and Multi-Aspect Hate Speech Analysis. International joint conference on natural language processing, pp 4675–4684

  20. Pao TL, Chen YT, Yeh JH (2004) Emotion recognition from Mandarin speech signals, 2004 International Symposium on Chinese Spoken Language Processing, pp 301–304. https://doi.org/10.1109/CHINSL.2004.1409646

  21. Petrushin VA (2000) Emotion recognition in speech signal: experimental study, development, and application. Sixth International Conference on Spoken Language Processing, Beijing, China

  22. Picard RW (1997) Affective Computing. MIT Press, Google Scholar, Cambridge

    Google Scholar 

  23. Voican O (2020) Using data mining methods to solve classification problems in financial-banking institutions. Econ Comput Econ Cybern Stud Res 54(1):159–176. https://doi.org/10.24818/18423264/54.1.20.11

    Article  Google Scholar 

  24. Williams CE, Stevens KN (1972) Emotions and speech: some acoustical correlates. J Acoust Soc Am 52(4):1238–1250. https://doi.org/10.1121/1.1913238

    Article  Google Scholar 

  25. Zhang B, Mower Provost E, Essl G (2019) Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences. IEEE Trans Affect Comput 10(1):85–99. https://doi.org/10.1109/TAFFC.2017.2684799

    Article  Google Scholar 

Download references

Acknowledgements

For advice and discussions, we thank Heyan Huang, professor of the School of Computer Science, Beijing Institute of Technology. We also thank anonymous reviewers for their valuable work.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ying Ren or Na Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, C., Ren, Y., Zhang, N. et al. Speech emotion recognition based on multi‐feature and multi‐lingual fusion. Multimed Tools Appl 81, 4897–4907 (2022). https://doi.org/10.1007/s11042-021-10553-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-10553-4

Keywords

Navigation