Skip to main content

Transferability Evaluation of Speech Emotion Recognition Between Different Languages

  • 110 Accesses

Part of the Lecture Notes on Data Engineering and Communications Technologies book series (LNDECT,volume 134)

Abstract

Advances in automated speech recognition significantly accelerated the automation of contact centers, thus creating a need for robust Speech Emotion Recognition (SER) as an integral part of customer net promoter score measuring. However, to train a specific language, a specifically labeled dataset of emotions should be available, a significant limitation. Emotion detection datasets cover only English, German, Mandarin, and Indian. We have shown by results difference between predicting two and four emotions, which leads us to narrow down datasets to particular practical use cases rather than train the model on the whole given dataset. We identified that if emotion transfers good enough from source language to target language, it reflects the same quality of transferability in vice verse direction between languages. Hence engineers can not expect the same transferability in the mirror direction. Chinese language and datasets are the hardest to transfer to other languages for transferability purposes. English dataset transferability is one of the lowest, hence for a production environment, engineers cannot rely on a training model on English for their language. This paper conducted more than 140 experiments for seven languages to evaluate and show the transferability of speech recognition models trained on different languages to have a clear framework which starting dataset to use to achieve good accuracy for practical implementation. The novelty of this study lies in the fact that models for different languages have not yet been compared with each other.

Keywords

  • Speech emotion recognition
  • Sentiment analysis
  • Emotion detection
  • Engagement analysis

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-031-04812-8_35
  • Chapter length: 14 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   149.00
Price excludes VAT (USA)
  • ISBN: 978-3-031-04812-8
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   199.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.

References

  1. Win, H.P.P., Khine, P.T.T.: Emotion recognition system of noisy speech in real world environment. Int. J. Image Graph. Sig. Process. 12(2), 1–8 (2020). https://doi.org/10.5815/ijigsp.2020.02.01

    CrossRef  Google Scholar 

  2. Kumar, J.A., Balakrishnan, M., Wan Yahaya, W.A.J.: Emotional design in multimedia learning: how emotional intelligence moderates learning outcomes. Int. J. Mod. Educ. Comput. Sci. 8(5), 54–63 (2016). https://doi.org/10.5815/ijmecs.2016.05.07

    CrossRef  Google Scholar 

  3. Dhar, P., Guha, S.: A system to predict emotion from Bengali speech. Int. J. Math. Sci. Comput. 7(1), 26–35 (2021). https://doi.org/10.5815/ijmsc.2021.01.04

    CrossRef  Google Scholar 

  4. Shirani, A., Nilchi, A.R.N.: Speech emotion recognition based on SVM as both feature selector and classifier. Int. J. Image Graph. Sig. Process. 8(4), 39–45 (2016). https://doi.org/10.5815/ijigsp.2016.04.05

    CrossRef  Google Scholar 

  5. Devi, J.S., Yarramalle, S., Prasad Nandyala, S.: Speaker emotion recognition based on speech features and classification techniques. Int. J. Image Graph. Sig. Process. 6(7), 61–77 (2014). https://doi.org/10.5815/ijigsp.2014.07.08s

    CrossRef  Google Scholar 

  6. Abdel-Hamid, L.: Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Commun. 122, 19–30 (2020). https://doi.org/10.1016/j.specom.2020.04.005

    CrossRef  Google Scholar 

  7. Pajupuu, H.: Estonian emotional speech corpus. Dataset V5. Center of Estonian Language Resources (2012). https://doi.org/10.15155/EKI.000A

  8. Kerkeni, L., et al.: French emotional speech database—Oréau. Dataset V2 (2020). https://doi.org/10.5281/zenodo.4405783

  9. Burkhardt, F., et al.: A database of German emotional speech. Interspeech (2005). https://doi.org/10.21437/interspeech.2005-446

  10. Vrysas, N., et al.: Speech emotion recognition for performance interaction. J. Audio Eng. Soc. 66(6), 457–467 (2018). https://doi.org/10.17743/jaes.2018.0036

    CrossRef  Google Scholar 

  11. Vryzas, N., et al.: Subjective evaluation of a speech emotion recognition interaction framework. In: Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion (2018). https://doi.org/10.1145/3243274.3243294

  12. Mohamad Nezami, O., Jamshid Lou, P., Karami, M.: ShEMO: a large-scale validated database for Persian speech emotion detection. Lang. Resour. Eval. 53(1), 1–16 (2018). https://doi.org/10.1007/s10579-018-9427-x

    CrossRef  Google Scholar 

  13. Latif, S., et al.: Cross lingual speech emotion recognition: Urdu vs. Western languages. In: 2018 International Conference on Frontiers of Information Technology (FIT) (2018). https://doi.org/10.1109/fit.2018.00023

  14. Roberts, F., Margutti, P., Takano, S.: Judgments concerning the valence of inter-turn silence across speakers of American English, Italian, and Japanese. Discourse Process. 48(5), 331–354 (2011). https://doi.org/10.1080/0163853x.2011.558002

    CrossRef  Google Scholar 

  15. Neumann, M., Thang Vu, N.: Cross-lingual and multilingual speech emotion recognition on English and French. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018). https://doi.org/10.1109/icassp.2018.8462162

  16. Li, X., Akagi, M.: Improving multilingual speech emotion recognition by combining acoustic features in a three-layer model. Speech Commun. 110, 1–12 (2019). https://doi.org/10.1016/j.specom.2019.04.004

    CrossRef  Google Scholar 

  17. Zehra, W., Javed, A.R., Jalil, Z., Khan, H.U., Gadekallu, T.R.: Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex Intell. Syst. 7(4), 1845–1854 (2021). https://doi.org/10.1007/s40747-020-00250-4

    CrossRef  Google Scholar 

  18. Heracleous, P., Yoneyama, A.: A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme. PLoS ONE 14(8), e0220386 (2019). https://doi.org/10.1371/journal.pone.0220386

    CrossRef  Google Scholar 

  19. Sagha, H., et al.: Enhancing multilingual recognition of emotion in speech by language identification. Interspeech (2016). https://doi.org/10.21437/interspeech.2016-333

  20. Scotti, V., Galati, F., Sbattella, L., Tedesco, R.: Combining deep and unsupervised features for multilingual speech emotion recognition. In: Del Bimbo, A., et al. (eds.) ICPR 2021. LNCS, vol. 12662, pp. 114–128. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68790-8_10

    CrossRef  Google Scholar 

  21. Iosifova, O., et al.: Techniques comparison for natural language processing. In: 2nd International Workshop on Modern Machine Learning Technologies and Data Science (MoMLeT&DS), vol. I(2631), pp. 57–67 (2020)

    Google Scholar 

  22. Iosifova, O., et al.: Analysis of automatic speech recognition methods. In: Workshop on Cybersecurity Providing in Information and Telecommunication Systems (CPITS), vol. 2923, pp. 252–257 (2021)

    Google Scholar 

  23. Iosifov, I., Iosifova, O., Sokolov, V.: Sentence segmentation from unformatted text using language modeling and sequence labeling approaches. In: 2020 IEEE International Conference on Problems of Infocommunications. Science and Technology (PICST), pp. 335–337 (2020). https://doi.org/10.1109/picst51311.2020.9468084

  24. Romanovskyi, O., Iosifov, I., Iosifova, O., Sokolov, V., Kipchuk, F., Sukaylo, I.: Automated pipeline for training dataset creation from unlabeled audios for automatic speech recognition. In: Hu, Z., Petoukhov, S., Dychka, I., He, M. (eds.) ICCSEEA 2021. LNDECT, vol. 83, pp. 25–36. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-80472-5_3

    CrossRef  Google Scholar 

  25. Lech, M., et al.: Real-time speech emotion recognition using a pre-trained image classification network: effects of bandwidth reduction and companding. Frontiers Comput. Sci. 2 (2020). https://doi.org/10.3389/fcomp.2020.00014

  26. ISO 639-6:2009. Codes for the representation of names of languages. Part 6. Alpha-4 code for comprehensive coverage of language variants. https://www.iso.org/standard/43380.html. Accessed 20 Nov 2021

  27. Zhou, K., et al.: Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (2021). https://doi.org/10.1109/icassp39728.2021.9413391

  28. Cao, H., et al.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377–390 (2014). https://doi.org/10.1109/taffc.2014.2336244

    CrossRef  Google Scholar 

  29. Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008). https://doi.org/10.1007/s10579-008-9076-6

    CrossRef  Google Scholar 

  30. Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018). https://doi.org/10.1371/journal.pone.0196391

    CrossRef  Google Scholar 

  31. Haq, S., Jackson, P.J.B.: Multimodal emotion recognition. Mach. Audit. 398–423 (2011). https://doi.org/10.4018/978-1-61520-919-4.ch017

  32. Pichora-Fuller, M.K., Dupuis, K.: Toronto emotional speech set (TESS). Dataset 59. Scholars Portal Dataverse (2020). https://doi.org/10.5683/SP2/E8H2MF

  33. Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. Interspeech (2020). https://doi.org/10.21437/interspeech.2020-2650

  34. Kumawat, P., Routray, A.: Applying TDNN architectures for analyzing duration dependencies on speech emotion recognition. Interspeech (2021). https://doi.org/10.21437/interspeech.2021-2168

  35. Ravanelli, M., et al.: SpeechBrain: a general-purpose speech toolkit, pp. 1–34 (2020, preprint). https://arxiv.org/abs/2106.04624

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Volodymyr Sokolov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Iosifov, I., Iosifova, O., Romanovskyi, O., Sokolov, V., Sukailo, I. (2022). Transferability Evaluation of Speech Emotion Recognition Between Different Languages. In: Hu, Z., Dychka, I., Petoukhov, S., He, M. (eds) Advances in Computer Science for Engineering and Education. ICCSEEA 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 134. Springer, Cham. https://doi.org/10.1007/978-3-031-04812-8_35

Download citation