Skip to main content
Log in

Optimal trained ensemble of classification model for speech emotion recognition: Considering cross-lingual and multi-lingual scenarios

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Speech has a significant role in conveying emotional information, and SER has emerged as a crucial component of the human–computer interface that has high real-time and accuracy needs. This paper proposes a novel Improved Coot optimization-based Ensemble Classification (ICO-EC) for SER that follows three stages: preprocessing, feature extraction, and classification. The model starts with the preprocessing step, where the class imbalance problem is resolved using Improved SMOTE-ENC. Subsequently, in the feature extraction stage, IMFCC-based features, Chroma-based features, ZCR-based features, and spectral roll-off-based features are extracted. The last stage is classification; in this, an ensemble classification model is used, which combines the classifiers including Deep Maxout, LSTM and ICNN, respectively. Here, the training process is made optimal via an Improved Coot Optimization (ICO) by tuning the optimal weights. At last, the performances of the developed model are validated with conventional methods with four different databases. Also, the proposed model for cross-lingual provides a better accuracy as 92.76% for Hindi, 92.95% for Kannada, 93.85% for Telugu, and 95.97% for Urdu, respectively. The ICO-CE model outperformed 93% accuracy in the Hindi dataset over other models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The datasets for Hindi, Urdu, Kannada, and Telugu were used to generate the data.

Abbreviations

SER:

Speech Emotion Recognition

ASR:

Automatic Speech Recognition

HMM:

Hidden Markov models

DTW:

Dynamic Time Warping

MFCC:

Mel-frequency Cepstral Coefficients

NN:

Neural Network

ML:

Machine Learning

Taylor-DBN:

Taylor series-based Deep Belief Network

MKMFCC:

Multiple Kernel Mel Frequency Cepstral Coefficients

ECSO:

Enhanced Cat Swarm Optimization

OBL:

Opposition Based Learning

DL:

Deep Learning

BDBN:

Bimodal Deep Belief Network

CNN:

Convolution Neural Network

MEDC:

Mel Energy Spectrum Dynamic Coefficients

SVM:

Support Vector Machine

RNN:

Recurrent neural network

References

  1. Tao J-H, Huang J, Li Ya, Lian Z, Niu M-Y (2019) Semi-supervised ladder networks for speech emotion recognition. Int J Autom Comput 16:437–448. https://doi.org/10.1007/s11633-019-1175-x

    Article  Google Scholar 

  2. Christy, Vaithyasubramanian S, Jesudoss A, Praveena MD Anto (2020) Multimodal speech emotion recognition and classification using convolutional neural network techniques. Int J Speech Technol 23:381–388.https://doi.org/10.1007/s10772-020-09713-y

  3. Poorna SS, Nair GJ (2019) Multistage classification scheme to enhance speech emotion recognition. Int J Speech Technol 22:327–340. https://doi.org/10.1007/s10772-019-09605-w

    Article  Google Scholar 

  4. Kumaran U, Rammohan S Radha, Nagarajan Senthil Murugan, Prathik A (2021) Fusion of mel and gamma tone frequency cepstral coefficients for speech emotion recognition using deep C-RNN. Int J Speech Technol 24:303–314. https://doi.org/10.1007/s10772-020-09792-x

  5. Koduru Anusha, Valiveti Hima Bindu, Budati Anil Kumar (2020) Feature extraction algorithms to improve the speech emotion recognition rate. Int J Speech Technol 23:45–55. https://doi.org/10.1007/s10772-020-09672-4

  6. Huijuan Z, Ning Ye, Ruchuan W (2021) Coarse-to-fine speech emotion recognition based on multi-task learning. J Signal Process Syst 93:299–308. https://doi.org/10.1007/s11265-020-01538-x

    Article  Google Scholar 

  7. Arano Keith April, Gloor Peter, Orsenigo Carlotta, Vercellis Carlo (2021) When old meets new: Emotion recognition from speech signals. Cogn Comput 13:771–783.https://doi.org/10.1007/s12559-021-09865-2

  8. Zhang C, Xue L (2021) Autoencoder with emotion embedding for speech emotion recognition. IEEE Access 9:51231–51241. https://doi.org/10.1109/ACCESS.2021.3069818

    Article  Google Scholar 

  9. Karan Aggarwal et al (2022) Has the future started? The current growth of artificial intelligence, machine learning, and deep learning. Iraqi J Comput Sci Math 3.1:115–123

  10. Atmaja BT, Sasou A, Akagi M (2022) Speech emotion and naturalness recognitions with multitask and single-task learnings. IEEE Access 10:72381–72387. https://doi.org/10.1109/ACCESS.2022.3189481

  11. Liu Na, Zhang B, Liu B, Shi J, Yang L, Li Z, Zhu J (2021) Transfer subspace learning for unsupervised cross-corpus speech emotion recognition. IEEE Access 9:95925–95937. https://doi.org/10.1109/ACCESS.2021.3094355

    Article  Google Scholar 

  12. Sun Ting-Wei (2020) End-to-end speech emotion recognition with gender information. IEEE Access 8: 152423-152438.https://doi.org/10.1109/ACCESS.2020.3017462

  13. Xia X, Jiang D, Sahli H (2020) Learning salient segments for speech emotion recognition using attentive temporal pooling. IEEE Access 8:151740–151752. https://doi.org/10.1109/ACCESS.2020.3014733

    Article  Google Scholar 

  14. Retta Ephrem Afele et al (2023) Cross-corpus multilingual speech emotion recognition: Amharic vs. Other Languages. arXiv preprint arXiv:2307.10814

  15. Upadhyay Shreya G et al (2023) Phonetic anchor-based transfer learning to facilitate unsupervised cross-lingual speech emotion recognition. ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE

  16. Latif S, Qayyum A, Usman M, Qadir J (2018) Cross lingual speech emotion recognition: Urdu vs. western languages. In: Proceedings - 2018 International Conference on Frontiers of Information Technology, FIT 2018 8616972, pp 88–93

  17. Goel Shivali, Beigi Homayoon (2020) Cross lingual cross corpus speech emotion recognition. arXiv preprint arXiv:2003.07996

  18. Zehra Wisha, Javed Abdul Rehman, Jalil Zunera, Khan Habib Ullah, Gadekallu Thippa Reddy (2021) Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex Intell Syst 7:1845–1854.https://doi.org/10.1007/s40747-020-00250-4

  19. Haridas Arul Valiyavalappil, Marimuthu Ramalatha, Sivakumar VG, Chakraborty Basabi (2022) Emotion recognition of speech signal using Taylor series and deep belief network based classification. Evol Intell 15:1145–1158. https://doi.org/10.1007/s12065-019-00333-3

  20. Gomathy M (2021) Optimal feature selection for speech emotion recognition using enhanced cat swarm optimization algorithm. Int J Speech Technol 24:155–163. https://doi.org/10.1007/s10772-020-09776-x

    Article  Google Scholar 

  21. Jermsittiparsert K, Abdurrahman A, Siriattakul P, Sundeeva LA, Hashim W, Rahim R, Maseleno A (2020) Pattern recognition and features selection for speech emotion recognition model using deep learning. Int J Speech Technol 23:799–806. https://doi.org/10.1007/s10772-020-09690-2

    Article  Google Scholar 

  22. Yang Z, Huang Y (2022) Algorithm for speech emotion recognition classification based on Mel-frequency Cepstral coefficients and broad learning system. Evol Intel 15:2485–2494. https://doi.org/10.1007/s12065-020-00532-3

    Article  Google Scholar 

  23. Wang C, Ren Y, Zhang Na, Cui F, Luo S (2022) Speech emotion recognition based on multi-feature and multi-lingual fusion. Multimed Tools Applic 81:4897–4907. https://doi.org/10.1007/s11042-021-10553-4

    Article  Google Scholar 

  24. Liu D, Chen L, Wang Z, Diao G (2021) Speech expression multimodal emotion recognition based on deep belief network. J Grid Comput 19:22. https://doi.org/10.1007/s10723-021-09564-0

    Article  Google Scholar 

  25. Pawar MD, Kokate RD (2021) Convolution neural network based automatic speech emotion recognition using Mel-frequency Cepstrum coefficients. Multimed Tools Applic 80:15563–15587. https://doi.org/10.1007/s11042-020-10329-2

    Article  Google Scholar 

  26. Mukherjee M, Khushi M (2021) SMOTE-ENC: A Novel SMOTE-based method to generate synthetic data for nominal and continuous features. Appl Syst Innov 4:18. https://doi.org/10.3390/asi4010018

    Article  Google Scholar 

  27. Taoufiq Belhoussine Drissi, Soumaya Zayrit, Benayad Nsiri, Nouhaila Boualoulou (2022) Cepstral coefficient extraction using the MFCC with the discrete wavelet transform for the parkinson's disease diagnosis. Int J Eng Trends Technol 70(7):283–290, ISSN: 2231 – 5381. https://doi.org/10.14445/22315381/IJETT-V70I7P229

  28. Shah Ayush Kumar, Kattel Manasi, Nepal Araju (2019) Chroma feature extraction. Conference Paper.

  29. Shete DS, Patil SB, Patil SB (2014) Zero crossing rate and energy of the speech signal of devanagari script. IOSR J VLSI Signal Process (IOSR-JVSP) 4(1), Ver. I, PP 01–05 e-ISSN: 2319 – 4200, p-ISSN No. : 2319 – 4197. www.iosrjournals.org

  30. Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302

  31. Goodfellow Ian J, Warde-Farley David, Mirza Mehdi, Courville Aaron, Bengio Yoshua (2013) "Maxout networks", Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, USA, JMLR: W&CP volume 28

  32. Sak Hasim, Senior Andrew, Beaufays Francoise (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling.

  33. Ghosh Anirudha, Sufian A, Sultana Farhana, Chakrabarti Amlan (2020) Fundamental concepts of convolutional neural network. https://doi.org/10.1007/978-3-030-32644-9_36

  34. Naruei I, Keynia F (2021) A new optimization method based on COOT bird natural life model. Expert Syst Appl 183:15352

    Article  Google Scholar 

  35. He Di, He C, Jiang L-G, Zhu H-W, Guang-Rui Hu (2001) Chaotic characteristics of a one-dimensional iterative map with infinite collapses. IEEE Trans Circ Syst I: Fundam Theory Applic 48(7):900–906. https://doi.org/10.1109/81.933333

    Article  MathSciNet  Google Scholar 

  36. Koolagudi SG, Reddy R, Yadav J, Rao KS (2011) IITKGP-SEHSC: Hindi speech corpus for emotion analysis. In 2011 International conference on devices and communications. IEEE

  37. Koolagudi SG, Maity S, Kumar VA, Chakrabarti S, Rao KS (2009) IITKGP-SESC: speech database for emotion analysis. In: International conference on contemporary computing, Springer

  38. Zehra W, Javed AR, Jalil Z, Khan HU, Gadekallu TR (2021) Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex Intell Syst 7

  39. Joy Jerry, Kannan Aparna, Ram Shreya, Rama S (2020) Speech emotion recognition using neural network and MLP classifier

  40. Aouani Hadhami, Ayed Yassine Ben (2020) Speech emotion recognition with deep learning. Procedia Comput Sci 176

  41. Tamulevicius Gintautas, Korvel Grazina, Yayak Anil Bora, Treigys Povilas, Bernataviciene Jolita, Kostek Bozena (2020) A study of cross-linguistic speech emotion recognition based on 2D feature spaces. Electronics 9:1725

  42. Zehra Wisha, Javed Abdul Rehman, Jalil Zunera, Khan Habib Ullah, Gadekallu Thippa Reddy (2021) Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex Intell Syst 7:1845-1854

  43. Biau G (2012) Analysis of a Random Forests Model. J Mach Learn Res 13:1063–1095

    MathSciNet  Google Scholar 

  44. Sak Hasim, Senior Andrew, Beaufays Francoise (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. INTERSPEECH, pp. 338–342

  45. Liu X, Wang Y, Wang X, Hui Xu, Li C, Xin X (2021) Bi-directional gated recurrent unit neural network based nonlinear equalizer for coherent optical communication system. Opt Express 29:5923–5933

    Article  Google Scholar 

  46. Sherstinsky Alex (2020) Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. Elsevier, vol.404

  47. Ms. Sonali. B. Maind, Ms Priyanka Wankar (2014) Research paper on basic of artificial neural network. Int J Recent Innov Trends Comput Commun 2:96–100

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rupali Ramdas Kawade.

Ethics declarations

Informed consent

Not Relevant

Ethical approval

Not Relevant

Conflict of interest

The authors say they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kawade, R.R., Jagtap, S.K. Optimal trained ensemble of classification model for speech emotion recognition: Considering cross-lingual and multi-lingual scenarios. Multimed Tools Appl 83, 54331–54365 (2024). https://doi.org/10.1007/s11042-023-17097-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17097-9

Keywords

Navigation