Skip to main content
Log in

Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Accurate emotion detection from speech utterances has been a challenging and active research affair recently. Speech emotion recognition (SER) systems play an essential role in Human-machine interaction, virtual reality, emergency services, and many other real-time systems. It is an open-ended problem as subjects from different regions and lingual backgrounds convey emotions altogether differently. The conventional approach used low-level periodic features from audio samples like energy, pitch, etc., for classification but was not efficient enough to detect emotions accurately and not generalized. With the recent advancements in computer vision and neural networks extracting high-level features and more accurate recognition can be achieved. This study proposes an ensemble deep CNN + Bi-LSTM-based framework for speech emotion recognition and classification of seven different emotions. The paralinguistic log Mel-frequency spectral coefficients (MFSC) is used as a feature to train the proposed architecture. The proposed Hybrid model is validated with TESS and SAVEE datasets. Experimental results have indicated a classification accuracy of 96.36%. The proposed model is compared with existing models, proving the superiority of the proposed hybrid deep CNN and Bi-LSTM model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

Will be available based on the request.

Code availability

Will be available based on the request.

Notes

  1. https://tspace.library.utoronto.ca/handle/1807/24487.

  2. http://kahlan.eps.surrey.ac.uk/savee/Download.html.

References

  1. Chen J, Wang C, Wang K et al (2021) HEU Emotion: a large-scale database for multimodal emotion recognition in the wild. Neural Comput Appl 33:8669–8685. https://doi.org/10.1007/s00521-020-05616-w

    Article  Google Scholar 

  2. Zeng Y, Mao H, Peng D (2019) Spectrogram-based multi-task audio classification. Multimed Tools Appl 78:3705–3722. https://doi.org/10.1007/s11042-017-5539-3

    Article  Google Scholar 

  3. Jahangir R, Teh YW, Hanif F et al (2021) Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimed Tools Appl 80:23745–23812. https://doi.org/10.1007/s11042-020-09874-7

    Article  Google Scholar 

  4. Jaiswal S, Nandi GC (2020) Robust real-time emotion detection system using CNN architecture. Neural Comput Appl 32:11253–11262. https://doi.org/10.1007/s00521-019-04564-4

    Article  Google Scholar 

  5. Likitha MS, Gupta SRR, Hasitha K, Raju AU (2017) Speech-based human emotion recognition using MFCC. In: 2017 international conference on wireless communications, signal processing and networking (WiSPNET), pp 2257–2260. https://doi.org/10.1109/WiSPNET.2017.8300161

  6. Atmaja BT, Sasou A, Akagi M (2022) Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Commun 140:11–28. https://doi.org/10.1016/j.specom.2022.03.002

  7. Monisha A, Tamanna S, Sadia S (2022) A review of the advancement in speech emotion recognition for Indo-Aryan and Dravidian Languages. Adv Hum-Comput Interact 2022:9602429. https://doi.org/10.1155/2022/9602429

    Article  Google Scholar 

  8. Lope JD, Graña M (2023) An ongoing review of speech emotion recognition. Neurocomputing 528:1–11. https://doi.org/10.1016/j.neucom.2023.01.002

    Article  Google Scholar 

  9. Luvembe AM, Li W, Li S, Liu F, Xu G (2023) Dual emotion based fake news detection: a deep attention-weight update approach. Inf Process Manag 60(4):103354. https://doi.org/10.1016/j.ipm.2023.103354

  10. Mohapatra A, Thota N, Prakasam P (2022) Fake news detection and classification using hybrid BiLSTM and self-attention model. Multimed Tools Appl 81:18503–18519. https://doi.org/10.1007/s11042-022-12764-9

    Article  Google Scholar 

  11. Kumbhar HS, Bhandari SU (2019) Speech emotion recognition using MFCC features and LSTM network. In: 2019 5th international conference on computing, communication, control and automation (ICCUBEA), pp 1–3. https://doi.org/10.1109/ICCUBEA47591.2019.9129067

  12. Zehra W, Javed AR, Jalil Z (2021) Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex Intell Syst. https://doi.org/10.1007/s40747-020-00250-4

    Article  Google Scholar 

  13. Abbaschian BJ, Sierra-Sosa D, Elmaghraby A (2021) Deep learning techniques for speech emotion recognition, from databases to models. Sensors 21:1249. https://doi.org/10.3390/s21041249

    Article  Google Scholar 

  14. Zheng C, Wang C, Jia N (2020) An ensemble model for multi-level speech emotion recognition. Appl Sci 10:205. https://doi.org/10.3390/app10010205

    Article  Google Scholar 

  15. Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3D Log-Mel spectrograms with deep learning network. IEEE Access 7:125868–125881. https://doi.org/10.1109/ACCESS.2019.2938007

    Article  Google Scholar 

  16. Zhang S, Tao X, Chuang Y, Zhao X (2021) Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun 127:73–81. https://doi.org/10.1016/j.specom.2020.12.009

    Article  Google Scholar 

  17. Mustaqeem, Sajjad M, Kwon S (2020) Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 8:79861-79875. https://doi.org/10.1109/ACCESS.2020.2990405

  18. Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323. https://doi.org/10.1016/j.bspc.2018.08.035

    Article  Google Scholar 

  19. Tarantino L, Garner PN, Lazaridis A (2019) Self-attention for speech emotion recognition. Proc Interspeech 2019:2578–2582. https://doi.org/10.21437/Interspeech.2019-2822

    Article  Google Scholar 

  20. Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. 2018 IEEE spoken language technology workshop (SLT), 112–118. https://doi.org/10.1109/SLT.2018.8639583

  21. Schuller BW (2018) Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Commun ACM 61(5):90–99. https://doi.org/10.1145/3129340

    Article  Google Scholar 

  22. Tzirakis P, Zhang J, Schuller BW (2018) End-to-end speech emotion recognition using deep neural networks. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5089–5093. https://doi.org/10.1109/ICASSP.2018.8462677

  23. Mirsamadi S, Barsoum E, Zhang C (2017)Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2227–2231. https://doi.org/10.1109/ICASSP.2017.7952552

  24. Abdelwahab M, Busso C (2018) Study of dense network approaches for speech emotion recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5084–5088. https://doi.org/10.1109/ICASSP.2018.8461866

  25. Neumann M, Vu NT (2017) Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 1263–1267. https://doi.org/10.21437/Interspeech.2017-917

  26. Harár P, Burget R, Dutta MK (2017) Speech emotion recognition with deep learning. In: 2017 4th international conference on signal processing and integrated networks (SPIN), pp 137–140. https://doi.org/10.1109/SPIN.2017.8049931

  27. Lotfidereshgi R, Gournay P (2017) Biologically inspired speech emotion recognition. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5135–5139. https://doi.org/10.1109/ICASSP.2017.7953135

  28. Tzinis E, Potamianos A (2017) Segment-based speech emotion recognition using recurrent neural networks. In: 2017 seventh international conference on affective computing and intelligent interaction (ACII), pp 190–195. https://doi.org/10.1109/ACII.2017.8273599

  29. Shegokar P, Sircar P (2016) Continuous wavelet transform based speech emotion recognition. In: 10th international conference on signal processing and communication systems (ICSPCS), pp 1–8. https://doi.org/10.1109/ICSPCS.2016.7843306

  30. Dangol R, Alsadoon A, Prasad PWC et al (2020) Speech emotion recognition using convolutional neural network and long-short TermMemory. Multimed Tools Appl 79:32917–32934. https://doi.org/10.1007/s11042-020-09693-w

    Article  Google Scholar 

  31. Singh R, Puri H, Aggarwal N, Gupta V (2020) An efficient language-independent acoustic emotion classification system. Arab J Sci Eng 45:3111–3121

    Article  Google Scholar 

  32. Zong Y, Zheng W, Zhang T, Huang X (2016) Cross-corpus speech emotion recognition based on domain-adaptive least-squares regression. IEEE Signal Process Lett 23(5):585–589. https://doi.org/10.1109/LSP.2016.2537926

    Article  Google Scholar 

  33. Zeng Y, Mao H, Peng D, Yi Z (2017) Spectrogram based multi-task audio classification. Multimed Tools Appl 78:3705–3722

    Article  Google Scholar 

  34. Yadav A, Vishwakarma DK (2020) A comparative study on bio-inspired algorithms for sentiment analysis. Clust Comput 23:2969–2989. https://doi.org/10.1007/s10586-020-03062-w

  35. Mohan BJ, Ramesh Babu N (2014) Speech Recognition using MFCC and DTW. In: International conference on advances in electrical engineering (ICAEE), pp 1–4. https://doi.org/10.1109/ICAEE.2014.6838564

  36. Yenigalla P, Kumar A, Tripathi S, Singh C, Kar S, Vepa J (2018) Speech emotion recognition using spectrogram & phoneme embedding. Proc Interspeech 2018:3688–3692. https://doi.org/10.21437/Interspeech.2018-1811

    Article  Google Scholar 

  37. Anvarjon T, Mustaqeem, Kwon S (2020) Deep-Net: a lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors 20:5212. https://doi.org/10.3390/s20185212

  38. Lech M, Stolar M, Best C, Bolia R (2020) Real-time speech emotion recognition using a pre-trained image classification network: effects of bandwidth reduction and companding. Front Comput Sci 2(14). https://doi.org/10.3389/fcomp.2020.00014

  39. Yadav A, Vishwakarma DK (2020) A Multi-lingual Framework of CNN and Bi-LSTM for Emotion Classification. In: 2020 11th international conference on computing, communication and networking technologies (ICCCNT), pp 1–6. https://doi.org/10.1109/ICCCNT49239.2020.9225614

  40. Singh J, Saheer LB, Faust O (2023) Speech emotion recognition using attention model. Int J Environ Res Public Health 20(6):5140. https://doi.org/10.3390/ijerph20065140

  41. Swain M, Maji B, Kabisatpathy P et al (2022) A DCRNN-based ensemble classifier for speech emotion recognition in Odia language. Complex Intell Syst 8:4237–4249. https://doi.org/10.1007/s40747-022-00713-w

    Article  Google Scholar 

  42. Sun C, Li H, Ma L (2023) Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network. Front Psychol 13:2022. https://doi.org/10.3389/fpsyg.2022.1075624

  43. Ullah S, Sahib QA, Faizullah, Ullah S, Haq IU, Ullah I (2022) Speech emotion recognition using deep neural networks. In: Proceedings of the IEEE international conference on IT and industrial technologies (ICIT), pp 01–06. https://doi.org/10.1109/ICIT56493.2022.9989197

Download references

Author information

Authors and Affiliations

Authors

Contributions

Prakasam P and Sureshkumar T R devised and proofread the main conceptual ideas. Swami Mishra and Nehal Bhatnagar worked on almost all the technical details, devising the model, data collection, and experimentation. Prakasam P performed the evaluation metrics to validate the proposed model. Sureshkumar T R, Swami Mishra, and Nehal Bhatnagar prepared and verified the entire manuscript by Prakasam P.

Corresponding author

Correspondence to Prakasam P.

Ethics declarations

Conflicts of interest

We hereby declare that there is no conflict of interest in this research work/paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mishra, S., Bhatnagar, N., P, P. et al. Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model. Multimed Tools Appl 83, 37603–37620 (2024). https://doi.org/10.1007/s11042-023-16849-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16849-x

Keywords

Navigation