Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model

Mishra, Swami; Bhatnagar, Nehal; P, Prakasam; T. R, Sureshkumar

doi:10.1007/s11042-023-16849-x

Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model

Published: 02 October 2023

Volume 83, pages 37603–37620, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Swami Mishra¹,
Nehal Bhatnagar¹,
Prakasam P ORCID: orcid.org/0000-0002-2471-6375¹ &
…
Sureshkumar T. R¹

Abstract

Accurate emotion detection from speech utterances has been a challenging and active research affair recently. Speech emotion recognition (SER) systems play an essential role in Human-machine interaction, virtual reality, emergency services, and many other real-time systems. It is an open-ended problem as subjects from different regions and lingual backgrounds convey emotions altogether differently. The conventional approach used low-level periodic features from audio samples like energy, pitch, etc., for classification but was not efficient enough to detect emotions accurately and not generalized. With the recent advancements in computer vision and neural networks extracting high-level features and more accurate recognition can be achieved. This study proposes an ensemble deep CNN + Bi-LSTM-based framework for speech emotion recognition and classification of seven different emotions. The paralinguistic log Mel-frequency spectral coefficients (MFSC) is used as a feature to train the proposed architecture. The proposed Hybrid model is validated with TESS and SAVEE datasets. Experimental results have indicated a classification accuracy of 96.36%. The proposed model is compared with existing models, proving the superiority of the proposed hybrid deep CNN and Bi-LSTM model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Data availability

Will be available based on the request.

Code availability

Will be available based on the request.

Notes

References

Chen J, Wang C, Wang K et al (2021) HEU Emotion: a large-scale database for multimodal emotion recognition in the wild. Neural Comput Appl 33:8669–8685. https://doi.org/10.1007/s00521-020-05616-w
Article Google Scholar
Zeng Y, Mao H, Peng D (2019) Spectrogram-based multi-task audio classification. Multimed Tools Appl 78:3705–3722. https://doi.org/10.1007/s11042-017-5539-3
Article Google Scholar
Jahangir R, Teh YW, Hanif F et al (2021) Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimed Tools Appl 80:23745–23812. https://doi.org/10.1007/s11042-020-09874-7
Article Google Scholar
Jaiswal S, Nandi GC (2020) Robust real-time emotion detection system using CNN architecture. Neural Comput Appl 32:11253–11262. https://doi.org/10.1007/s00521-019-04564-4
Article Google Scholar
Likitha MS, Gupta SRR, Hasitha K, Raju AU (2017) Speech-based human emotion recognition using MFCC. In: 2017 international conference on wireless communications, signal processing and networking (WiSPNET), pp 2257–2260. https://doi.org/10.1109/WiSPNET.2017.8300161
Atmaja BT, Sasou A, Akagi M (2022) Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Commun 140:11–28. https://doi.org/10.1016/j.specom.2022.03.002
Monisha A, Tamanna S, Sadia S (2022) A review of the advancement in speech emotion recognition for Indo-Aryan and Dravidian Languages. Adv Hum-Comput Interact 2022:9602429. https://doi.org/10.1155/2022/9602429
Article Google Scholar
Lope JD, Graña M (2023) An ongoing review of speech emotion recognition. Neurocomputing 528:1–11. https://doi.org/10.1016/j.neucom.2023.01.002
Article Google Scholar
Luvembe AM, Li W, Li S, Liu F, Xu G (2023) Dual emotion based fake news detection: a deep attention-weight update approach. Inf Process Manag 60(4):103354. https://doi.org/10.1016/j.ipm.2023.103354
Mohapatra A, Thota N, Prakasam P (2022) Fake news detection and classification using hybrid BiLSTM and self-attention model. Multimed Tools Appl 81:18503–18519. https://doi.org/10.1007/s11042-022-12764-9
Article Google Scholar
Kumbhar HS, Bhandari SU (2019) Speech emotion recognition using MFCC features and LSTM network. In: 2019 5th international conference on computing, communication, control and automation (ICCUBEA), pp 1–3. https://doi.org/10.1109/ICCUBEA47591.2019.9129067
Zehra W, Javed AR, Jalil Z (2021) Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex Intell Syst. https://doi.org/10.1007/s40747-020-00250-4
Article Google Scholar
Abbaschian BJ, Sierra-Sosa D, Elmaghraby A (2021) Deep learning techniques for speech emotion recognition, from databases to models. Sensors 21:1249. https://doi.org/10.3390/s21041249
Article Google Scholar
Zheng C, Wang C, Jia N (2020) An ensemble model for multi-level speech emotion recognition. Appl Sci 10:205. https://doi.org/10.3390/app10010205
Article Google Scholar
Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3D Log-Mel spectrograms with deep learning network. IEEE Access 7:125868–125881. https://doi.org/10.1109/ACCESS.2019.2938007
Article Google Scholar
Zhang S, Tao X, Chuang Y, Zhao X (2021) Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun 127:73–81. https://doi.org/10.1016/j.specom.2020.12.009
Article Google Scholar
Mustaqeem, Sajjad M, Kwon S (2020) Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 8:79861-79875. https://doi.org/10.1109/ACCESS.2020.2990405
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323. https://doi.org/10.1016/j.bspc.2018.08.035
Article Google Scholar
Tarantino L, Garner PN, Lazaridis A (2019) Self-attention for speech emotion recognition. Proc Interspeech 2019:2578–2582. https://doi.org/10.21437/Interspeech.2019-2822
Article Google Scholar
Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. 2018 IEEE spoken language technology workshop (SLT), 112–118. https://doi.org/10.1109/SLT.2018.8639583
Schuller BW (2018) Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Commun ACM 61(5):90–99. https://doi.org/10.1145/3129340
Article Google Scholar
Tzirakis P, Zhang J, Schuller BW (2018) End-to-end speech emotion recognition using deep neural networks. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5089–5093. https://doi.org/10.1109/ICASSP.2018.8462677
Mirsamadi S, Barsoum E, Zhang C (2017)Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2227–2231. https://doi.org/10.1109/ICASSP.2017.7952552
Abdelwahab M, Busso C (2018) Study of dense network approaches for speech emotion recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5084–5088. https://doi.org/10.1109/ICASSP.2018.8461866
Neumann M, Vu NT (2017) Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 1263–1267. https://doi.org/10.21437/Interspeech.2017-917
Harár P, Burget R, Dutta MK (2017) Speech emotion recognition with deep learning. In: 2017 4th international conference on signal processing and integrated networks (SPIN), pp 137–140. https://doi.org/10.1109/SPIN.2017.8049931
Lotfidereshgi R, Gournay P (2017) Biologically inspired speech emotion recognition. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5135–5139. https://doi.org/10.1109/ICASSP.2017.7953135
Tzinis E, Potamianos A (2017) Segment-based speech emotion recognition using recurrent neural networks. In: 2017 seventh international conference on affective computing and intelligent interaction (ACII), pp 190–195. https://doi.org/10.1109/ACII.2017.8273599
Shegokar P, Sircar P (2016) Continuous wavelet transform based speech emotion recognition. In: 10th international conference on signal processing and communication systems (ICSPCS), pp 1–8. https://doi.org/10.1109/ICSPCS.2016.7843306
Dangol R, Alsadoon A, Prasad PWC et al (2020) Speech emotion recognition using convolutional neural network and long-short TermMemory. Multimed Tools Appl 79:32917–32934. https://doi.org/10.1007/s11042-020-09693-w
Article Google Scholar
Singh R, Puri H, Aggarwal N, Gupta V (2020) An efficient language-independent acoustic emotion classification system. Arab J Sci Eng 45:3111–3121
Article Google Scholar
Zong Y, Zheng W, Zhang T, Huang X (2016) Cross-corpus speech emotion recognition based on domain-adaptive least-squares regression. IEEE Signal Process Lett 23(5):585–589. https://doi.org/10.1109/LSP.2016.2537926
Article Google Scholar
Zeng Y, Mao H, Peng D, Yi Z (2017) Spectrogram based multi-task audio classification. Multimed Tools Appl 78:3705–3722
Article Google Scholar
Yadav A, Vishwakarma DK (2020) A comparative study on bio-inspired algorithms for sentiment analysis. Clust Comput 23:2969–2989. https://doi.org/10.1007/s10586-020-03062-w
Mohan BJ, Ramesh Babu N (2014) Speech Recognition using MFCC and DTW. In: International conference on advances in electrical engineering (ICAEE), pp 1–4. https://doi.org/10.1109/ICAEE.2014.6838564
Yenigalla P, Kumar A, Tripathi S, Singh C, Kar S, Vepa J (2018) Speech emotion recognition using spectrogram & phoneme embedding. Proc Interspeech 2018:3688–3692. https://doi.org/10.21437/Interspeech.2018-1811
Article Google Scholar
Anvarjon T, Mustaqeem, Kwon S (2020) Deep-Net: a lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors 20:5212. https://doi.org/10.3390/s20185212
Lech M, Stolar M, Best C, Bolia R (2020) Real-time speech emotion recognition using a pre-trained image classification network: effects of bandwidth reduction and companding. Front Comput Sci 2(14). https://doi.org/10.3389/fcomp.2020.00014
Yadav A, Vishwakarma DK (2020) A Multi-lingual Framework of CNN and Bi-LSTM for Emotion Classification. In: 2020 11th international conference on computing, communication and networking technologies (ICCCNT), pp 1–6. https://doi.org/10.1109/ICCCNT49239.2020.9225614
Singh J, Saheer LB, Faust O (2023) Speech emotion recognition using attention model. Int J Environ Res Public Health 20(6):5140. https://doi.org/10.3390/ijerph20065140
Swain M, Maji B, Kabisatpathy P et al (2022) A DCRNN-based ensemble classifier for speech emotion recognition in Odia language. Complex Intell Syst 8:4237–4249. https://doi.org/10.1007/s40747-022-00713-w
Article Google Scholar
Sun C, Li H, Ma L (2023) Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network. Front Psychol 13:2022. https://doi.org/10.3389/fpsyg.2022.1075624
Ullah S, Sahib QA, Faizullah, Ullah S, Haq IU, Ullah I (2022) Speech emotion recognition using deep neural networks. In: Proceedings of the IEEE international conference on IT and industrial technologies (ICIT), pp 01–06. https://doi.org/10.1109/ICIT56493.2022.9989197

Download references

Author information

Authors and Affiliations

School of Electronics Engineering, Vellore Institute of Technology, Vellore, India
Swami Mishra, Nehal Bhatnagar, Prakasam P & Sureshkumar T. R

Authors

Swami Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Nehal Bhatnagar
View author publications
You can also search for this author in PubMed Google Scholar
Prakasam P
View author publications
You can also search for this author in PubMed Google Scholar
Sureshkumar T. R
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Prakasam P and Sureshkumar T R devised and proofread the main conceptual ideas. Swami Mishra and Nehal Bhatnagar worked on almost all the technical details, devising the model, data collection, and experimentation. Prakasam P performed the evaluation metrics to validate the proposed model. Sureshkumar T R, Swami Mishra, and Nehal Bhatnagar prepared and verified the entire manuscript by Prakasam P.

Corresponding author

Correspondence to Prakasam P.

Ethics declarations

Conflicts of interest

We hereby declare that there is no conflict of interest in this research work/paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mishra, S., Bhatnagar, N., P, P. et al. Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model. Multimed Tools Appl 83, 37603–37620 (2024). https://doi.org/10.1007/s11042-023-16849-x

Download citation

Received: 09 January 2022
Revised: 16 August 2023
Accepted: 04 September 2023
Published: 02 October 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s11042-023-16849-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

A comprehensive survey on automatic speech recognition using neural networks

Data availability

Code availability

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

A comprehensive survey on automatic speech recognition using neural networks

Data availability

Code availability

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation