Abstract
Accurate emotion detection from speech utterances has been a challenging and active research affair recently. Speech emotion recognition (SER) systems play an essential role in Human-machine interaction, virtual reality, emergency services, and many other real-time systems. It is an open-ended problem as subjects from different regions and lingual backgrounds convey emotions altogether differently. The conventional approach used low-level periodic features from audio samples like energy, pitch, etc., for classification but was not efficient enough to detect emotions accurately and not generalized. With the recent advancements in computer vision and neural networks extracting high-level features and more accurate recognition can be achieved. This study proposes an ensemble deep CNN + Bi-LSTM-based framework for speech emotion recognition and classification of seven different emotions. The paralinguistic log Mel-frequency spectral coefficients (MFSC) is used as a feature to train the proposed architecture. The proposed Hybrid model is validated with TESS and SAVEE datasets. Experimental results have indicated a classification accuracy of 96.36%. The proposed model is compared with existing models, proving the superiority of the proposed hybrid deep CNN and Bi-LSTM model.
Similar content being viewed by others
Data availability
Will be available based on the request.
Code availability
Will be available based on the request.
References
Chen J, Wang C, Wang K et al (2021) HEU Emotion: a large-scale database for multimodal emotion recognition in the wild. Neural Comput Appl 33:8669–8685. https://doi.org/10.1007/s00521-020-05616-w
Zeng Y, Mao H, Peng D (2019) Spectrogram-based multi-task audio classification. Multimed Tools Appl 78:3705–3722. https://doi.org/10.1007/s11042-017-5539-3
Jahangir R, Teh YW, Hanif F et al (2021) Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimed Tools Appl 80:23745–23812. https://doi.org/10.1007/s11042-020-09874-7
Jaiswal S, Nandi GC (2020) Robust real-time emotion detection system using CNN architecture. Neural Comput Appl 32:11253–11262. https://doi.org/10.1007/s00521-019-04564-4
Likitha MS, Gupta SRR, Hasitha K, Raju AU (2017) Speech-based human emotion recognition using MFCC. In: 2017 international conference on wireless communications, signal processing and networking (WiSPNET), pp 2257–2260. https://doi.org/10.1109/WiSPNET.2017.8300161
Atmaja BT, Sasou A, Akagi M (2022) Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Commun 140:11–28. https://doi.org/10.1016/j.specom.2022.03.002
Monisha A, Tamanna S, Sadia S (2022) A review of the advancement in speech emotion recognition for Indo-Aryan and Dravidian Languages. Adv Hum-Comput Interact 2022:9602429. https://doi.org/10.1155/2022/9602429
Lope JD, Graña M (2023) An ongoing review of speech emotion recognition. Neurocomputing 528:1–11. https://doi.org/10.1016/j.neucom.2023.01.002
Luvembe AM, Li W, Li S, Liu F, Xu G (2023) Dual emotion based fake news detection: a deep attention-weight update approach. Inf Process Manag 60(4):103354. https://doi.org/10.1016/j.ipm.2023.103354
Mohapatra A, Thota N, Prakasam P (2022) Fake news detection and classification using hybrid BiLSTM and self-attention model. Multimed Tools Appl 81:18503–18519. https://doi.org/10.1007/s11042-022-12764-9
Kumbhar HS, Bhandari SU (2019) Speech emotion recognition using MFCC features and LSTM network. In: 2019 5th international conference on computing, communication, control and automation (ICCUBEA), pp 1–3. https://doi.org/10.1109/ICCUBEA47591.2019.9129067
Zehra W, Javed AR, Jalil Z (2021) Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex Intell Syst. https://doi.org/10.1007/s40747-020-00250-4
Abbaschian BJ, Sierra-Sosa D, Elmaghraby A (2021) Deep learning techniques for speech emotion recognition, from databases to models. Sensors 21:1249. https://doi.org/10.3390/s21041249
Zheng C, Wang C, Jia N (2020) An ensemble model for multi-level speech emotion recognition. Appl Sci 10:205. https://doi.org/10.3390/app10010205
Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3D Log-Mel spectrograms with deep learning network. IEEE Access 7:125868–125881. https://doi.org/10.1109/ACCESS.2019.2938007
Zhang S, Tao X, Chuang Y, Zhao X (2021) Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun 127:73–81. https://doi.org/10.1016/j.specom.2020.12.009
Mustaqeem, Sajjad M, Kwon S (2020) Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 8:79861-79875. https://doi.org/10.1109/ACCESS.2020.2990405
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323. https://doi.org/10.1016/j.bspc.2018.08.035
Tarantino L, Garner PN, Lazaridis A (2019) Self-attention for speech emotion recognition. Proc Interspeech 2019:2578–2582. https://doi.org/10.21437/Interspeech.2019-2822
Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. 2018 IEEE spoken language technology workshop (SLT), 112–118. https://doi.org/10.1109/SLT.2018.8639583
Schuller BW (2018) Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Commun ACM 61(5):90–99. https://doi.org/10.1145/3129340
Tzirakis P, Zhang J, Schuller BW (2018) End-to-end speech emotion recognition using deep neural networks. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5089–5093. https://doi.org/10.1109/ICASSP.2018.8462677
Mirsamadi S, Barsoum E, Zhang C (2017)Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2227–2231. https://doi.org/10.1109/ICASSP.2017.7952552
Abdelwahab M, Busso C (2018) Study of dense network approaches for speech emotion recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5084–5088. https://doi.org/10.1109/ICASSP.2018.8461866
Neumann M, Vu NT (2017) Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), pp 1263–1267. https://doi.org/10.21437/Interspeech.2017-917
Harár P, Burget R, Dutta MK (2017) Speech emotion recognition with deep learning. In: 2017 4th international conference on signal processing and integrated networks (SPIN), pp 137–140. https://doi.org/10.1109/SPIN.2017.8049931
Lotfidereshgi R, Gournay P (2017) Biologically inspired speech emotion recognition. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5135–5139. https://doi.org/10.1109/ICASSP.2017.7953135
Tzinis E, Potamianos A (2017) Segment-based speech emotion recognition using recurrent neural networks. In: 2017 seventh international conference on affective computing and intelligent interaction (ACII), pp 190–195. https://doi.org/10.1109/ACII.2017.8273599
Shegokar P, Sircar P (2016) Continuous wavelet transform based speech emotion recognition. In: 10th international conference on signal processing and communication systems (ICSPCS), pp 1–8. https://doi.org/10.1109/ICSPCS.2016.7843306
Dangol R, Alsadoon A, Prasad PWC et al (2020) Speech emotion recognition using convolutional neural network and long-short TermMemory. Multimed Tools Appl 79:32917–32934. https://doi.org/10.1007/s11042-020-09693-w
Singh R, Puri H, Aggarwal N, Gupta V (2020) An efficient language-independent acoustic emotion classification system. Arab J Sci Eng 45:3111–3121
Zong Y, Zheng W, Zhang T, Huang X (2016) Cross-corpus speech emotion recognition based on domain-adaptive least-squares regression. IEEE Signal Process Lett 23(5):585–589. https://doi.org/10.1109/LSP.2016.2537926
Zeng Y, Mao H, Peng D, Yi Z (2017) Spectrogram based multi-task audio classification. Multimed Tools Appl 78:3705–3722
Yadav A, Vishwakarma DK (2020) A comparative study on bio-inspired algorithms for sentiment analysis. Clust Comput 23:2969–2989. https://doi.org/10.1007/s10586-020-03062-w
Mohan BJ, Ramesh Babu N (2014) Speech Recognition using MFCC and DTW. In: International conference on advances in electrical engineering (ICAEE), pp 1–4. https://doi.org/10.1109/ICAEE.2014.6838564
Yenigalla P, Kumar A, Tripathi S, Singh C, Kar S, Vepa J (2018) Speech emotion recognition using spectrogram & phoneme embedding. Proc Interspeech 2018:3688–3692. https://doi.org/10.21437/Interspeech.2018-1811
Anvarjon T, Mustaqeem, Kwon S (2020) Deep-Net: a lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors 20:5212. https://doi.org/10.3390/s20185212
Lech M, Stolar M, Best C, Bolia R (2020) Real-time speech emotion recognition using a pre-trained image classification network: effects of bandwidth reduction and companding. Front Comput Sci 2(14). https://doi.org/10.3389/fcomp.2020.00014
Yadav A, Vishwakarma DK (2020) A Multi-lingual Framework of CNN and Bi-LSTM for Emotion Classification. In: 2020 11th international conference on computing, communication and networking technologies (ICCCNT), pp 1–6. https://doi.org/10.1109/ICCCNT49239.2020.9225614
Singh J, Saheer LB, Faust O (2023) Speech emotion recognition using attention model. Int J Environ Res Public Health 20(6):5140. https://doi.org/10.3390/ijerph20065140
Swain M, Maji B, Kabisatpathy P et al (2022) A DCRNN-based ensemble classifier for speech emotion recognition in Odia language. Complex Intell Syst 8:4237–4249. https://doi.org/10.1007/s40747-022-00713-w
Sun C, Li H, Ma L (2023) Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network. Front Psychol 13:2022. https://doi.org/10.3389/fpsyg.2022.1075624
Ullah S, Sahib QA, Faizullah, Ullah S, Haq IU, Ullah I (2022) Speech emotion recognition using deep neural networks. In: Proceedings of the IEEE international conference on IT and industrial technologies (ICIT), pp 01–06. https://doi.org/10.1109/ICIT56493.2022.9989197
Author information
Authors and Affiliations
Contributions
Prakasam P and Sureshkumar T R devised and proofread the main conceptual ideas. Swami Mishra and Nehal Bhatnagar worked on almost all the technical details, devising the model, data collection, and experimentation. Prakasam P performed the evaluation metrics to validate the proposed model. Sureshkumar T R, Swami Mishra, and Nehal Bhatnagar prepared and verified the entire manuscript by Prakasam P.
Corresponding author
Ethics declarations
Conflicts of interest
We hereby declare that there is no conflict of interest in this research work/paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mishra, S., Bhatnagar, N., P, P. et al. Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model. Multimed Tools Appl 83, 37603–37620 (2024). https://doi.org/10.1007/s11042-023-16849-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16849-x