Abstract
The textual or display-based control paradigm in human–computer interaction (HCI) has changed in favor of more natural control modalities like voice and gesture. Speech, in particular, contains a significant deal of information, revealing the speaker's inner state and intention. While word analysis makes understanding the speaker's request possible, other speech aspects reveal the speaker's attitude, goal, and motivation. As a result, it is now crucial for modern human–computer interface systems to recognize emotions from speech. Numerous techniques for sound analysis have been created in the past. This work aims to detect human emotions from their voice snippet; for this, an English language open source dataset Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Hindi-language dataset IITKGP-SEHSC are used. RAVDESS contains over 2000 voice samples recorded by 24 actors covering eight emotions: anger, fear, neutral, calmness, happiness, sadness, disgust, and surprise. The proposed model uses ADAM optimized deep learning model along with MFCC, chroma, and Mel band spectral energy features (MBSE) to classify and recognize eight different human vocal emotions. A multilayer perceptron (MLP) classifier is used for classification. The efficiency of the proposed model was compared to another state of the art, and the outcomes were assessed. Using the proposed structure of the model on the RAVDESS and IITKGP-SEHSC datasets, an overall accuracy of 85.19% and 80%, respectively, were achieved.
Similar content being viewed by others
Data availability
The experiment uses RAVDESS, an open-source speech emotional dataset. This dataset can be shared if requested.
References
Taylor JG, Scherer K, Cowie R (2005) Emotion and brain: u. Neural Netw 18(4):313–316
Chavhan Y, Dhore ML, Yesaware P (2010) Speech emotion recognition using support vector machine. Int J Comput Appl 1(20):6–9
Shami M, Verhelst W (2007) An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Commun 49(3):201–212. https://doi.org/10.1016/j.specom.2007.01.006
Rani P, Liu C, Sarkar N, Vanman E (2006) An empirical study of machine learning techniques for affect recognition in human-robot interaction. Pattern Anal Appl 9(1):58–69
Partila P, Voznak M (2013) Speech emotions recognition using a 2-d neural classifier. In: Nostradamus 2013: Prediction, modeling and analysis of complex systems. Springer, Berlin, Germany, pp 221–231
Zhao Z (2021) Combining a parallel 2D CNN with a self-attention dilated residual network for CTC- based discrete speech emotion recognition. Neural Netw 141:52–60
Lee S, Han DK, Ko H (2020) Fusion-ConvBERT: parallel convolution and BERT fusion for speech emotion recognition. Sensors 20(22):6688
Zhang H, Gou R, Shang J, Shen F, Wu Y, Dai G (2021) Pretrained deep convolution neural network model with attention for speech emotion recognition. Front Physiol 12:643202
Gharavian D, Sheikhan M, Nazerieh A, Garoucy S (2012) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput Appl 21(8):2115–2126
Petrushin V (1999) Emotion in speech: recognition and application to call centers. Proc Artif Neural Netw Eng 710:22
Zhang S, Zhao X (2013) Dimensionality reduction-based spoken emotion recognition. Multimed Tools Appl 63(3):615–646
Picard RW, Vyzas E, Healey J (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans Pattern Anal Mach Intell 23(10):1175–1191
Fahad Md.S, Ranjan A, Yadav J, Deepak A (2021) A survey of speech emotion recognition in natural environment. Digit Sig Proc 110:102951. https://doi.org/10.1016/j.dsp.2020.102951
Khurana S, Dev A, Bansal P (2021) Current state of Speech emotion dataset- national and international level. In: Proc. International conference on artificial intelligence and speech technology. Springer, pp 232–243
Khurana S, Dev A, Bansal P (2023) SER: performance evaluation of cnn model along with an overview of available indic speech datasets, and transition of classifiers from traditional to modern era. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3605778
Livingstone S, Russo F (2018) The Ryerson audio-visual database of emotional speech and song(RAVDESS): a dynamic. Multimodal Set Facial Vocal Expressions N. Amer Engl 13
https://www.kaggle.com/uwrfkaggler/ravdess-emotionalspeech-audio. Accessed Jan 2023
Koolagudi GS, Reddy R, Yadav J, Rao KS (2022) IITKGP-SEHSC: Hindi speech corpus for emotion analysis. In: Proc. IEEE international conference on devices and communications (ICDeCom), pp1–5
Kaur K, Singh P (2023) Trends in speech emotion recognition: a comprehensive survey. Multimed Tools Appl 82(19):29307–29351. https://doi.org/10.1007/s11042-023-14656-y
Kattel M, Nepal A, Shah AK, Shrestha D (2019) Chroma feature extraction using Fourier Transform. In: Proc the conference, Jan 2019
Moreno JJM (2011) Artificial neural networks applied to forecasting time series. Psicothema 23(2):322–329
Raghu Vamsi U, Yuvraj Chowdhary B, Harshitha M, Ravi Theja S, Divya Udayan J (2021) Speech emotion recognition(ser) using multilayer perceptron and deep learning techniques. IEEE Access 27(5)
Agarwal SS (2011) Emotions in Hindi speech-analysis, perception and recognition. In: Proc of international conference on Speech Database and Assessment. https://doi.org/10.1109/ICSDA.2011.6085972
Montero JM, Gutierrez-Arriola J, Colás J, Enriquez E, Pardo JM (1999) Analysis andmodelling of emotional speech in Spanish. In: Proc of ICPhS. vol 2, pp 957–960
Xu M, Zhang F, Zhang W (2021) Head fusion: improving the accuracy and robustness of speech emotion recognition on the IEMOCAP and RAVDESS dataset. IEEE Access 9:74539–74549
Alnuaim AA et al (2022) Human-computer interaction for recognizing speech emotions using multilayer perceptron classifier. J Healthc Eng 2022:1–12. https://doi.org/10.1155/2022/6005446
Caschera MC, Grifoni P, Ferri F (2022) Emotion classification from speech and text in videos using a multimodal approach. Multimodal Technol Interact 6(4):28. https://doi.org/10.3390/mti6040028
Ahmed N, Aghbari ZA, Girija S (2023) A systematic survey on multimodal emotion recognition using learning algorithms. Intell Syst Appl 17:200171. https://doi.org/10.1016/j.iswa.2022.200171
Al-Dujaili MJ, Ebrahimi-Moghadam A (2023) Speech emotion recognition: a comprehensive survey. Wirel Pers Commun 129(4):2525–2561. https://doi.org/10.1007/s11277-023-10244-3
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
Authors have shown No conflict of Interest. No funding has been taken to conduct the experiment and study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Khurana, S., Dev, A. & Bansal, P. ADAM optimised human speech emotion recogniser based on statistical information distribution of chroma, MFCC, and MBSE features. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19321-6
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-19321-6