Skip to main content
Log in

ADAM optimised human speech emotion recogniser based on statistical information distribution of chroma, MFCC, and MBSE features

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The textual or display-based control paradigm in human–computer interaction (HCI) has changed in favor of more natural control modalities like voice and gesture. Speech, in particular, contains a significant deal of information, revealing the speaker's inner state and intention. While word analysis makes understanding the speaker's request possible, other speech aspects reveal the speaker's attitude, goal, and motivation. As a result, it is now crucial for modern human–computer interface systems to recognize emotions from speech. Numerous techniques for sound analysis have been created in the past. This work aims to detect human emotions from their voice snippet; for this, an English language open source dataset Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Hindi-language dataset IITKGP-SEHSC are used. RAVDESS contains over 2000 voice samples recorded by 24 actors covering eight emotions: anger, fear, neutral, calmness, happiness, sadness, disgust, and surprise. The proposed model uses ADAM optimized deep learning model along with MFCC, chroma, and Mel band spectral energy features (MBSE) to classify and recognize eight different human vocal emotions. A multilayer perceptron (MLP) classifier is used for classification. The efficiency of the proposed model was compared to another state of the art, and the outcomes were assessed. Using the proposed structure of the model on the RAVDESS and IITKGP-SEHSC datasets, an overall accuracy of 85.19% and 80%, respectively, were achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The experiment uses RAVDESS, an open-source speech emotional dataset. This dataset can be shared if requested.

References

  1. Taylor JG, Scherer K, Cowie R (2005) Emotion and brain: u. Neural Netw 18(4):313–316

    Article  Google Scholar 

  2. Chavhan Y, Dhore ML, Yesaware P (2010) Speech emotion recognition using support vector machine. Int J Comput Appl 1(20):6–9

    Google Scholar 

  3. Shami M, Verhelst W (2007) An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Commun 49(3):201–212. https://doi.org/10.1016/j.specom.2007.01.006

  4. Rani P, Liu C, Sarkar N, Vanman E (2006) An empirical study of machine learning techniques for affect recognition in human-robot interaction. Pattern Anal Appl 9(1):58–69

    Article  Google Scholar 

  5. Partila P, Voznak M (2013) Speech emotions recognition using a 2-d neural classifier. In: Nostradamus 2013: Prediction, modeling and analysis of complex systems. Springer, Berlin, Germany, pp 221–231

  6. Zhao Z (2021) Combining a parallel 2D CNN with a self-attention dilated residual network for CTC- based discrete speech emotion recognition. Neural Netw 141:52–60

    Article  Google Scholar 

  7. Lee S, Han DK, Ko H (2020) Fusion-ConvBERT: parallel convolution and BERT fusion for speech emotion recognition. Sensors 20(22):6688

    Article  Google Scholar 

  8. Zhang H, Gou R, Shang J, Shen F, Wu Y, Dai G (2021) Pretrained deep convolution neural network model with attention for speech emotion recognition. Front Physiol 12:643202

    Article  Google Scholar 

  9. Gharavian D, Sheikhan M, Nazerieh A, Garoucy S (2012) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput Appl 21(8):2115–2126

    Article  Google Scholar 

  10. Petrushin V (1999) Emotion in speech: recognition and application to call centers. Proc Artif Neural Netw Eng 710:22

    Google Scholar 

  11. Zhang S, Zhao X (2013) Dimensionality reduction-based spoken emotion recognition. Multimed Tools Appl 63(3):615–646

    Article  Google Scholar 

  12. Picard RW, Vyzas E, Healey J (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans Pattern Anal Mach Intell 23(10):1175–1191

    Article  Google Scholar 

  13. Fahad Md.S, Ranjan A, Yadav J, Deepak A (2021) A survey of speech emotion recognition in natural environment. Digit Sig Proc 110:102951. https://doi.org/10.1016/j.dsp.2020.102951

  14. Khurana S, Dev A, Bansal P (2021) Current state of Speech emotion dataset- national and international level. In: Proc. International conference on artificial intelligence and speech technology. Springer, pp 232–243

  15. Khurana S, Dev A, Bansal P (2023) SER: performance evaluation of cnn model along with an overview of available indic speech datasets, and transition of classifiers from traditional to modern era. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3605778

    Article  Google Scholar 

  16. Livingstone S, Russo F (2018) The Ryerson audio-visual database of emotional speech and song(RAVDESS): a dynamic. Multimodal Set Facial Vocal Expressions N. Amer Engl 13

  17. https://www.kaggle.com/uwrfkaggler/ravdess-emotionalspeech-audio. Accessed Jan 2023

  18. Koolagudi GS, Reddy R, Yadav J, Rao KS (2022) IITKGP-SEHSC: Hindi speech corpus for emotion analysis. In: Proc. IEEE international conference on devices and communications (ICDeCom), pp1–5

  19. Kaur K, Singh P (2023) Trends in speech emotion recognition: a comprehensive survey. Multimed Tools Appl 82(19):29307–29351. https://doi.org/10.1007/s11042-023-14656-y

    Article  Google Scholar 

  20. Kattel M, Nepal A, Shah AK, Shrestha D (2019) Chroma feature extraction using Fourier Transform. In: Proc the conference, Jan 2019

  21. Moreno JJM (2011) Artificial neural networks applied to forecasting time series. Psicothema 23(2):322–329

    Google Scholar 

  22. Raghu Vamsi U, Yuvraj Chowdhary B, Harshitha M, Ravi Theja S, Divya Udayan J (2021) Speech emotion recognition(ser) using multilayer perceptron and deep learning techniques. IEEE Access 27(5)

  23. Agarwal SS (2011) Emotions in Hindi speech-analysis, perception and recognition. In: Proc of international conference on Speech Database and Assessment. https://doi.org/10.1109/ICSDA.2011.6085972

  24. Montero JM, Gutierrez-Arriola J, Colás J, Enriquez E, Pardo JM (1999) Analysis andmodelling of emotional speech in Spanish. In: Proc of ICPhS. vol 2, pp 957–960

  25. Xu M, Zhang F, Zhang W (2021) Head fusion: improving the accuracy and robustness of speech emotion recognition on the IEMOCAP and RAVDESS dataset. IEEE Access 9:74539–74549

    Article  Google Scholar 

  26. Alnuaim AA et al (2022) Human-computer interaction for recognizing speech emotions using multilayer perceptron classifier. J Healthc Eng 2022:1–12. https://doi.org/10.1155/2022/6005446

    Article  Google Scholar 

  27. Caschera MC, Grifoni P, Ferri F (2022) Emotion classification from speech and text in videos using a multimodal approach. Multimodal Technol Interact 6(4):28. https://doi.org/10.3390/mti6040028

    Article  Google Scholar 

  28. Ahmed N, Aghbari ZA, Girija S (2023) A systematic survey on multimodal emotion recognition using learning algorithms. Intell Syst Appl 17:200171. https://doi.org/10.1016/j.iswa.2022.200171

    Article  Google Scholar 

  29. Al-Dujaili MJ, Ebrahimi-Moghadam A (2023) Speech emotion recognition: a comprehensive survey. Wirel Pers Commun 129(4):2525–2561. https://doi.org/10.1007/s11277-023-10244-3

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Surbhi Khurana.

Ethics declarations

Conflict of interests

Authors have shown No conflict of Interest. No funding has been taken to conduct the experiment and study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khurana, S., Dev, A. & Bansal, P. ADAM optimised human speech emotion recogniser based on statistical information distribution of chroma, MFCC, and MBSE features. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19321-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-19321-6

Keywords

Navigation