Machine learning techniques for speech emotion recognition using paralinguistic acoustic features

Jha, Tulika; Kavya, Ramisetty; Christopher, Jabez; Arunachalam, Vasan

doi:10.1007/s10772-022-09985-6

Machine learning techniques for speech emotion recognition using paralinguistic acoustic features

Published: 08 July 2022

Volume 25, pages 707–725, (2022)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Tulika Jha¹,
Ramisetty Kavya¹,
Jabez Christopher ORCID: orcid.org/0000-0001-6744-9329¹ &
…
Vasan Arunachalam²

737 Accesses
7 Citations
Explore all metrics

Abstract

Speech emotion recognition is one of the fastest growing areas of interest in the field of affective computing. Emotion detection aids human–computer interaction and finds application in a wide gamut of sectors, ranging from healthcare to retail to education. The present work strives to provide a speech emotion recognition framework that is both reliable and efficient enough to work in real-time environments. Speech emotion recognition can be performed using linguistic as well as paralinguistic aspects of speech; this work focusses on the latter, using non-lexical or paralinguistic attributes of speech like pitch, intensity and mel-frequency cepstral coefficients to train supervised machine learning models for emotion recognition. A combination of prosodic and spectral features is used for experimental analysis and classification is performed using algorithms like Gaussian Naïve Bayes, Random Forest, k-Nearest Neighbours, Support Vector Machine and Multilayer Perceptron. The choice of these ML models was based on the swiftness with which they could be trained, making them more suitable for real-time applications. Comparative analysis of the models reveals SVM and MLP to be the best performers with 77.86% and 79.62% accuracies respectively. The performance of these classifiers is compared with benchmark results in literature, and a significant improvement over state-of-the-art models is presented. The observations and findings of this work can be applied to design real-time emotion recognition frameworks that can be used to design and develop applications and technologies for various domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A computationally efficient speech emotion recognition system employing machine learning classifiers and ensemble learning

Article 30 March 2024

Speech Emotion Recognition Using Machine Learning Techniques

Comparative Analysis of Different Classifiers for Speech Emotion Recognition

References

Agrawal, E., & Christopher, J. (2020). Emotion recognition from periocular features. In International conference on machine learning, image processing, network security and data sciences (pp. 194–208). Springer.
Agrawal, E., Christopher, J. J., & Arunachalam, V. (2021). Emotion recognition through voting on expressions in multiple facial regions. ICAART, 2, 1038–1045.
Google Scholar
Anagnostopoulos, C.-N., Iliou, T., & Giannoukos, I. (2015). Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artificial Intelligence Review, 43(2), 155–177.
Article Google Scholar
Bhavan, A., Chauhan, P., Shah, R. R., et al. (2019). Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems, 184, 104886.
Article Google Scholar
Chen, L., Su, W., Feng, Y., Wu, M., She, J., & Hirota, K. (2020). Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction. Information Sciences, 509, 150–163.
Article Google Scholar
Christopher, J. J., Nehemiah, K. H., & Arputharaj, K. (2016). Knowledge-based systems and interestingness measures: Analysis with clinical datasets. Journal of Computing and Information Technology, 24(1), 65–78.
Article Google Scholar
Christy, A., Vaithyasubramanian, S., Jesudoss, A., & Praveena, M. A. (2020). Multimodal speech emotion recognition and classification using convolutional neural network techniques. International Journal of Speech Technology, 23, 381–388.
Article Google Scholar
Daneshfar, F., & Kabudian, S. J. (2020). Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm. Multimedia Tools and Applications, 79(1), 1261–1289.
Article Google Scholar
Gomathy, M. (2021). Optimal feature selection for speech emotion recognition using enhanced cat swarm optimization algorithm. International Journal of Speech Technology, 24(1), 155–163.
Article Google Scholar
Gupta, K., Gupta, M., Christopher, J., & Arunachalam, V. (2020). Fuzzy system for facial emotion recognition. In International conference on intelligent systems design and applications (pp. 536–552). Springer.
Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894.
Article Google Scholar
Jadoul, Y., Thompson, B., & De Boer, B. (2018). Introducing parselmouth: A python interface to praat. Journal of Phonetics, 71, 1–15.
Article Google Scholar
Kavya, R., Christopher, J., Panda, S., & Lazarus, Y. B. (2021). Machine learning and XAI approaches for allergy diagnosis. Biomedical Signal Processing and Control, 69, 102681.
Article Google Scholar
Koduru, A., Valiveti, H. B., & Budati, A. K. (2020). Feature extraction algorithms to improve the speech emotion recognition rate. International Journal of Speech Technology, 23(1), 45–55.
Article Google Scholar
Kursa, M. B., Jankowski, A., & Rudnicki, W. R. (2010). Boruta—a system for feature selection. Fundamenta Informaticae, 101(4), 271–285.
Article MathSciNet Google Scholar
Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signals. In Eighth European conference on speech communication and technology.
Kwon, S., et al. (2020). CLSTM: Deep feature-based speech emotion recognition using the hierarchical convLSTM network. Mathematics, 8(12), 2133.
Article Google Scholar
Kwon, S., et al. (2021). Mlt-dnet: Speech emotion recognition using 1d dilated CNN based on multi-learning trick approach. Expert Systems with Applications, 167, 114177.
Article Google Scholar
Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1), 559–563.
Google Scholar
Liu, G. K. (2018). Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech. arXiv preprint arXiv:1806.09010.
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391.
Article Google Scholar
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). Librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference Vol. 8, (pp. 18–25). Citeseer.
Nantasri, P., Phaisangittisagul, E., Karnjana, J., Boonkla, S., Keerativittayanun, S., Rugchatjaroen, A., Usanavasin, S., & Shinozaki, T. (2020). A light-weight artificial neural network for speech emotion recognition using average values of MFCCs and their derivatives. In 2020 17th International conference on electrical engineering/electronics, computer, telecommunications and information technology (ECTI-CON) (pp. 41–44). IEEE.
Pan, Y., Shen, P., & Shen, L. (2012). Speech emotion recognition using support vector machine. International Journal of Smart Home, 6(2), 101–108.
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12, 2825–2830.
MathSciNet MATH Google Scholar
Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.
Article Google Scholar
Petrushin, V. A. (2000). Emotion recognition in speech signal: Experimental study, development, and application. In Sixth international conference on spoken language processing.
Picard, R. W. (2000). Affective computing. MIT press.
Book Google Scholar
Quan, C., Zhang, B., Sun, X., & Ren, F. (2017). A combined cepstral distance method for emotional speech recognition. International Journal of Advanced Robotic Systems, 14(4), 1729881417719836.
Article Google Scholar
Rojas, R. (1996). The backpropagation algorithm. In Neural networks (pp. 149–182). Springer.
Rong, J., Li, G., & Chen, Y.-P.P. (2009). Acoustic feature selection for automatic emotion recognition from speech. Information Processing & Management, 45(3), 315–328.
Article Google Scholar
Shegokar, P., & Sircar, P. (2016). Continuous wavelet transform based speech emotion recognition. In 2016 10th international conference on signal processing and communication systems (ICSPCS) (pp. 1–8). IEEE.
Surampudi, N., Srirangan, M., & Christopher, J. (2019). Enhanced feature extraction approaches for detection of sound events. In 2019 IEEE 9th international conference on advanced computing (IACC) (pp. 223–229). IEEE.
Tzirakis, P., Zhang, J., & Schuller, B. W. (2018). End-to-end speech emotion recognition using deep neural networks. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5089–5093). IEEE.
Vogt, T., & André, E. (2006). Improving automatic emotion recognition from speech via gender differentiaion. In LREC (pp. 1123–1126).
Zamil, A. A. A., Hasan, S., Baki, S. M. J., Adam, J. M., & Zaman, I. (2019). Emotion detection from speech signals using voting mechanism on classified frames. In 2019 international conference on robotics, electrical and signal processing techniques (ICREST) (pp. 281–285). IEEE.
Zeng, Y., Mao, H., Peng, D., & Yi, Z. (2019). Spectrogram based multi-task audio classification. Multimedia Tools and Applications, 78(3), 3705–3722.
Article Google Scholar
Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., & Shamma, S. (2011). Linear versus mel frequency cepstral coefficients for speaker recognition. In 2011 IEEE workshop on automatic speech recognition & understanding (pp. 559–564). IEEE.

Download references

Author information

Authors and Affiliations

Department of Computer Science and Information Systems, BITS Pilani Hyderabad Campus, Hyderabad, Telangana, India
Tulika Jha, Ramisetty Kavya & Jabez Christopher
Department of Civil Engineering, BITS Pilani Hyderabad Campus, Hyderabad, Telangana, India
Vasan Arunachalam

Authors

Tulika Jha
View author publications
You can also search for this author in PubMed Google Scholar
Ramisetty Kavya
View author publications
You can also search for this author in PubMed Google Scholar
Jabez Christopher
View author publications
You can also search for this author in PubMed Google Scholar
Vasan Arunachalam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jabez Christopher.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jha, T., Kavya, R., Christopher, J. et al. Machine learning techniques for speech emotion recognition using paralinguistic acoustic features. Int J Speech Technol 25, 707–725 (2022). https://doi.org/10.1007/s10772-022-09985-6

Download citation

Received: 16 April 2021
Accepted: 17 June 2022
Published: 08 July 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10772-022-09985-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine learning techniques for speech emotion recognition using paralinguistic acoustic features

Abstract

Access this article

Similar content being viewed by others

A computationally efficient speech emotion recognition system employing machine learning classifiers and ensemble learning

Speech Emotion Recognition Using Machine Learning Techniques

Comparative Analysis of Different Classifiers for Speech Emotion Recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Machine learning techniques for speech emotion recognition using paralinguistic acoustic features

Abstract

Access this article

Similar content being viewed by others

A computationally efficient speech emotion recognition system employing machine learning classifiers and ensemble learning

Speech Emotion Recognition Using Machine Learning Techniques

Comparative Analysis of Different Classifiers for Speech Emotion Recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation