Skip to main content

Advertisement

Log in

Machine learning techniques for speech emotion recognition using paralinguistic acoustic features

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Speech emotion recognition is one of the fastest growing areas of interest in the field of affective computing. Emotion detection aids human–computer interaction and finds application in a wide gamut of sectors, ranging from healthcare to retail to education. The present work strives to provide a speech emotion recognition framework that is both reliable and efficient enough to work in real-time environments. Speech emotion recognition can be performed using linguistic as well as paralinguistic aspects of speech; this work focusses on the latter, using non-lexical or paralinguistic attributes of speech like pitch, intensity and mel-frequency cepstral coefficients to train supervised machine learning models for emotion recognition. A combination of prosodic and spectral features is used for experimental analysis and classification is performed using algorithms like Gaussian Naïve Bayes, Random Forest, k-Nearest Neighbours, Support Vector Machine and Multilayer Perceptron. The choice of these ML models was based on the swiftness with which they could be trained, making them more suitable for real-time applications. Comparative analysis of the models reveals SVM and MLP to be the best performers with 77.86% and 79.62% accuracies respectively. The performance of these classifiers is compared with benchmark results in literature, and a significant improvement over state-of-the-art models is presented. The observations and findings of this work can be applied to design real-time emotion recognition frameworks that can be used to design and develop applications and technologies for various domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Agrawal, E., & Christopher, J. (2020). Emotion recognition from periocular features. In International conference on machine learning, image processing, network security and data sciences (pp. 194–208). Springer.

  • Agrawal, E., Christopher, J. J., & Arunachalam, V. (2021). Emotion recognition through voting on expressions in multiple facial regions. ICAART, 2, 1038–1045.

    Google Scholar 

  • Anagnostopoulos, C.-N., Iliou, T., & Giannoukos, I. (2015). Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artificial Intelligence Review, 43(2), 155–177.

    Article  Google Scholar 

  • Bhavan, A., Chauhan, P., Shah, R. R., et al. (2019). Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems, 184, 104886.

    Article  Google Scholar 

  • Chen, L., Su, W., Feng, Y., Wu, M., She, J., & Hirota, K. (2020). Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction. Information Sciences, 509, 150–163.

    Article  Google Scholar 

  • Christopher, J. J., Nehemiah, K. H., & Arputharaj, K. (2016). Knowledge-based systems and interestingness measures: Analysis with clinical datasets. Journal of Computing and Information Technology, 24(1), 65–78.

    Article  Google Scholar 

  • Christy, A., Vaithyasubramanian, S., Jesudoss, A., & Praveena, M. A. (2020). Multimodal speech emotion recognition and classification using convolutional neural network techniques. International Journal of Speech Technology, 23, 381–388.

    Article  Google Scholar 

  • Daneshfar, F., & Kabudian, S. J. (2020). Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm. Multimedia Tools and Applications, 79(1), 1261–1289.

    Article  Google Scholar 

  • Gomathy, M. (2021). Optimal feature selection for speech emotion recognition using enhanced cat swarm optimization algorithm. International Journal of Speech Technology, 24(1), 155–163.

    Article  Google Scholar 

  • Gupta, K., Gupta, M., Christopher, J., & Arunachalam, V. (2020). Fuzzy system for facial emotion recognition. In International conference on intelligent systems design and applications (pp. 536–552). Springer.

  • Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894.

    Article  Google Scholar 

  • Jadoul, Y., Thompson, B., & De Boer, B. (2018). Introducing parselmouth: A python interface to praat. Journal of Phonetics, 71, 1–15.

    Article  Google Scholar 

  • Kavya, R., Christopher, J., Panda, S., & Lazarus, Y. B. (2021). Machine learning and XAI approaches for allergy diagnosis. Biomedical Signal Processing and Control, 69, 102681.

    Article  Google Scholar 

  • Koduru, A., Valiveti, H. B., & Budati, A. K. (2020). Feature extraction algorithms to improve the speech emotion recognition rate. International Journal of Speech Technology, 23(1), 45–55.

    Article  Google Scholar 

  • Kursa, M. B., Jankowski, A., & Rudnicki, W. R. (2010). Boruta—a system for feature selection. Fundamenta Informaticae, 101(4), 271–285.

    Article  MathSciNet  Google Scholar 

  • Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signals. In Eighth European conference on speech communication and technology.

  • Kwon, S., et al. (2020). CLSTM: Deep feature-based speech emotion recognition using the hierarchical convLSTM network. Mathematics, 8(12), 2133.

    Article  Google Scholar 

  • Kwon, S., et al. (2021). Mlt-dnet: Speech emotion recognition using 1d dilated CNN based on multi-learning trick approach. Expert Systems with Applications, 167, 114177.

    Article  Google Scholar 

  • Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1), 559–563.

    Google Scholar 

  • Liu, G. K. (2018). Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech. arXiv preprint arXiv:1806.09010.

  • Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391.

    Article  Google Scholar 

  • McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). Librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference Vol. 8, (pp. 18–25). Citeseer.

  • Nantasri, P., Phaisangittisagul, E., Karnjana, J., Boonkla, S., Keerativittayanun, S., Rugchatjaroen, A., Usanavasin, S., & Shinozaki, T. (2020). A light-weight artificial neural network for speech emotion recognition using average values of MFCCs and their derivatives. In 2020 17th International conference on electrical engineering/electronics, computer, telecommunications and information technology (ECTI-CON) (pp. 41–44). IEEE.

  • Pan, Y., Shen, P., & Shen, L. (2012). Speech emotion recognition using support vector machine. International Journal of Smart Home, 6(2), 101–108.

    Google Scholar 

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12, 2825–2830.

    MathSciNet  MATH  Google Scholar 

  • Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.

    Article  Google Scholar 

  • Petrushin, V. A. (2000). Emotion recognition in speech signal: Experimental study, development, and application. In Sixth international conference on spoken language processing.

  • Picard, R. W. (2000). Affective computing. MIT press.

    Book  Google Scholar 

  • Quan, C., Zhang, B., Sun, X., & Ren, F. (2017). A combined cepstral distance method for emotional speech recognition. International Journal of Advanced Robotic Systems, 14(4), 1729881417719836.

    Article  Google Scholar 

  • Rojas, R. (1996). The backpropagation algorithm. In Neural networks (pp. 149–182). Springer.

  • Rong, J., Li, G., & Chen, Y.-P.P. (2009). Acoustic feature selection for automatic emotion recognition from speech. Information Processing & Management, 45(3), 315–328.

    Article  Google Scholar 

  • Shegokar, P., & Sircar, P. (2016). Continuous wavelet transform based speech emotion recognition. In 2016 10th international conference on signal processing and communication systems (ICSPCS) (pp. 1–8). IEEE.

  • Surampudi, N., Srirangan, M., & Christopher, J. (2019). Enhanced feature extraction approaches for detection of sound events. In 2019 IEEE 9th international conference on advanced computing (IACC) (pp. 223–229). IEEE.

  • Tzirakis, P., Zhang, J., & Schuller, B. W. (2018). End-to-end speech emotion recognition using deep neural networks. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5089–5093). IEEE.

  • Vogt, T., & André, E. (2006). Improving automatic emotion recognition from speech via gender differentiaion. In LREC (pp. 1123–1126).

  • Zamil, A. A. A., Hasan, S., Baki, S. M. J., Adam, J. M., & Zaman, I. (2019). Emotion detection from speech signals using voting mechanism on classified frames. In 2019 international conference on robotics, electrical and signal processing techniques (ICREST) (pp. 281–285). IEEE.

  • Zeng, Y., Mao, H., Peng, D., & Yi, Z. (2019). Spectrogram based multi-task audio classification. Multimedia Tools and Applications, 78(3), 3705–3722.

    Article  Google Scholar 

  • Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., & Shamma, S. (2011). Linear versus mel frequency cepstral coefficients for speaker recognition. In 2011 IEEE workshop on automatic speech recognition & understanding (pp. 559–564). IEEE.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jabez Christopher.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jha, T., Kavya, R., Christopher, J. et al. Machine learning techniques for speech emotion recognition using paralinguistic acoustic features. Int J Speech Technol 25, 707–725 (2022). https://doi.org/10.1007/s10772-022-09985-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-022-09985-6

Keywords

Navigation