Speech Emotion Recognition Integrating Paralinguistic Features and Auto-encoders in a Deep Learning Model

  • Rubén D. Fonnegra
  • Gloria M. DíazEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10901)


Emotions play an extremely important role in human decisions and interactions with both other humans and machines. This fact had promoted development of methods that aim to recognize emotions from different physiological signals. Particularly, emotion recognition from speech signals is still a research challenge due to the large voice variability between subjects. In this work, paralinguistic features and deep learning models are used to perform speech emotion classification. A set of 1582 INTERSPEECH 2010 features is initially extracted from the speech signals, which are then used to feed a deep convolutional stack auto-encoder network that transform those features in a higher level representation. Then, a multilayer perceptron is trained to classify the utterances in one of six emotions: anger, fear, disgust, happiness, surprise and sadness. The size of the auto-encoders was evaluated for 4 different architectures, in terms of performance, computational cost and execution time for obtaining the most suitable configuration model. Thus, the proposed approach was twofold evaluated. First, a 5-fold cross-validation strategy was performed using \(70\%\) of the samples. Then, the best network architecture was used to evaluate the classification in a validation set, composed of the remaining \(30\%\) of samples. Results report an overall accuracy of 91.4 in the 5-fold testing stage and 61, 1 in the validation set.


Speech emotion recognition Deep learning Paralinguistic features Auto-encoders 


  1. 1.
    Ali, M., Mosa, A.H., Machot, F.A., Kyamakya, K.: Emotion recognition involving physiological and speech signals: a comprehensive review. In: Kyamakya, K., Mathis, W., Stoop, R., Chedjou, J.C., Li, Z. (eds.) Recent Advances in Nonlinear Dynamics and Synchronization. SSDC, vol. 109, pp. 287–302. Springer, Cham (2018). Scholar
  2. 2.
    Anagnostopoulos, C.N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2015)CrossRefGoogle Scholar
  3. 3.
    Schuller, B.W., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C.A., Narayanan, S.S., et al.: The interspeech 2010 paralinguistic challenge. In: Interspeech, vol. 2010, pp. 2795–2798 (2010)Google Scholar
  4. 4.
    Kaya, H., Salah, A.A., Karpov, A., Frolova, O., Grigorev, A., Lyakso, E.: Emotion, age, and gender classification in children’s speech by humans and machines. Comput. Speech Lang. 46(Supplement C), 268–283 (2017)CrossRefGoogle Scholar
  5. 5.
    Zhang, Y., Liu, J., Hu, J., Xie, X., Huang, S.: Social personality evaluation based on prosodic and acoustic features. In: Proceedings of the 2017 International Conference on Machine Learning and Soft Computing, pp. 214–218. ACM (2017)Google Scholar
  6. 6.
    Jassim, W.A., Paramesran, R., Harte, N.: Speech emotion classification using combined neurogram and interspeech 2010 paralinguistic challenge features. IET Signal Process. (2017)Google Scholar
  7. 7.
    Fu, J., Mao, Q., Tu, J., Zhan, Y.: Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis. Multimed. Syst., 1–11 (2017)Google Scholar
  8. 8.
    Hossain, M.S., Muhammad, G.: Audio-visual emotion recognition using multi-directional regression and ridgelet transform. J. Multimodal User Interfaces 10(4), 325–333 (2016)CrossRefGoogle Scholar
  9. 9.
    Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), pp. 511–516. IEEE (2013)Google Scholar
  10. 10.
    Schmidt, E.M., Kim, Y.E.: Learning emotion-based acoustic features with deep belief networks. In: 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 65–68 October 2011Google Scholar
  11. 11.
    Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Interspeech, pp. 223–227 (2014)Google Scholar
  12. 12.
    Cibau, N.E., Albornoz, E.M., Rufiner, H.L.: Speech emotion recognition using a deep autoencoder. Anales de la XV Reunion de Procesamiento de la Informacion y Control 16, 934–939 (2013)Google Scholar
  13. 13.
    Mirsamadi, S., Barsoum, E., Zhang, C.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2227–2231. IEEE (2017)Google Scholar
  14. 14.
    Martin, O., Kotsia, I., Macq, B., Pitas, I.: The eNTERFACE’05 audio-visual emotion database. In: 2006 22nd International Conference on Data Engineering Workshops, Proceedings, p. 8. IEEE (2006)Google Scholar
  15. 15.
    Alva, M.Y., Nachamai, M., Paulose, J.: A comprehensive survey on features and methods for speech emotion detection. In: 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp. 1–6. IEEE (2015)Google Scholar
  16. 16.
    Swain, M., Routray, A., Kabisatpathy, P.: Databases, features and classifiers for speech emotion recognition: a review. Int. J. Speech Technol. 21(1), 93–120 (2018)CrossRefGoogle Scholar
  17. 17.
    Poria, S., Cambria, E., Hussain, A., Huang, G.B.: Towards an intelligent framework for multimodal affective data analysis. Neural Netw. 63, 104–116 (2015)CrossRefGoogle Scholar
  18. 18.
    Eyben, F., Wöllmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462. ACM (2010)Google Scholar
  19. 19.
    Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  20. 20.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)Google Scholar
  21. 21.
    Dobrišek, S., Gajšek, R., Mihelič, F., Pavešić, N., Štruc, V.: Towards efficient multi-modal emotion recognition. Int. J. Adv. Rob. Syst. 10(1), 53 (2013)CrossRefGoogle Scholar
  22. 22.
    Yan, J., Zheng, W., Xu, Q., Lu, G., Li, H., Wang, B.: Sparse kernel reduced-rank regression for bimodal emotion recognition from facial expression and speech. IEEE Trans. Multimed. 18(7), 1319–1329 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Instituto Tecnológico MetropolitanoMedellínColombia

Personalised recommendations