Combined Feature Representation for Emotion Classification from Russian Speech

  • Oxana VerkholyakEmail author
  • Alexey Karpov
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 789)


Acoustic feature extraction for emotion classification is possible on different levels. Frame-level features provide low-level description characteristics that preserve temporal structure of the utterance. On the other hand, utterance-level features represent functionals applied to the low-level descriptors and contain important information about speaker emotional state. Utterance-level features are particularly useful for determining emotion intensity, however, they lose information about temporal changes of the signal. Another drawback includes often insufficient number of feature vectors for complex classification tasks. One solution to overcome these problems is to combine the frame-level features and utterance-level features to take advantage of both methods. This paper proposes to obtain low-level feature representation feeding frame-level descriptor sequences to a Long Short-Term Memory (LSTM) network, combine the outcome with the Principal Component Analysis (PCA) representation of utterance-level features, and make the final prediction with a logistic regression classifier.


Emotion classification Long Short-Term Memory Logistic regression Principal Component Analysis 



This work was financially supported by the Ministry of Education and Science of the Russian Federation (contract 14.575.21.0132, ID RFMEFI57517X0132), as well as by the Council for grants of the President of the Russian Federation (project № MD–254.2017.8) and by the RFBR (project № 16-3760100).


  1. 1.
    Metallinou, A., Wollmer, M., Katsamanis, A., Eyben, F., Schuller, B., Narayanan, S.: Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans. Affect. Comput. 3(2), 184–198 (2012)CrossRefGoogle Scholar
  2. 2.
    Vlasenko, B., Schuller, B., Wendemuth, A., Rigoll, G.: Frame vs. turn-level: emotion recognition from speech considering static and dynamic processing. In: Paiva, A.C.R., Prada, R., Picard, R.W. (eds.) ACII 2007. LNCS, vol. 4738, pp. 139–147. Springer, Heidelberg (2007). Scholar
  3. 3.
    Kim, Y., Honglak, L., Provost, E.M.: Deep learning for robust feature generation in audiovisual emotion recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing ICASSP-2013, pp. 3687–3691 (2013)Google Scholar
  4. 4.
    Hochreiter, S., Jürgen, S.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  5. 5.
    Vlasenko, B., Schuller, B., Wendemuth, A., Rigoll, G.: Combining frame and turn-level information for robust recognition of emotions within speech. In: Proceedings of 8th International Conference INTERSPEECH-2007, Antwerp, Belgium, pp. 2249–2252 (2007)Google Scholar
  6. 6.
    Eyben, F., Wöllmer, M., Schuller, B.: openSMILE – the Munich versatile and fast open-source audio feature extractor. In: Proceedings of 18th ACM International Conference on Multimedia, Florence, Italy, pp. 1459–1462 (2010)Google Scholar
  7. 7.
    Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C.A., Narayanan, S.S.: The INTERSPEECH 2010 paralinguistic challenge. In: Proceedings 11th International Conference INTERSPEECH-2010, Makuhari, Japan, pp. 2795–2798 (2010)Google Scholar
  8. 8.
    Verkholyak, O.: Research on methods of automatic emotion recognition in Russian speech. Ms. dissertation, ITMO University, St. Petersburg, Russia (2017)Google Scholar
  9. 9.
    Kaya, H., Karpov, A.A., Salah, A.A.: Robust acoustic emotion recognition based on cascaded normalization and extreme learning machines. In: Cheng, L., Liu, Q., Ronzhin, A. (eds.) ISNN 2016. LNCS, vol. 9719, pp. 115–123. Springer, Cham (2016). Scholar
  10. 10.
    Jolliffe, I.: Principal Component Analysis. Wiley, Indianapolis (2002)zbMATHGoogle Scholar
  11. 11.
    Sidorov, M.: Automatic recognition of paralinguistic information. Ph.D. dissertation, Ulm University, Ulm, Germany (2016)Google Scholar
  12. 12.
    Makarova, V., Petrushin, V.A.: RUSLANA: a database of Russian emotional utterances. In: Proceedings of 7th International Conference on Spoken Language Processing ICSLP-2002, Denver, Colorado, USA, pp. 2041–2044 (2002)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.ITMO UniversitySt. PetersburgRussia
  2. 2.SPIIRAS InstituteSt. PetersburgRussia

Personalised recommendations