Fisher Kernels on Phase-Based Features for Speech Emotion Recognition

Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 427)


The involvement of affect information in a spoken dialogue system can increase the user-friendliness and provide a more natural way for the interaction experience. This can be reached by speech emotion recognition, where the features are usually dominated by the spectral amplitude information while they ignore the use of the phase spectrum. In this chapter, we propose to use phase-based features to build up such an emotion recognition system. To exploit these features, we employ Fisher kernels. The according technique encodes the phase-based features by their deviation from a generative Gaussian mixture model. The resulting representation is fed to train a classification model with a linear kernel classifier. Experimental results on the GeWEC database including ‘normal’ and whispered phonation demonstrate the effectiveness of our method.


Speech emotion recognition Phase-based features Fisher kernels Modified group delay features 



This work has been partially supported by the BMBF IKT2020-Grant under grant agreement No. 16SV7213 (EmotAsS) and the European Communitys Seventh Framework Programme through the ERC Starting Grant No. 338164 (iHEARu).


  1. 1.
    Andre, E., Rehm, M., Minker, W., Bühler, D.: Endowing spoken language dialogue systems with emotional intelligence. In: Affective Dialogue Systems, pp. 178–187. Springer (2004)Google Scholar
  2. 2.
    Acosta, J.C.: Using emotion to gain rapport in a spoken dialog system. In: Proceedings of NAACL HLT, pp. 49–54. Boulder, CO (2009)Google Scholar
  3. 3.
    Pittermann, J., Pittermann, A., Minker, W.: Emotion recognition and adaptation in spoken dialogue systems. Int. J. Speech Technol. 13(1), 49–60 (2010)CrossRefGoogle Scholar
  4. 4.
    Callejas, Z., Griol, D., López-Cózar, R.: Predicting user mental states in spoken dialogue systems. EURASIP J. Adv. Sign. Process. 2011, 6 (2011)CrossRefGoogle Scholar
  5. 5.
    Vinciarelli, A., Pantic, M., Heylen, D., Pelachaud, C., Poggi, I., D’Errico, F., Schröder, M.: Bridging the gap between social animal and unsocial machine: a survey of social signal processing. IEEE Trans. Affect. Comput. 3(1), 69–87 (2012)CrossRefGoogle Scholar
  6. 6.
    Benyon, D., Gamback, B., Hansen, P., Mival, O., Webb, N.: How was your day? Evaluating a conversational companion. IEEE Trans. Affect. Comput. 4(3), 299–311 (2013)CrossRefGoogle Scholar
  7. 7.
    Dumouchel, P., Dehak, N., Attabi, Y., Dehak, R., Boufaden, N.: Cepstral and long-term features for emotion recognition. In: Proceedings of INTERSPEECH, pp. 344–347. Brighton, UK (2009)Google Scholar
  8. 8.
    Schuller, B.: Intelligent Audio Analysis. Signals and Communication Technology, Springer (2013), 350 pGoogle Scholar
  9. 9.
    Anagnostopoulos, C.N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. pp. 1–23 (2012)Google Scholar
  10. 10.
    Attabi, Y., Alam, M.J., Dumouchel, P., Kenny, P., O’Shaughnessy, D.: Multiple windowed spectral features for emotion recognition. In: Proceedings of ICASSP, pp. 7527–7531. IEEE, Vancouver, BC (2013)Google Scholar
  11. 11.
    Mowlaee, P., Saeidi, R., Stylanou, Y.: INTERSPEECH 2014 special session: phase importance in speech processing applications. In: Proceedings of INTERSPEECH. Singapore (2014), 5 pGoogle Scholar
  12. 12.
    Yegnanarayana, B., Sreekanth, J., Rangarajan, A.: Waveform estimation using group delay processing. IEEE Trans. Acoust. Speech Sign. Process. 33(4), 832–836 (1985)CrossRefGoogle Scholar
  13. 13.
    Murthy, H., Gadde, V., et al.: The modified group delay function and its application to phoneme recognition. In: Proceedings of ICASSP, vol. 1, pp. I–68. Hong Kong, China (2003)Google Scholar
  14. 14.
    Hegde, R., Murthy, H., Gadde, V.: Significance of the modified group delay feature in speech recognition. IEEE Trans. Audio Speech Lang. Process. 15(1), 190–202 (2007)CrossRefGoogle Scholar
  15. 15.
    Mowlaee, P., Saiedi, R., Martin, R.: Phase estimation for signal reconstruction in single-channel speech separation. In: Proceedings of ICSLP, pp. 1–4. Hong Kong, China (2012)Google Scholar
  16. 16.
    Hernáez, I., Saratxaga, I., Sanchez, J., Navas, E., Luengo, I.: Use of the harmonic phase in speaker recognition. In: Proceedings of INTERSPEECH, pp. 2757–2760. Florence, Italy (2011)Google Scholar
  17. 17.
    Tahon, M., Degottex, G., Devillers, L.: Usual voice quality features and glottal features for emotional valence detection. In: Proceedings of ICSP, pp. 693–696. Beijing, China (2012)Google Scholar
  18. 18.
    Jaakkola, T., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Proceedings of NIPS, pp. 487–493. Denver, CO (1999)Google Scholar
  19. 19.
    McCowan, I., Dean, D., McLaren, M., Vogt, R., Sridharan, S.: The delta-phase spectrum with application to voice activity detection and speaker recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2026–2038 (2011)CrossRefGoogle Scholar
  20. 20.
    Diment, A., Rajan, P., Heittola, T., Virtanen, T.: Modified group delay feature for musical instrument recognition. In: Proceedings of CMMR, pp. 431–438. Marseille, France (2013)Google Scholar
  21. 21.
    Wu, Z., Siong, C.E., Li, H.: Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: Proceedings of INTERSPEECH. Portland, OR (2012), 4 pGoogle Scholar
  22. 22.
    Xiao, X., Tian, X., Du, S., Xu, H., Chng, E.S., Li, H.: Spoofing speech detection using high dimensional magnitude and phase features: The NTU approach for ASVspoof 2015 challenge. In: Proceedings of INTERSPEECH, pp. 2052–2056. Dresden, Germany (2015)Google Scholar
  23. 23.
    Zhu, D., Paliwal, K.K.: Product of power spectrum and group delay function for speech recognition. In: Proceedings of ICASSP, pp. 125–128. Montreal, Canada (2004)Google Scholar
  24. 24.
    Moreno, P.J., Rifkin, R.: Using the Fisher kernel method for web audio classification. In: Proceedings of ICASSP, pp. 2417–2420. Istanbul, Turkey (2000)Google Scholar
  25. 25.
    Fine, S., Navrátil, J., Gopinath, R.A.: A hybrid GMM/SVM approach to speaker identification. In: Proceedings of ICASSP, pp. 417–420. Utah, USA (2001)Google Scholar
  26. 26.
    Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: Proceedings of CVPR, pp. 1–8. Minneapolis, MN (2007)Google Scholar
  27. 27.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the Fisher kernel for large-scale image classification. In: Proceedings of ECCV, pp. 143–156. Crete, Greece (2010)Google Scholar
  28. 28.
    Schuller, B., Steidl, S., Batliner, A., Vinciarelli, A., Scherer, K., Ringeval, F., Chetouani, M., Weninger, F., Eyben, F., Marchi, E., Mortillaro, M., Salamin, H., Polychroniou, A., Valente, F., Kim, S.: The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of INTERSPEECH, pp. 148–152. Lyon, France (2013)Google Scholar
  29. 29.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)zbMATHGoogle Scholar
  30. 30.
    Eyben, F., Wöllmer, M., Schuller, B.: openSMILE—the Munich versatile and fast open-source audio feature extractor. In: Proceedings of MM, pp. 1459–1462. Florence, Italy (2010)Google Scholar
  31. 31.
    Eyben, F., Weninger, F., Groß, F., Schuller, B.: Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In: Proceedings of MM, pp. 835–838. Barcelona, Spain (2013)Google Scholar
  32. 32.
    Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: Proceedings of INTERSPEECH, pp. 312–315. Brighton, UK (2009)Google Scholar
  33. 33.
    Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C., Narayanan, S.: The INTERSPEECH 2010 paralinguistic challenge. In: Proceedings of INTERSPEECH, pp. 2794–2797. Makuhari, Japan (2010)Google Scholar
  34. 34.
    Schuller, B., Batliner, A., Steidl, S., Schiel, F., Krajewski, J.: The INTERSPEECH 2011 speaker state challenge. In: Proceedings of INTERSPEECH, pp. 3201–3204. Florence, Italy (2011)Google Scholar
  35. 35.
    Schuller, B., Steidl, S., Batliner, A., Nöth, E., Vinciarelli, A., Burkhardt, F., van Son, R., Weninger, F., Eyben, F., Bocklet, T., Mohammadi, G., Weiss, B.: The INTERSPEECH 2012 speaker trait challenge. In: Proceedings of INTERSPEECH, Portland, OR (2012)Google Scholar
  36. 36.
    Eyben, F., Scherer, K., Schuller, B., Sundberg, J., André, E., Busso, C., Devillers, L., Epps, J., Laukka, P., Narayanan, S., Truong, K.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2016)Google Scholar

Copyright information

© Springer Science+Business Media Singapore 2017

Authors and Affiliations

  1. 1.Chair of Complex & Intelligent SystemsUniversity of PassauPassauGermany
  2. 2.Machine Intelligence & Signal Processing Group, MMKTechnische Universität MünchenMunichGermany
  3. 3.Swiss Center for Affective SciencesUniversity of GenevaGenevaSwitzerland
  4. 4.Department of ComputingImperial College LondonLondonUK

Personalised recommendations