Multi-stage classification of emotional speech motivated by a dimensional emotion model

  • Zhongzhe Xiao
  • Emmanuel DellandreaEmail author
  • Weibei Dou
  • Liming Chen


This paper deals with speech emotion analysis within the context of increasing awareness of the wide application potential of affective computing. Unlike most works in the literature which mainly rely on classical frequency and energy based features along with a single global classifier for emotion recognition, we propose in this paper some new harmonic and Zipf based features for better speech emotion characterization in the valence dimension and a multi-stage classification scheme driven by a dimensional emotion model for better emotional class discrimination. Experimented on the Berlin dataset with 68 features and six emotion states, our approach shows its effectiveness, displaying a 68.60% classification rate and reaching a 71.52% classification rate when a gender classification is first applied. Using the DES dataset with five emotion states, our approach achieves an 81% recognition rate when the best performance in the literature to our knowledge is 76.15% on the same dataset.


Emotional speech Harmonic feature Zipf feature Dimensional emotion model Multi-stage classification 



This work has received a scholarship awarded by the French government from 2004 to 2007 and was partly supported by a PRA project Apollo under the number SI04-02 and a PICS grant by CNRS under the number 3597.


  1. 1.
    Abelin A, Allwood J (2000) Cross-linguistic interpretation of emotional prosody. Proceedings of the ISCA Workshop on Speech and Emotion, BelfastGoogle Scholar
  2. 2.
    Atal B, Rabiner L (1976) A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition. IEEE Transactions on ASSP 24(3):201–212CrossRefGoogle Scholar
  3. 3.
    Banse R, Sherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70(3):614–636. doi: 10.1037/0022-3514.70.3.614 CrossRefGoogle Scholar
  4. 4.
    Bellman R (1961) Adaptive control processes: a guided tour, Princeton University PressGoogle Scholar
  5. 5.
    Bishop CM Pattern recognition and machine learning, Ed. Springer, 2006Google Scholar
  6. 6.
    Breazeal C (2001) Designing social robots. MIT Press, Cambridge, MAGoogle Scholar
  7. 7.
    Brian CJ Moore (1997) An introduction to the psychology of hearing, Academic PressGoogle Scholar
  8. 8.
    Burkhardt F, Sendlmeier W (2000) Verification of acoustical correlates of emotional speech using formant-synthesis, Proceedings of the ISCA Workshop on Speech and EmotionGoogle Scholar
  9. 9.
    Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss BA (2005) Database of German Emotional Speech Proceedings Interspeech, Lisbon, PortugalGoogle Scholar
  10. 10.
    Childers DG, Hand M, Larar JM (1989) Silent and voiced/unvoied/ mixed excitation(four-way), classification of speech. IEEE Transaction on ASSP 37(11):1771–1774CrossRefGoogle Scholar
  11. 11.
    Cohen A, Mantegna RN, Havlin S (1997) Numerical analysis of word frequencies in artificial and natural language texts. Fractals 5(1):95–104. doi: 10.1142/S0218348X97000103 zbMATHCrossRefGoogle Scholar
  12. 12.
    Dellandrea E, Makris P, Vincent N (2004) Zipf analysis of audio signals, fractals. World Sci Publishing Co 12(1):73–85Google Scholar
  13. 13.
    Devillers L, Lamel L (2003) Emotion detection in task-oriented dialogs, proceedings of the ICME 2003, IEEE, Multimedia Human-Machine Interface and Interaction I, Vol.III, pp.549-552, Baltimore, MD, USAGoogle Scholar
  14. 14.
    Druin A, Hendler J (2000) Robots for fids: exploring new technologies for learning. Morgan Kauffman, Los Altos, CAGoogle Scholar
  15. 15.
    Ekman P Emotions in the human face, Cambridge University Press, 1982Google Scholar
  16. 16.
    Engberg IS, Hansen AV (1996) Documentation of the Danish Emotional Speech Database DES, AalborgGoogle Scholar
  17. 17.
    Harb H, Chen L (2005) Voice-based gender identification in multimedia applications. J Intell Inf Syst 24(2):179–198CrossRefGoogle Scholar
  18. 18.
    Havlin S (1995) The distance between Zipf Plots. Physica A 216:148–150. doi: 10.1016/0378-4371(95)00069-J CrossRefMathSciNetGoogle Scholar
  19. 19.
  20. 20.
    Juslin PN (2000) Cue utilization in communication of emotion in music performance: relating performance to perception. J Exp Psychol 16(6):1797–1813Google Scholar
  21. 21.
    Kusahara M (2001) The art of creating subjective reality: an analysis of Japanese digital pets. In: Boudreau E (ed) in artificial life 7 workshop proceedings, p141–144Google Scholar
  22. 22.
    McGilloway S, Cowie R, Cowie ED, Gielen S, Westerdijk M, Stroeve S (2000) Approaching automatic recognition of emotion from voice: a rough benchmark, Proceedings of the ISCA workshop on Speech and Emotion, p. 207–212, Newcastle, Northern IrelandGoogle Scholar
  23. 23.
    Morrison D, Silva LCD (2007) Voting ensembles for spoken affect classification. J Netw Comput Appl 30:1356–1365. doi: 10.1016/j.jnca.2006.09.005 CrossRefGoogle Scholar
  24. 24.
    Oudeyer PY (2003) The production and recognition of emotions in speech: features and algorithms. Int J Hum Comput Stud 59(1–2):157–183. doi: 10.1016/S1071-5819(02)00141-6 Google Scholar
  25. 25.
    Pereira C (2000) Dimensions of emotional meaning in speech, Proceedings of the ISCA workshop on speech and emotion p. 25–28, Newcastle, Northern IrelandGoogle Scholar
  26. 26.
    Picard R (1997) Affective computing. MIT PressGoogle Scholar
  27. 27.
    Polzin T, Waibel A (2000) Emotion-sensitive human-computer interfaces, Proceedings of the ISCA workshop on Speech and Emotion, p. 201 ~ 206, Newcastle, Northern IrelandGoogle Scholar
  28. 28.
    PRAAT (2001) A system for doing phonetics by computer. Glot Int 5(9/10):341–345Google Scholar
  29. 29.
    Rakotomalala R (2005) TANAGRA : un logiciel gratuit pour l'enseignement et la recherche, in Actes de EGC'2005, RNTI-E-3, vol. 2, pp. 697-702Google Scholar
  30. 30.
    Russel JA (1980) A circumplex model of affect. J Pers Soc Psychol 39:1161–1178. doi: 10.1037/h0077714 CrossRefGoogle Scholar
  31. 31.
    Scherer KR (1989) Vocal correlates of emotion. In: Manstead A, Wagner H (eds) Handbook of psychophysiology: emotion and social behavior. Wiley, London, pp 165–197Google Scholar
  32. 32.
    Scherer KR (2002) Vocal communication of emotion: a review of research paradigms. Speech Commun 40:227–256. doi: 10.1016/S0167-6393(02)00084-5 CrossRefGoogle Scholar
  33. 33.
    Scherer KR, Kappas A (1988) Primate vocal expression of affective state. In: Todt D, Goedeking P, Symmes D (eds) Primate vocal communication. Springer, Berlin, pp 171–194Google Scholar
  34. 34.
    Scherer KR, Johnstone T, Klasmeyer G, Banziger T (2000) Can automatic speaker verification be improved by training the algorithms on emotional speech? Proc.ICSLP2000, Beijing, ChinaGoogle Scholar
  35. 35.
    Scherer KR, Schorr A, Johnstone T (2001) Appraisal processes in emotion: theory, methods, research, Oxford University Press, New York and OxfordGoogle Scholar
  36. 36.
    Schuller B, Rigoll G, Lang M (2003) Hidden markov model-based speech emotion recognition. Proceedings of ICASSP 2003, pp.II-1-II-4Google Scholar
  37. 37.
    Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in hybrid support vector machine − belief network architecture, proceedings of ICASSP, pp I-577-I-580Google Scholar
  38. 38.
    Schuller B, Reiter S, Muller R, Al-Hames M, Lang M, Rigoll G (2005) Speaker independent speech emotion recognition by ensemble classification, ICME, pp. 864–867Google Scholar
  39. 39.
    Schuller B, Reiter S, Rigoll G (2006) Evolutionary feature generation in speech emotion recognition. ICME 2006:5–8Google Scholar
  40. 40.
    Schuller B, Wimmer M, Mösenlechner L, Kern C, Arsic D, Rigoll G (2008) Brute-forcing hierarchical functional for paralinguistics : a waste of feature space. Proceedings of Icassp, pp 4501–4504Google Scholar
  41. 41.
    Slaney M, Mcroberts G (1998) Baby Ears: A recognition system for affective vocalizations. Proceedings of the ICASSP 1998, Seattle, WAGoogle Scholar
  42. 42.
    Spence C, Sajda P (1998) The role of feature selection in building pattern recognizers for computer-aided diagnosis, Proceedings of SPIE - Volume 3338, Medical Imaging 1998: Image Processing, Kenneth M. Hanson, Editor, p 1434–1441Google Scholar
  43. 43.
    Thayer RE (1989) The biopsychology of mood and arousal. Oxford Univ. PressGoogle Scholar
  44. 44.
    Tickle A (2000) English and Japanese speaker’s emotion vocalizations and recognition: a comparison highlighting vowel quality, ISCA Workshop on Speech and Emotion, BelfastGoogle Scholar
  45. 45.
    Ververidis D, Kotropoulos C (2004) Automatic speech classification to five emotional states based on gender information, Proceedings of 12th European Signal Processing Conference, p 341–344, AustriaGoogle Scholar
  46. 46.
    Ververidis D Kotropoulos C (2005) Emotional speech classification using gaussian mixture models and the sequential floating forward selection algorithm, IEEE International Conference on Multimedia and Expo, ICME, p. 1500– 1503Google Scholar
  47. 47.
    Ververidis D, Kotropoulos C, Pitas I (2004) Automatic emotional speech classification. Proceedings of ICASSP 2004, pp 593–596, Montreal, CanadaGoogle Scholar
  48. 48.
    Voght T, André E (2005) Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition, in Proc. Multimedia and Expo (ICME 2005), Amsterdam, pp.474–477Google Scholar
  49. 49.
    Watson D, Tellegen A (1985) Toward a Consensual Structure of Mood. Psychol Bull 98:219–235. doi: 10.1037/0033-2909.98.2.219 CrossRefGoogle Scholar
  50. 50.
    Wieczorkowska A, Synak P, Lewis R, Ras ZW (2005) Extracting emotions from music data. Proceedings of 15th International Symposium, ISMIS 2005, p. 456–465, Saratoga Springs, NY, USAGoogle Scholar
  51. 51.
    Witten IH, Frank E (2000) Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco, CA, USAGoogle Scholar
  52. 52.
    Xiao Z, Dellandrea E, Dou W, Chen L (2005) Features extraction and selection in emotional speech, International Conference on Advanced Video and Signal based Surveillance (AVSS 2005). p. 411–416., Como, ItalyGoogle Scholar
  53. 53.
    Xiao Z, Dellandrea E, Dou W, Chen L (2006) Two-stage classification of emotional speech, International Conference on Digital Telecommunications (ICDT'06), p. 32–37, Cap Esterel, Côte d’Azur, FranceGoogle Scholar
  54. 54.
    Xiao Z, Dellandrea E, Dou W, Chen L (2007) Automatic hierarchical classification of emotional speech, Ninth IEEE International Symposium on Multimedia Workshops (ISMW 2007), p. 291–296, TaiwanGoogle Scholar
  55. 55.
    Xiao Z, Dellandrea E, Dou W, Chen L (2007) Hierarchical classification of emotional speech, research report RR-LIRIS-2007-006, LIRIS UMR 5205 CNRSGoogle Scholar
  56. 56.
    Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley Press, 1949Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Zhongzhe Xiao
    • 1
  • Emmanuel Dellandrea
    • 1
    Email author
  • Weibei Dou
    • 2
  • Liming Chen
    • 1
  1. 1.LIRIS Laboratory, UMR5205, CNRSUniversité de Lyon, Ecole Centrale de LyonEcully CedexFrance
  2. 2.Tsinghua National Laboratory for Information Science and Technology Department of Electronic EngineeringTsinghua UniversityBeijingPeople’s Republic of China

Personalised recommendations