On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues

  • Florian Eyben
  • Martin Wöllmer
  • Alex Graves
  • Björn Schuller
  • Ellen Douglas-Cowie
  • Roddy Cowie
Original Paper


For many applications of emotion recognition, such as virtual agents, the system must select responses while the user is speaking. This requires reliable on-line recognition of the user’s affect. However most emotion recognition systems are based on turnwise processing. We present a novel approach to on-line emotion recognition from speech using Long Short-Term Memory Recurrent Neural Networks. Emotion is recognised frame-wise in a two-dimensional valence-activation continuum. In contrast to current state-of-the-art approaches, recognition is performed on low-level signal frames, similar to those used for speech recognition. No statistical functionals are applied to low-level feature contours. Framing at a higher level is therefore unnecessary and regression outputs can be produced in real-time for every low-level input frame. We also investigate the benefits of including linguistic features on the signal frame level obtained by a keyword spotter.

Continuous emotion recognition Recurrent neural nets Long short-term memory Affective databases 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Batliner A, Steidl S, Nöth E (2008) Releasing a thoroughly annotated and processed spontaneous emotional database: the FAU Aibo Emotion Corpus. In: Deviller  L, Martin JC, Cowie R, Douglas-Cowie E, Batliner A (eds) Proc. of a satellite workshop of LREC 2008 on corpora for research on emotion and affect, pp 28–31. Marrakesh Google Scholar
  2. 2.
    Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166 CrossRefGoogle Scholar
  3. 3.
    Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Proc. of interspeech, pp 1517–1520. Lisbon, Portugal Google Scholar
  4. 4.
    Caridakis G, Malatesta L, Kessous L, Amir N, Raouzaiou A, Karpouzis K (2006) Modeling naturalistic affective states via facial and vocal expressions recognition. In: Proc. of the 8th international conference on multimodal interfaces, pp 146–154. Banff, Alberta, Canada, Google Scholar
  5. 5.
    Castellano G, Kessous L, Caridakis G (2008) Emotion recognition through multiple modalities: face, body gesture, speech. In: Peter C, Beale R (eds) Affect and emotion in human-computer interaction. Springer, Berlin, pp 92–103 CrossRefGoogle Scholar
  6. 6.
    Cowie R, Douglas-Cowie E, Savvidou S, McMahon E, Sawey M, Schröder M (2000) Feeltrace: an instrument for recording perceived emotion in real time. In: Proceedings of the ISCA workshop on speech and emotion, pp 19–24 Google Scholar
  7. 7.
    Douglas-Cowie E, Cowie R, Sneddon I, Cox C, Lowry O, McRorie M, Martin JC, Devillers L, Abrilian S, Batliner A, Amir N, Karpouzis K (2007) The HUMAINE database. In: Proc. of ACII, pp 488–500 Google Scholar
  8. 8.
    Eyben F, Wöllmer M, Schuller B (2009) openEAR—introducing the Munich Open-source Emotion and Affect Recognition Toolkit. In: Proc. of ACII, pp 576–581. Amsterdam, The Netherlands Google Scholar
  9. 9.
    Fernandez S, Graves A, Schmidhuber J (2007) An application of recurrent neural networks to discriminative keyword spotting. In: Proc. of ICANN, pp 220–229. Porto, Portugal Google Scholar
  10. 10.
    Fernandez S, Graves A, Schmidhuber J (2008) Phoneme recognition in TIMIT with BLSTM-CTC. Tech. rep., IDSIA Google Scholar
  11. 11.
    Graves A (2008) Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technische Universität München Google Scholar
  12. 12.
    Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5–6):602–610 CrossRefGoogle Scholar
  13. 13.
    Graves A, Fernandez S, Schmidhuber J (2005) Bidirectional LSTM networks for improved phoneme classification and recognition. In: Proceedings of ICANN, vol 18. Warsaw, Poland, pp 602–610 Google Scholar
  14. 14.
    Graves A, Fernandez S, Liwicki M, Bunke H, Schmidhuber J (2008) Unconstrained online handwriting recognition with recurrent neural networks. Adv Neural Inf Process Syst Google Scholar
  15. 15.
    Grimm M, Kroschel K, Narayanan S (2007) Support vector regression for automatic recognition of spontaneous emotions in speech. In: Proc. of ICASSP, pp 1085–1088 Google Scholar
  16. 16.
    Grimm M, Kroschel K, Narayanan S (2008) The vera am mittag german audio-visual emotional speech database. In: Proc. of ICME, pp 865–868. Hannover, Germany Google Scholar
  17. 17.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780 CrossRefGoogle Scholar
  18. 18.
    Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proc. of ECML, pp 137–142. Chemniz, Germany Google Scholar
  19. 19.
    Lang KJ, Waibel AH, Hinton GE (1990) A time-delay neural network architecture for isolated word recognition. Neural Netw 3(1):23–43 CrossRefGoogle Scholar
  20. 20.
    Lin T, Horne BG, Tino P, Giles CL (1996) Learning long-term dependencies in NARX recurrent neural networks. IEEE Trans Neural Netw 7(6):1329–1338 CrossRefGoogle Scholar
  21. 21.
    Liwicki M, Graves A, Fernandez S, Bunke H, Schmidhuber J (2007) A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proc. of ICDAR, pp 367–371. Curitiba, Brazil Google Scholar
  22. 22.
    Peters C, O’Sullivan C (2002) Synthetic vision and memory for autonomous virtual humans. Comput Graph Forum 21(4):743–753 CrossRefGoogle Scholar
  23. 23.
    Riedmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning: the RPROP algorithm. In: IEEE international conference on neural networks, pp 586–591 Google Scholar
  24. 24.
    Schaefer AM, Udluft S, Zimmermann HG (2008) Learning long-term dependencies with recurrent neural networks. Neurocomputing 71(13–15):2481–2488 CrossRefGoogle Scholar
  25. 25.
    Schmidhuber J (1992) Learning complex extended sequences using the principle of history compression. Neural Comput 4(2):234–242 CrossRefGoogle Scholar
  26. 26.
    Schröder M, Devillers L, Karpouzis K, Martin JC, Pelachaud C, Peter C, Pirker H, Schuller B, Tao J, Wilson I (2007) What should a generic emotion markup language be able to represent? In: Paiva A, Prada R, Picard RW (eds) Affective computing and intelligent interaction. Springer, Berlin, pp 440–451 CrossRefGoogle Scholar
  27. 27.
    Schröder M, Cowie R, Heylen D, Pantic M, Pelachaud C, Schuller B (2008) Towards responsive sensitive artificial listeners. In: Proc. of 4th intern. workshop on human-computer conversation. Bellagio, Italy Google Scholar
  28. 28.
    Schuller B, Rigoll G (2006) Timing levels in segment-based speech emotion recognition. In: Proc. of interspeech, pp 1818–1821. Pittsburgh, PA, USA Google Scholar
  29. 29.
    Schuller B, Rigoll G, Lang M (2003) Hidden Markov model-based speech emotion recognition. In: Proc. of ICASSP, pp 1–4. Hong Kong, China Google Scholar
  30. 30.
    Schuller B, Reiter S, Rigoll G (2006) Evolutionary feature generation in speech emotion recognition. In: Proc. of ICME, pp 5–8. Toronto, Canada Google Scholar
  31. 31.
    Schuller B, Vlasenko B, Minguez R, Rigoll G, Wendemuth A (2007) Comparing one and two-stage acoustic modeling in the recognition of emotion in speech. In: Proc. of ASRU, pp 596–600. Kyoto, Japan Google Scholar
  32. 32.
    Schuller B, Wimmer M, Mösenlechner L, Kern C, Arsic D, Rigoll G (2008) Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? In: Proc. of ICASSP, pp 4501–4504. Las Vegas, Nevada, USA Google Scholar
  33. 33.
    Schuller B, Müller R, Eyben F, Gast J, Hörnler B, Wöllmer M, Rigoll G, Höthker A, Konosu H (2009) Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image Vis Comput J 27(12):1760–1774. Special issue on visual and multimodal analysis of human spontaneous behavior CrossRefGoogle Scholar
  34. 34.
    Schuller B, Steidl S, Batliner A (2009) The Interspeech 2009 emotion challenge. In: Proc. of interspeech, pp 312–315. Brighton, UK Google Scholar
  35. 35.
    Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A (2009) Acoustic emotion recognition: A benchmark comparison of performances. In: Proc. of ASRU 2009. Merano, Italy Google Scholar
  36. 36.
    Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Proc 45:2673–2681 CrossRefGoogle Scholar
  37. 37.
    Seppi D, Batliner A, Schuller B, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Aharonson V (2008) Patterns, prototypes, performance: classifying emotional user states. In: Proc. of interspeech, pp 601–604. Brisbane, Australia Google Scholar
  38. 38.
    Steidl S (2009) Automatic classification of emotion-related user states in spontaneous children’s speech. Logos, Berlin Google Scholar
  39. 39.
    Steininger S, Schiel F, Dioubina O, Raubold S (2002) Development of user-state conventions for the multimodal corpus in smartkom. In: Workshop on multimodal resources and multimodal systems evaluation, pp 33–37. Las Palmas Google Scholar
  40. 40.
    Streit M, Batliner A, Portele T (2006) Emotions analysis and emotion-handling subdialogues. In: Wahlster W (ed) SmartKom: foundations of multimodal dialogue systems. Springer, Berlin, pp 317–332 CrossRefGoogle Scholar
  41. 41.
    Vlasenko B, Schuller B, Wendemuth A, Rigoll G (2007) Frame vs. turn-level: Emotion recognition from speech considering static and dynamic processing. In: Paiva A (ed) Proc. of ACII, pp 139–147. Lisbon, Portugal Google Scholar
  42. 42.
    Werbos P (1990) Backpropagation through time: What it does and how to do it. Proc IEEE 78:1550–1560 CrossRefGoogle Scholar
  43. 43.
    Witten IH, Frank E (2005) Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco, zbMATHGoogle Scholar
  44. 44.
    Wöllmer M, Eyben F, Reiter S, Schuller B, Cox C, Douglas-Cowie E, Cowie R (2008) Abandoning emotion classes—towards continuous emotion recognition with modelling of long-range dependencies. In: Proc. of interspeech, pp 597–600. Brisbane, Australia Google Scholar
  45. 45.
    Wöllmer M, Al-Hames M, Eyben F, Schuller B, Rigoll G (2009) A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams. Neurocomputing 73:366–380 CrossRefGoogle Scholar
  46. 46.
    Wöllmer M, Eyben F, Keshet J, Graves A, Schuller B, Rigoll G (2009) Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks. In: Proc. of ICASSP, pp 3949–3952. Taipei, Taiwan Google Scholar
  47. 47.
    Wöllmer M, Eyben F, Schuller B, Douglas-Cowie E, Cowie R (2009) Data-driven clustering in emotional space for affect recognition using discriminatively trained LSTM networks. In: Proc. of interspeech, pp 1595–1598. Brighton, UK Google Scholar
  48. 48.
    Wöllmer M, Eyben F, Schuller B, Rigoll G (2009) Robust vocabulary independent keyword spotting with graphical models. In: Proc. of ASRU 2009. Merano, Italy Google Scholar
  49. 49.
    Wöllmer M, Eyben F, Schuller B, Sun Y, Moosmayr T, Nguyen-Thien N (2009) Robust in-car spelling recognition—a tandem BLSTM-HMM approach. In: Proc. of interspeech, pp 2507–2510. Brighton, UK Google Scholar
  50. 50.
    Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58 CrossRefGoogle Scholar

Copyright information

© OpenInterface Association 2009

Authors and Affiliations

  • Florian Eyben
    • 1
  • Martin Wöllmer
    • 1
  • Alex Graves
    • 2
  • Björn Schuller
    • 1
  • Ellen Douglas-Cowie
    • 3
  • Roddy Cowie
    • 3
  1. 1.Institute for Human-Machine CommunicationTechnische Universität MünchenMunichGermany
  2. 2.Institute for Computer Science VITechnische Universität MünchenMunichGermany
  3. 3.School of PsychologyQueen’s UniversityBelfastUK

Personalised recommendations