Companion Technology pp 409-428 | Cite as
Emotion Recognition from Speech
Abstract
Spoken language is one of the main interaction patterns in human-human as well as in natural, companion-like human-machine interactions. Speech conveys content, but also emotions and interaction patterns determining the nature and quality of the user’s relationship to his counterpart. Hence, we consider emotion recognition from speech in the wider sense of application in Companion-systems. This requires a dedicated annotation process to label emotions and to describe their temporal evolution in view of a proper regulation and control of a system’s reaction. This problem is peculiar for naturalistic interactions, where the emotional labels are no longer a priori given. This calls for generating and measuring of a reliable ground truth, where the measurement is closely related to the usage of appropriate emotional features and classification techniques. Further, acted and naturalistic spoken data has to be available in operational form (corpora) for the development of emotion classification; we address the difficulties arising from the variety of these data sources. Speaker clustering and speaker adaptation will as well improve the emotional modeling. Additionally, a combination of the acoustical affective evaluation and the interpretation of non-verbal interaction patterns will lead to a better understanding of and reaction to user-specific emotional behavior.
Notes
Acknowledgements
This work was done within the Transregional Collaborative Research Centre SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” funded by the German Research Foundation (DFG).
References
- 1.Altman, D.G.: Practical Statistics for Medical Research. Chapman & Hall, London (1991)Google Scholar
- 2.Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34, 555–596 (2008)CrossRefGoogle Scholar
- 3.Batliner, A., Seppi, D., Steidl, S., Schuller, B.: Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach. Adv. Hum. Comput. Interact. 2010, 15 (2010)Google Scholar
- 4.Batliner, A., Steidl, S., Schuller, B., Seppi, D., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Aharonson, V., Kessous, L., Amir, N.: Whodunnit – searching for the most important feature types signalling emotion-related user states in speech. Comput. Speech Lang. 25, 4–28 (2011)CrossRefGoogle Scholar
- 5.Bergmann, K., Böck, R., Jaecks, P.: Emogest: investigating the impact of emotions on spontaneous co-speech gestures. In: Proceedings of the Workshop on Multimodal Corpora 2014, pp. 13–16. LREC, Reykjavik (2014)Google Scholar
- 6.Böck, R., Hübner, D., Wendemuth, A.: Determining optimal signal features and parameters for HMM-based emotion classification. In: Proceedings of the 15th IEEE MELECON, Valletta, Malta, pp. 1586–1590 (2010)Google Scholar
- 7.Böck, R., Siegert, I., Vlasenko, B., Wendemuth, A., Haase, M., Lange, J.: A processing tool for emotionally coloured speech. In: Proceedings of the 2011 IEEE ICME, p. s.p, Barcelona (2011)Google Scholar
- 8.Böck, R., Limbrecht, K., Walter, S., Hrabal, D., Traue, H., Glüge, S., Wendemuth, A.: Intraindividual and interindividual multimodal emotion analyses in human-machine-interaction. In: Proceedings of the IEEE CogSIMA, New Orleans, pp. 59–64 (2012)Google Scholar
- 9.Böck, R., Limbrecht-Ecklundt, K., Siegert, I., Walter, S., Wendemuth, A.: Audio-based pre-classification for semi-automatic facial expression coding. In: Kurosu, M. (ed.) Human-Computer Interaction. Towards Intelligent and Implicit Interaction. Lecture Notes in Computer Science, vol. 8008, pp. 301–309. Springer, Berlin/Heidelberg (2013)CrossRefGoogle Scholar
- 10.Böck, R., Bergmann, K., Jaecks, P.: Disposition recognition from spontaneous speech towards a combination with co-speech gestures. In: Böck, R., Bonin, F., Campbell, N., Poppe, R. (eds.) Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction. Lecture Notes in Artificial Intelligence, vol. 8757, pp. 57–66. Springer, Cham (2015)Google Scholar
- 11.Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH-2005, Lisbon, pp. 1517–1520 (2005)Google Scholar
- 12.Callejas, Z., López-Cózar, R.: Influence of contextual information in emotion annotation for spoken dialogue systems. Speech Comm. 50, 416–433 (2008)CrossRefGoogle Scholar
- 13.Cicchetti, D., Feinstein, A.: High agreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol. 43, 551–558 (1990)CrossRefGoogle Scholar
- 14.Cowie, R., Cornelius, R.R.: Describing the emotional states that are expressed in speech. Speech Comm. 40, 5–32 (2003)CrossRefzbMATHGoogle Scholar
- 15.Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M., Schröder, M.: FEELTRACE: an instrument for recording perceived emotion in real time. In: Proceedings of the SpeechEmotion-2000, Newcastle, pp. 19–24 (2000)Google Scholar
- 16.Dobris̆ek, S., Gajs̆ek, R., Mihelic̆, F., Paves̆ić, N., S̆truc, V.: Towards efficient multi-modal emotion recognition. Int. J. Adv. Robot. Syst. 10, 1–10 (2013)Google Scholar
- 17.Ekman, P.: Are there basic emotions? Psychol. Rev. 99, 550–553 (1992)CrossRefGoogle Scholar
- 18.Feinstein, A., Cicchetti, D.: High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 43, 543–549 (1990)CrossRefGoogle Scholar
- 19.Fleiss, J.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382 (1971)CrossRefGoogle Scholar
- 20.Frommer, J., Rösner, D., Haase, M., Lange, J., Friesen, R., Otto, M.: Detection and Avoidance of Failures in Dialogues – Wizard of Oz Experiment Operator’s Manual. Pabst Science Publishers, Lengerich (2012)Google Scholar
- 21.Grimm, M., Kroschel, K., Mower, E., Narayanan, S.: Primitives-based evaluation and estimation of emotions in speech. Speech Comm. 49, 787–800 (2007)CrossRefGoogle Scholar
- 22.Grimm, M., Kroschel, K., Narayanan, S.: The Vera am Mittag German audio-visual emotional speech database. In: Proceedings of the 2008 IEEE ICME, Hannover, pp. 865–868 (2008)Google Scholar
- 23.Harrington, J., Palethorpe, S., Watson, C.: Age-related changes in fundamental frequency and formants: a longitudinal study of four speakers. In: Proceedings of the INTERSPEECH-2007, Antwerp, vol. 2, pp. 1081–1084 (2007)Google Scholar
- 24.Iliou, T., Anagnostopoulos, C.N.: Comparison of different classifiers for emotion recognition. In: Proceedings of the Panhellenic Conference on Informatics, pp. 102–106 (2009)Google Scholar
- 25.Kelly, F., Harte, N.: Effects of long-term ageing on speaker verification. In: Vielhauer, C., Dittmann, J., Drygajlo, A., Juul, N., Fairhurst, M. (eds.) Biometrics and ID Management. Lecture Notes in Computer Science, vol. 6583, pp. 113–124. Springer, Berlin/Heidelberg (2011)CrossRefGoogle Scholar
- 26.Krippendorff, K.: Content Analysis: An Introduction to Its Methodology, 3rd edn. SAGE, Thousand Oaks (2012)Google Scholar
- 27.Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)CrossRefzbMATHGoogle Scholar
- 28.Lee, C.M., Yildirim, S., Bulut, M., Kazemzadeh, A., Busso, C., Deng, Z., Lee, S., Narayanan, S.: Emotion recognition based on phoneme classes. In: Proceedings of the INTERSPEECH 2004, Jeju Island, pp. 889–892 (2004)Google Scholar
- 29.Lee, C., Busso, C., Lee, S., Narayanan, S.: Modeling mutual influence of interlocutor emotion states in dyadic spoken interactions. In: Proceedings of the INTERSPEECH 2009, pp. 1983–1986 (2009)Google Scholar
- 30.Lipovčan, L., Prizmić, Z., Franc, R.: Age and gender differences in affect regulation strategies. Drustvena istrazivanja: J. Gen. Soc. Issues 18, 1075–1088 (2009)Google Scholar
- 31.Maganti, H.K., Scherer, S., Palm, G.: A novel feature for emotion recognition in voice based applications. In: Affective Computing and Intelligent Interaction, pp. 710–711. Springer, Berlin/Heidelberg (2007)Google Scholar
- 32.McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The SEMAINE database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3, 5–17 (2012)CrossRefGoogle Scholar
- 33.Meudt, S., Bigalke, L., Schwenker, F.: ATLAS – an annotation tool for HCI data utilizing machine learning methods. In: Proceedings of the 1st APD, San Francisco, pp. 5347–5352 (2012)Google Scholar
- 34.Morris, J.D.: SAM: the self-assessment manikin an efficient cross-cultural measurement of emotional response. J. Adv. Res. 35, 63–68 (1995)Google Scholar
- 35.Palm, G., Glodek, M.: Towards emotion recognition in human computer interaction. In: Neural nets and surroundings, pp. 323–336. Springer, Berlin/Heidelberg (2013)Google Scholar
- 36.Pittermann, J., Pittermann, A., Minker, W.: Handling Emotions in Human-Computer Dialogues. Springer, Amsterdam (2010)CrossRefGoogle Scholar
- 37.Prylipko, D., Rösner, D., Siegert, I., Günther, S., Friesen, R., Haase, M., Vlasenko, B., Wendemuth, A.: Analysis of significant dialog events in realistic human–computer interaction. J. Multimodal User Interfaces 8, 75–86 (2014)CrossRefGoogle Scholar
- 38.Rösner, D., Frommer, J., Friesen, R., Haase, M., Lange, J., Otto, M.: LAST MINUTE: a multimodal corpus of speech-based user-companion interactions. In: Proceedings of the 8th LREC, Istanbul, pp. 96–103 (2012)Google Scholar
- 39.Scherer, K.R.: Unconscious Processes in Emotion: The Bulk of the Iceberg, pp. 312–334. Guilford Press, New York (2005)Google Scholar
- 40.Scherer, S., Kane, J., Gobl, C., Schwenker, F.: Investigating fuzzy-input fuzzy-output support vector machines for robust voice quality classification. Comput. Speech Lang. 27(1), 263–287 (2013)CrossRefGoogle Scholar
- 41.Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: Proceedings of the INTERSPEECH-2009, Brighton, pp. 312–315 (2009)Google Scholar
- 42.Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., Wendemuth, A.: Acoustic emotion recognition: a benchmark comparison of performances. In: Proceedings of the IEEE ASRU-2009, Merano, pp. 552–557 (2009)Google Scholar
- 43.Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Comm. 53, 1062–1087 (2011)CrossRefGoogle Scholar
- 44.Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., Pantic, M.: AVEC 2011–the first international audio/visual emotion challenge. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.C. (eds.) Affective Computing and Intelligent Interaction. Lecture Notes in Computer Science, vol. 6975, pp. 415–424. Springer, Berlin/Heidelberg (2011)CrossRefGoogle Scholar
- 45.Shami, M., Verhelst, W.: Automatic classification of emotions in speech using multi-corpora approaches. In: Proceedings of the 2nd IEEE Signal Processing Symposium, Antwerp, pp. 3–6 (2006)Google Scholar
- 46.Siegert, I., Böck, R., Philippou-Hübner, D., Vlasenko, B., Wendemuth, A.: Appropriate emotional labeling of non-acted speech using basic emotions, Geneva emotion wheel and self assessment manikins. In: Proceedings of the 2011 IEEE ICME, p. s.p, Barcelona (2011)Google Scholar
- 47.Siegert, I., Böck, R., Wendemuth, A.: The influence of context knowledge for multi-modal affective annotation. In: Kurosu, M. (ed.) Human-Computer Interaction. Towards Intelligent and Implicit Interaction. Lecture Notes in Computer Science, vol. 8008, pp. 381–390. Springer, Berlin/Heidelberg (2013)CrossRefGoogle Scholar
- 48.Siegert, I., Glodek, M., Panning, A., Krell, G., Schwenker, F., Al-Hamadi, A., Wendemuth, A.: Using speaker group dependent modelling to improve fusion of fragmentary classifier decisions. In: Proceedings of 2013 IEEE CYBCONF, Lausanne, pp. 132–137 (2013)Google Scholar
- 49.Siegert, I., Hartmann, K., Philippou-Hübner, D., Wendemuth, A.: Human behaviour in HCI: complex emotion detection through sparse speech features. In: Salah, A., Hung, H., Aran, O., Gunes, H. (eds.) Human Behavior Understanding. Lecture Notes in Computer Science, vol. 8212, pp. 246–257. Springer, Berlin/Heidelberg (2013)CrossRefGoogle Scholar
- 50.Siegert, I., Böck, R., Wendemuth, A.: Inter-rater reliability for emotion annotation in human-computer interaction – comparison and methodological improvements. J. Multimodal User Interfaces 8, 17–28 (2014)CrossRefGoogle Scholar
- 51.Siegert, I., Haase, M., Prylipko, D., Wendemuth, A.: Discourse particles and user characteristics in naturalistic human-computer interaction. In: Kurosu, M. (ed.) Human-Computer Interaction. Advanced Interaction Modalities and Techniques. Lecture Notes in Computer Science, vol. 8511, pp. 492–501. Springer, Berlin/Heidelberg (2014)Google Scholar
- 52.Siegert, I., Philippou-Hübner, D., Hartmann, K., Böck, R., Wendemuth, A.: Investigation of speaker group-dependent modelling for recognition of affective states from speech. Cogn. Comput. 6(4), 892–913 (2014)CrossRefGoogle Scholar
- 53.Siegert, I., Prylipko, D., Hartmann, K., Böck, R., Wendemuth, A.: Investigating the form-function-relation of the discourse particle “hm” in a naturalistic human-computer interaction. In: Bassis, S., Esposito, A., Morabito, F. (eds.) Recent Advances of Neural Network Models and Applications. Smart Innovation, Systems and Technologies, vol. 26, pp. 387–394. Springer, Berlin/Heidelberg (2014)CrossRefGoogle Scholar
- 54.Strauß, P.M., Hoffmann, H., Minker, W., Neumann, H., Palm, G., Scherer, S., Schwenker, F., Traue, H., Walter, W., Weidenbacher, U.: Wizard-of-oz data collection for perception and interaction in multi-user environments. In: International Conference on Language Resources and Evaluation (LREC) (2006)Google Scholar
- 55.Ververidis, D., Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech Comm. 48, 1162–1181 (2006)CrossRefGoogle Scholar
- 56.Vlasenko, B., Wendemuth, A.: Location of an emotionally neutral region in valence-arousal space. Two-class vs. three-class cross corpora emotion recognition evaluations. In: Proceedings of 2014 IEEE ICME (2014)Google Scholar
- 57.Vlasenko, B., Philippou-Hübner, D., Prylipko, D., Böck, R., Siegert, I., Wendemuth, A.: Vowels formants analysis allows straightforward detection of high arousal emotions. In: Proceedings of 2011 IEEE ICME, Barcelona (2011)Google Scholar
- 58.Vogt, T., André, E.: Improving automatic emotion recognition from speech via gender differentiation. In: Proceedings of the 5th LREC, p. s.p, Genoa (2006)Google Scholar
- 59.Wahlster, W. (ed.): SmartKom: Foundations of Multimodal Dialogue Systems. Springer, Heidelberg/Berlin (2006)Google Scholar
- 60.Walter, S., Scherer, S., Schels, M., Glodek, M., Hrabal, D., Schmidt, M., Böck, R., Limbrecht, K., Traue, H., Schwenker, F.: Multimodal emotion classification in naturalistic user behavior. In: Jacko, J. (ed.) Human-Computer Interaction. Towards Mobile and Intelligent Interaction Environments. Lecture Notes in Computer Science, vol. 6763, pp. 603–611. Springer, Berlin/Heidelberg (2011)CrossRefGoogle Scholar
- 61.Walter, S., Kim, J., Hrabal, D., Crawcour, S., Kessler, H., Traue, H.: Transsituational individual-specific biopsychological classification of emotions. IEEE Trans. Syst. Man Cybern. Syst. Hum. 43(4), 988–995 (2013)CrossRefGoogle Scholar
- 62.Young, S., Evermann, G., Gales, M., Hasin, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.4). Engineering Department, Cambridge University, Cambridge (2009)Google Scholar
- 63.Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 39–58 (2009)CrossRefGoogle Scholar