Multimodal Affect Recognition in the Context of Human-Computer Interaction for Companion-Systems

  • Friedhelm SchwenkerEmail author
  • Ronald Böck
  • Martin Schels
  • Sascha Meudt
  • Ingo Siegert
  • Michael Glodek
  • Markus Kächele
  • Miriam Schmidt-Wack
  • Patrick Thiam
  • Andreas Wendemuth
  • Gerald Krell
Part of the Cognitive Technologies book series (COGTECH)


In general, humans interact with each other using multiple modalities. The main channels are speech, facial expressions, and gesture. But also bio-physiological data such as biopotentials can convey valuable information which can be used to interpret the communication in a dedicated way. A Companion-System can use these modalities to perform an efficient human-computer interaction (HCI). To do so, the multiple sources need to be analyzed and combined in technical systems. However, so far only few studies have been published dealing with the fusion of three or even more such modalities. This chapter addresses the necessary processing steps in the development of a multimodal system applying fusion approaches.

ATLAS and ikannotate are presented which are designed for the pre-analyzing of multimodal data streams and the labeling of relevant parts. ATLAS allows us to display raw data, extracted features and even outputs of pre-trained classifier modules. Further, the tool integrates annotation, transcription and an active learning module. Ikannotate can be directly used for transcription and guided step-wise emotional annotation of multimodal data. The tool includes the three mainly used annotation paradigms, namely the basic emotions, the Geneva emotion wheel and the self-assessment manikins (SAMs). Furthermore, annotators using ikannotate can assign an uncertainty to samples.

Classifier architectures need to realize a fusion system in which the multiple modalities are combined. A large number of machine learning approaches were evaluated, such as data, feature, score and decision-level fusion schemes, but also temporal fusion architectures and partially supervised learning.

The proposed methods are evaluated on either multimodal benchmark corpora or on the datasets of the Transregional Collaborative Research Centre SFB/TRR 62, i.e. Last Minute Corpus and the EmoRec Dataset. Furthermore, we present results which were achieved in international challenges.



We thank our highly regarded deceased colleague and friend Prof. Dr. Bernd Michaelis who contributed to the SFB on various topics and provided well-informed suggestions. This work was done within the Transregional Collaborative Research Centre SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” funded by the German Research Foundation (DFG).


  1. 1.
    Batliner, A., Fischer, K., Huber, R., Spiker, J., Nöth, E.: Desperately seeking emotions: Actors, wizards and human beings. In: Proceedings of the ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research, pp. 195–200 (2000)Google Scholar
  2. 2.
    Böck, R., Siegert, I., Haase, M., Lange, J., Wendemuth, A.: ikannotate - a tool for labelling, transcription, and annotation of emotionally coloured speech. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.C. (eds.) Proceedings of ACII. Lecture Notes on Computer Science, vol. 6974, pp. 25–34. Springer, Berlin (2011)Google Scholar
  3. 3.
    Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of Interspeech 2005, pp. 1517–1520 (2005)Google Scholar
  5. 5.
    Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., Taylor, J.: Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 18(1), 32–80 (2001)CrossRefGoogle Scholar
  6. 6.
    Devillers, L., Vidrascu, L., Lamel, L.: Challenges in real-life emotion annotation and machine learning based detection. Neural Netw. 18(4), 407–422 (2005)CrossRefGoogle Scholar
  7. 7.
    Dhall, A., Goecke, R., Joshi, J., Sikka, K., Gedeon, T.: Emotion recognition in the wild challenge 2014: baseline, data and protocol. In: Proceedings of ICMI, pp. 461–466. ACM, New York (2014)Google Scholar
  8. 8.
    Dix, A., Finlay, J., Abowd, G., Beale, R.: Human-computer Interaction. Prentice-Hall, Upper Saddle River, NJ (1997)zbMATHGoogle Scholar
  9. 9.
    Frommer, J., Michaelis, B., Rösner, D., Wendemuth, A., Friesen, R., Haase, M., Kunze, M., Andrich, R., Lange, J., Panning, A., Siegert, I.: Towards emotion and affect detection in the multimodal last minute corpus. In: Calzolari, N., Choukri, K., Declerck, T., Doğan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of LREC. ELRA, Paris (2012)Google Scholar
  10. 10.
    Glodek, M., Tschechne, S., Layher, G., Schels, M., Brosch, T., Scherer, S., Kächele, M., Schmidt, M., Neumann, H., Palm, G., Schwenker, F.: Multiple classifier systems for the classification of audio-visual emotional states. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.C. (eds.) Proceedings of ACII - Part II, Lecture Notes on Computer Science, vol. 6975, pp. 359–368. Springer, Berlin (2011)Google Scholar
  11. 11.
    Glodek, M., Reuter, S., Schels, M., Dietmayer, K., Schwenker, F.: Kalman filter based classifier fusion for affective state recognition. In: Zhou, Z.H., Roli, F., Kittler, J. (eds.) Multiple Classifier Systems (MCS). Lecture Notes on Computer Science, vol. 7872, pp. 85–94. Springer, Berlin (2013)Google Scholar
  12. 12.
    Glodek, M., Schels, M., Schwenker, F.: Ensemble Gaussian mixture models for probability density estimation. Comput. Stat. 27(1), 127–138 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Glodek, M., Geier, T., Biundo, S., Palm, G.: A layered architecture for probabilistic complex pattern recognition to detect user preferences. J. Biol. Inspired Cognitive Archit. 9, 46–56 (2014)CrossRefGoogle Scholar
  14. 14.
    Glodek, M., Schels, M., Schwenker, F., Palm, G.: Combination of sequential class distributions from multiple channels using Markov fusion networks. J. Multimodal User Interfaces 8(3), 257–272 (2014)CrossRefGoogle Scholar
  15. 15.
    Glodek, M., Honold, F., Geier, T., Krell, G., Nothdurft, F., Reuter, S., Schüssel, F., Hörnle, T., Dietmayer, K., Minker, W., Biundo, S., Weber, M., Palm, G., Schwenker, F.: Fusion paradigms in cognitive technical systems for human-computer interaction. Neurocomputing 161, 17–37 (2015)CrossRefGoogle Scholar
  16. 16.
    Gunes, H., Piccardi, M.: Bi-modal emotion recognition from expressive face and body gestures. J. Netw. Comput. Appl. 30(4), 1334–1345 (2007)CrossRefGoogle Scholar
  17. 17.
    Healey, J.: Wearable and automotive systems for affect recognition from physiology. Ph.D. thesis, MIT (2000)Google Scholar
  18. 18.
    Hudlicka, E.: To feel or not to feel: The role of affect in human-computer interaction. Int. J. Hum.-Comput. Stud. 59(1-2), 1–32 (2003)CrossRefGoogle Scholar
  19. 19.
    Kächele, M., Schwenker, F.: Cascaded fusion of dynamic, spatial, and textural feature sets for person-independent facial emotion recognition. In: Proceedings of ICPR, pp. 4660–4665 (2014)Google Scholar
  20. 20.
    Kächele, M., Glodek, M., Zharkov, D., Meudt, S., Schwenker, F.: Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression. In: De Marsico, M., Tabbone, A., Fred, A. (eds.) Proceedings of ICPRAM, pp. 671–678. SciTePress, Setúbal (2014)Google Scholar
  21. 21.
    Kächele, M., Schels, M., Schwenker, F.: Inferring depression and affect from application dependent meta knowledge. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, AVEC ’14, pp. 41–48. ACM, New York (2014)Google Scholar
  22. 22.
    Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Fluids Eng. 82(1), 35–45 (1960)Google Scholar
  23. 23.
    Kanade, T., Cohn, J., Tian, Y.: Comprehensive database for facial expression analysis. In: Automatic Face and Gesture Recognition, 2000, pp. 46–53 (2000)Google Scholar
  24. 24.
    Kim, K., Bang, S., Kim, S.: Emotion recognition system using short-term monitoring of physiological signals. Med. Biol. Eng. Comput. 42(3), 419–427 (2004)CrossRefGoogle Scholar
  25. 25.
    Kipp, M.: Anvil - a generic annotation tool for multimodal dialogue. In: INTERSPEECH-2001, Aalborg, Denmark, pp. 1367–1370 (2001)Google Scholar
  26. 26.
    Krell, G., Niese, R., Al-Hamadi, A., Michaelis, B.: Suppression of uncertainties at emotional transitions — facial mimics recognition in video with 3-D model. In: Richard, P., Braz, J. (eds.) Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP), vol. 2, pp. 537–542 (2010)Google Scholar
  27. 27.
    Krell, G., Glodek, M., Panning, A., Siegert, I., Michaelis, B., Wendemuth, A., Schwenker, F.: Fusion of fragmentary classifier decisions for affective state recognition. In: MPRSS, Lecture Notes on Artificial Intelligence, vol. 7742, pp. 116–130. Springer, Berlin (2012)Google Scholar
  28. 28.
    Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, New York (2004)CrossRefzbMATHGoogle Scholar
  29. 29.
    Lang, P.J.: Behavioral Treatment and Bio-Behavioral Assessment: Computer Applications, pp. 119–137. Ablex Publishing, New York (1980)Google Scholar
  30. 30.
    Meudt, S., Schwenker, F.: Enhanced autocorrelation in real world emotion recognition. In: Proceedings of the 16th International Conference on Multimodal Interaction, ICMI ’14, pp. 502–507. ACM, New York (2014)Google Scholar
  31. 31.
    Meudt, S., Bigalke, L., Schwenker, F.: Atlas – an annotation tool for HCI data utilizing machine learning methods. In: International Conference on Affective and Pleasurable Design (APD’12), pp. 5347–5352 (2012)Google Scholar
  32. 32.
    Meudt, S., Zharkov, D., Kächele, M., Schwenker, F.: Multi classifier systems and forward backward feature selection algorithms to classify emotional coloured speech. In: International Conference on Multimodal Interaction, ICMI 2013, pp. 551–556. ACM, New York (2013)Google Scholar
  33. 33.
    Niese, R., Al-Hamadi, A., Heuer, M., Michaelis, B., Matuszewski, B.: Machine vision based recognition of emotions using the circumplex model of affect. In: Proceedings of the International Conference on Multimedia Technology (ICMT), pp. 6424–6427. IEEE, New York (2011)Google Scholar
  34. 34.
    North, D.O.: An analysis of the factors which determine signal/noise discrimination in pulsed-carrier systems. Proc. IEEE 51(7), 1016–1027 (1963)CrossRefGoogle Scholar
  35. 35.
    Oudeyer, P.: The production and recognition of emotions in speech: features and algorithms. Int. J. Hum.-Comput. Stud. 59(1-2), 157–183 (2003)CrossRefGoogle Scholar
  36. 36.
    Palm, G., Glodek, M.: Towards emotion recognition in human computer interaction. In: Esposito, A., Squartini, S., Palm, G. (eds.) Neural Nets and Surroundings, vol. 19, pp. 323–336. Springer, Berlin (2013)CrossRefGoogle Scholar
  37. 37.
    Panning, A., Siegert, I., Al-Hamadi, A., Wendemuth, A., Rösner, D., Frommer, J., Krell, G., Michaelis, B.: Multimodal affect recognition in spontaneous HCI environment. In: 2012 IEEE International Conference on Signal Processing, Communication and Computing, pp. 430–435. IEEE, New York (2012)Google Scholar
  38. 38.
    Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8 (2013)Google Scholar
  39. 39.
    Schels, M., Scherer, S., Glodek, M., Kestler, H., Palm, G., Schwenker, F.: On the discovery of events in EEG data utilizing information fusion. Comput. Stat. 28(1), 5–18 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  40. 40.
    Schels, M., Kächele, M., Glodek, M., Hrabal, D., Walter, S., Schwenker, F.: Using unlabeled data to improve classification of emotional states in human computer interaction. J. Multimodal User Interfaces 8(1), 5–16 (2014)CrossRefGoogle Scholar
  41. 41.
    Scherer, K.R.: What are emotions? and how can they be measured? Soc. Sci. Inf. 44, 695–729 (2005)CrossRefGoogle Scholar
  42. 42.
    Scherer, S., Schwenker, F., Palm, G.: Classifier fusion for emotion recognition from speech. In: Advanced Intelligent Environments, pp. 95–117. Springer, Boston (2009)Google Scholar
  43. 43.
    Scherer, S., Glodek, M., Layher, G., Schels, M., Schmidt, M., Brosch, T., Tschechne, S., Schwenker, F., Neumann, H., Palm, G.: A generic framework for the inference of user states in human computer interaction: how patterns of low level behavioral cues support complex user states in HCI. J. Multimodal User Interfaces 6(3–4), 117–141 (2012)CrossRefGoogle Scholar
  44. 44.
    Scherer, S., Glodek, M., Schwenker, F., Campbell, N., Palm, G.: Spotting laughter in natural multiparty conversations: a comparison of automatic online and offline approaches using audiovisual data. ACM Trans. Interactive Intell. Syst. 2(1), 4:1–4:31 (2012)Google Scholar
  45. 45.
    Schmidt, T., Schütte, W.: FOLKER: an annotation tool for efficient transcription of natural, multi-party interaction. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (2010)Google Scholar
  46. 46.
    Schmidt, T., Wörner, K.: EXMARaLDA – Creating, analysing and sharing spoken language corpora for pragmatic research. Pragmatics 19, 565–582 (2009)CrossRefGoogle Scholar
  47. 47.
    Schölkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.C.: Support vector method for novelty detection. In: NIPS, vol. 12, pp. 582–588 (1999)Google Scholar
  48. 48.
    Schüssel, F., Honold, F., Weber, M., Schmidt, M., Bubalo, N., Huckauf, A.: Multimodal interaction history and its use in error detection and recovery. In: Proceedings of the 16th ACM International Conference on Multimodal Interaction (ICMI’14), pp. 164–171. ACM, New York (2014)Google Scholar
  49. 49.
    Schwenker, F., Scherer, S., Magdi, Y.M., Palm, G.: The GMM-SVM supervector approach for the recognition of the emotional status from speech. In: ICANN (1), Lecture Notes on Computer Science, vol. 5768, pp. 894–903. Springer, Berlin (2009)Google Scholar
  50. 50.
    Schwenker, F., Scherer, S., Schmidt, M., Schels, M., Glodek, M.: Multiple classifier systems for the recognition of human emotions. In: Multiple Classifier Systems, Lecture Notes on Computer Science, vol. 5997, pp. 315–324. Springer, Berlin (2010)Google Scholar
  51. 51.
    Sezgin, M.C., Gunsel, B., Kurt, G.: Perceptual audio features for emotion detection. EURASIP J. Audio Speech Music Process. 2012, 1–21 (2012)CrossRefGoogle Scholar
  52. 52.
    Siegert, I., Glodek, M., Krell, G.: Using speaker group dependent modelling to improve fusion of fragmentary classifier decisions. In: Proceedings of the International IEEE Conference on Cybernetics (CYBCONF), pp. 132–137. IEEE, New York (2013)Google Scholar
  53. 53.
    Soleymani, M., Lichtenauer, J., Pun, T., Pantic, M.: A multimodal database for affect recognition and implicit tagging. IEEE Trans. Affect. Comput. 3, 42–55 (2012).CrossRefGoogle Scholar
  54. 54.
    Strauß, P.M., Hoffmann, H., Minker, W., Neumann, H., Palm, G., Scherer, S., Schwenker, F., Traue, H., Walter, W., Weidenbacher, U.: Wizard-of-oz data collection for perception and interaction in multi-user environments. In: Proceedings of LREC, pp. 2014–2017 (2006)Google Scholar
  55. 55.
    Traue, H.C., Ohl, F., Brechmann, A., Schwenker, F., Kessler, H., Limbrecht, K., Hoffman, H., Scherer, S., Kotzyba, M., Scheck, A., Walter, S.: A framework for emotions and dispositions in man-companion interaction. In: Rojc, M., Campbell, N. (eds.) Converbal Synchrony in Human-Machine Interaction, pp. 98–140. CRC Press, Boca Raton (2013)Google Scholar
  56. 56.
    Valstar, M., Schuller, B., Smith, K., Almaev, T., Eyben, F., Krajewski, J., Cowie, R., Pantic, M.: AVEC 2014: 3d dimensional affect and depression recognition challenge. In: Proceedings of ACM MM, AVEC ’14, pp. 3–10. ACM, New York (2014)Google Scholar
  57. 57.
    Vinciarelli, A., Pantic, M., Bourlard, H., Pentland, A.: Social signal processing: state-of-the-art and future perspectives of an emerging domain. In: Proceedings of the International ACM Conference on Multimedia (MM), pp. 1061–1070. ACM, New York, NY (2008)Google Scholar
  58. 58.
    Walter, S., Scherer, S., Schels, M., Glodek, M., Hrabal, D., Schmidt, M., Böck, R., Limbrecht, K., Traue, H.C., Schwenker, F.: Multimodal emotion classification in naturalistic user behavior. In: Jacko, J.A. (ed.) Proceedings of the 14th International Conference on Human Computer Interaction (HCI’11), Lecture Notes on Computer Science, vol. 6763, pp. 603–611. Springer, Berlin (2011)Google Scholar
  59. 59.
    Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31(1), 39–58 (2009)CrossRefGoogle Scholar
  60. 60.
    Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928 (2007)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Friedhelm Schwenker
    • 1
    Email author
  • Ronald Böck
    • 2
  • Martin Schels
    • 1
  • Sascha Meudt
    • 1
  • Ingo Siegert
    • 2
  • Michael Glodek
    • 1
  • Markus Kächele
    • 1
  • Miriam Schmidt-Wack
    • 1
  • Patrick Thiam
    • 1
  • Andreas Wendemuth
    • 2
    • 3
  • Gerald Krell
    • 4
  1. 1.Institute for Neural Information ProcessingUniversity of UlmUlmGermany
  2. 2.Cognitive Systems Group, Institute for Information Technology and CommunicationsOtto von Guericke UniversityMagdeburgGermany
  3. 3.Center for Behavioral Brain SciencesMagdeburgGermany
  4. 4.Technical Computer Science Group, Institute for Information Technology and CommunicationsOtto von Guericke UniversityMagdeburgGermany

Personalised recommendations