Fusion of Audio- and Visual Cues for Real-Life Emotional Human Robot Interaction

  • Ahmad Rabie
  • Uwe Handmann
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6835)

Abstract

Recognition of emotions from multimodal cues is of basic interest for the design of many adaptive interfaces in human-machine interaction (HMI) in general and human-robot interaction (HRI) in particular. It provides a means to incorporate non-verbal feedback in the course of interaction. Humans express their emotional and affective state rather unconsciously exploiting their different natural communication modalities such as body language, facial expression and prosodic intonation. In order to achieve applicability in realistic HRI settings, we develop person-independent affective models. In this paper, we present a study on multimodal recognition of emotions from such auditive and visual cues for interaction interfaces. We recognize six classes of basic emotions plus the neutral one of talking persons. The focus hereby lies on the simultaneous online visual and accoustic analysis of speaking faces. A probabilistic decision level fusion scheme based on Bayesian networks is applied to draw benefit of the complementary information from both – the acoustic and the visual – cues. We compare the performance of our state of the art recognition systems for separate modalities to the improved results after applying our fusion scheme on both DaFEx database and a real-life data that captured directly from robot. We furthermore discuss the results with regard to the theoretical background and future applications.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Battocchi, A., Pianesi, F., Goren-Bar, D.: A first evaluation study of a database of kinetic facial expressions (dafex). In: Proc. Int. Conf. Multimodal Interfaces, pp. 214–221. ACM Press, New York (2005)Google Scholar
  2. 2.
    Ekman, P., Friesen, W.: Unmasking the Face: A Guide to Recognizing Emotions from Facial Expressions. Prentice Hall, Englewood Cliffs (1975)Google Scholar
  3. 3.
    Paleari, M., Lisetti, C.L.: Toward multimodal fusion of affective cues. In: Proc. ACM Int. Workshop on Human-Centered Multimedia, pp. 99–108. ACM, New York (2006)CrossRefGoogle Scholar
  4. 4.
    Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U., Narayanan, S.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proc. Int. Conf. Multimodal Interfaces (2004)Google Scholar
  5. 5.
    Caridakis, G., Malatesta, L., Kessous, L., Amir, N., Raouzaiou, A., Karpouzis, K.: Modeling naturalistic affective states via facial and vocal expressions recognition. In: Proc. Int. Conf. Multimodal Interfaces, pp. 146–154. ACM, New York (2006)Google Scholar
  6. 6.
    Zeng, Z., Hu, Y., Fu, Y., Huang, T.S., Roisman, G.I., Wen, Z.: Audio-visual emotion recognition in adult attachment interview. In: Proc. Int. Conf. on Multimodal Interfaces, pp. 139–145. ACM, New York (2006)Google Scholar
  7. 7.
    Massaro, D.W., Egan, P.B.: Perceiving affect from the voice and the face. Psychonomoic Bulletin and Review (3), 215–221Google Scholar
  8. 8.
    de Gelder, B., Vroomen, J.: Bimodal emotion perception: integration across separate modalities, cross-modal perceptula grouping or perception of multimodal events? Cognition and Emotion 14, 321–324 (2000)CrossRefGoogle Scholar
  9. 9.
    Schwartz, J.L.: Why the FLMP should not be applied to McGurk data. or how to better compare models in the bazesian framework. In: Proc. Int. Conf. Audio-Visual Speech Processing, pp. 77–82 (2003)Google Scholar
  10. 10.
    Fagel, S.: Emotional mcgurk effect. In: Proc. Int. Conf. on Speech Prosody, Dresden, Germany (2006)Google Scholar
  11. 11.
    Rabie, A., Lang, C., Hanheide, M., Castrillon-Santana, M., Sagerer, G.: Automatic initialization for facial analysis in interactive robotics (2008)Google Scholar
  12. 12.
    Hegel, F., Spexard, T., Vogt, T., Horstmann, G., Wrede, B.: Playing a different imitation game: Interaction with an empathic android robot. In: Proc. Int. Conf. Humanoid Robots, pp. 56–61 (2006)Google Scholar
  13. 13.
    Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. PAMI 23, 681–685 (2001)CrossRefGoogle Scholar
  14. 14.
    Castrillón, M., Déniz, O., Guerra, C., Hernández, M.: Encara2: Real-time detection of multiple faces at different resolutions in video streams. Journal of Visual Communication and Image Representation 18, 130–140 (2007)CrossRefGoogle Scholar
  15. 15.
    Hanheide, M., Wrede, S., Lang, C., Sagerer, G.: Who am i talking with? a face memory for social robots (2008)Google Scholar
  16. 16.
    Vogt, T., André, E., Bee, N.: Emovoice — A framework for online recognition of emotions from voice. In: Proc. Workshop on Perception and Interactive Technologies for Speech-Based Systems, Irsee, Germany (2008)Google Scholar
  17. 17.
    Hall, M.A.: Correlation-based feature subset selection for machine learning. Master’s thesis, University of Waikato, New Zealand (1998)Google Scholar
  18. 18.
    Vogt, T., André, E.: Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition. In: Proc. of IEEE Int. Conf. on Multimedia & Expo., Amsterdam, The Netherlands (2005)Google Scholar
  19. 19.
    Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transaction on Pattern Analysis and Macine Intellegence 31, 39–58 (2009)CrossRefGoogle Scholar
  20. 20.
    Rabie, A., Vogt, T., Hanheide, M., Wrede, B.: Evaluation and discussion of multi-modal emotion recognition. In: ICCEE (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Ahmad Rabie
    • 1
  • Uwe Handmann
    • 1
  1. 1.Institute of InformaticsUniversity of Applied Sciences; HRW, Mülheim & BottropGermany

Personalised recommendations