Skip to main content
Log in

Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis

  • Original Paper
  • Published:
Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Abstract

In this paper a study on multimodal automatic emotion recognition during a speech-based interaction is presented. A database was constructed consisting of people pronouncing a sentence in a scenario where they interacted with an agent using speech. Ten people pronounced a sentence corresponding to a command while making 8 different emotional expressions. Gender was equally represented, with speakers of several different native languages including French, German, Greek and Italian. Facial expression, gesture and acoustic analysis of speech were used to extract features relevant to emotion. For the automatic classification of unimodal data, bimodal data and multimodal data, a system based on a Bayesian classifier was used. After performing an automatic classification of each modality, the different modalities were combined using a multimodal approach. Fusion of the modalities at the feature level (before running the classifier) and at the results level (combining results from classifier from each modality) were compared. Fusing the multimodal data resulted in a large increase in the recognition rates in comparison to the unimodal systems: the multimodal approach increased the recognition rate by more than 10% when compared to the most successful unimodal system. Bimodal emotion recognition based on all combinations of the modalities (i.e., ‘face-gesture’, ‘face-speech’ and ‘gesture-speech’) was also investigated. The results show that the best pairing is ‘gesture-speech’. Using all three modalities resulted in a 3.3% classification improvement over the best bimodal results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Valstar MF, Gunes H, Pantic M (2007) How to distinguish posed from spontaneous smiles using geometric features. In: Proceedings of ACM international conference on multimodal interfaces (ICMI’07), Nagoya, Japan, November 2007. ACM, New York, pp 38–45

    Google Scholar 

  2. Picard R (1997) Affective computing. MIT Press, Boston

    Google Scholar 

  3. Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 20:569–571

    Google Scholar 

  4. Sebe N, Cohen I, Huang TS (2005) Multimodal emotion recognition. Handbook of pattern recognition and computer vision. World Scientific, Boston. ISBN:981-256-105-6

    Google Scholar 

  5. Pantic M, Sebe N, Cohn J, Huang TS (2005) Affective multimodal human-computer interaction. ACM Multimedia 20:669–676

    Google Scholar 

  6. Ambady N, Rosenthal R (1992) Thin slices of expressive behavior as predictors of interpersonal consequences: a meta-analysis. Psychol Bull 111(2):256–274

    Article  Google Scholar 

  7. Busso C, Deng Z, Yildirim S, Bulut M, Lee CM, Kazemzaeh A, Lee S, Neumann U, Narayanan S (2004) Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proc. of ACM 6th int’l conf. on multimodal interfaces (ICMI 2004), State College, PA, October 2004. ACM, New York, pp 205–211

    Chapter  Google Scholar 

  8. Gunes H, Piccardi M (2007) Bi-modal emotion recognition from expressive face and body gestures. J Netw Comput Appl 30:1334–1345

    Article  Google Scholar 

  9. Karpouzis K, Caridakis G, Kessous L, Amir N, Raouzaiou A, Malatesta L, Kollias S (2007) Modeling naturalistic affective states via facial, vocal, and bodily expressions recognition. In: Artificial intelligence for human computing

  10. Bernhardt D, Robinson P (2007) Detecting affect from non-stylised body motions. In: Paiva A, Prada R, Picard RW (eds) Proceedings of affective computing and intelligent interaction, second international conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007. LNCS, vol 4738. Springer, Berlin, pp 59–70

    Google Scholar 

  11. Castellano G, Villalba SD, Camurri A (2007) Recognising human emotions from body movement and gesture dynamics. In: Paiva A, Prada R, Picard RW (eds) Proceedings of affective computing and intelligent interaction, second international conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007. LNCS, vol 4738. Springer, Berlin, pp 71–82

    Google Scholar 

  12. Kleinsmith A, Bianchi-Berthouze N (2007) Recognizing affective dimensions from body posture. In: Paiva A, Prada R, Picard RW (eds) Proceedings of affective computing and intelligent interaction, second international conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007. LNCS, vol 4738. Springer, Berlin, pp 48–58

    Google Scholar 

  13. Vidrascu L, Devillers L (2005) Real-life emotions representation and detection in call centers. In: Proc. of 2nd international conference on affective computing and intelligent interaction, Lisbon, Portugal

  14. Batliner A, Steidl S, Hacker C, Noth E, Niemann H (2005) Tales of tuning—prototyping for automatic classification of emotional user states. In: Proceedings of the interspeech conference

  15. Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70(3):614–636

    Article  Google Scholar 

  16. Vogt T, Andre E (2005) Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition. In: Proc. IEEE international conference on multimedia and Expo ICME05

  17. Gunes H, Piccardi M (2006) A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In: Proc. of ICPR 2006 the 18th international conference on pattern recognition, Hong Kong, China, November 2006

  18. Banziger T, Pirker H, Scherer K (2006) Gemep—Geneva multimodal emotion portrayals: a corpus for the study of multimodal emotional expressions. In: L. Deviller et al. (ed), Proceedings of LREC’06 workshop on corpora for research on emotion and affect, Genoa, Italy, pp 15–19

  19. Douglas-Cowie E, Campbell N, Cowie R, Roach P (2003) Emotional speech: towards a new generation of databases. Speech Commun 40:33–60

    Article  MATH  Google Scholar 

  20. Rosenblum M, Yacoob Y, Davis L (1996) Human expression recognition from motion using a radial basis function network architecture. IEEE Trans Neural Netw 7(5):1121–1138

    Article  Google Scholar 

  21. Pantic M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. IEEE Trans Pattern Anal Mach Intel 22(12):1424–1445

    Article  Google Scholar 

  22. Pantic M, Bartlett MS (2007) Machine analysis of facial expressions. In: Delac K, Grgic M (eds) Face recognition. I-Tech Education and Publishing, Vienna, pp 377–416

    Google Scholar 

  23. Cowie R, Douglas-Cowie E (1996) Automatic statistical analysis of the signal and prosodic signs of emotion in speech. In: Proceedings international conference on spoken language processing, Genoa, Italy, 1996

  24. Oudeyer PY (2003) The production and recognition of emotions in speech: features and algorithms. Int J Human-Comput Stud 59(1–2):157–183

    Google Scholar 

  25. Schuller B, Seppi D, Batliner A, Maier A, Steidl S (2007) Towards more reality in the recognition of emotional speech. In: Proc. int. conf. on acoustics, speech, and signal processing, Honolulu, Hawaii, USA, 2007, pp 941—944

  26. Camurri A, Lagerlöf I, Volpe G (2003) Recognizing emotion from dance movement: comparison of spectator recognition and automated techniques. Int J Human-Comput Stud 59(1–2):213–225

    Article  Google Scholar 

  27. Bianchi-Berthouze N, Kleinsmith A (2003) A categorical approach to affective gesture recognition. Connect Sci 15(4):259–269

    Article  Google Scholar 

  28. Castellano G, Mortillaro M, Camurri A, Volpe G, Scherer K (2008) Automated analysis of body movement in emotionally expressive piano performances. Music Percept 26(2):103–119

    Article  Google Scholar 

  29. Picard RW, Vyzas E, Healey J (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans Pattern Anal Mach Intell 23(10):1175–1191

    Article  Google Scholar 

  30. Kim J, Andre E, Rehm M, Vogt T, Wagner J (2005) Integrating information from speech and physiological signals to achieve emotional sensitivity. In: Proc. of the 9th European conference on speech communication and technology

  31. Sebe N, Cohen I, Huang TS (2005) Multimodal emotion recognition. Handbook of pattern recognition and computer vision. World Scientific, Boston

    Google Scholar 

  32. Pantic M, Sebe N, Cohn JF, Huang T (2005) Affective multimodal human-computer interaction. In: Proceedings of the 13th annual ACM international conference on multimedia. ACM, New York, pp 669–676

    Chapter  Google Scholar 

  33. Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58

    Article  Google Scholar 

  34. Zeng Z, Tu J, Liu M, Huang TS, Pianfetti B, Roth D, Levinson S (2007) Audio-visual affect recognition. IEEE Trans Multimedia 9:424–428

    Article  Google Scholar 

  35. Busso C, Narayanan S (2007) Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Trans Audio Speech Lang Process 20:2331–2347

    Article  Google Scholar 

  36. el Kaliouby R, Robinson P (2005) Generalization of a vision-based computational model of mind-reading. In: Proceedings of first international conference on affective computing and intelligent interfaces, pp 582–589

  37. Scherer KR, Ellgring H (2007) Multimodal expression of emotion: Affect programs or componential appraisal patterns? Emotion 7(1)

  38. Engelbrecht AP, Fletcher L, Cloete I (1999) Variance analysis of sensitivity information for pruning multilayer feedforward neural networks. In: Neural networks, IJCNN ’99, vol 3, pp 1829–1833

  39. Densley DJ, Willis PJ (1997) Emotional posturing: a method towards achieving emotional figure animation. In: Computer animation

  40. Yang M-H, Kriegman DJ, Ahuja N (2002) Detecting faces in images: a survey. IEEE Trans Pattern Anal Mach Intell 24(1):34–58

    Article  Google Scholar 

  41. Young JW Head and face anthropometry of adult U.S. civilians. Technical Report final report, FAA Civil Aeromedical Institute, 1963-93

  42. Ioannou S, Caridakis G, Karpouzis K, Kollias S (2007) Robust feature detection for facial expression recognition. EURASIP J Image Video Process

  43. Raouzaiou A, Tsapatsoulis N, Karpouzis K, Kollias S (2002) Parameterized facial expression synthesis based on mpeg-4. EURASIP J Appl Signal Process 10:1021–1038

    Google Scholar 

  44. Camurri A, Coletta P, Massari A, Mazzarino B, Peri M, Ricchetti M, Ricci A, Volpe G (2004) Toward real-time multimodal processing: Eyesweb 4.0. In: Proc. AISB 2004 convention: motion, emotion and cognition, Leeds, UK, March 2004

  45. Camurri A, Mazzarino B, Volpe G (2004) Analysis of expressive gesture: the eyesweb expressive gesture processing library. In: Camurri A, Volpe G (eds) Gesture-based communication in human-computer interaction. LNAI, vol 2915. Springer, Berlin

    Google Scholar 

  46. Kessous L, Amir N, Cohen R (2007) Evaluation of perceptual time/frequency representations for automatic classification of expressive speech. In: International workshop on paralinguistic speech—between models and data, ParaLing’07

  47. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

  48. Cooper G, Herskovits E (1992) A bayesian method for the induction of probabilistic networks from data. Mach Learn 9(4):309–347

    MATH  Google Scholar 

  49. Kononenko I (1995) On biases in estimating multi-valued attributes. In: 14th international joint conference on artificial intelligence, Newcastle upon Tyne, UK, pp 1034–1040

  50. Kohavi R (1995) A study on cross-validation and bootstrap for accurate estimation and model selection. In: Proceedings of the international joint conference on artificial intelligence, vol 2. Morgan Kaufmann, San Francisco, pp 1137–1143

    Google Scholar 

  51. Chen LS, Huang TS, Miyasato T, Nakatsu R (1998) Multimodal human emotion/expression recognition. In: Conf. on automatic face and gesture recognition

  52. Ioannou S, Kessous L, Caridakis G (2006) Adaptive on-line neural network retraining for real life multimodal emotion recognition. In: Proceedings of international conference on artificial neural networks (ICANN), Athens, Greece, September 2006, pp 81–92

  53. De Silva LC, Miyasato T, Nakatsu R (1997) Facial emotion recognition using multimodal information. In: Conf. on information, communications and signal processing (ICICS’97)

  54. Littlewort G, Stewart Bartlett M, Fasel IR, Susskind J, Movellan JR (2006) Dynamics of facial expression extracted automatically from video. Image Vis Comput 24(6):615–625

    Article  Google Scholar 

  55. Stein B, Meredith MA (1993) The merging of senses. MIT Press, Cambridge

    Google Scholar 

  56. Coulson M (2004) Attributing emotion to static body postures: recognition accuracy, confusions, and viewpoint dependence. J Nonverbal Behav 28(2):117–139

    Article  MathSciNet  Google Scholar 

  57. Balomenos T, Raouzaiou A, Ioannou S, Drosopoulos A, Karpouzis K, Kollias S (2005) Emotion analysis in man-machine interaction systems, 3D modeling and animation: synthesis and analysis techniques. Idea Group Publ., pp 175–200

  58. Karpouzis K, Raouzaiou A, Drosopoulos A, Ioannou S, Balomenos T, Tsapatsoulis N, Kollias S (2004) Facial expression and gesture analysis for emotionally-rich man-machine interaction. In: 3D modeling and animation: synthesis and analysis techniques

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Loic Kessous.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kessous, L., Castellano, G. & Caridakis, G. Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis. J Multimodal User Interfaces 3, 33–48 (2010). https://doi.org/10.1007/s12193-009-0025-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12193-009-0025-5

Navigation