Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis

  • Loic Kessous
  • Ginevra Castellano
  • George Caridakis
Original Paper


In this paper a study on multimodal automatic emotion recognition during a speech-based interaction is presented. A database was constructed consisting of people pronouncing a sentence in a scenario where they interacted with an agent using speech. Ten people pronounced a sentence corresponding to a command while making 8 different emotional expressions. Gender was equally represented, with speakers of several different native languages including French, German, Greek and Italian. Facial expression, gesture and acoustic analysis of speech were used to extract features relevant to emotion. For the automatic classification of unimodal data, bimodal data and multimodal data, a system based on a Bayesian classifier was used. After performing an automatic classification of each modality, the different modalities were combined using a multimodal approach. Fusion of the modalities at the feature level (before running the classifier) and at the results level (combining results from classifier from each modality) were compared. Fusing the multimodal data resulted in a large increase in the recognition rates in comparison to the unimodal systems: the multimodal approach increased the recognition rate by more than 10% when compared to the most successful unimodal system. Bimodal emotion recognition based on all combinations of the modalities (i.e., ‘face-gesture’, ‘face-speech’ and ‘gesture-speech’) was also investigated. The results show that the best pairing is ‘gesture-speech’. Using all three modalities resulted in a 3.3% classification improvement over the best bimodal results.

Affective body language Affective speech Facial expression Emotion recognition Multimodal fusion 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Valstar MF, Gunes H, Pantic M (2007) How to distinguish posed from spontaneous smiles using geometric features. In: Proceedings of ACM international conference on multimodal interfaces (ICMI’07), Nagoya, Japan, November 2007. ACM, New York, pp 38–45 Google Scholar
  2. 2.
    Picard R (1997) Affective computing. MIT Press, Boston Google Scholar
  3. 3.
    Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 20:569–571 Google Scholar
  4. 4.
    Sebe N, Cohen I, Huang TS (2005) Multimodal emotion recognition. Handbook of pattern recognition and computer vision. World Scientific, Boston. ISBN:981-256-105-6 Google Scholar
  5. 5.
    Pantic M, Sebe N, Cohn J, Huang TS (2005) Affective multimodal human-computer interaction. ACM Multimedia 20:669–676 Google Scholar
  6. 6.
    Ambady N, Rosenthal R (1992) Thin slices of expressive behavior as predictors of interpersonal consequences: a meta-analysis. Psychol Bull 111(2):256–274 CrossRefGoogle Scholar
  7. 7.
    Busso C, Deng Z, Yildirim S, Bulut M, Lee CM, Kazemzaeh A, Lee S, Neumann U, Narayanan S (2004) Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proc. of ACM 6th int’l conf. on multimodal interfaces (ICMI 2004), State College, PA, October 2004. ACM, New York, pp 205–211 CrossRefGoogle Scholar
  8. 8.
    Gunes H, Piccardi M (2007) Bi-modal emotion recognition from expressive face and body gestures. J Netw Comput Appl 30:1334–1345 CrossRefGoogle Scholar
  9. 9.
    Karpouzis K, Caridakis G, Kessous L, Amir N, Raouzaiou A, Malatesta L, Kollias S (2007) Modeling naturalistic affective states via facial, vocal, and bodily expressions recognition. In: Artificial intelligence for human computing Google Scholar
  10. 10.
    Bernhardt D, Robinson P (2007) Detecting affect from non-stylised body motions. In: Paiva A, Prada R, Picard RW (eds) Proceedings of affective computing and intelligent interaction, second international conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007. LNCS, vol 4738. Springer, Berlin, pp 59–70 Google Scholar
  11. 11.
    Castellano G, Villalba SD, Camurri A (2007) Recognising human emotions from body movement and gesture dynamics. In: Paiva A, Prada R, Picard RW (eds) Proceedings of affective computing and intelligent interaction, second international conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007. LNCS, vol 4738. Springer, Berlin, pp 71–82 Google Scholar
  12. 12.
    Kleinsmith A, Bianchi-Berthouze N (2007) Recognizing affective dimensions from body posture. In: Paiva A, Prada R, Picard RW (eds) Proceedings of affective computing and intelligent interaction, second international conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007. LNCS, vol 4738. Springer, Berlin, pp 48–58 Google Scholar
  13. 13.
    Vidrascu L, Devillers L (2005) Real-life emotions representation and detection in call centers. In: Proc. of 2nd international conference on affective computing and intelligent interaction, Lisbon, Portugal Google Scholar
  14. 14.
    Batliner A, Steidl S, Hacker C, Noth E, Niemann H (2005) Tales of tuning—prototyping for automatic classification of emotional user states. In: Proceedings of the interspeech conference Google Scholar
  15. 15.
    Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70(3):614–636 CrossRefGoogle Scholar
  16. 16.
    Vogt T, Andre E (2005) Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition. In: Proc. IEEE international conference on multimedia and Expo ICME05 Google Scholar
  17. 17.
    Gunes H, Piccardi M (2006) A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In: Proc. of ICPR 2006 the 18th international conference on pattern recognition, Hong Kong, China, November 2006 Google Scholar
  18. 18.
    Banziger T, Pirker H, Scherer K (2006) Gemep—Geneva multimodal emotion portrayals: a corpus for the study of multimodal emotional expressions. In: L. Deviller et al. (ed), Proceedings of LREC’06 workshop on corpora for research on emotion and affect, Genoa, Italy, pp 15–19 Google Scholar
  19. 19.
    Douglas-Cowie E, Campbell N, Cowie R, Roach P (2003) Emotional speech: towards a new generation of databases. Speech Commun 40:33–60 MATHCrossRefGoogle Scholar
  20. 20.
    Rosenblum M, Yacoob Y, Davis L (1996) Human expression recognition from motion using a radial basis function network architecture. IEEE Trans Neural Netw 7(5):1121–1138 CrossRefGoogle Scholar
  21. 21.
    Pantic M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. IEEE Trans Pattern Anal Mach Intel 22(12):1424–1445 CrossRefGoogle Scholar
  22. 22.
    Pantic M, Bartlett MS (2007) Machine analysis of facial expressions. In: Delac K, Grgic M (eds) Face recognition. I-Tech Education and Publishing, Vienna, pp 377–416 Google Scholar
  23. 23.
    Cowie R, Douglas-Cowie E (1996) Automatic statistical analysis of the signal and prosodic signs of emotion in speech. In: Proceedings international conference on spoken language processing, Genoa, Italy, 1996 Google Scholar
  24. 24.
    Oudeyer PY (2003) The production and recognition of emotions in speech: features and algorithms. Int J Human-Comput Stud 59(1–2):157–183 Google Scholar
  25. 25.
    Schuller B, Seppi D, Batliner A, Maier A, Steidl S (2007) Towards more reality in the recognition of emotional speech. In: Proc. int. conf. on acoustics, speech, and signal processing, Honolulu, Hawaii, USA, 2007, pp 941—944 Google Scholar
  26. 26.
    Camurri A, Lagerlöf I, Volpe G (2003) Recognizing emotion from dance movement: comparison of spectator recognition and automated techniques. Int J Human-Comput Stud 59(1–2):213–225 CrossRefGoogle Scholar
  27. 27.
    Bianchi-Berthouze N, Kleinsmith A (2003) A categorical approach to affective gesture recognition. Connect Sci 15(4):259–269 CrossRefGoogle Scholar
  28. 28.
    Castellano G, Mortillaro M, Camurri A, Volpe G, Scherer K (2008) Automated analysis of body movement in emotionally expressive piano performances. Music Percept 26(2):103–119 CrossRefGoogle Scholar
  29. 29.
    Picard RW, Vyzas E, Healey J (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans Pattern Anal Mach Intell 23(10):1175–1191 CrossRefGoogle Scholar
  30. 30.
    Kim J, Andre E, Rehm M, Vogt T, Wagner J (2005) Integrating information from speech and physiological signals to achieve emotional sensitivity. In: Proc. of the 9th European conference on speech communication and technology Google Scholar
  31. 31.
    Sebe N, Cohen I, Huang TS (2005) Multimodal emotion recognition. Handbook of pattern recognition and computer vision. World Scientific, Boston Google Scholar
  32. 32.
    Pantic M, Sebe N, Cohn JF, Huang T (2005) Affective multimodal human-computer interaction. In: Proceedings of the 13th annual ACM international conference on multimedia. ACM, New York, pp 669–676 CrossRefGoogle Scholar
  33. 33.
    Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58 CrossRefGoogle Scholar
  34. 34.
    Zeng Z, Tu J, Liu M, Huang TS, Pianfetti B, Roth D, Levinson S (2007) Audio-visual affect recognition. IEEE Trans Multimedia 9:424–428 CrossRefGoogle Scholar
  35. 35.
    Busso C, Narayanan S (2007) Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Trans Audio Speech Lang Process 20:2331–2347 CrossRefGoogle Scholar
  36. 36.
    el Kaliouby R, Robinson P (2005) Generalization of a vision-based computational model of mind-reading. In: Proceedings of first international conference on affective computing and intelligent interfaces, pp 582–589 Google Scholar
  37. 37.
    Scherer KR, Ellgring H (2007) Multimodal expression of emotion: Affect programs or componential appraisal patterns? Emotion 7(1) Google Scholar
  38. 38.
    Engelbrecht AP, Fletcher L, Cloete I (1999) Variance analysis of sensitivity information for pruning multilayer feedforward neural networks. In: Neural networks, IJCNN ’99, vol 3, pp 1829–1833 Google Scholar
  39. 39.
    Densley DJ, Willis PJ (1997) Emotional posturing: a method towards achieving emotional figure animation. In: Computer animation Google Scholar
  40. 40.
    Yang M-H, Kriegman DJ, Ahuja N (2002) Detecting faces in images: a survey. IEEE Trans Pattern Anal Mach Intell 24(1):34–58 CrossRefGoogle Scholar
  41. 41.
    Young JW Head and face anthropometry of adult U.S. civilians. Technical Report final report, FAA Civil Aeromedical Institute, 1963-93 Google Scholar
  42. 42.
    Ioannou S, Caridakis G, Karpouzis K, Kollias S (2007) Robust feature detection for facial expression recognition. EURASIP J Image Video Process Google Scholar
  43. 43.
    Raouzaiou A, Tsapatsoulis N, Karpouzis K, Kollias S (2002) Parameterized facial expression synthesis based on mpeg-4. EURASIP J Appl Signal Process 10:1021–1038 Google Scholar
  44. 44.
    Camurri A, Coletta P, Massari A, Mazzarino B, Peri M, Ricchetti M, Ricci A, Volpe G (2004) Toward real-time multimodal processing: Eyesweb 4.0. In: Proc. AISB 2004 convention: motion, emotion and cognition, Leeds, UK, March 2004 Google Scholar
  45. 45.
    Camurri A, Mazzarino B, Volpe G (2004) Analysis of expressive gesture: the eyesweb expressive gesture processing library. In: Camurri A, Volpe G (eds) Gesture-based communication in human-computer interaction. LNAI, vol 2915. Springer, Berlin Google Scholar
  46. 46.
    Kessous L, Amir N, Cohen R (2007) Evaluation of perceptual time/frequency representations for automatic classification of expressive speech. In: International workshop on paralinguistic speech—between models and data, ParaLing’07 Google Scholar
  47. 47.
    Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco MATHGoogle Scholar
  48. 48.
    Cooper G, Herskovits E (1992) A bayesian method for the induction of probabilistic networks from data. Mach Learn 9(4):309–347 MATHGoogle Scholar
  49. 49.
    Kononenko I (1995) On biases in estimating multi-valued attributes. In: 14th international joint conference on artificial intelligence, Newcastle upon Tyne, UK, pp 1034–1040 Google Scholar
  50. 50.
    Kohavi R (1995) A study on cross-validation and bootstrap for accurate estimation and model selection. In: Proceedings of the international joint conference on artificial intelligence, vol 2. Morgan Kaufmann, San Francisco, pp 1137–1143 Google Scholar
  51. 51.
    Chen LS, Huang TS, Miyasato T, Nakatsu R (1998) Multimodal human emotion/expression recognition. In: Conf. on automatic face and gesture recognition Google Scholar
  52. 52.
    Ioannou S, Kessous L, Caridakis G (2006) Adaptive on-line neural network retraining for real life multimodal emotion recognition. In: Proceedings of international conference on artificial neural networks (ICANN), Athens, Greece, September 2006, pp 81–92 Google Scholar
  53. 53.
    De Silva LC, Miyasato T, Nakatsu R (1997) Facial emotion recognition using multimodal information. In: Conf. on information, communications and signal processing (ICICS’97) Google Scholar
  54. 54.
    Littlewort G, Stewart Bartlett M, Fasel IR, Susskind J, Movellan JR (2006) Dynamics of facial expression extracted automatically from video. Image Vis Comput 24(6):615–625 CrossRefGoogle Scholar
  55. 55.
    Stein B, Meredith MA (1993) The merging of senses. MIT Press, Cambridge Google Scholar
  56. 56.
    Coulson M (2004) Attributing emotion to static body postures: recognition accuracy, confusions, and viewpoint dependence. J Nonverbal Behav 28(2):117–139 CrossRefMathSciNetGoogle Scholar
  57. 57.
    Balomenos T, Raouzaiou A, Ioannou S, Drosopoulos A, Karpouzis K, Kollias S (2005) Emotion analysis in man-machine interaction systems, 3D modeling and animation: synthesis and analysis techniques. Idea Group Publ., pp 175–200 Google Scholar
  58. 58.
    Karpouzis K, Raouzaiou A, Drosopoulos A, Ioannou S, Balomenos T, Tsapatsoulis N, Kollias S (2004) Facial expression and gesture analysis for emotionally-rich man-machine interaction. In: 3D modeling and animation: synthesis and analysis techniques Google Scholar

Copyright information

© OpenInterface Association 2009

Authors and Affiliations

  • Loic Kessous
    • 1
  • Ginevra Castellano
    • 2
  • George Caridakis
    • 3
  1. 1.MarseilleFrance
  2. 2.Department of Computer Science, School of Electronic Engineering and Computer ScienceQueen Mary University of LondonLondonUK
  3. 3.Image, Video and Multimedia Systems Laboratory, School of Electrical and Computer EngineeringNational Technical University of AthensAthensGreece

Personalised recommendations