Multimedia Tools and Applications

, Volume 49, Issue 2, pp 277–297 | Cite as

Multimodal information fusion application to human emotion recognition from face and speech

  • Muharram Mansoorizadeh
  • Nasrollah Moghaddam Charkari


A multimedia content is composed of several streams that carry information in audio, video or textual channels. Classification and clustering multimedia contents require extraction and combination of information from these streams. The streams constituting a multimedia content are naturally different in terms of scale, dynamics and temporal patterns. These differences make combining the information sources using classic combination techniques difficult. We propose an asynchronous feature level fusion approach that creates a unified hybrid feature space out of the individual signal measurements. The target space can be used for clustering or classification of the multimedia content. As a representative application, we used the proposed approach to recognize basic affective states from speech prosody and facial expressions. Experimental results over two audiovisual emotion databases with 42 and 12 subjects revealed that the performance of the proposed system is significantly higher than the unimodal face based and speech based systems, as well as synchronous feature level and decision level fusion approaches.


Multimodal feature extraction Multimodal information fusion Human computer interaction Multimodal emotion recognition 



The authors would like to express their sincere thanks to professor Kabir, professor Khaki, Mr Massoud Kimyaei and Colleagues for their valuable help and suggestions.


  1. 1.
    Bassili JN (1979) Emotion recognition: the role of facial movement and the relative importance of upper and lower areas of the face. J Pers Soc Psychol 37(11):2049–2058CrossRefGoogle Scholar
  2. 2.
    Black MJ, Yacoob Y (1997) Recognizing facial expressions in image sequences using local parameterized models of image motion. Int J Comput Vis 25:23–48CrossRefGoogle Scholar
  3. 3.
    Boehner K, DePaula R, Dourish P, Sengers P (2007) How emotion is made and measured. Int J Human Comput Stud 65:275–291CrossRefGoogle Scholar
  4. 4.
    Boersma P, Weenink D (2007) Praat: doing phonetics by computer (version 4.6.12) [computer program]Google Scholar
  5. 5.
    Busso C, Narayanan SS (2007) Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Trans Audio Speech Lang Process 15:2331–2347CrossRefGoogle Scholar
  6. 6.
    Castellano G, Kessous L, Caridakis G (2008) Emotion recognition through multiple modalities: face, body gesture, speech. In: Beale R (ed) Affect and emotion in human-computer interaction, vol 4868. Springer, New YorkGoogle Scholar
  7. 7.
    Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. Signal Process Mag IEEE 18:32–80CrossRefGoogle Scholar
  8. 8.
    De Silva LC, Pei Chi N (2000) Bimodal emotion recognition. In: Proceedings of the fourth IEEE international conference on automatic face and gesture recognition, vol 1, pp 332–335Google Scholar
  9. 9.
    Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech recognition. IEEE Trans Multimedia 2(3):141–151CrossRefGoogle Scholar
  10. 10.
    Ekman P (1993) Facial expression and emotion. Am Psychol 48:384–392CrossRefGoogle Scholar
  11. 11.
    Ekman P, Friesen WV, Hager JC (2002) Facial Action Coding System (FACS), the manual: a human faceGoogle Scholar
  12. 12.
    Fasel B, Luettin J (1999) Automatic facial expression analysis: a survey, vol 36Google Scholar
  13. 13.
    Fragopanagos N, Taylor JG (2005) Emotion recognition in human-computer interaction. Neural Netw 18:389–405CrossRefGoogle Scholar
  14. 14.
    Gunes H, Piccardi M, Pantic M (2008) From the lab to the real world: affect recognition using multiple cues and modalities. In: Or J (ed) Affective computing: focus on emotion expression. InTech Education and Publishing, Vienna, pp 185–218Google Scholar
  15. 15.
    Hager GD, Belhumeur PN (1998) Efficient region tracking with parametric models of geometry and illumination. PAMI 20:1025–1039Google Scholar
  16. 16.
    Hall DL, Llians J (2001) Handbook of multisensor data fusion. CRC, Boca RatonGoogle Scholar
  17. 17.
    Jimenez LO, Morales-Morell A, Creus A (1999) Classification of hyperdimensional data based on feature and decision fusion approaches using projection pursuit, majority voting, and neural networks. IEEE Trans Geosci Remote Sens 37(3 Part 1):1360–1366CrossRefGoogle Scholar
  18. 18.
    Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502CrossRefGoogle Scholar
  19. 19.
    Mansoorizadeh M, Charkari NM (2008) Bimodal person-dependent emotion recognition: comparison of feature level and decision level information fusion. In: HCI/HRI workshop, PETRA’08Google Scholar
  20. 20.
    Mansoorizadeh M, Charkari NM (2009) Audiovisual emotion database in persian language (persian). In: CSI national conference, csicc 2009, CSI, TehranGoogle Scholar
  21. 21.
    Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface’05 audio-visual emotion database. In: Proc. 22nd intl. conf. on data engineering workshops (ICDEW’06)Google Scholar
  22. 22.
    Mehrabian A (1968) Communication without words. Psychol Today 2:53–56Google Scholar
  23. 23.
    Paleari M, Lisetti CL (2006) Toward multimodal fusion of affective cues. In: Proceedings of the 1st ACM international workshop on human-centered multimedia. ACM, New York, pp 99–108CrossRefGoogle Scholar
  24. 24.
    Pantic M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. PAMI 22:124–1445Google Scholar
  25. 25.
    Pierre-Yves O (2003) The production and recognition of emotions in speech: features and algorithms. Int J Human-Comput Stud 59(1–2):157–183CrossRefGoogle Scholar
  26. 26.
    Ross A, Jain A (2003) Information fusion in biometrics. Pattern Recogn Lett 24(13):2115–2125CrossRefGoogle Scholar
  27. 27.
    Sadlier DA, O’Connor NE (2005) Event detection in field sports video using audio-visual features and a support vector machine. IEEE Trans Circuits Syst Video Technol 15(10):1225–1233CrossRefGoogle Scholar
  28. 28.
    Sobottka K, Pitas I (1998) A novel method for automatic face segmentation, facial feature extraction and tracking. Signal Process Image Commun 12:263–281CrossRefGoogle Scholar
  29. 29.
    Song M, You M, Li N, Chen C (1920) A robust multimodal approach for emotion recognition. Neurocomputing 71:1913–2008CrossRefGoogle Scholar
  30. 30.
    Webb A (ed) (2002) Statistical pattern recognition. Wiley, New YorkzbMATHGoogle Scholar
  31. 31.
    Welch G, Bishop G (1995) An introduction to the Kalman filter, University of North Carolina at Chapel Hill, Chapel HillGoogle Scholar
  32. 32.
    Wu Y, Chang EY, Chang KCC, Smith JR (2004) Optimal multimodal fusion for multimedia data analysis. In: Proceedings of the 12th annual ACM international conference on multimedia. ACM, New York, pp 572–579CrossRefGoogle Scholar
  33. 33.
    Yang J, Yang J-y, Zhang D, Lu J-f (2003) Feature fusion: parallel strategy vs. serial strategy. Pattern Recogn 36(6):1369–1381zbMATHCrossRefGoogle Scholar
  34. 34.
    Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. PAMI 31:39–58Google Scholar
  35. 35.
    Zhou ZH, Geng X (2004) Projection functions for eye detection. Pattern Recogn 37:1049–1056zbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Muharram Mansoorizadeh
    • 1
  • Nasrollah Moghaddam Charkari
    • 1
  1. 1.Parallel and Image Processing Lab, Faculty of Electrical and Computer EngineeringTarbiat Modares UniversityTehranIran

Personalised recommendations