Skip to main content
Log in

Human emotion recognition from videos using spatio-temporal and audio features

  • Original Article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

In this paper, we present human emotion recognition systems based on audio and spatio-temporal visual features. The proposed system has been tested on audio visual emotion data set with different subjects for both genders. The mel-frequency cepstral coefficient (MFCC) and prosodic features are first identified and then extracted from emotional speech. For facial expressions spatio-temporal features are extracted from visual streams. Principal component analysis (PCA) is applied for dimensionality reduction of the visual features and capturing 97 % of variances. Codebook is constructed for both audio and visual features using Euclidean space. Then occurrences of the histograms are employed as input to the state-of-the-art SVM classifier to realize the judgment of each classifier. Moreover, the judgments from each classifier are combined using Bayes sum rule (BSR) as a final decision step. The proposed system is tested on public data set to recognize the human emotions. Experimental results and simulations proved that using visual features only yields on average 74.15 % accuracy, while using audio features only gives recognition average accuracy of 67.39 %. Whereas by combining both audio and visual features, the overall system accuracy has been significantly improved up to 80.27 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Chen, C.Y., et al.: Visual/acoustic emotion recognition. In: IEEE International Conference on Multimedia and Expo, ICME 2005, pp. 1468–1471 (2005)

    Chapter  Google Scholar 

  2. Yongjin, W., Ling, G.: An investigation of speech-based human emotion recognition. In: Proceedings of IEEE 6th Workshop on Multimedia Signal, pp. 15–18 (2004)

    Google Scholar 

  3. Lee, C.-C., et al.: Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 53, 1162–1171 (2011)

    Article  Google Scholar 

  4. Lien, J.J., et al.: Automated facial expression recognition based on FACS action units. In: Proceedings of Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 390–395 (1998)

    Chapter  Google Scholar 

  5. Devillers, L., et al.: Emotion detection in task-oriented spoken dialogues. In: Proceedings of International Conference on Multimedia and Expo, ICME’03, vol. 3, pp. III-549–III-552 (2003).

    Google Scholar 

  6. Chul Min, L., et al.: Classifying emotions in human–machine spoken dialogs. In: Proceedings of IEEE International Conference on Multimedia and Expo, ICME’02, vol. 1, pp. 737–740 (2002)

    Chapter  Google Scholar 

  7. Cowie, R., et al.: Emotion recognition in human–computer interaction. IEEE Signal Process. Mag. 18, 32–80 (2001)

    Article  Google Scholar 

  8. Busso, C., et al.: Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans. Audio Speech Lang. Process. 17, 582–596 (2009)

    Article  Google Scholar 

  9. Louis, T.B.: Emotions, speech and the ASR framework. Speech Commun. 40, 213–225 (2003)

    Article  MATH  Google Scholar 

  10. Kwang-Sub, B., et al.: Emotion recognition from facial expression using hybrid-feature extraction. In: SICE 2004 Annual Conference, vol. 3, pp. 2483–2487 (2004)

    Google Scholar 

  11. Kharat, G.U., Dudul, S.V.: Neural network classifier for human emotion recognition from facial expressions using discrete cosine transform. In: First International Conference on Emerging Trends in Engineering and Technology, ICETET’08, pp. 653–658 (2008)

    Chapter  Google Scholar 

  12. Ekman, P., Friesen, W.V.: Unmasking the Face: A Guide to Recognizing Emotions from Facial Clues. Prentice-Hall, Englewood Cliffs (1975)

    Google Scholar 

  13. Lien, J.J.J., et al.: Subtly different facial expression recognition and expression intensity estimation. In: Proceedings of Computer Society Conference on Computer Vision and Pattern Recognition, pp. 853–859 (1998)

    Google Scholar 

  14. Black, M.J., Yacoob, Y.: Recognizing facial expressions in image sequences using local parameterized models of image motion. Int. J. Comput. Vis. 25, 23–48 (1997)

    Article  Google Scholar 

  15. Dollar, P., et al.: Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)

    Chapter  Google Scholar 

  16. Ekman, P., Friesen, W.V.: Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto (1978)

    Google Scholar 

  17. Essa, I., Pentland, A.: Coding, analysis, interpretation and recognition off acial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 17, 757–763 (1997)

    Article  Google Scholar 

  18. Omid, S., et al.: Facial expression recognition using image orientation field in limited regions and MLP neural network. In: 10th International Conference on Information Science, Signal Processing and Their Applications (ISSPA 2010), pp. 85–88 (2010)

    Google Scholar 

  19. Munaf, R., et al.: Incorporating interactive genetic algorithm (IGA) for real world human action retrievals. Arch. Sci. 65, 502–515 (2012)

    Google Scholar 

  20. Jones, S., et al.: Relevance feedback for real-world human action retrieval. Pattern Recognit. Lett. 33, 446–452 (2012)

    Article  Google Scholar 

  21. Bhatti, M.W., et al.: A neural network approach for human emotion recognition in speech. In: Proceedings of the International Symposium on Circuits and Systems, ISCAS’04, vol. 2, pp. II-181–II-184 (2004)

    Google Scholar 

  22. Jiang, R.M., et al.: Multimodal biometric human recognition for perceptual human computer interaction. IEEE Trans. Syst. Man Cybern., Part C, Appl. Rev. 40, 676–681 (2010)

    Article  Google Scholar 

  23. Irie, G., et al.: Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. IEEE Trans. Multimed. 12, 523–535 (2010)

    Article  Google Scholar 

  24. Milgram, M.C.J., Sabourin, R.: One-against-one or one-against-all: which one is better for handwriting recognition with SVMs? In: 10th International Workshop on Frontiers in Handwriting Recognition (2006)

    Google Scholar 

  25. Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)

    Article  Google Scholar 

  26. Martin, O., et al.: The eNTERFACE’05 audio-visual emotion database. In: Proceedings of the 22nd International Conference on Data Engineering Workshops, p. 8 (2006)

    Google Scholar 

Download references

Acknowledgements

Authors would like to thank Universiti Teknologi Malaysia (UTM) and Ministry of Higher Education (MOHE) for providing financial assistance. Authors are also indebted to Dr. Usman Ullah Sheikh for technical help and discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Munaf Rashid.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rashid, M., Abu-Bakar, S.A.R. & Mokji, M. Human emotion recognition from videos using spatio-temporal and audio features. Vis Comput 29, 1269–1275 (2013). https://doi.org/10.1007/s00371-012-0768-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-012-0768-y

Keywords

Navigation