Word-Level Emotion Recognition Using High-Level Features

  • Johanna D. Moore
  • Leimin Tian
  • Catherine Lai
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8404)


In this paper, we investigate the use of high-level features for recognizing human emotions at the word-level in natural conversations with virtual agents. Experiments were carried out on the 2012 Audio/Visual Emotion Challenge (AVEC2012) database, where emotions are defined as vectors in the Arousal-Expectancy-Power-Valence emotional space. Our model using 6 novel disfluency features yields significant improvements compared to those using large number of low-level spectral and prosodic features, and the overall performance difference between it and the best model of the AVEC2012 Word-Level Sub-Challenge is not significant. Our visual model using the Active Shape Model visual features also yields significant improvements compared to models using the low-level Local Binary Patterns visual features. We built a bimodal model By combining our disfluency and visual feature sets and applying Correlation-based Feature-subset Selection. Considering overall performance on all emotion dimensions, our bimodal model outperforms the second best model of the challenge, and comes close to the best model. It also gives the best result when predicting Expectancy values.


Visual Feature Support Vector Regression Emotion Recognition Local Binary Pattern Emotion Dimension 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    D’Mello, S., Jackson, T., Craig, S., Morgan, B., Chipman, P., White, H., Person, N., Kort, B., el Kaliouby, R., Picard, R., et al.: AutoTutor detects and responds to learners affective and cognitive states. In: Workshop on Emotional and Cognitive Issues at the International Conference on Intelligent Tutoring Systems (2008)Google Scholar
  2. 2.
    Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U., Narayanan, S.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th International Conference on Multimodal Interfaces, pp. 205–211. ACM (2004)Google Scholar
  3. 3.
    Schuller, B., Valster, M., Eyben, F., Cowie, R., Pantic, M.: AVEC 2012: the continuous audio/visual emotion challenge. In: Proceedings of the 14th ACM International Conference on Multimodal Interaction, pp. 449–456. ACM (2012)Google Scholar
  4. 4.
    Nicolle, J., Rapp, V., Bailly, K., Prevost, L., Chetouani, M.: Robust continuous prediction of human emotions using multiscale dynamic cues. In: Proceedings of the 14th ACM International Conference on Multimodal Interaction, pp. 501–508. ACM (2012)Google Scholar
  5. 5.
    Savran, A., Cao, H., Shah, M., Nenkova, A., Verma, R.: Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering. In: Proceedings of the 14th ACM International Conference on Multimodal Interaction, pp. 485–492. ACM (2012)Google Scholar
  6. 6.
    Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 971–987 (2002)CrossRefGoogle Scholar
  7. 7.
    Scherer, K.R.: Expression of emotion in voice and music. Journal of Voice 9, 235–248 (1995)CrossRefGoogle Scholar
  8. 8.
    Devillers, L., Vidrascu, L., Lamel, L.: Challenges in real-life emotion annotation and machine learning based detection. Neural Networks 18, 407–422 (2005)CrossRefGoogle Scholar
  9. 9.
    Soladié, C., Salam, H., Pelachaud, C., Stoiber, N., Séguier, R.: A multimodal fuzzy inference system using a continuous facial expression representation for emotion detection. In: Proceedings of the 14th ACM International Conference on Multimodal Interaction, pp. 493–500. ACM (2012)Google Scholar
  10. 10.
    Silva, L.C., Miyasato, T.: Degree of human perception of facial emotions based on audio and video information. IEICE Technical Report. Image Engineering 96, 9–15 (1996)Google Scholar
  11. 11.
    Chen, L.S., Huang, T.S., Miyasato, T., Nakatsu, R.: Multimodal human emotion/expression recognition. In: Proceedings of Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 366–371. IEEE (1998)Google Scholar
  12. 12.
    McKeown, G., Valstar, M.F., Cowie, R., Pantic, M.: The SEMAINE corpus of emotionally coloured character interactions. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1079–1084. IEEE (2010)Google Scholar
  13. 13.
    Fontaine, J.R., Scherer, K.R., Roesch, E.B., Ellsworth, P.C.: The world of emotions is not two-dimensional. Psychological Science 18, 1050–1057 (2007)CrossRefGoogle Scholar
  14. 14.
    Eyben, F., Wöllmer, M., Schuller, B.: OpenSMILE: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the International Conference on Multimedia, pp. 1459–1462. ACM (2010)Google Scholar
  15. 15.
    Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, pp. I–511. IEEE (2001)Google Scholar
  16. 16.
    Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models-their training and application. Computer vision and image understanding 61, 38–59 (1995)CrossRefGoogle Scholar
  17. 17.
    Cootes, T.F., Edwards, G.J., Taylor, C.J.: Comparing active shape models with active appearance models. BMVC 99, 173–182 (1999)Google Scholar
  18. 18.
    Milborrow, S., Morkel, J., Nicolls, F.: The MUCT landmarked face database. Pattern Recognition Association of South Africa 201 (2010)Google Scholar
  19. 19.
    Milborrow, S., Nicolls, F.: Active shape models with sift descriptors and mars 1, 5 (2014)Google Scholar
  20. 20.
    Milborrow, S.: Stasm User Manual (2013),
  21. 21.
    Hall, M.A.: Correlation-based Feature Subset Selection for Machine Learning. PhD thesis, University of Waikato, Hamilton, New Zealand (1998)Google Scholar
  22. 22.
    Drucker, H., Burges, C.J., Kaufman, L., Smola, A., Vapnik, V.: Support vector regression machines. Advances in Neural Information Processing Systems, 155–161 (1997)Google Scholar
  23. 23.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2, 27 (2011)Google Scholar
  24. 24.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11, 10–18 (2009)CrossRefGoogle Scholar
  25. 25.
    Ozkan, D., Scherer, S., Morency, L.P.: Step-wise emotion recognition using concatenated-HMM. In: Proceedings of the 14th ACM International Conference on Multimodal Interaction, pp. 477–484. ACM (2012)Google Scholar
  26. 26.
    van der Maaten, L.: Audio-visual emotion challenge 2012: a simple approach. In: Proceedings of the 14th ACM International Conference on Multimodal Interaction, pp. 473–476. ACM (2012)Google Scholar
  27. 27.
    Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., Harper, M.: Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing 14, 1526–1540 (2006)CrossRefGoogle Scholar
  28. 28.
    Niewiadomski, R., Hofmann, J., Urbain, J., Platt, T., Wagner, J., Piot, B., Cakmak, H., Pammi, S., Baur, T., Dupont, S., et al.: et al.: Laugh-aware virtual agent and its impact on user amusement. In: Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems, pp. 619–626. International Foundation for Autonomous Agents and Multiagent Systems (2013)Google Scholar
  29. 29.
    Lai, C., Carletta, J., Renals, S.: Detecting summarization hot spots in meetings using group level involvement and turn-taking features. In: Proceedings of Interspeech 2013, Lyon, France (2013)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Johanna D. Moore
    • 1
  • Leimin Tian
    • 1
  • Catherine Lai
    • 1
  1. 1.School of InformaticsUniversity of EdinburghEdinburghUK

Personalised recommendations