The University of Passau Open Emotion Recognition System for the Multimodal Emotion Challenge

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 663)


This paper presents the University of Passau’s approaches for the Multimodal Emotion Recognition Challenge 2016. For audio signals, we exploit Bag-of-Audio-Words techniques combining Extreme Learning Machines and Hierarchical Extreme Learning Machines. For video signals, we use not only the information from the cropped face of a video frame, but also the broader contextual information from the entire frame. This information is extracted via two Convolutional Neural Networks pre-trained for face detection and object classification. Moreover, we extract facial action units, which reflect facial muscle movements and are known to be important for emotion recognition. Long Short-Term Memory Recurrent Neural Networks are deployed to exploit temporal information in the video representation. Average late fusion of audio and video systems is applied to make prediction for multimodal emotion recognition. Experimental results on the challenge database demonstrate the effectiveness of our proposed systems when compared to the baseline.


Multimodal emotion recognition Bag-of-audio-words Transfer learning Long short-term memory Convolutional neural networks 



This work has been partially supported by the BMBF IKT2020-Grant under grant agreement No. 16SV7213 (EmotAsS), the European Communitys Seventh Framework Programme through the ERC Starting Grant No. 338164 (iHEARu), and the EU’s Horizon 2020 Programme agreement No. 688835 (DE-ENIGMA), and the European Union’s Horizon 2020 Programme through the Innovative Action No. 645094 (SEWA). It was further partially supported by research grants from the China Scholarship Council (CSC) awarded to Xinzhou Xu.


  1. 1.
    Baltrušaitis, T., Robinson, P., Morency, L.P.: OpenFace: an open source facial behavior analysis toolkit. In: Proceedings of WACV, Lace Placid, USA, pp. 1–10 (2016)Google Scholar
  2. 2.
    Bao, W., Li, Y., Gu, M., Yang, M., Li, H., Chao, L., Tao, J.: Building a Chinese natural emotional audio-visual database. In: Proceedings of ICSP, Hangzhou, China, pp. 583–587. IEEE (2014)Google Scholar
  3. 3.
    Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Proceedings of ACII, pp. 511–516. IEEE (2013)Google Scholar
  4. 4.
    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: Proceedings of ICML, Beijing, China, pp. 647–655 (2014)Google Scholar
  5. 5.
    El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011)CrossRefzbMATHGoogle Scholar
  6. 6.
    Eyben, F., Scherer, K., Schuller, B., Sundberg, J., André, E., Busso, C., Devillers, L., Epps, J., Laukka, P., Narayanan, S., Truong, K.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2016)CrossRefGoogle Scholar
  7. 7.
    Eyben, F., Weninger, F., Groß, F., Schuller, B.: Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, MM 2013, pp. 835–838. ACM (2013)Google Scholar
  8. 8.
    Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of INTERSPEECH, Singapore, pp. 223–227 (2014)Google Scholar
  9. 9.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  10. 10.
    Huang, G., Huang, G.B., Song, S., You, K.: Trends in extreme learning machines: a review. Neural Netw. 61, 32–48 (2015)CrossRefzbMATHGoogle Scholar
  11. 11.
    Huang, G.B.: What are extreme learning machines? Filling the gap between Frank Rosenblatt’s Dream and John von Neumann’s Puzzle. Cognitive Comput. 7(3), 263–278 (2015)CrossRefGoogle Scholar
  12. 12.
    Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006)CrossRefGoogle Scholar
  13. 13.
    Kaya, H., Eyben, F., Salah, A.A.: CCA based feature selection with application to continuous depression recognition from acoustic speech features. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, pp. 3757–3761. IEEE (2014)Google Scholar
  14. 14.
    Kaya, H., Salah, A.A.: Combining modality-specific extreme learning machines for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, ICMI 2014, pp. 487–493. ACM (2014)Google Scholar
  15. 15.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of NIPS, Lake Tahoe, USA, pp. 1106–1114 (2012)Google Scholar
  16. 16.
    Li, Y., Tao, J., Schuller, B., Shan, S., Jiang, D., Jia, J.: MEC 2016: the multimodal emotion recognition challenge of CCPR 2016. In: Proceedings of CCPR, Chengdu, China (2016). 11 pagesGoogle Scholar
  17. 17.
    Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16(8), 2203–2213 (2014)CrossRefGoogle Scholar
  18. 18.
    Pancoast, S., Akbacak, M.: Bag-of-audio-words approach for multimedia event classification. In: Proceedings of INTERSPEECH, Portland, USA, pp. 2105–2108 (2012)Google Scholar
  19. 19.
    Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of BMVC, Swansea, UK, pp. 41.1–41.12 (2015)Google Scholar
  20. 20.
    Pokorny, F., Graf, F., Pernkopf, F., Schuller, B.: Detection of negative emotions in speech signals using bags-of-audio-words. In: Proceedings of ACII, Xi’an, China, pp. 879–884 (2015)Google Scholar
  21. 21.
    Sariyanidi, E., Gunes, H., Cavallaro, A.: Automatic analysis of facial affect: a survey of registration, representation, and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1113–1133 (2015)CrossRefGoogle Scholar
  22. 22.
    Schmitt, M., Ringeval, F., Schuller, B.: At the border of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech. In: Proceedings of INTERSPEECH, San Francsico, USA (2016). 5 pagesGoogle Scholar
  23. 23.
    Schmitt, M., Schuller, B.W.: Openxbow-introducing the passau open-source crossmodal bag-of-words toolkit. CoRR abs/1605.06778 (2016)Google Scholar
  24. 24.
    Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C., Narayanan, S.S.: The INTERSPEECH 2010 paralinguistic challenge. In: Proceedings of INTERSPEECH, pp. 2794–2797 (2010)Google Scholar
  25. 25.
    Schuller, B., Batliner, A.: Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing. Wiley Publishing, Chichester (2013)CrossRefGoogle Scholar
  26. 26.
    Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: Proceedings of INTERSPEECH, Brighton, UK, pp. 312–315 (2009)Google Scholar
  27. 27.
    Schuller, B., Steidl, S., Batliner, A., Vinciarelli, A., Scherer, K., Ringeval, F., Chetouani, M., Weninger, F., Eyben, F., Marchi, E., Mortillaro, M., Salamin, H., Polychroniou, A., Valente, F., Kim, S.: The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of INTERSPEECH, Lyon, France, pp. 148–152 (2013)Google Scholar
  28. 28.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR (2014).
  29. 29.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of CVPR, Boston, USA, pp. 1–9 (2015)Google Scholar
  30. 30.
    Tang, J., Deng, C., Guang, G.B.: Extreme learning machine for multilayer perceptron. IEEE Trans. Neural Netw. Learn. Syst. 27(4), 809–821 (2015)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Wöllmer, M., Kaiser, M., Eyben, F., Schuller, B., Rigoll, G.: LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vision Comput. 31(2), 153–163 (2013). Special Issue on Affect Analysis in Continuous InputCrossRefGoogle Scholar
  32. 32.
    Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative CNN video representation for event detection. In: Proceedings of CVPR, Boston, USA, pp. 1798–1807 (2015)Google Scholar
  33. 33.
    Yao, A., Shao, J., Ma, N., Chen, Y.: Capturing au-aware facial features and their latent relations for emotion recognition in the wild. In: Proceedings of ICMI, Seattle, USA, pp. 451–458. ACM (2015)Google Scholar
  34. 34.
    Zhang, Z., Pinto, J., Plahl, C., Schuller, B., Willett, D.: Channel mapping using bidirectional long short-term memory for dereverberation in hands-free voice controlled devices. IEEE Trans. Consum. Electron. 60(3), 525–533 (2014)CrossRefGoogle Scholar
  35. 35.
    Zhang, Z., Ringeval, F., Han, J., Deng, J., Marchi, E., Schuller, B.: Facing realism in spontaneous emotion recognition from speech: feature enhancement by autoencoder with LSTM neural networks. In: Proceedings of INTERSPEECH, San Francsico, CA (2016). 5 pagesGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2016

Authors and Affiliations

  1. 1.Chair of Complex and Intelligent SystemsUniversity of PassauPassauGermany
  2. 2.Technische Universität MünchenMunichGermany
  3. 3.Northwestern Polytechnical UniversityXi’anPeople’s Republic of China

Personalised recommendations