Advertisement

Journal on Multimodal User Interfaces

, Volume 10, Issue 2, pp 99–111 | Cite as

EmoNets: Multimodal deep learning approaches for emotion recognition in video

  • Samira Ebrahimi Kahou
  • Xavier Bouthillier
  • Pascal Lamblin
  • Caglar Gulcehre
  • Vincent Michalski
  • Kishore Konda
  • Sébastien Jean
  • Pierre Froumenty
  • Yann Dauphin
  • Nicolas Boulanger-Lewandowski
  • Raul Chandias Ferrari
  • Mehdi Mirza
  • David Warde-Farley
  • Aaron Courville
  • Pascal Vincent
  • Roland Memisevic
  • Christopher Pal
  • Yoshua Bengio
Original Paper

Abstract

The task of the Emotion Recognition in the Wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based “bag-of-mouths” model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67 % on the 2014 dataset.

Keywords

Emotion recognition Deep learning Model combination Multimodal learning 

Notes

Acknowledgments

The authors would like to thank the developers of Theano [2, 3]. We thank NSERC, Ubisoft, the German BMBF, project 01GQ0841 and CIFAR for their support. We also thank Abhishek Aggarwal, Emmanuel Bengio, Jörg Bornschein, Pierre-Luc Carrier, Myriam Côté, Guillaume Desjardins, David Krueger, Razvan Pascanu, Jean-Philippe Raymond, Arjun Sharma, Atousa Torabi, Zhenzhou Wu, and Jeremie Zumer for their work on the 2013 submission.

Compliance with ethical standards

Ethical statement

Authors Yoshua Bengio, Christopher Pal, Pascal Vincent, Roland Memisevic, and Aaron Courville have received research grants from the government of Canada for activities in collaboration with Ubisoft Entertainment Montreal.

References

  1. 1.
    Adelson EH, Bergen JR (1985) Spatiotemporal energy models for the perception of motion. JOSA A 2(2):284–299CrossRefGoogle Scholar
  2. 2.
    Bastien F, Lamblin P, Pascanu R, Bergstra J, Goodfellow I, Bergeron A, Bouchard N, Warde-Farley D, Bengio Y (2012) Theano: new features and speed improvements. arXiv:1211.5590
  3. 3.
    Bergstra J, Breuleux O, Bastien F, Lamblin P, Pascanu R, Desjardins G, Turian J, Warde-Farley D, Bengio Y (2010) Theano: a CPU and GPU math Expression compiler. In: Proceedings of the Python for scientific Computing conference (SciPy), vol 4. AustinGoogle Scholar
  4. 4.
    Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. JMLR 13:281–305MathSciNetzbMATHGoogle Scholar
  5. 5.
    Carrier PL, Courville A, Goodfellow IJ, Mirza M, Bengio Y (2013) FER-2013 face database. Tech rep, 1365 (Université de Montréal) Google Scholar
  6. 6.
    Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intel Syst Technol 2:27:1–27:27CrossRefGoogle Scholar
  7. 7.
    Chen J, Chen Z, Chi Z, Fu H (2014) Emotion recognition in the wild with feature fusion and multiple kernel learning. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 508–513. ACMGoogle Scholar
  8. 8.
    Coates A, Lee H, Ng AY (2011) An analysis of single-layer networks in unsupervised feature learning. In: AISTATSGoogle Scholar
  9. 9.
    Dahl GE, Sainath TN, Hinton GE (2013) Improving deep neural networks for lvcsr using rectified linear units and dropout. In: Proc. ICASSPGoogle Scholar
  10. 10.
    Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 461–466. ACMGoogle Scholar
  11. 11.
    Dhall A, Goecke R, Joshi J, Wagner M, Gedeon T (2013) Emotion recognition in the wild challenge 2013. In: ACM ICMIGoogle Scholar
  12. 12.
    Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE Multi Media 3:34–41CrossRefGoogle Scholar
  13. 13.
    Gehrig T, Ekenel HK (2013) Why is facial expression analysis in the wild challenging? In: Proceedings of the 2013 on Emotion recognition in the wild challenge and workshop, pp. 9–16. ACMGoogle Scholar
  14. 14.
    Google: The Google picasa face detector (2013). http://picasa.google.com. Accessed 1-Aug-2013
  15. 15.
    Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6645–6649. IEEEGoogle Scholar
  16. 16.
    Hamel P, Lemieux S, Bengio Y, Eck D (2011) Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In: ISMIR, pp. 729–734Google Scholar
  17. 17.
    Heusch G, Cardinaux F, Marcel S (2005) Lighting normalization algorithms for face verification. IDIAP Communication Com05-03Google Scholar
  18. 18.
    Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig Proc Magazine 29(6):82–97CrossRefGoogle Scholar
  19. 19.
    Hinton G, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comp 18(7):1527–1554MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580
  21. 21.
    Kahou SE, Froumenty P, Pal C (2015) Facial expression analysis based on high dimensional binary features. In: L Agapito, MM Bronstein, C Rother (eds) Computer vision - ECCV 2014 Workshops, Lecture Notes in Computer Science, vol. 8926Google Scholar
  22. 22.
    Kahou SE, Pal C, Bouthillier X, Froumenty P, Gulcehre C, Memisevic R, Vincent P, Courville A, Bengio Y, Ferrari RC, Mirza M, Jean S, Carrier PL, Dauphin Y, Boulanger-Lewandowski N, Aggarwal A, Zumer J, Lamblin P, Raymond JP, Desjardins G, Pascanu R, Warde-Farley D, Torabi A, Sharma A, Bengio E, Côté M, Konda KR, Wu Z (2013) Combining modality specific deep neural networks for emotion recognition in video. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, ICMI ’13Google Scholar
  23. 23.
    Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. arXiv:1404.2188
  24. 24.
    Konda KR, Memisevic R, Michalski V (2014) The role of spatio-temporal synchrony in the encoding of motion. In: ICLRGoogle Scholar
  25. 25.
    Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech repGoogle Scholar
  26. 26.
    Krizhevsky A (2012) Cuda-convnet Google code home page. https://code.google.com/p/cuda-convnet/
  27. 27.
    Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114Google Scholar
  28. 28.
    Le Q, Zou W, Yeung S, Ng A (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPRGoogle Scholar
  29. 29.
    Liu M, Wang R, Huang Z, Shan S, Chen X (2013) Partial least squares regression on grassmannian manifold for emotion recognition. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 525–530. ACMGoogle Scholar
  30. 30.
    Liu M, Wang R, Li S, Shan S, Huang Z, Chen X (2014) Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 494–501. ACMGoogle Scholar
  31. 31.
    Neverova N, Wolf C, Taylor GW, Nebout F (2014) Moddrop: adaptive multi-modal gesture recognition. arXiv:1501.00102
  32. 32.
    Sikka K, Dykstra K, Sathyanarayana S, Littlewort G, Bartlett M (2013) Multiple kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 517–524. ACMGoogle Scholar
  33. 33.
    Štruc V, Pavešić N (2009) Gabor-based kernel partial-least-squares discrimination features for face recognition. Informatica 20(1):115–138zbMATHGoogle Scholar
  34. 34.
    Sun B, Li L, Zuo T, Chen Y, Zhou G, Wu X (2014) Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 481–486. ACMGoogle Scholar
  35. 35.
    Susskind J, Anderson A, Hinton G (2010) The toronto face database. Tech Rep, UTML TR 2010-001, University of TorontoGoogle Scholar
  36. 36.
    Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: ICML 2013Google Scholar
  37. 37.
    Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Proceedings of the 11th European conference on Computer vision: Part VI, ECCV’10Google Scholar
  38. 38.
    Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: CVPRGoogle Scholar
  39. 39.
    Štruc V, Pavešić N (2011) Photometric normalization techniques for illumination invariance, pp. 279–300. IGI-GlobalGoogle Scholar
  40. 40.
    Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVCGoogle Scholar
  41. 41.
    Zhu X, Ramanan D (2012) Face Detection, Pose Estimation, and Landmark Localization in the Wild. In: CVPRGoogle Scholar

Copyright information

© OpenInterface Association 2015

Authors and Affiliations

  • Samira Ebrahimi Kahou
    • 1
  • Xavier Bouthillier
    • 3
  • Pascal Lamblin
    • 3
  • Caglar Gulcehre
    • 3
  • Vincent Michalski
    • 2
  • Kishore Konda
    • 2
  • Sébastien Jean
    • 3
  • Pierre Froumenty
    • 1
  • Yann Dauphin
    • 3
  • Nicolas Boulanger-Lewandowski
    • 3
  • Raul Chandias Ferrari
    • 3
  • Mehdi Mirza
    • 3
  • David Warde-Farley
    • 3
  • Aaron Courville
    • 3
  • Pascal Vincent
    • 3
  • Roland Memisevic
    • 3
  • Christopher Pal
    • 1
  • Yoshua Bengio
    • 3
  1. 1.École Polytechique de MontréalUniversité de MontréalMontrealCanada
  2. 2.Goethe-Universität FrankfurtFrankfurtGermany
  3. 3.Montreal Institute for Learning AlgorithmsUniversité de MontréalMontrealCanada

Personalised recommendations