EmoNets: Multimodal deep learning approaches for emotion recognition in video

Kahou, Samira Ebrahimi; Bouthillier, Xavier; Lamblin, Pascal; Gulcehre, Caglar; Michalski, Vincent; Konda, Kishore; Jean, Sébastien; Froumenty, Pierre; Dauphin, Yann; Boulanger-Lewandowski, Nicolas; Chandias Ferrari, Raul; Mirza, Mehdi; Warde-Farley, David; Courville, Aaron; Vincent, Pascal; Memisevic, Roland; Pal, Christopher; Bengio, Yoshua

doi:10.1007/s12193-015-0195-2

EmoNets: Multimodal deep learning approaches for emotion recognition in video

Original Paper
Published: 21 August 2015

Volume 10, pages 99–111, (2016)
Cite this article

Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Samira Ebrahimi Kahou¹,
Xavier Bouthillier³,
Pascal Lamblin³,
Caglar Gulcehre³,
Vincent Michalski²,
Kishore Konda²,
Sébastien Jean³,
Pierre Froumenty¹,
Yann Dauphin³,
Nicolas Boulanger-Lewandowski³,
Raul Chandias Ferrari³,
Mehdi Mirza³,
David Warde-Farley³,
Aaron Courville³,
Pascal Vincent³,
Roland Memisevic³,
Christopher Pal¹ &
…
Yoshua Bengio³

9968 Accesses
236 Citations
4 Altmetric
Explore all metrics

Abstract

The task of the Emotion Recognition in the Wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based “bag-of-mouths” model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67 % on the 2014 dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

Notes

Yaafe: audio features extraction toolbox: http://yaafe.sourceforge.net/.

References

Adelson EH, Bergen JR (1985) Spatiotemporal energy models for the perception of motion. JOSA A 2(2):284–299
Article Google Scholar
Bastien F, Lamblin P, Pascanu R, Bergstra J, Goodfellow I, Bergeron A, Bouchard N, Warde-Farley D, Bengio Y (2012) Theano: new features and speed improvements. arXiv:1211.5590
Bergstra J, Breuleux O, Bastien F, Lamblin P, Pascanu R, Desjardins G, Turian J, Warde-Farley D, Bengio Y (2010) Theano: a CPU and GPU math Expression compiler. In: Proceedings of the Python for scientific Computing conference (SciPy), vol 4. Austin
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. JMLR 13:281–305
MathSciNet MATH Google Scholar
Carrier PL, Courville A, Goodfellow IJ, Mirza M, Bengio Y (2013) FER-2013 face database. Tech rep, 1365 (Université de Montréal)
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intel Syst Technol 2:27:1–27:27
Article Google Scholar
Chen J, Chen Z, Chi Z, Fu H (2014) Emotion recognition in the wild with feature fusion and multiple kernel learning. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 508–513. ACM
Coates A, Lee H, Ng AY (2011) An analysis of single-layer networks in unsupervised feature learning. In: AISTATS
Dahl GE, Sainath TN, Hinton GE (2013) Improving deep neural networks for lvcsr using rectified linear units and dropout. In: Proc. ICASSP
Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 461–466. ACM
Dhall A, Goecke R, Joshi J, Wagner M, Gedeon T (2013) Emotion recognition in the wild challenge 2013. In: ACM ICMI
Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE Multi Media 3:34–41
Article Google Scholar
Gehrig T, Ekenel HK (2013) Why is facial expression analysis in the wild challenging? In: Proceedings of the 2013 on Emotion recognition in the wild challenge and workshop, pp. 9–16. ACM
Google: The Google picasa face detector (2013). http://picasa.google.com. Accessed 1-Aug-2013
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6645–6649. IEEE
Hamel P, Lemieux S, Bengio Y, Eck D (2011) Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In: ISMIR, pp. 729–734
Heusch G, Cardinaux F, Marcel S (2005) Lighting normalization algorithms for face verification. IDIAP Communication Com05-03
Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig Proc Magazine 29(6):82–97
Article Google Scholar
Hinton G, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comp 18(7):1527–1554
Article MathSciNet MATH Google Scholar
Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580
Kahou SE, Froumenty P, Pal C (2015) Facial expression analysis based on high dimensional binary features. In: L Agapito, MM Bronstein, C Rother (eds) Computer vision - ECCV 2014 Workshops, Lecture Notes in Computer Science, vol. 8926
Kahou SE, Pal C, Bouthillier X, Froumenty P, Gulcehre C, Memisevic R, Vincent P, Courville A, Bengio Y, Ferrari RC, Mirza M, Jean S, Carrier PL, Dauphin Y, Boulanger-Lewandowski N, Aggarwal A, Zumer J, Lamblin P, Raymond JP, Desjardins G, Pascanu R, Warde-Farley D, Torabi A, Sharma A, Bengio E, Côté M, Konda KR, Wu Z (2013) Combining modality specific deep neural networks for emotion recognition in video. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, ICMI ’13
Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. arXiv:1404.2188
Konda KR, Memisevic R, Michalski V (2014) The role of spatio-temporal synchrony in the encoding of motion. In: ICLR
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech rep
Krizhevsky A (2012) Cuda-convnet Google code home page. https://code.google.com/p/cuda-convnet/
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114
Le Q, Zou W, Yeung S, Ng A (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR
Liu M, Wang R, Huang Z, Shan S, Chen X (2013) Partial least squares regression on grassmannian manifold for emotion recognition. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 525–530. ACM
Liu M, Wang R, Li S, Shan S, Huang Z, Chen X (2014) Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 494–501. ACM
Neverova N, Wolf C, Taylor GW, Nebout F (2014) Moddrop: adaptive multi-modal gesture recognition. arXiv:1501.00102
Sikka K, Dykstra K, Sathyanarayana S, Littlewort G, Bartlett M (2013) Multiple kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 517–524. ACM
Štruc V, Pavešić N (2009) Gabor-based kernel partial-least-squares discrimination features for face recognition. Informatica 20(1):115–138
MATH Google Scholar
Sun B, Li L, Zuo T, Chen Y, Zhou G, Wu X (2014) Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 481–486. ACM
Susskind J, Anderson A, Hinton G (2010) The toronto face database. Tech Rep, UTML TR 2010-001, University of Toronto
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: ICML 2013
Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Proceedings of the 11th European conference on Computer vision: Part VI, ECCV’10
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: CVPR
Štruc V, Pavešić N (2011) Photometric normalization techniques for illumination invariance, pp. 279–300. IGI-Global
Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC
Zhu X, Ramanan D (2012) Face Detection, Pose Estimation, and Landmark Localization in the Wild. In: CVPR

Download references

Acknowledgments

The authors would like to thank the developers of Theano [2, 3]. We thank NSERC, Ubisoft, the German BMBF, project 01GQ0841 and CIFAR for their support. We also thank Abhishek Aggarwal, Emmanuel Bengio, Jörg Bornschein, Pierre-Luc Carrier, Myriam Côté, Guillaume Desjardins, David Krueger, Razvan Pascanu, Jean-Philippe Raymond, Arjun Sharma, Atousa Torabi, Zhenzhou Wu, and Jeremie Zumer for their work on the 2013 submission.

Author information

Authors and Affiliations

École Polytechique de Montréal, Université de Montréal, Montreal, Canada
Samira Ebrahimi Kahou, Pierre Froumenty & Christopher Pal
Goethe-Universität Frankfurt, Frankfurt, Germany
Vincent Michalski & Kishore Konda
Montreal Institute for Learning Algorithms, Université de Montréal, Montreal, Canada
Xavier Bouthillier, Pascal Lamblin, Caglar Gulcehre, Sébastien Jean, Yann Dauphin, Nicolas Boulanger-Lewandowski, Raul Chandias Ferrari, Mehdi Mirza, David Warde-Farley, Aaron Courville, Pascal Vincent, Roland Memisevic & Yoshua Bengio

Authors

Samira Ebrahimi Kahou
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Bouthillier
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Lamblin
View author publications
You can also search for this author in PubMed Google Scholar
Caglar Gulcehre
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Michalski
View author publications
You can also search for this author in PubMed Google Scholar
Kishore Konda
View author publications
You can also search for this author in PubMed Google Scholar
Sébastien Jean
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Froumenty
View author publications
You can also search for this author in PubMed Google Scholar
Yann Dauphin
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Boulanger-Lewandowski
View author publications
You can also search for this author in PubMed Google Scholar
Raul Chandias Ferrari
View author publications
You can also search for this author in PubMed Google Scholar
Mehdi Mirza
View author publications
You can also search for this author in PubMed Google Scholar
David Warde-Farley
View author publications
You can also search for this author in PubMed Google Scholar
Aaron Courville
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Vincent
View author publications
You can also search for this author in PubMed Google Scholar
Roland Memisevic
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Pal
View author publications
You can also search for this author in PubMed Google Scholar
Yoshua Bengio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samira Ebrahimi Kahou.

Ethics declarations

Ethical statement

Authors Yoshua Bengio, Christopher Pal, Pascal Vincent, Roland Memisevic, and Aaron Courville have received research grants from the government of Canada for activities in collaboration with Ubisoft Entertainment Montreal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kahou, S.E., Bouthillier, X., Lamblin, P. et al. EmoNets: Multimodal deep learning approaches for emotion recognition in video. J Multimodal User Interfaces 10, 99–111 (2016). https://doi.org/10.1007/s12193-015-0195-2

Download citation

Received: 05 March 2015
Accepted: 06 August 2015
Published: 21 August 2015
Issue Date: June 2016
DOI: https://doi.org/10.1007/s12193-015-0195-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EmoNets: Multimodal deep learning approaches for emotion recognition in video

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Facial emotion recognition using convolutional neural networks (FERC)

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical statement

Rights and permissions

About this article

Cite this article

Keywords

Navigation

EmoNets: Multimodal deep learning approaches for emotion recognition in video

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Facial emotion recognition using convolutional neural networks (FERC)

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical statement

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation