Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild

Sun, Bo; Li, Liandong; Wu, Xuewen; Zuo, Tian; Chen, Ying; Zhou, Guoyan; He, Jun; Zhu, Xiaoming

doi:10.1007/s12193-015-0203-6

Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild

Original Paper
Published: 18 November 2015

Volume 10, pages 125–137, (2016)
Cite this article

Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Bo Sun¹,
Liandong Li¹,
Xuewen Wu¹,
Tian Zuo¹,
Ying Chen¹,
Guoyan Zhou¹,
Jun He¹ &
…
Xiaoming Zhu¹

1196 Accesses
28 Citations
Explore all metrics

Abstract

Emotion recognition in the wild is a very challenging task. In this paper, we investigate a variety of different multimodal features (acoustic and visual) from video clips to evaluate their discriminative abilities in human emotion analysis. For each clip, we extract MSDF BoW, LBP-TOP, PHOG, LPQ-TOP and Audio features. We train different classifiers for every type of feature on the AFEW dataset from the ICMI 2014 EmotiW Challenge, and we propose a novel hierarchical classification framework, which combines the feature-level and decision-level fusion strategy for all of the extracted multimodal features. The final achievement we gain on the AFEW test set is 47.17 %, which is considerably better than the best baseline recognition rate of 33.7 %. Among all of the teams participating in the ICMI 2014 EmotiW challenge, our recognition performance won the first runner-up award. Furthermore, we test our method on FERA and CK datasets, the experimental results also show good performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Emotion Recognition in Videos via Fusing Multimodal Features

Combining modality-specific extreme learning machines for emotion recognition in the wild

Article 01 May 2015

Revisiting AVEC 2011 – An Information Fusion Architecture

References

Knapp M, Hall J, Horgan T (2013) Nonverbal communication in human interaction. Cengage Learning, Oklahoma
Google Scholar
Pantic M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. Pattern Anal Mach Intell IEEE Trans 22(12):1424–1445
Article Google Scholar
Wu T, Bartlett MS, Movellan JR (2010) Facial expression recognition using Gabor motion energy filters. In: Computer vision and pattern recognition workshops (CVPRW), 2010 IEEE computer society conference on IEEE, pp 42–47
Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Pattern Anal Mach Intell IEEE Trans 24(7):971–987
Article MATH Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Computer vision and pattern recognition. CVPR 2005. IEEE computer society conference, vol 1. IEEE, pp 886–893
Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685
Article Google Scholar
Gönen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268
MathSciNet MATH Google Scholar
Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 461–466
Zhu X, Ramanan D (2012) Face detection, pose estimation, and landmark localization in the wild. In: Computer vision and pattern recognition (CVPR), 2012 IEEE conference on IEEE, pp 2879–2886
Xiong X, De la Torre F (2013) Supervised descent method and its applications to face alignment. In: Computer vision and pattern recognition (CVPR), IEEE conference on IEEE, pp 532–539
Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed 19(3):34–41
Article Google Scholar
Vedaldi A, Fulkerson B (2010) VLFeat: an open and portable library of computer vision algorithms. In: Proceedings of the international conference on multimedia. ACM, pp 1469–1472
Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238
Article Google Scholar
Sikka K, Wu T, Susskind J, Bartlett M (2012) Exploring bag of words architectures in the facial expression domain. In: Computer vision-ECCV 2012. Workshops and demonstrations. Springer, Berlin, pp 250–259
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: Computer Vision and pattern recognition (CVPR), 2010 IEEE conference on IEEE, pp 3360–3367
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Computer vision and pattern recognition, IEEE computer society conference on IEEE, vol. 2, pp 2169–2178
Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: Computer vision and pattern recognition, CVPR 2009. IEEE conference on IEEE, pp 1794–1801
Chatfield K, Lempitsky V, Vedaldi A, Zisserman A (2011) The devil is in the details: an evaluation of recent feature encoding methods. BMVC 2(4):239–259
Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel. In: Proceedings of the 6th ACM international conference on image and video retrieval. ACM, pp 401–408
Grauman K, Darrell T (2005) The pyramid match kernel: discriminative classification with sets of image features. In: Computer vision, ICCV 2005. Tenth IEEE international conference on IEEE, vol. 2, pp 1458–1465
Dhall A, Asthana A, Goecke R, Gedeon T (2011) Emotion recognition using PHOG and LPQ features. In: Automatic face & gesture recognition and workshops (FG 2011), IEEE international conference on IEEE, pp 878–883
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. Pattern Anal Mach Intell IEEE Trans 29(6):915–928
Article Google Scholar
Päivärinta J, Rahtu E, Heikkilä J (2011) Volume local phase quantization for blur-insensitive dynamic texture classification. In: Image analysis. Springer, Berlin, pp 360–369
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the international conference on multimedia. ACM, pp 1459–1462
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Woodland P (2006) The HTK book (for HTK version 3.4). Camb Univ Eng Dep 2(2):2–3
Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Google Scholar
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Liu M, Wang R, Huang Z, Shan S, Chen X (2013) Partial least squares regression on grassmannian manifold for emotion recognition. In: Proceedings of the 15th ACM on international conference on multimodal interaction. ACM, pp 525–530
Zafeiriou S, Zhang C, Zhang Z (2015) A survey on face detection in the wild: past, present and future. Comput Vis Image Underst 138:1–24
Article Google Scholar
Peng Y, Ganesh A, Wright J, Xu W, Ma Y (2012) RASL: robust alignment by sparse and low-rank decomposition for linearly correlated images. Pattern Anal Mach Intell IEEE Trans 34(11):2233–2246
Article Google Scholar
Hassner T, Harel S, Paz E, Enbar R (2014) Effective face frontalization in unconstrained images. Preprint arXiv:1411.7964
Ekman P, Friesen WV (1977) Facial action coding system. In: Blacking J (ed) Anthropology of the body. Academic Press, New York
Google Scholar
Sikka K, Dykstra K, Sathyanarayana S, Littlewort G, Bartlett M (2013) Multiple kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM on international conference on multimodal interaction. ACM, pp 517–524
Kahou SE, Pal C, Bouthillier X, Froumenty P, Gülçehre Ç, Memisevic R, Wu Z (2013) Combining modality specific deep neural networks for emotion recognition in video. In: Proceedings of the 15th ACM on international conference on multimodal interaction. ACM, pp 543–550
Liu M, Wang R, Li S, Shan S, Huang Z, Chen X (2014) Combining multiple kernel methods on iemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 494–501
Chen J, Chen Z, Chi Z, Fu H (2014) Emotion recognition in the wild with feature fusion and multiple kernel learning. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 508–513
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44(3):572–587
Article MATH Google Scholar
De la Torre F, Cohn JF (2011) Facial expression analysis. In: Moeslund TB, Hilton A, Krüger V, Sigal L (eds) Visual analysis of humans. Springer, London, pp 377–409
Chapter Google Scholar
Huang X, He Q, Hong X, Zhao G, Pietikainen M (2014) Improved spatiotemporal local monogenic binary pattern for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 514–520
Xia H, Hoi SC (2013) Mkboost: a framework of multiple kernel boosting. Knowl Data Eng IEEE Trans 25(7):1574–1586
Article Google Scholar
Bucak SS, Jin R, Jain AK (2014) Multiple kernel learning for visual object recognition: a review. Pattern Anal Mach Intell IEEE Trans 36(7):1354–1369
Article Google Scholar
Valstar M, Girard J, Almaev T, McKeown G, Mehu M, Yin L, Cohn J (2015) Fera 2015-second facial expression recognition and analysis challenge. Proceeding of the IEEE ICFG
Almaev TR, Valstar MF (2013) Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition. In: Affective computing and intelligent interaction (ACII), humaine association conference on IEEE, pp 356–361
Valstar MF, Jiang B, Mehu M, Pantic M, Scherer K (2011) The first facial expression recognition and analysis challenge. In: Automatic face & gesture recognition and workshops (FG 2011), IEEE international conference on IEEE, pp 921–926
Tian YL, Kanade T, Cohn JF (2001) Recognizing action units for facial expression analysis. Pattern Anal Mach Intell IEEE Trans 23(2):97–115
Article Google Scholar
Sun B, Li L, Zuo T, Chen Y, Zhou G, Wu X (2014) Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 481–486
Day M (2013) Emotion recognition with boosted tree classifiers. In: Proceedings of the 15th ACM on international conference on multimodal interaction. ACM, pp 531–534
Tariq U, Lin KH, Li Z, Zhou X, Wang Z, Le V, Han TX (2011) Emotion recognition from an ensemble of features. In: Automatic face & gesture recognition and workshops (FG 2011), IEEE international conference on IEEE, pp 872–877
Meudt S, Zharkov D, Kächele M, Schwenker F (2013) Multi classifier systems and forward backward feature selection algorithms to classify emotional coloured speech. In: Proceedings of the 15th ACM on international conference on multimodal interaction. ACM, pp 551–556

Download references

Acknowledgments

This work is supported by the Fundamental Research Funds for the Central Universities of China (2014KJJCA15, 2012YBXS10) and the National Education Science Twelfth Five-Year Plan Key Issues of the Ministry of Education (DCA140229).

Author information

Authors and Affiliations

College of Information Science and Technology, Beijing Normal University, Beijing, 100875, China
Bo Sun, Liandong Li, Xuewen Wu, Tian Zuo, Ying Chen, Guoyan Zhou, Jun He & Xiaoming Zhu

Authors

Bo Sun
View author publications
You can also search for this author in PubMed Google Scholar
Liandong Li
View author publications
You can also search for this author in PubMed Google Scholar
Xuewen Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tian Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Ying Chen
View author publications
You can also search for this author in PubMed Google Scholar
Guoyan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jun He
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoming Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuewen Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, B., Li, L., Wu, X. et al. Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild. J Multimodal User Interfaces 10, 125–137 (2016). https://doi.org/10.1007/s12193-015-0203-6

Download citation

Received: 28 January 2015
Accepted: 12 October 2015
Published: 18 November 2015
Issue Date: June 2016
DOI: https://doi.org/10.1007/s12193-015-0203-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild

Abstract

Access this article

Similar content being viewed by others

Emotion Recognition in Videos via Fusing Multimodal Features

Combining modality-specific extreme learning machines for emotion recognition in the wild

Revisiting AVEC 2011 – An Information Fusion Architecture

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild

Abstract

Access this article

Similar content being viewed by others

Emotion Recognition in Videos via Fusing Multimodal Features

Combining modality-specific extreme learning machines for emotion recognition in the wild

Revisiting AVEC 2011 – An Information Fusion Architecture

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation