Abstract
Emotion recognition in the wild is a very challenging task. In this paper, we investigate a variety of different multimodal features (acoustic and visual) from video clips to evaluate their discriminative abilities in human emotion analysis. For each clip, we extract MSDF BoW, LBP-TOP, PHOG, LPQ-TOP and Audio features. We train different classifiers for every type of feature on the AFEW dataset from the ICMI 2014 EmotiW Challenge, and we propose a novel hierarchical classification framework, which combines the feature-level and decision-level fusion strategy for all of the extracted multimodal features. The final achievement we gain on the AFEW test set is 47.17 %, which is considerably better than the best baseline recognition rate of 33.7 %. Among all of the teams participating in the ICMI 2014 EmotiW challenge, our recognition performance won the first runner-up award. Furthermore, we test our method on FERA and CK datasets, the experimental results also show good performance.
Similar content being viewed by others
References
Knapp M, Hall J, Horgan T (2013) Nonverbal communication in human interaction. Cengage Learning, Oklahoma
Pantic M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. Pattern Anal Mach Intell IEEE Trans 22(12):1424–1445
Wu T, Bartlett MS, Movellan JR (2010) Facial expression recognition using Gabor motion energy filters. In: Computer vision and pattern recognition workshops (CVPRW), 2010 IEEE computer society conference on IEEE, pp 42–47
Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Pattern Anal Mach Intell IEEE Trans 24(7):971–987
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Computer vision and pattern recognition. CVPR 2005. IEEE computer society conference, vol 1. IEEE, pp 886–893
Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685
Gönen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268
Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 461–466
Zhu X, Ramanan D (2012) Face detection, pose estimation, and landmark localization in the wild. In: Computer vision and pattern recognition (CVPR), 2012 IEEE conference on IEEE, pp 2879–2886
Xiong X, De la Torre F (2013) Supervised descent method and its applications to face alignment. In: Computer vision and pattern recognition (CVPR), IEEE conference on IEEE, pp 532–539
Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed 19(3):34–41
Vedaldi A, Fulkerson B (2010) VLFeat: an open and portable library of computer vision algorithms. In: Proceedings of the international conference on multimedia. ACM, pp 1469–1472
Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238
Sikka K, Wu T, Susskind J, Bartlett M (2012) Exploring bag of words architectures in the facial expression domain. In: Computer vision-ECCV 2012. Workshops and demonstrations. Springer, Berlin, pp 250–259
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: Computer Vision and pattern recognition (CVPR), 2010 IEEE conference on IEEE, pp 3360–3367
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Computer vision and pattern recognition, IEEE computer society conference on IEEE, vol. 2, pp 2169–2178
Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: Computer vision and pattern recognition, CVPR 2009. IEEE conference on IEEE, pp 1794–1801
Chatfield K, Lempitsky V, Vedaldi A, Zisserman A (2011) The devil is in the details: an evaluation of recent feature encoding methods. BMVC 2(4):239–259
Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel. In: Proceedings of the 6th ACM international conference on image and video retrieval. ACM, pp 401–408
Grauman K, Darrell T (2005) The pyramid match kernel: discriminative classification with sets of image features. In: Computer vision, ICCV 2005. Tenth IEEE international conference on IEEE, vol. 2, pp 1458–1465
Dhall A, Asthana A, Goecke R, Gedeon T (2011) Emotion recognition using PHOG and LPQ features. In: Automatic face & gesture recognition and workshops (FG 2011), IEEE international conference on IEEE, pp 878–883
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. Pattern Anal Mach Intell IEEE Trans 29(6):915–928
Päivärinta J, Rahtu E, Heikkilä J (2011) Volume local phase quantization for blur-insensitive dynamic texture classification. In: Image analysis. Springer, Berlin, pp 360–369
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the international conference on multimedia. ACM, pp 1459–1462
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Woodland P (2006) The HTK book (for HTK version 3.4). Camb Univ Eng Dep 2(2):2–3
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
Liu M, Wang R, Huang Z, Shan S, Chen X (2013) Partial least squares regression on grassmannian manifold for emotion recognition. In: Proceedings of the 15th ACM on international conference on multimodal interaction. ACM, pp 525–530
Zafeiriou S, Zhang C, Zhang Z (2015) A survey on face detection in the wild: past, present and future. Comput Vis Image Underst 138:1–24
Peng Y, Ganesh A, Wright J, Xu W, Ma Y (2012) RASL: robust alignment by sparse and low-rank decomposition for linearly correlated images. Pattern Anal Mach Intell IEEE Trans 34(11):2233–2246
Hassner T, Harel S, Paz E, Enbar R (2014) Effective face frontalization in unconstrained images. Preprint arXiv:1411.7964
Ekman P, Friesen WV (1977) Facial action coding system. In: Blacking J (ed) Anthropology of the body. Academic Press, New York
Sikka K, Dykstra K, Sathyanarayana S, Littlewort G, Bartlett M (2013) Multiple kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM on international conference on multimodal interaction. ACM, pp 517–524
Kahou SE, Pal C, Bouthillier X, Froumenty P, Gülçehre Ç, Memisevic R, Wu Z (2013) Combining modality specific deep neural networks for emotion recognition in video. In: Proceedings of the 15th ACM on international conference on multimodal interaction. ACM, pp 543–550
Liu M, Wang R, Li S, Shan S, Huang Z, Chen X (2014) Combining multiple kernel methods on iemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 494–501
Chen J, Chen Z, Chi Z, Fu H (2014) Emotion recognition in the wild with feature fusion and multiple kernel learning. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 508–513
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44(3):572–587
De la Torre F, Cohn JF (2011) Facial expression analysis. In: Moeslund TB, Hilton A, Krüger V, Sigal L (eds) Visual analysis of humans. Springer, London, pp 377–409
Huang X, He Q, Hong X, Zhao G, Pietikainen M (2014) Improved spatiotemporal local monogenic binary pattern for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 514–520
Xia H, Hoi SC (2013) Mkboost: a framework of multiple kernel boosting. Knowl Data Eng IEEE Trans 25(7):1574–1586
Bucak SS, Jin R, Jain AK (2014) Multiple kernel learning for visual object recognition: a review. Pattern Anal Mach Intell IEEE Trans 36(7):1354–1369
Valstar M, Girard J, Almaev T, McKeown G, Mehu M, Yin L, Cohn J (2015) Fera 2015-second facial expression recognition and analysis challenge. Proceeding of the IEEE ICFG
Almaev TR, Valstar MF (2013) Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition. In: Affective computing and intelligent interaction (ACII), humaine association conference on IEEE, pp 356–361
Valstar MF, Jiang B, Mehu M, Pantic M, Scherer K (2011) The first facial expression recognition and analysis challenge. In: Automatic face & gesture recognition and workshops (FG 2011), IEEE international conference on IEEE, pp 921–926
Tian YL, Kanade T, Cohn JF (2001) Recognizing action units for facial expression analysis. Pattern Anal Mach Intell IEEE Trans 23(2):97–115
Sun B, Li L, Zuo T, Chen Y, Zhou G, Wu X (2014) Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 481–486
Day M (2013) Emotion recognition with boosted tree classifiers. In: Proceedings of the 15th ACM on international conference on multimodal interaction. ACM, pp 531–534
Tariq U, Lin KH, Li Z, Zhou X, Wang Z, Le V, Han TX (2011) Emotion recognition from an ensemble of features. In: Automatic face & gesture recognition and workshops (FG 2011), IEEE international conference on IEEE, pp 872–877
Meudt S, Zharkov D, Kächele M, Schwenker F (2013) Multi classifier systems and forward backward feature selection algorithms to classify emotional coloured speech. In: Proceedings of the 15th ACM on international conference on multimodal interaction. ACM, pp 551–556
Acknowledgments
This work is supported by the Fundamental Research Funds for the Central Universities of China (2014KJJCA15, 2012YBXS10) and the National Education Science Twelfth Five-Year Plan Key Issues of the Ministry of Education (DCA140229).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sun, B., Li, L., Wu, X. et al. Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild. J Multimodal User Interfaces 10, 125–137 (2016). https://doi.org/10.1007/s12193-015-0203-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-015-0203-6