Abstract
Fully-automatic facial expression recognition (FER) is a key component of human behavior analysis. Performing FER from still images is a challenging task as it involves handling large interpersonal morphological differences, and as partial occlusions can occasionally happen. Furthermore, labelling expressions is a time-consuming process that is prone to subjectivity, thus the variability may not be fully covered by the training data. In this work, we propose to train random forests upon spatially-constrained random local subspaces of the face. The output local predictions form a categorical expression-driven high-level representation that we call local expression predictions (LEPs). LEPs can be combined to describe categorical facial expressions as well as action units (AUs). Furthermore, LEPs can be weighted by confidence scores provided by an autoencoder network. Such network is trained to locally capture the manifold of the non-occluded training data in a hierarchical way. Extensive experiments show that the proposed LEP representation yields high descriptive power for categorical expressions and AU occurrence prediction, and leads to interesting perspectives towards the design of occlusion-robust and confidence-aware FER systems.
Similar content being viewed by others
References
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Bylander, T. (2002). Estimating generalization error on two-class datasets using out-of-bag estimates. Machine Learning, 48(1–3), 287–297.
Chen, C., Liaw, A., & Breiman, L. (2004). Using random forest to learn imbalanced data (Vol. 110). Technical report. Berkeley: University of California.
Chu, W.-S., De la Torre, F., & Cohn, J. F. (2013). Selective transfer machine for personalized facial action unit detection. In CVPR (pp. 3515–3522).
Cotter, S. F. (2010). Sparse representation for accurate classification of corrupted and occluded facial expressions. In ICASSP (pp. 838–841).
Dapogny, A., Bailly, K., & Dubuisson, S. (2015) Pairwise conditional random forests for facial expression recognition. In ICCV.
Dhall, A., Goecke, R., Lucey, S., & Gedeon, T. (2011). Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark. In ICCV Workshops (pp. 2106–2112).
Dollár, P., Tu, Z., Perona, P., & Belongie, S. (2009). Integral channel features. In BMVC.
Du, S., Tao, Y., & Martinez, A. M. (2014). Compound facial expressions of emotion. In Proceedings of the National Academy of Sciences (pp. 111).
Ekman, P., & Friesen, W. V. (1977). Facial action coding system. Palo Alto: Consulting Psychologists Press.
Ekman, Paul, & Friesen, W. V. (1971). Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2), 124.
Eleftheriadis, S., Rudovic, O., & Pantic, M. (2015). Multi-conditional latent variable model for joint facial action unit detection. In ICCV.
Eleftheriadis, S., Rudovic, O., & Pantic, M. (2015). Discriminative shared gaussian processes for multiview and view-invariant facial expression recognition. IEEE Transactions on Image Processing, 24(1), 189–204.
Ghiasi, G., & Fowlkes, C. C. (2014). Occlusion coherence: Localizing occluded faces with a hierarchical deformable part model. In CVPR (pp. 1899–1906).
Ghosh, S., Laksana, E., Scherer, S., & Morency, L.-P. (2015) A multi-label convolutional neural network approach to cross-domain action unit detection. In ACII.
Greenwald, M. K., Cook, E. W., & Lang, P. J. (1989). Affective judgment and psychophysiological response: Dimensional covariation in the evaluation of pictorial stimuli. Journal of Psychophysiology, 3(1), 51–64.
Hayat, M., Bennamoun, M., & El-Sallam, A. A. (2012). Evaluation of spatiotemporal detectors and descriptors for facial expression recognition. In International Conference on Human-System Interaction (pp. 43–47).
Huang, X., Zhao, G., Zheng, W., & Pietikäinen, M. (2012). Towards a dynamic expression recognition system under facial occlusion. Pattern Recognition Letters, 33(16), 2181–2191.
Jeni, L., Cohn, J. F, & Kanade, J. F. (2015). Dense 3d face alignment from 2d videos in real-time. In FG.
Jiang, B., Valstar, M. F, & Pantic, M. (2011). Action unit detection using sparse appearance descriptors in space-time video volumes. In FG (pp. 314–321).
Jolliffe, I. (2002). Principal component analysis. NewYork: Wiley.
Kotsia, I., Buciu, I., & Pitas, I. (2008). An analysis of facial expression recognition under partial facial image occlusion. Image and Vision Computing, 26(7), 1052–1067.
Linusson, H. (2013). Multi-output random forests. University of Borås/School of Business and IT.
Liu, M., Li, S., Shan, Shiguang, S., & Chen, X. (2013). Enhancing expression recognition in the wild with unlabeled reference data. In ACCV (pp. 577–588).
Liu, M., Li, S., Shan, S., & Chen, X. (2015). Au-inspired deep networks for facial expression feature learning. Neurocomputing, 159, 126–136.
Lucey, P., Cohn J. F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. (2010). The extended cohn-kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. In CVPR Workshops (pp. 94–101).
Mavadati, S. M., Mahoor, M. H., Bartlett, K., Trinh, P., & Cohn, J. F. (2013). DISFA: A spontaneous facial action intensity database. Transactions on Affective Computing, 4(2), 151–160.
Nicolle, J., Bailly, K., & Chetouani, M. (2015). Facial action unit intensity prediction via hard multi-task metric learning for kernel regression. In FG.
Pei, Y., Kim, T.-K., & Zha, H. (2013). Unsupervised random forest manifold alignment for lipreading. In ICCV (pp. 129–136).
Ranzato, M. A., Susskind, J., Mnih, V., & Hinton, G. (2011). On deep generative models with applications to recognition. In CVPR (pp. 2857–2864).
Ren, S., Cao, X., Wei, Y., & Sun, J. (2014). Face alignment at 3000 fps via regressing local binary features. In CVPR (pp. 1685–1692).
Rifai, S., Bengio, Y., Courville, A., Vincent, P., & Mirza, M. (2012). Disentangling factors of variation for facial expression recognition. In ECCV.
Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In ICML (pp. 833–840).
Sandbach, G., Zafeiriou, S., Pantic, M., & Rueck, D. (2011). A dynamic approach to the recognition of 3D facial expressions and their temporal models. In FG (pp. 406–413).
Savran, A., Alyüz, N., Dibeklioğlu, H., Çeliktutan, O., Gökberk, B., Sankur, B., & Akarun, L. (2008). Bosphorus database for 3d face analysis. In Biometrics and Identity Management (pp. 47–56).
Sénéchal, T., Rapp, V., Salam, H., Seguier, R., Bailly, K., & Prevost, L. (2012). Facial action recognition combining heterogeneous features via multikernel learning. TSMC-B (pp. 42).
Shan, C., Gong, S., & McOwan, P. W. (2009). Facial expression recognition based on local binary patterns: A comprehensive study. Image and Vision Computing, 27(6), 803–816.
Sun, Y., & Yin, L. (2008). Facial expression recognition based on 3D dynamic range model sequences. In ECCV (pp. 58–71).
Van de Weijer, J., Ruiz, A., & Binefa, X. (2015). From emotions to action units with hidden and semi-hidden-task learning. In ICCV.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 3371–3408.
Wallhoff, F. (2006). Database with facial expressions and emotions from technical university of Munich (feedtum).
Xiong, X., & De la Torre, F. (2013). Supervised descent method and its applications to face alignment. In CVPR (pp. 532–539).
Xu, L., & Mordohai, P. (2010). Automatic facial expression recognition using bags of motion words. In BMVC (pp. 1–13).
Yin, L., Chen, X., & Sun, Y. (2008). Tony Worm, and Michael Reale. A high-resolution 3D dynamic facial expression database. In FG (pp. 1–6).
Zeng, Z., Pantic, M., Roisman, G. I., & Huang, T. S. (2009). A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1), 39–58.
Zhang, L., Tjondronegoro, D., & Chandran, V. (2014). Random Gabor based templates for facial expression recognition in images with facial occlusion. Neurocomputing, 145, 451–464.
Zhang, X., Yin, L., Cohn, J. F., Canavan, S., Reale, M., Horowitz, A., et al. (2014). BP4D-spontaneous a high-resolution spontaneous 3D dynamic facial expression database. Image and Vision Computing, 32(10), 692–706.
Zhao, K., Chu, W.-S., De la Torre, F., Jeffrey, F. C., & Honggang, Z. (2015). Joint patch and multi-label learning for facial action unit detection. In CVPR.
Zhao, K., Chu, W.-S., & Zhang, H. (2016). Deep region and multi-label learning for facial action unit detection. In CVPR.
Zhao, X., Kim, T. K., & Luo, W. (2014). Unified face analysis by iterative multi-output random forests. In CVPR (pp. 1765–1772).
Zhong, L., Liu, Q., Yang, P., Liu, B., Huang, J., & Metaxas, D. N. (2012). Learning active facial patches for expression analysis. In CVPR (pp. 2562–2569).
Acknowledgements
This work has been supported by the French National Agency (ANR) in the frame of its Technological Research CONTINT program (JEMImE, project number ANR-13-CORD-0004).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Thomas Brox, Cordelia Schmid.
Rights and permissions
About this article
Cite this article
Dapogny, A., Bailly, K. & Dubuisson, S. Confidence-Weighted Local Expression Predictions for Occlusion Handling in Expression Recognition and Action Unit Detection. Int J Comput Vis 126, 255–271 (2018). https://doi.org/10.1007/s11263-017-1010-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-017-1010-1