Comprehending different categories of facial expressions plays a great role in the design of computational model analyzing human perceived and affective state. Authoritative studies have revealed that facial expressions in human daily life are in multiple or co-occurring mental states. However, due to the lack of valid datasets, most previous studies are still restricted to basic emotions with single label. In this paper, we present a novel multi-label facial expression database, RAF-ML, along with a new deep learning algorithm, to address this problem. Specifically, a crowdsourcing annotation of 1.2 million labels from 315 participants was implemented to identify the multi-label expressions collected from social network, then EM algorithm was designed to filter out unreliable labels. For all we know, RAF-ML is the first database in the wild that provides with crowdsourced cognition for multi-label expressions. Focusing on the ambiguity and continuity of blended expressions, we propose a new deep manifold learning network, called Deep Bi-Manifold CNN, to learn the discriminative feature for multi-label expressions by jointly preserving the local affinity of deep features and the manifold structures of emotion labels. Furthermore, a deep domain adaption method is leveraged to extend the deep manifold features learned from RAF-ML to other expression databases under various imaging conditions and cultures. Extensive experiments on the RAF-ML and other diverse databases (JAFFE, CK\(+\), SFEW and MMI) show that the deep manifold feature is not only superior in multi-label expression recognition in the wild, but also captures the elemental and generic components that are effective for a wide range of expression recognition tasks.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Ekman have stated in Ekman et al. (2013), “if the stimulus does contain an emotion blend, and the investigator allows only a single choice which does not contain blend terms, low levels of agreement may result, since some of the observers may choose a term for one of the blend components, some for another.”
It’s reasonable that there are no images with five or six labels in RAF-ML, since people can hardly perceive both negative and positive valence-level (for example, joyful and angry) simultaneously from still images, which occupy different regions in the valence-activation space (Cowie et al. 2001; Russell and Barrett 1999).
Anitha, C., Venkatesha, M., & Adiga, B. S. (2010). A survey on facial expression databases. International Journal of Engineering Science and Technology, 2(10), 5158–5174.
Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27. Software Available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Chang, Y., Hu, C., & Turk, M. (2004). Probabilistic expression analysis on manifolds. In Computer vision and pattern recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE computer society conference on (Vol. 2, pp. II–II). IEEE.
Chen, J., Liu, X., Tu, P., & Aragones, A. (2013). Learning person-specific models for facial expression and action unit recognition. Pattern Recognition Letters, 34(15), 1964–1970.
Chu, W. S., De la Torre, F., & Cohn, J. F. (2013). Selective transfer machine for personalized facial action unit detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3515–3522).
Cour, T., Sapp, B., & Taskar, B. (2011). Learning from partial labels. Journal of Machine Learning Research, 12(May), 1501–1536.
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., et al. (2001). Emotion recognition in human–computer interaction. IEEE Signal Processing Magazine, 18(1), 32–80.
Csurka, G. (2017). Domain adaptation for visual applications: A comprehensive survey. CoRR, arXiv:1702.05374.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In: CVPR, on (Vol. 1, pp. 886–893). IEEE.
Dhall, A., Goecke, R., & Gedeon, T. (2015a). Automatic group happiness intensity analysis. IEEE Transactions on Affective Computing, 6(1), 13–26.
Dhall, A., Ramana Murthy, O., Goecke, R., Joshi, J., & Gedeon, T. (2015b). Video and image based emotion recognition challenges in the wild: Emotiw 2015. In Proceedings of the 2015 ACM on international conference on multimodal interaction (pp. 423–426). ACM.
Ding, X., Chu, W. S., De la Torre, F., Cohn, J. F., & Wang, Q. (2013). Facial action unit event detection by cascade of tasks. In Proceedings of the IEEE international conference on computer vision (pp. 2400–2407).
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In ICML (pp. 647–655).
Du, S., Tao, Y., & Martinez, A. M. (2014). Compound facial expressions of emotion. Proceedings of the National Academy of Sciences, 111(15), E1454–E1462.
Ekman, P., & Friesen, W. V. (2003). Unmasking the face: A guide to recognizing emotions from facial clues. Journal of Personality (p. 212). Cambridge, MA: Malor Books.
Ekman, P., Friesen, W. V., & Ellsworth, P. (2013). Emotion in the human face: Guidelines for research and an integration of findings. Amsterdam: Elsevier.
Ekman, P., & Rosenberg, E. L. (1997). What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford: Oxford University Press.
Ekman, P., & Scherer, K. (1984). Expression and the nature of emotion. Approaches to Emotion, 3, 19–344.
Eleftheriadis, S., Rudovic, O., & Pantic, M. (2015a). Discriminative shared gaussian processes for multiview and view-invariant facial expression recognition. IEEE Transactions on Image Processing, 24(1), 189–204.
Eleftheriadis, S., Rudovic, O., & Pantic, M. (2015b). Multi-conditional latent variable model for joint facial action unit detection. In Proceedings of the IEEE international conference on computer vision (pp. 3792–3800).
Fabian Benitez-Quiroz, C., Srinivasan, R., & Martinez, A. M. (2016). Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5562–5570).
Fürnkranz, J., Hüllermeier, E., Mencía, E. L., & Brinker, K. (2008). Multilabel classification via calibrated label ranking. Machine Learning, 73(2), 133–153.
Gao, B. B., Xing, C., Xie, C. W., Wu, J., & Geng, X. (2017). Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing, 26(6), 2825–2838.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012a). A kernel two-sample test. Journal of Machine Learning Research, 13(Mar), 723–773.
Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., & Sriperumbudur, B. K. (2012b). Optimal kernel choice for large-scale two-sample tests. In Advances in neural information processing systems (pp. 1205–1213).
Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on (Vol. 2, pp. 1735–1742). IEEE.
Hassin, R. R., Aviezer, H., & Bentin, S. (2013). Inherently ambiguous: Facial expressions of emotions, in context. Emotion Review, 5(1), 60–65.
He, X., & Niyogi, P. (2004). Locality preserving projections. In Advances in neural information processing systems (pp. 153–160).
Hou, P., Geng, X., & Zhang, M. L. (2016). Multi-label manifold learning. In AAAI (pp. 1680–1686).
Huang, S. J., Zhou, Z. H., & Zhou, Z. (2012). Multi-label learning by exploiting label correlations locally. In AAAI (pp. 949–955).
Inc. M. (2013). Face++ research toolkit. www.faceplusplus.com.
Izard, C. E. (1972). Anxiety: A variable combination of interacting fundamental emotions. In Anxiety: Current trends in theory and research (Vol. 1, pp. 55–106).
Izard, C. E. (2013). Human emotions. Berlin: Springer.
Jack, R. E., Garrod, O. G., Yu, H., Caldara, R., & Schyns, P. G. (2012). Facial expressions of emotion are not culturally universal. Proceedings of the National Academy of Sciences, 109(19), 7241–7244.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on multimedia (pp. 675–678). ACM.
Jung, H., Lee, S., Yim, J., Park, S., & Kim, J. (2015). Joint fine-tuning in deep neural networks for facial expression recognition. In Proceedings of the IEEE international conference on computer vision (pp. 2983–2991).
Kim, B. K., Roh, J., Dong, S. Y., & Lee, S. Y. (2016). Hierarchical committee of deep convolutional neural networks for robust facial expression recognition. Journal on Multimodal User, Interfaces, 1–17.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).
Li, S., & Deng, W. (2018). Deep facial expression recognition: A survey. CoRR, arXiv:1804.08348.
Li, S., & Deng, W. (2019). Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition. IEEE Transactions on Image Processing, 28(1), 356–370.
Li, S., Deng, W., & Du, J. (2017). Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2584–2593). IEEE.
Liu, M., Li, S., Shan, S., & Chen, X. (2013). Au-aware deep networks for facial expression recognition. In Automatic face and gesture recognition (FG), 2013 10th IEEE international conference and workshops on (pp. 1–6). IEEE.
Liu, M., Li, S., Shan, S., Wang, R., & Chen, X. (2014a). Deeply learning deformable facial action parts model for dynamic expression analysis. In Asian conference on computer vision (pp. 143–157). Berlin: Springer.
Liu, M., Shan, S., Wang, R., & Chen, X. (2014b). Learning expression lets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1749–1756).
Liu, M., Shan, S., Wang, R., & Chen, X. (2016). Learning expressionlets via universal manifold model for dynamic facial expression recognition. IEEE Transactions on Image Processing, 25(12), 5920–5932.
Long, M., Cao, Y., Wang, J., & Jordan, M. (2015). Learning transferable features with deep adaptation networks. In International conference on machine learning (pp. 97–105).
Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. (2010). The extended Cohn–Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. In CVPRW, on (pp. 94–101). IEEE.
Lv, Y., Feng, Z., & Xu, C. (2014). Facial expression recognition via deep learning. In Smart computing (SMARTCOMP), 2014 international conference on (pp. 303–308). IEEE.
Lyons, M., Akamatsu, S., Kamachi, M., & Gyoba, J. (1998). Coding facial expressions with Gabor wavelets. In Automatic face and gesture recognition, 1998. Proceedings. Third IEEE international conference on (pp. 200–205). IEEE.
Miao, Y. Q., Araujo, R., & Kamel, M. S. (2012) Cross-domain facial expression recognition using supervised kernel mean matching. In Machine learning and applications (ICMLA), 2012 11th international conference on, IEEE (Vol. 2, pp. 326–332).
Mollahosseini, A., Chan, D., & Mahoor, M. H. (2016). Going deeper in facial expression recognition using deep neural networks. In 2016 IEEE Winter conference on applications of computer vision (WACV) (pp. 1–10). IEEE.
Ng, H. W., Nguyen, V. D., Vonikakis, V., & Winkler, S. (2015). Deep learning for emotion recognition on small datasets using transfer learning. In Proceedings of the 2015 ACM on international conference on multimodal interaction (pp. 443–449). ACM.
Nummenmaa, T. (1988). The recognition of pure and blended facial expressions of emotion from still photographs. Scandinavian Journal of Psychology, 29(1), 33–47.
Ojala, T., Pietikäinen, M., & Mäenpää, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 971–987.
Pantic, M., & Rothkrantz, L. J. M. (2000). Automatic analysis of facial expressions: The state of the art. IEEE Transactions on pattern analysis and machine intelligence, 22(12), 1424–1445.
Patel, V. M., Gopalan, R., Li, R., & Chellappa, R. (2015). Visual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3), 53–69.
Plutchik, R. (1991). The emotions. Lanham: University Press of America.
Russell, J. A., & Barrett, L. F. (1999). Core affect, prototypical emotional episodes, and other things called emotion: Dissecting the elephant. Journal of Personality and Social Psychology, 76(5), 805.
Sariyanidi, E., Gunes, H., & Cavallaro, A. (2015). Automatic analysis of facial affect: A survey of registration, representation, and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(6), 1113–1133.
Sariyanidi, E., Gunes, H., & Cavallaro, A. (2017). Learning bases of activity for facial expression recognition. IEEE Transactions on Image Processing, 26(4), 1965–1978. https://doi.org/10.1109/TIP.2017.2662237.
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. (pp. 815–823).
Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 806–813).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, arXiv:1409.1556.
Tomkins, S. S. (1963). Affect imagery consciousness: Volume II: The negative affects (Vol. 2). Berlin: Springer.
Tsoumakas, G., & Katakis, I. (2006). Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3), 1–13.
Tsoumakas, G., & Vlahavas, I. (2007). Random k-labelsets: An ensemble method for multilabel classification. In Machine learning: ECML 2007 (pp. 406–417). Berlin: Springer.
Valstar, M., & Pantic, M. (2010). Induced disgust, happiness and surprise: An addition to the MMI facial expression database. In Proceedings of the 3rd international workshop on EMOTION (satellite of LREC): Corpora for research on emotion and affect (p. 65).
Viola, P., & Jones, M. (2001) Rapid object detection using a boosted cascade of simple features. In CVPR, on (Vol. 1, pp. 1–511). IEEE.
Wang, S., Liu, Z., Wang, J., Wang, Z., Li, Y., Chen, X., et al. (2014). Exploiting multi-expression dependences for implicit multi-emotion video tagging. Image and Vision Computing, 32(10), 682–691.
Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. In European conference on computer vision. Berlin: Springer (pp. 499–515).
Whitehill, J., Wu, T. F., Bergsma, J., Movellan, J. R., & Ruvolo, P. L. (2009). Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems (pp. 2035–2043).
Xing, C., Geng, X., & Xue, H. (2016). Logistic boosting regression for label distribution learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4489–4497).
Xiong, X., & De la Torre, F. (2013). Supervised descent method and its applications to face alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 532–539).
Yan, H., Ang, M. H., & Poo, A. N. (2011). Cross-dataset facial expression recognition. In Robotics and automation (ICRA), 2011 IEEE international conference on (pp. 5985–5990). IEEE.
Yin, L., Wei, X., Sun, Y., Wang, J., & Rosato, M. J. (2006). A 3d facial expression database for facial behavior research. In Automatic face and gesture recognition, 2006. FGR 2006. 7th international conference on (pp. 211–216). IEEE.
Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems (pp. 3320–3328).
Yu, Z., & Zhang, C. (2015). Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on international conference on multimodal interaction (pp. 435–442). ACM.
Zen, G., Porzi, L., Sangineto, E., Ricci, E., & Sebe, N. (2016). Learning personalized models for facial expression analysis and gesture recognition. IEEE Transactions on Multimedia, 18(4), 775–788.
Zeng, J., Chu, W. S., De la Torre, F., Cohn, J. F., & Xiong, Z. (2015). Confidence preserving machine for facial action unit detection. In Proceedings of the IEEE international conference on computer vision (pp. 3622–3630).
Zhang, M. L., & Wu, L. (2015). Lift: Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1), 107–120.
Zhang, M. L., & Yu, F. (2015). Solvingthe partial label learning problem: An instance-based approach. In IJCAI (pp. 4048–4054).
Zhang, M. L., & Zhou, Z. H. (2007). ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7), 2038–2048.
Zhang, M. L., & Zhou, Z. H. (2014). A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1819–1837.
Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2018). From facial expression recognition to interpersonal relation prediction. International Journal of Computer Vision, 126(5), 550–569.
Zhao, K., Zhang, H., Ma, Z., Song, Y. Z., & Guo, J. (2015). Multi-label learning with prior knowledge for facial expression analysis. Neurocomputing, 157, 280–289.
Zhong, L., Liu, Q., Yang, P., Liu, B., Huang, J., & Metaxas, D. N. (2012). Learning active facial patches for expression analysis. In Computer vision and pattern recognition (CVPR), 2012 IEEE conference on (pp. 2562–2569). IEEE.
Zhou, Y., Xue, H., & Geng, X. (2015). Emotion distribution recognition from facial expressions. In Proceedings of the 23rd ACM international conference on multimedia (pp. 1247–1250). ACM.
Zhu, R., Sang, G., & Zhao, Q. (2016). Discriminative feature adaptation for cross-domain facial expression recognition. In Biometrics (ICB), 2016 international conference on (pp. 1–7). IEEE.
Zong, Y., Huang, X., Zheng, W., Cui, Z., & Zhao, G. (2017). Learning a target sample re-generator for cross-database micro-expression recognition. In Proceedings of the 2017 ACM on multimedia conference (pp. 872–880). ACM.
The funding was provided by National Natural Science Foundation of China (Grant Nos 61573068, 61471048), Beijing Nova Program (Grant No Z161100004916088).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Communicated by Rama Chellappa, Xiaoming Liu, Tae-Kyun Kim, Fernando De la Torre, Chen Change Loy.
Electronic supplementary material
Below is the link to the electronic supplementary material.
About this article
Cite this article
Li, S., Deng, W. Blended Emotion in-the-Wild: Multi-label Facial Expression Recognition Using Crowdsourced Annotations and Deep Locality Feature Learning. Int J Comput Vis 127, 884–906 (2019). https://doi.org/10.1007/s11263-018-1131-1
- Facial expression recognition
- Deep feature learning
- Multi-label classification
- Crowdsourced database in-the-wild