Abstract
Facial expression recognition from videos in the wild is a challenging task due to the lack of abundant labelled training data. Large DNN (deep neural network) architectures and ensemble methods have resulted in better performance, but soon reach saturation at some point due to data inadequacy. In this paper, we use a self-training method that utilizes a combination of a labelled dataset and an unlabelled dataset (Body Language Dataset - BoLD). Experimental analysis shows that training a noisy student network iteratively helps in achieving significantly better results. Additionally, our model isolates different regions of the face and processes them independently using a multi-level attention mechanism which further boosts the performance. Our results show that the proposed method achieves state-of-the-art performance on benchmark datasets CK+ and AFEW 8.0 when compared to single models.
V. Kumar and S. Rao—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Dhall, A., Goecke, R., Lucey, S., Gedeon, T.: Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia (3), 34–41 (2012)
Fan, Y., Lam, J.C., Li, V.O.: Video-based emotion recognition using deeply-supervised neural networks. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 584–588(2018)
Lu, C., et al.: Multiple spatio-temporal feature learning for video-based emotion recognition in the wild. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 646–652 (2018)
Vielzeuf, V., Pateux, S., Jurie, F.: Temporal multimodal fusion for video emotion classification in the wild. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 569–576 (2017)
Dhall, A.: Emotiw 2019: Automatic emotion, engagement and cohesion prediction tasks. In: 2019 International Conference on Multimodal Interaction, pp. 546–550 (2019)
Littlewort, G., Bartlett, M.S., Fasel, I., Susskind, J., Movellan, J.: Dynamics of facial expression extracted automatically from video. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, p. 80. IEEE (2004)
Shan, C., Gong, S., McOwan, P.W.: Facial expression recognition based on local binary patterns: a comprehensive study. Image Vis. Comput. 27(6), 803–816 (2009)
Knyazev, B., Shvetsov, R., Efremova, N., Kuharenko, A.: Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video. arXiv preprint arXiv:1711.04598 (2017)
Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)
Meng, D., Peng, X., Wang, K., Qiao, Y.: frame attention networks for facial expression recognition in videos. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 3866–3870. IEEE (2019)
Luo, Y., Ye, J., Adams, R.B., Li, J., Newman, M.G., Wang, J.Z.: Arbee: towards automated recognition of bodily expression of emotion in the wild. Int. J. Comput. Vis. 128(1), 1–25 (2020)
Xie, Q., Hovy, E., Luong, M.T., Le, Q.V.: Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252 (2019)
Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, vol. 2010, pp. 94–101. IEEE (2010)
Sikka, K., Dykstra, K., Sathyanarayana, S., Littlewort, G., Bartlett, M.: Multiple kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 517–524 (2013)
Liu, M., Wang, R., Huang, Z., Shan, S., Chen, X.: Partial least squares regression on grassmannian manifold for emotion recognition. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 525–530 (2013)
Chen, J., Chen, Z., Chi, Z., Fu, H.: Emotion recognition in the wild with feature fusion and multiple kernel learning. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 508–513 (2014)
Liu, C., Tang, T., Lv, K., Wang, M.: Multi-feature based emotion recognition for video clips. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 630–634 (2018)
Fan, Y., Lu, X., Li, D., Liu, Y.: Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 445–450 (2016)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Aminbeidokhti, M., Pedersoli, M., Cardinal, P., Granger, E.: Emotion recognition with spatial attention and temporal softmax pooling. In: Karray, F., Campilho, A., Yu, A. (eds.) ICIAR 2019. LNCS, vol. 11662, pp. 323–331. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27202-9_29
Fang, Y., Gao, J., Huang, C., Peng, H., Wu, R.: Self multi-head attention-based convolutional neural networks for fake news detection. PloS one 14(9), e0222713 (2019)
Lin, Z., et al.: A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017)
Wang, K., Peng, X., Yang, J., Meng, D., Qiao, Y.: Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 29, 4057–4069 (2020)
Zeng, X., Wu, Q., Zhang, S., Liu, Z., Zhou, Q., Zhang, M.: A false trail to follow: differential effects of the facial feedback signals from the upper and lower face on the recognition of micro-expressions. Front. Psychol. 9, 2015 (2018)
Acharya, D., Huang, Z., Pani Paudel, D., Van Gool, L.: Covariance pooling for facial expression recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 367–374 (2018)
Valstar, M., Pantic, M.: Induced disgust, happiness and surprise: an addition to the mmi facial expression database. In: Proceedings of 3rd International Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, Paris, France, p. 65 (2010)
Lyons, M.J., Akamatsu, S., Kamachi, M., Gyoba, J., Budynek, J.: The Japanese female facial expression (jaffe) database. In: Proceedings of Third International Conference on Automatic Face and Gesture Recognition, pp. 14–16 (1998)
Kollias, D., Zafeiriou, S.: Aff-wild2: extending the aff-wild database for affect recognition. arXiv preprint arXiv:1811.07770 (2018)
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196 (1995)
Riloff, E.: Automatically generating extraction patterns from untagged text. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1044–1049 (1996)
Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., He, K.: Data distillation: Towards omni-supervised learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4119–4128 (2018)
Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. In: Advances in Neural Information Processing Systems, pp. 3365–3373 (2014)
Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: Advances in Neural Information Processing Systems, pp. 3546–3554 (2015)
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)
Jiang, Y., et al.: Enlightengan: Deep light enhancement without paired supervision. arXiv preprint arXiv:1906.06972 (2019)
Zhang, K., Huang, Y., Du, Y., Wang, L.: Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Trans. Image Process. 26(9), 4193–4203 (2017)
Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2983–2991 (2015)
Cai, J., Meng, Z., Khan, A.S., Li, Z., O’Reilly, J., Tong, Y.: Island loss for learning discriminative features in facial expression recognition. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 302–309. IEEE (2018)
Sikka, K., Sharma, G., Bartlett, M.: Lomo: latent ordinal model for facial analysis in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5580–5589 (2016)
Gu, C., et al.: Ava: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10(Jul), 1755–1758 (2009)
Anila, S., Devarajan, N.: Preprocessing technique for face recognition applications under varying illumination conditions. Glob. J. Comput. Sci. Technol. (2012)
Liu, Y., Li, Y., Ma, X., Song, R.: Facial expression recognition with fusion features extracted from salient facial areas. Sensors 17(4), 712 (2017)
Wang, S., Li, W., Wang, Y., Jiang, Y., Jiang, S., Zhao, R.: An improved difference of gaussian filter in face recognition. J. Multimedia 7(6), 429–433 (2012)
Bendjillali, R.I., Beladgham, M., Merit, K., Taleb-Ahmed, A.: Improved facial expression recognition based on dwt feature for deep CNN. Electronics 8(3), 324 (2019)
Karthigayan, M., et al.: Development of a personified face emotion recognition technique using fitness function. Artif. Life Rob. 11(2), 197–203 (2007)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Barsoum, E., Zhang, C., Ferrer, C.C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283 (2016)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719 (2019)
Yan, J., Zheng, W., Cui, Z., Tang, C., Zhang, T., Zong, Y.: Multi-cue fusion for emotion recognition in the wild. Neurocomputing 309, 27–35 (2018)
Hu, P., Cai, D., Wang, S., Yao, A., Chen, Y.: Learning supervised scoring ensemble for emotion recognition in the wild. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 553–560 (2017)
Kim, J.H., Kim, B.G., Roy, P.P., Jeong, D.M.: Efficient facial expression recognition algorithm based on hierarchical deep neural network structure. IEEE Access 7, 41273–41285 (2019)
Vielzeuf, V., Kervadec, C., Pateux, S., Lechervy, A., Jurie, F.: An occam’s razor view on learning audiovisual emotion recognition with small training sets. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 589–593 (2018)
Sun, N., Li, Q., Huan, R., Liu, J., Han, G.: Deep spatial-temporal feature fusion for facial expression recognition in static images. Pattern Recogn. Lett. 119, 49–61 (2019)
Kuo, C.M., Lai, S.H., Sarkis, M.: A compact deep learning model for robust facial expression recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2121–2129 (2018)
Acknowledgements
The authors acknowledge the contribution of Dr. James Wang for providing the opportunity to work on this project during his course on Artificial Emotion Intelligence taught at the Pennsylvania State University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Kumar, V., Rao, S., Yu, L. (2020). Noisy Student Training Using Body Language Dataset Improves Facial Expression Recognition. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12535. Springer, Cham. https://doi.org/10.1007/978-3-030-66415-2_53
Download citation
DOI: https://doi.org/10.1007/978-3-030-66415-2_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66414-5
Online ISBN: 978-3-030-66415-2
eBook Packages: Computer ScienceComputer Science (R0)