Deep, Landmark-Free FAME: Face Alignment, Modeling, and Expression Estimation
Abstract
We present a novel method for modeling 3D face shape, viewpoint, and expression from a single, unconstrained photo. Our method uses three deep convolutional neural networks to estimate each of these components separately. Importantly, unlike others, our method does not use facial landmark detection at test time; instead, it estimates these properties directly from image intensities. In fact, rather than using detectors, we show how accurate landmarks can be obtained as a by-product of our modeling process. We rigorously test our proposed method. To this end, we raise a number of concerns with existing practices used in evaluating face landmark detection methods. In response to these concerns, we propose novel paradigms for testing the effectiveness of rigid and non-rigid face alignment methods without relying on landmark detection benchmarks. We evaluate rigid face alignment by measuring its effects on face recognition accuracy on the challenging IJB-A and IJB-B benchmarks. Non-rigid, expression estimation is tested on the CK+ and EmotiW’17 benchmarks for emotion classification. We do, however, report the accuracy of our approach as a landmark detector for 3D landmarks on AFLW2000-3D and 2D landmarks on 300W and AFLW-PIFA. A surprising conclusion of these results is that better landmark detection accuracy does not necessarily translate to better face processing. Parts of this paper were previously published by Tran et al. (2017) and Chang et al. (2017, 2018).
Keywords
3D face modeling Face alignment Facial expression estimation Facial landmark detectionNotes
Acknowledgements
This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA 2014-14071600011. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purpose notwithstanding any copyright annotation thereon.
References
- Artizzu, X. P., Perona, P., & Dollár, P. (2013). Robust face landmark estimation under occlusion. In Proceedings of the international conference on computer vision.Google Scholar
- Asthana, A., Zafeiriou, S., Cheng, S., & Pantic, M. (2014). Incremental face alignment in the wild. In Proceedings of the conference on computer vision pattern recognition.Google Scholar
- Baltrusaitis, T., Robinson, P., & Morency, L. P. (2013). Constrained local neural fields for robust facial landmark detection in the wild. In Proceedings of the conference on computer vision pattern recognition workshops.Google Scholar
- Baltrušaitis, T., Robinson, P., & Morency, L. P. (2016). Openface: An open source facial behavior analysis toolkit. In Winter conference on appllications of computer vision.Google Scholar
- Bansal, A., Russell, B., & Gupta, A. (2016). Marr revisited: 2D-3D alignment via surface normal prediction. In Proceedings of the conference on computer vision pattern recognition.Google Scholar
- Bas, A., Smith, W. A. P., Bolkart, T., & Wuhrer, S. (2016). Fitting a 3D morphable model to edges: A comparison between hard and soft correspondences. In ACCV workshops.Google Scholar
- Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2013). Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2930–2940.CrossRefGoogle Scholar
- Bhagavatula, C., Zhu, C., Luu, K., & Savvides, M. (2017). Faster than real-time facial alignment: A 3D spatial transformer network approach in unconstrained poses. In Proceedings of the international conference on computer vision.Google Scholar
- Blanz, V., & Vetter, T. (1999). Morphable model for the synthesis of 3D faces. In Proceedings of ACM SIGGRAPH conference on computer graphics.Google Scholar
- Blanz, V., & Vetter, T. (2003). Face recognition based on fitting a 3d morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9), 1063–1074.CrossRefGoogle Scholar
- Blanz, V., Romdhani, S., & Vetter, T. (2002). Face identification across different poses and illuminations with a 3d morphable model. In International conference on automatic face and gesture recognition.Google Scholar
- Blanz, V., Scherbaum, K., Vetter, T., & Seidel, H. P. (2004). Exchanging faces in images. Computer Graphics Forum, 23(3), 669–676.CrossRefGoogle Scholar
- Booth, J., Antonakos, E., Ploumpis, S., Trigeorgis, G., Panagakis, Y., & Zafeiriou, S. (2017). 3D face morphable models “in-the-wild”. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Bulat, A., & Tzimiropoulos, G. (2017a). Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In Proceedings of the international conference on computer vision.Google Scholar
- Bulat, A., & Tzimiropoulos, G. (2017b). How far are we from solving the 2d and 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proceedings of the international conference on computer vision.Google Scholar
- Cao, X., Wei, Y., Wen, F., & Sun, J. (2014). Face alignment by explicit shape regression. International Journal of Computer Vision, 107(2), 177–190.MathSciNetCrossRefGoogle Scholar
- Chang, F. J., Tran, A., Hassner, T., Masi, I., Nevatia, R., & Medioni, G. (2017) Faceposenet: Making a case for landmark-free face alignment. In Proceedings of international conference on computer vision workshops.Google Scholar
- Chang, F. J., Tran, A. T., Hassner, T., Masi, I., Nevatia, R., & Medioni, G. (2018) Expnet: Landmark-free, deep, 3D facial expressions. In International conference on automatic face and gesture recognition.Google Scholar
- Chu, B., Romdhani, S., & Chen, L. (2014). 3D-aided face recognition robust to expression and pose variations. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Crosswhite, N., Byrne, J., Stauffer, C., Parkhi, O., Cao, Q., & Zisserman, A. (2017). Template adaptation for face verification and identification. In International conference on automatic face and gesture recognition.Google Scholar
- Dantone, M., Gall, J., Fanelli, G., & Van Gool, L. (2012). Real-time facial feature detection using conditional regression forests. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Dhall, A., Goecke, R., Lucey, S., & Gedeon, T. (2012). Collecting large, richly annotated facial-expression databases from movies. IEEE MultiMedia, 19(3), 34–41.CrossRefGoogle Scholar
- Dhall, A., Goecke, R., Ghosh, S., Joshi, J., Hoey, J., & Gedeon, T. (2017). From individual to group-level emotion recognition: Emotiw 5.0. In ACM ICMI.Google Scholar
- Dhall, A., Murthy, O. R., Goecke, R., Joshi, J., & Gedeon, T. (2015). Video and image based emotion recognition challenges in the wild: EmotiW 2015. In: ACM ICMI.Google Scholar
- Dong, X., Yan, Y., Ouyang, W., & Yang, Y. (2018a). Style aggregated network for facial landmark detection. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Dong, X., Yu, S. I., Weng, X., Wei, S. E., Yang, Y., & Sheikh, Y. (2018b). Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Dong, X., Zheng, L., Ma, F., Yang, Y., & Meng, D. (2018c). Few-example object detection with model communication. IEEE Transactions on Pattern Analysis & Machine Intelligence. https://doi.org/10.1109/TPAMI.2018.2844853.
- Dou, P., Shah, S. K., & Kakadiaris, I. A. (2017). End-to-end 3D face reconstruction with deep neural networks. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Eidinger, E., Enbar, R., & Hassner, T. (2014). Age and gender estimation of unfiltered faces. IEEE Transactions on Information Forensics and Security, 9(12), 2170–2179.CrossRefGoogle Scholar
- Everingham, M., Sivic, J., & Zisserman, A. (2006). “Hello! My name is... Buffy”—Automatic naming of characters in TV video. In Proceedings of British machine vision conference.Google Scholar
- Fabian Benitez-Quiroz, C., Srinivasan, R., & Martinez, A. M. (2016). Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.zbMATHGoogle Scholar
- Hassner, T. (2013). Viewing real-world faces in 3D. In Proceedings of the international conference on computer vision. Available www.openu.ac.il/home/hassner/projects/poses.
- Hassner, T., & Basri, R. (2006). Example based 3D reconstruction from single 2D images. In Proceedings of conference on computer vision pattern recognition workshops.Google Scholar
- Hassner, T., Harel, S., Paz, E., & Enbar, R. (2015). Effective face frontalization in unconstrained images. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Hassner, T., Masi, I., Kim, J., Choi, J., Harel, S., Natarajan, P., & Medioni, G. (2016). Pooling faces: Template based face recognition with pooled face images. In Proceedings of conference on computer vision pattern recognition workshops.Google Scholar
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Huang, G. B., Jain, V., & Learned-Miller, E. (2007). Unsupervised joint alignment of complex images. In Proceedings of the international conference on computer vision.Google Scholar
- Huber, P., Hu, G., Tena, R., Mortazavian, P., Koppen, W., Christmas, W., Rtsch, M., & Kittler, J. (2016). A multiresolution 3D morphable face model and fitting framework. In VISAPP.Google Scholar
- Jackson, A. S., Bulat, A., Argyriou, V., & Tzimiropoulos, G. (2017). Large pose 3D face reconstruction from a single image via direct volumetric CNN regression. In Proceedings of the international conference on computer vision Google Scholar
- Jeni, L. A., Cohn, J. F., & Kanade, T. (2015). Dense 3D face alignment from 2D videos in real-time. In International conference on automatic face and gesture recognition.Google Scholar
- Jourabloo, A., & Liu, X. (2015). Pose-invariant 3d face alignment. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Jourabloo, A., & Liu, X. (2016). Large-pose face alignment via cnn-based dense 3D model fitting. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Kazemi, V., & Sullivan, J. (2014). One millisecond face alignment with an ensemble of regression trees. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Kemelmacher-Shlizerman, I., & Basri, R. (2011). 3D face reconstruction from a single image using a single reference face shape. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 394–405.CrossRefGoogle Scholar
- King, D. E. (2009). Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10, 1755–1758.Google Scholar
- Klare, B. F., Klein, B., Taborsky, E., Blanton, A., Cheney, J., Allen, K., Grother, P., Mah, A., Burge, M., & Jain, A. K. (2015). Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus Benchmark-A. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Kosti, R., Alvarez, J. M., Recasens, A., & Lapedriza, A. (2017). Emotion recognition in context. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Köstinger, M., Wohlhart, P., Roth, P. M., & Bischof, H. (2011). Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In Proceedings of the international conference on computer vision workshops.Google Scholar
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Neural information processing systems.Google Scholar
- Kumar, A., Alavi, A., & Chellappa, R. (2017). Kepler: Keypoint and pose estimation of unconstrained faces by learning efficient h-cnn regressors. In Automatic face and gesture recognition.Google Scholar
- Kumar, A., & Chellappa, R. (2018). Disentangling 3D pose in a dendritic cnn for unconstrained 2d face alignment. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Le, V., Brandt, J., Lin, Z., Bourdev, L., & Huang, T. (2012). Interactive facial feature localization. In European conference on computer vision.Google Scholar
- Levi, G., & Hassner, T. (2015). Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In ACM ICMI.Google Scholar
- Li, C., Zhou, K., & Lin, S. (2014). Intrinsic face image decomposition with human face priors. In European conference on computer vision.Google Scholar
- Liu, Y., Jourabloo, A., Ren, W., & Liu, X. (2017). Dense face alignment. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of the international conference on computer vision.Google Scholar
- Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. (2010) The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Proceedings of conference on computer vision pattern recognition workshops.Google Scholar
- Masi, I., Ferrari, C., Del Bimbo, A., & Medioni, G. (2014). Pose independent face recognition by localizing local binary patterns via deformation components. In International conference on pattern recognition (pp. 4477–4482). IEEE.Google Scholar
- Masi, I., Chang, F. J., Choi, J., Harel, S., Kim, J., Kim, K., et al. (2018a). Learning pose-aware models for pose-invariant face recognition in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 379–393.CrossRefGoogle Scholar
- Masi, I., Hassner, T., Tran, A. T., & Medioni, G. (2017). Rapid synthesis of massive face sets for improved face recognition. In International conference on automatic face and gesture recognition (pp. 604–611). IEEE.Google Scholar
- Masi, I., Rawls, S., Medioni, G., & Natarajan, P. (2016a). Pose-aware face recognition in the wild. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Masi, I., Tran, A., Hassner, T., Leksut, J. T., & Medioni, G. (2016b). Do we really need to collect millions of faces for effective face recognition?. In European conference computer vision. Available www.openu.ac.il/home/hassner/projects/augmented_faces.
- Masi, I., Wu, Y., Hassner, T., & Natarajan, P. (2018b). Deep face recognition: A survey. In Conference on graphics, patterns and images.Google Scholar
- Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In Proceedings of British machine vision conference.Google Scholar
- Paysan, P., Knothe, R., Amberg, B., Romhani, S., & Vetter, T. (2009). A 3D face model for pose and illumination invariant face recognition. In International conference on advanced video and signal based surveillance.Google Scholar
- Poirson, P., Ammirato, P., Fu, C. Y., Liu, W., Kosecka, J., & Berg, A. C. (2016). Fast single shot detection and pose estimation. In 3DV.Google Scholar
- Ranjan, R., Castillo, C. D., & Chellappa, R. (2017). L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507.
- Ren, S., Cao, X., Wei, Y., & Sun, J. (2014). Face alignment at 3000 fps via regressing local binary features. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Richardson, E., Sela, M., & Kimmel, R. (2016). 3d face reconstruction by learning from synthetic data. In 3DV.Google Scholar
- Richardson, E., Sela, M., Or-El, R., & Kimmel, R. (2017). Learning detailed face reconstruction from a single image. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Romdhani, S., & Vetter, T. (2003). Efficient, robust and accurate fitting of a 3D morphable model. In Proceedings of the international conference on computer vision.Google Scholar
- Romdhani, S., & Vetter, T. (2005). Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013). 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of conference on computer vision pattern recognition workshops.Google Scholar
- Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2016). 300 faces in-the-wild challenge: Database and results. Image and Vision Computing, 47, 3–18.CrossRefGoogle Scholar
- Sela, M., Richardson, E., & Kimmel, R. (2017). Unrestricted facial geometry reconstruction using image-to-image translation. In Proceedings of the international conference on computer vision.Google Scholar
- Sengupta, S., Kanazawa, A., Castillo, C. D., & Jacobs, D. (2018). SfSNet: Learning shape, reflectance and illuminance of faces in the wild. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Su, H., Qi, C. R., Li, Y., & Guibas, L. J. (2015). Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In Proceedings of the international conference on computer vision.Google Scholar
- Surace, L., Patacchiola, M., Battini Sönmez, E., Spataro, W., & Cangelosi, A. (2017). Emotion recognition in the wild using deep neural networks and Bayesian classifiers. In ACM ICMI.Google Scholar
- Tang, H., Hu, Y., Fu, Y., Hasegawa-Johnson, M., & Huang, T. S. (2008). Real-time conversion from a single 2d face image to a 3D text-driven emotive audio-visual avatar. In International conference on multimedia and expo.Google Scholar
- Tewari, A., Zollhfer, M., Garrido, P., Florian Bernard, H. K., Prez, P., & Theobalt, C. (2018). Self-supervised multi-level face model learning for monocular reconstruction at over 250 Hz. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Tran, A., Hassner, T., Masi, I., & Medioni, G. (2017). Regressing robust and discriminative 3D morphable models with a very deep neural network. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Tran, A. T., Hassner, T., Masi, I., Paz, E., Nirkin, Y., & Medioni, G. (2018) Extreme 3D face reconstruction: Looking past occlusions. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Vetter, T., & Blanz, V. (1998). Estimating coloured 3D face models from single images: An example based approach. In European conference on computer vision.Google Scholar
- Whitelam, C., Taborsky, E., Blanton, A., Maze, B., Adams, J., Miller, T., Kalka, N., Jain, A. K., Duncan, J. A., & Allen, K., et al. (2017). Iarpa janus benchmark-b face dataset. In Proceedings of conference on computer vision pattern recognition workshops.Google Scholar
- Wolf, L., Hassner, T., & Maoz, I. (2011). Face recognition in unconstrained videos with matched background similarity. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Wu, Y., Hassner, T., Kim, K., Medioni, G., & Natarajan, P. (2017). Facial landmark detection with tweaked convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 3067–3074.CrossRefGoogle Scholar
- Xiang, Y., Mottaghi, R., & Savarese, S. (2014). Beyond pascal: A benchmark for 3D object detection in the wild. In Winter conference on applications of computer vision.Google Scholar
- Xiang, Y., Kim, W., Chen, W., Ji, J., Choy, C., Su, H., Mottaghi, R., Guibas, L., & Savarese, S. (2016). Objectnet3D: A large scale database for 3D object recognition. In European conference on computer vision.Google Scholar
- Xie, L., Wang, J., Wei, Z., Wang, M., & Tian, Q. (2016). Disturblabel: Regularizing cnn on the loss layer. In Proceedings of conference on computer vision pattern recognition (pp. 4753–4762).Google Scholar
- Xiong, X., & De la Torre, F. (2013). Supervised descent method and its applications to face alignment. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Yang, Z., & Nevatia, R. (2016). A multi-scale cascade fully convolutional network face detector. In ICPR.Google Scholar
- Yang, F., Wang, J., Shechtman, E., Bourdev, L., & Metaxas, D. (2011). Expression flow for 3D-aware face component transfer. ACM Transactions on Graphics, 30(4), 60.CrossRefGoogle Scholar
- Yi, D., Lei, Z., Liao, S., & Li, S. Z. (2014). Learning face representation from scratch. arXiv preprint arXiv:1411.7923. Available http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html.
- Yu, X., Huang, J., Zhang, S., Yan, W., & Metaxas, D. N. (2013). Pose-free facial landmark fitting via optimized part mixtures and cascaded deformable shape model. In Proceedings of the international conference on computer vision (pp. 1944–1951). IEEE.Google Scholar
- Zadeh, A., Baltrušaitis, T., & Morency, L. P. (2016). Deep constrained local models for facial landmark detection. arXiv preprint arXiv:1611.08657.
- Zafeiriou, S., Chrysos, G. G., Roussos, A., Ververas, E., Deng, J., & Trigeorgis, G. (2017). The 3D menpo facial landmark tracking challenge. In Proceedings of international conference on computer vision workshops.Google Scholar
- Zafeiriou, S., Papaioannou, A., Kotsia, I., Nicolaou, M., & Zhao, G. (2016) Facial affect “in-the-wild”. In Proceedings of conference on computer vision pattern recognition workshops (pp. 36–47).Google Scholar
- Zhang, J., Shan, S., Kan, M., & Chen, X. (2014). Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In European conference on computer vision. Springer.Google Scholar
- Zhang, K., Tan, L., Li, Z., & Qiao, Y. (2016). Gender and smile classification using deep convolutional neural networks. In Proceedings of conference on computer vision pattern recognition workshops (pp. 34–38).Google Scholar
- Zhu, S., Li, C., Change Loy, C., & Tang, X. (2015a). Face alignment by coarse-to-fine shape searching. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Zhu, S., Li, C., Loy, C. C., & Tang, X. (2016a). Unconstrained face alignment via cascaded compositional learning. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Zhu, X., Lei, Z., Liu, X., Shi, H., & Li, S. (2016b). Face alignment across large poses: A 3D solution. In Proceedings of conference on computer vision pattern recognition.Google Scholar
- Zhu, X., Lei, Z., Yan, J., Yi, D., & Li, S. Z. (2015b). High-fidelity pose and expression normalization for face recognition in the wild. In Proceedings of conference on computer vision pattern recognition (pp. 787–796).Google Scholar
- Zhu, X., & Ramanan, D. (2012). Face detection, pose estimation, and landmark localization in the wild. In Proceedings of conference on computer vision pattern recognition.Google Scholar