Real-Time Facial Segmentation and Performance Capture from RGB Input

  • Shunsuke SaitoEmail author
  • Tianye Li
  • Hao Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9912)


We introduce the concept of unconstrained real-time 3D facial performance capture through explicit semantic segmentation in the RGB input. To ensure robustness, cutting edge supervised learning approaches rely on large training datasets of face images captured in the wild. While impressive tracking quality has been demonstrated for faces that are largely visible, any occlusion due to hair, accessories, or hand-to-face gestures would result in significant visual artifacts and loss of tracking accuracy. The modeling of occlusions has been mostly avoided due to its immense space of appearance variability. To address this curse of high dimensionality, we perform tracking in unconstrained images assuming non-face regions can be fully masked out. Along with recent breakthroughs in deep learning, we demonstrate that pixel-level facial segmentation is possible in real-time by repurposing convolutional neural networks designed originally for general semantic segmentation. We develop an efficient architecture based on a two-stream deconvolution network with complementary characteristics, and introduce carefully designed training samples and data augmentation strategies for improved segmentation accuracy and robustness. We adopt a state-of-the-art regression-based facial tracking framework with segmented face images as training, and demonstrate accurate and uninterrupted facial performance capture in the presence of extreme occlusion and even side views. Furthermore, the resulting segmentation can be directly used to composite partial 3D face models on the input images and enable seamless facial manipulation tasks, such as virtual make-up or face replacement.


Real-time facial performance capture Face segmentation Deep convolutional neural network Regression 



We would like to thank Joseph J. Lim, Qixing Huang, Duygu Ceylan, Lingyu Wei, Kyle Olszewski, Harry Shum, and Gary Bradski for the fruitful discussions and the proofreading. We also thank Rui Saito and Frances Chen for being our capture models. This research is supported in part by Adobe, Oculus & Facebook, Sony, Pelican Imaging, Panasonic, Embodee, Huawei, the Google Faculty Research Award, The Okawa Foundation Research Grant, the Office of Naval Research (ONR)/U.S. Navy, under award number N00014-15-1-2639, the Office of the Director of National Intelligence (ODNI), and Intelligence Advanced Research Projects Activity (IARPA), under contract number 2014-14071600010. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purpose notwithstanding any copyright annotation thereon.

Supplementary material

Supplementary material 1 (mov 26029 KB)

419983_1_En_15_MOESM2_ESM.pdf (253 kb)
Supplementary material 2 (pdf 253 KB)


  1. 1.
    Weise, T., Bouaziz, S., Li, H., Pauly, M.: Realtime performance-based facial animation. ACM Trans. Graph. (TOG) 30(4), 77 (2011). ACMGoogle Scholar
  2. 2.
    Bouaziz, S., Wang, Y., Pauly, M.: Online modeling for realtime facial animation. ACM Trans. Graph 32(4), 40: 1–40: 10 (2013)CrossRefzbMATHGoogle Scholar
  3. 3.
    Li, H., Yu, J., Ye, Y., Bregler, C.: Realtime facial animation with on-the-fly correctives. ACM Trans. Graph. 32(4), 42 (2013)zbMATHGoogle Scholar
  4. 4.
    Cao, C., Weng, Y., Lin, S., Zhou, K.: 3D shape regression for real-time facial animation. ACM Trans. Graph. 32(4), 41: 1–41: 10 (2013)CrossRefzbMATHGoogle Scholar
  5. 5.
    Cao, C., Hou, Q., Zhou, K.: Displaced dynamic expression regression for real-time facial tracking and animation. ACM Trans. Graph. (TOG) 33(4), 43 (2014)Google Scholar
  6. 6.
    Cao, C., Bradley, D., Zhou, K., Beeler, T.: Real-time high-fidelity facial performance capture. ACM Trans. Graph. (TOG) 34(4), 46 (2015)CrossRefGoogle Scholar
  7. 7.
    Hsieh, P.L., Ma, C., Yu, J., Li, H.: Unconstrained realtime facial performance capture. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1675–1683 (2015)Google Scholar
  8. 8.
    Faceshift (2014).
  9. 9.
    Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: SIGGRAPH 1999, pp. 187–194 (1999)Google Scholar
  10. 10.
    Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: Facewarehouse: a 3D facial expression database for visual computing. IEEE Trans. Vis. Comput. Graph. 20(3), 413–425 (2014)CrossRefGoogle Scholar
  11. 11.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015, to appear)Google Scholar
  12. 12.
    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: 2015 IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  13. 13.
    Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: British Machine Vision Conference (2014)Google Scholar
  14. 14.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)Google Scholar
  15. 15.
    Rother, C., Kolmogorov, V., Blake, A.: “grabcut”: interactive foreground extraction using iterated graph cuts. In: ACM SIGGRAPH 2004 Papers, SIGGRAPH 2004, pp. 309–314. ACM, New York (2004)Google Scholar
  16. 16.
    Pighin, F., Lewis, J.P.: Performance-driven facial animation. In: ACM SIGGRAPH 2006 Courses, SIGGRAPH 2006 (2006)Google Scholar
  17. 17.
    Guenter, B., Grimm, C., Wood, D., Malvar, H., Pighin, F.: Making faces. In: SIGGRAPH 1998, pp. 55–66 (1998)Google Scholar
  18. 18.
    Zhang, L., Snavely, N., Curless, B., Seitz, S.M.: Spacetime faces: high resolution capture for modeling and animation. ACM Trans. Graph. 23(3), 548–558 (2004)CrossRefGoogle Scholar
  19. 19.
    Furukawa, Y., Ponce, J.: Dense 3D motion capture for human faces. In: CVPR, pp. 1674–1681 (2009)Google Scholar
  20. 20.
    Li, H., Adams, B., Guibas, L.J., Pauly, M.: Robust single-view geometry and motion reconstruction. ACM Trans. Graph. 28(5), 175: 1–175: 10 (2009)CrossRefGoogle Scholar
  21. 21.
    Beeler, T., Hahn, F., Bradley, D., Bickel, B., Beardsley, P., Gotsman, C., Sumner, R.W., Gross, M.: High-quality passive facial performance capture using anchor frames. ACM Trans. Graph. 30, 75: 1–75: 10 (2011)CrossRefGoogle Scholar
  22. 22.
    Fyffe, G., Hawkins, T., Watts, C., Ma, W.C., Debevec, P.: Comprehensive facial performance capture. In: Computer Graphics Forum, vol. 30, pp. 425–434. Wiley Online Library (2011)Google Scholar
  23. 23.
    Bhat, K.S., Goldenthal, R., Ye, Y., Mallet, R., Koperwas, M.: High fidelity facial animation capture and retargeting with contours. In: SCA 2013, pp. 7–14 (2013)Google Scholar
  24. 24.
    Fyffe, G., Jones, A., Alexander, O., Ichikari, R., Debevec, P.: Driving high-resolution facial scans with video performance capture. ACM Trans. Graph. 34(1), 8: 1–8: 14 (2014)CrossRefGoogle Scholar
  25. 25.
    Li, H., Roivainen, P., Forcheimer, R.: 3-D motion estimation in model-based facial image coding. TPAMI 15(6), 545–555 (1993)CrossRefGoogle Scholar
  26. 26.
    Bregler, C., Omohundro, S.: Surface learning with applications to lipreading. In: Advances in Neural Information Processing Systems, p. 43 (1994)Google Scholar
  27. 27.
    Black, M.J., Yacoob, Y.: Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion. In: ICCV, pp. 374–381 (1995)Google Scholar
  28. 28.
    Essa, I., Basu, S., Darrell, T., Pentland, A.: Modeling, tracking and interactive animation of faces and heads using input from video. In: Proceedings of the Computer Animation, pp. 68–79(1996)Google Scholar
  29. 29.
    Decarlo, D., Metaxas, D.: Optical flow constraints on deformable models with applications to face tracking. Int. J. Comput. Vis. 38(2), 99–127 (2000)CrossRefzbMATHGoogle Scholar
  30. 30.
    Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 6, 681–685 (2001)CrossRefGoogle Scholar
  31. 31.
    Cristinacce, D., Cootes, T.: Automatic feature localisation with constrained local models. Pattern Recogn. 41(10), 3054–3067 (2008)CrossRefzbMATHGoogle Scholar
  32. 32.
    Saragih, J.M., Lucey, S., Cohn, J.F.: Deformable model fitting by regularized landmark mean-shift. Int. J. Comput. Vis. 91(2), 200–215 (2011)CrossRefzbMATHMathSciNetGoogle Scholar
  33. 33.
    Xiong, X., De la Torre, F.: Supervised descent method and its applications to face alignment. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 532–539. IEEE (2013)Google Scholar
  34. 34.
    Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. Int. J. Comput. Vis. 107(2), 177–190 (2013)CrossRefMathSciNetGoogle Scholar
  35. 35.
    Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1867–1874. IEEE (2014)Google Scholar
  36. 36.
    Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 fps via regressing local binary features. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1685–1692. IEEE (2014)Google Scholar
  37. 37.
    Weise, T., Li, H., Van Gool, L., Pauly, M.: Face/off: live facial puppetry. In: Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 7–16. ACM (2009)Google Scholar
  38. 38.
    Li, H., Weise, T., Pauly, M.: Example-based facial rigging. ACM Trans. Graph. 29(4), 32: 1–32: 6 (2010)Google Scholar
  39. 39.
    Pighin, F.H., Szeliski, R., Salesin, D.: Resynthesizing facial animation through 3D model-based tracking. In: ICCV, pp. 143–150 (1999)Google Scholar
  40. 40.
    Chuang, E., Bregler, C.: Performance driven facial animation using blendshape interpolation. Technical report. Stanford University (2002)Google Scholar
  41. 41.
    Chai, J., Xiao, J., Hodgins, J.: Vision-based control of 3D facial animation. In: SCA 2003, pp. 193–206 (2003)Google Scholar
  42. 42.
    Suwajanakorn, S., Kemelmacher-Shlizerman, I., Seitz, S.M.: Total moving face reconstruction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 796–812. Springer, Heidelberg (2014)Google Scholar
  43. 43.
    Garrido, P., Valgaerts, L., Wu, C., Theobalt, C.: Reconstructing detailed dynamic face geometry from monocular video. ACM Trans. Graph. 32(6), 158 (2013)CrossRefGoogle Scholar
  44. 44.
    Shi, F., Wu, H.T., Tong, X., Chai, J.: Automatic acquisition of high-fidelity facial performances using monocular videos. ACM Trans. Graph. (TOG) 33(6), 222 (2014)CrossRefGoogle Scholar
  45. 45.
    Luo, P., Wang, X., Tang, X.: Hierarchical face parsing via deep learning. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2480–2487. IEEE (2012)Google Scholar
  46. 46.
    Smith, B., Zhang, L., Brandt, J., Lin, Z., Yang, J.: Exemplar-based face parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3484–3491 (2013)Google Scholar
  47. 47.
    Ghiasi, G., Fowlkes, C.: Using segmentation to predict the absence of occluded parts. Proceedings of the British machine vision conference (BMVC). 22(1–22), 12 (2015)Google Scholar
  48. 48.
    Burgos-Artizzu, X.P., Perona, P., Dollár, P.: Robust face landmark estimation under occlusion. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 1513–1520. IEEE (2013)Google Scholar
  49. 49.
    Gross, R., Matthews, I., Baker, S.: Active appearance models with occlusion. Image Vis. Comput. 24(6), 593–604 (2006)CrossRefGoogle Scholar
  50. 50.
    Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: CVPR, pp. 2879–2886 (2012)Google Scholar
  51. 51.
    Ghiasi, G., Fowlkes, C.C.: Occlusion coherence: localizing occluded faces with a hierarchical deformable part model. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1899–1906. IEEE (2014)Google Scholar
  52. 52.
    Yu, X., Lin, Z., Brandt, J., Metaxas, D.N.: Consensus of regression for occlusion-robust facial feature localization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 105–118. Springer, Heidelberg (2014)Google Scholar
  53. 53.
    Viola, P., Jones, M.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)CrossRefGoogle Scholar
  54. 54.
  55. 55.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFS. arXiv preprint arXiv:1412.7062 (2014)
  56. 56.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  57. 57.
    Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFS with gaussian edge potentials. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 109–117. Curran Associates, Inc. (2011)Google Scholar
  58. 58.
    Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report 07–49. University of Massachusetts, Amherst, October 2007Google Scholar
  59. 59.
    Siva, P., Russell, C., Xiang, T.: In defence of negative mining for annotating weakly labelled data. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 594–608. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  60. 60.
    Song, H.O., Girshick, R., Jegelka, S., Mairal, J., Harchaoui, Z., Darrell, T.: On learning to localize objects with minimal supervision. arXiv preprint arXiv:1403.1024 (2014)
  61. 61.
    Mittal, A., Zisserman, A., Torr, P.H.S.: Hand detection using multiple proposals. In: British Machine Vision Conference (2011)Google Scholar
  62. 62.
    Ekman, P., Friesen, W.: Facial action coding system: a technique for the measurement of facial movement. Consulting Psychologists, San Francisco (1978)Google Scholar
  63. 63.
    Tarrés, F., Rama, A.: GTAV face database. GVAP, UPC (2012)Google Scholar
  64. 64.
    Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995)CrossRefzbMATHMathSciNetGoogle Scholar
  65. 65.
    Lu, C.P., Hager, G.D., Mjolsness, E.: Fast and globally convergent pose estimation from video images. IEEE Trans. Pattern Anal. Mach. Intell. 22(6), 610–622 (2000)CrossRefGoogle Scholar
  66. 66.
    Jia, X., Yang, H., Lin, A., Chan, K.P., Patras, I.: Structured semi-supervised forest for facial landmarks localization with face mask reasoning. In: Proceedings British Machines Visualization Conference (BMVA) (2014)Google Scholar
  67. 67.
    Yang, H., He, X., Jia, X., Patras, I.: Robust face alignment under occlusion via regional predictive power estimation. IEEE Trans. Image Process. 24(8), 2393–2403 (2015)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.PinscreenSanta MonicaUSA
  2. 2.University of Southern CaliforniaLos AngelesUSA

Personalised recommendations