Towards Photo-Realistic Facial Expression Manipulation

Abstract

We present a method for photo-realistic face manipulation. Given a single RGB face image with an arbitrary expression, our method can synthesize another arbitrary expression of the same person. To achieve this, we first fit a 3D face model and disentangle the face into its texture and shape. We then train separate networks in each of these spaces. In texture space, we use a conditional generative network to change the appearance, and carefully design the input format and loss functions to achieve the best results. In shape space, we use a fully connected network to predict an accurate face shape. When available, the shape branch uses depth data for supervision. Both networks are conditioned on expression coefficients rather than discrete labels, allowing us to generate an unlimited number of expressions. Furthermore, we adopt spatially adaptive denormalization on our texture space representation to improve the quality of the synthesized results. We show the superiority of this disentangling approach through both quantitative and qualitative studies. The proposed method does not require paired data, and is trained using an in-the-wild dataset of videos consisting of talking people. To achieve this, we present a simple yet efficient method to select appropriate key frames from these videos. In a user study, our method is preferred in 83.2% of cases when compared to state-of-the-art alternative approaches.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

References

  1. Bao, J., Chen, D., Wen, F., Li, H., & Hua, G. (2017). Cvae-gan: Fine-grained image generation through asymmetric training. In Proceedings of the IEEE international conference on computer vision (pp. 2745–2754).

  2. Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3d faces. In SIGGRAPH ’99 (pp. 187–194). New York, NY: ACM Press/Addison-Wesley Publishing Co.

  3. Bodnar, C. (2018). Text to image synthesis using generative adversarial networks. CoRR abs/1805.00676.

  4. Bouaziz, S., Wang, Y., & Pauly, M. (2013). Online modeling for realtime facial animation. ACM Transactions on Graphics, 32(4), 40:1–40:10.

    Article  Google Scholar 

  5. Cao, C., Chai, M., Woodford, O., & Luo, L. (2018). Stabilized real-time face tracking via a learned dynamic rigidity prior. ACM Transactions on Graphics, 37(6), 233:1–233:11.

    Google Scholar 

  6. Cao, C., Hou, Q., & Zhou, K. (2014a). Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on Graphics, 33(4), 43:1–43:10.

    Google Scholar 

  7. Cao, C., Weng, Y., Zhou, S., Tong, Y., & Zhou, K. (2014b). Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20(3), 413–425.

    Article  Google Scholar 

  8. Chen, Y. C., Lin, H., Shu, M., Li, R., Tao, X., Shen, X., Ye, Y., & Jia, J. (2018). Facelet-bank for fast portrait manipulation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3541–3549).

  9. Choi, Y., Choi, M., Kim, M., Ha, J., Kim, S., & Choo, J. (2018). Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8789–8797). https://doi.org/10.1109/CVPR.2018.00916

  10. Chung, J. S., Nagrani, A., & Zisserman, A. (2018). Voxceleb2: Deep speaker recognition. In INTERSPEECH.

  11. Friesen, E., & Ekman, P. (1978). Facial action coding system: A technique for the measurement of facial movement. Palo Alto: Consulting Psychologists Press.

    Google Scholar 

  12. Gecer, B., Ploumpis, S., Kotsia, I., & Zafeiriou, S. (2019). Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  13. Geng, J., Shao, T., Zheng, Y., Weng, Y., & Zhou, K. (2018). Warp-guided gans for single-photo facial animation. ACM Transactions on Graphics (TOG), 37(6), 1–12.

    Article  Google Scholar 

  14. Geng, Z., Cao, C., & Tulyakov, S. (2019). 3d guided fine-grained face manipulation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  15. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st international conference on neural information processing systems, NIPS’17 (pp. 6629–6640). Red Hook, NY: Curran Associates Inc.

  16. Huang, J., Shi, X., Liu, X., Zhou, K., Wei, L. Y., Teng, S. H., et al. (2006). Subspace gradient domain mesh deformation. ACM Transactions on Graphics, 25(3), 1126–1134.

    Article  Google Scholar 

  17. Isola, P., Zhu, J., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5967–5976).

  18. Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European conference on computer vision (pp. 694–711).

  19. Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of GANs for improved quality, stability, and variation. CoRR abs/1710.10196.

  20. Kazemi, V., & Sullivan, J. (2014). One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1867–1874).

  21. Kim, H., Carrido, P., Tewari, A., Xu, W., Thies, J., Niessner, M., et al. (2018). Deep video portraits. ACM Transactions on Graphics, 37(4), 163.

    Google Scholar 

  22. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. CoRR abs/1412.6980.

  23. Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., & Ranzato, M. (2017). Fader networks: Manipulating images by sliding attributes. In Proceedings of the neural information processing systems conference (pp. 5969–5978).

  24. Lévy, B., & Zhang, R. H. (2009). Spectral geometry processing. In ACM SIGGRAPH ASIA course notes.

  25. Li, H., Weise, T., & Pauly, M. (2010). Example-based facial rigging. ACM Transactions on Graphics, 29(4), 32:1–32:6.

    Google Scholar 

  26. Liu, M., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. In Proceedings of the neural information processing systems conference (pp. 700–708).

  27. Liu, R., Lehman, J., Molino, P., Such, F. P., Frank, E., Sergeev, A., & Yosinski, J. (2018). An intriguing failing of convolutional neural networks and the coordconv solution. CoRR abs/1807.03247.

  28. Ma, D. S., Correll, J., & Wittenbrink, B. (2015). The Chicago face database: A free stimulus set of faces and norming data. Behavior Research Methods, 47(4), 1122–1135.

    Article  Google Scholar 

  29. Mao, X., Li, Q., Xie, H., Lau, R. Y. K., & Wang, Z. (2016). Multi-class generative adversarial networks with the L2 loss function. CoRR abs/1611.04076.

  30. Nagano, K., Seo, J., Xing, J., Wei, L., Li, Z., Saito, S., et al. (2018). paGAN: Real-time avatars using dynamic textures. ACM Transactions on Graphics, 37(6), 258-1.

    Google Scholar 

  31. Olszewski, K., Li, Z., Yang, C., Zhou, Y., Yu, R., Huang, Z., Xiang, S., Saito, S., Kohli, P., & Li, H. (2017). Realistic dynamic facial textures from a single image using GANs. In The IEEE international conference on computer vision (ICCV).

  32. Park, T., Liu, M., Wang, T., & Zhu, J. (2019). Semantic image synthesis with spatially-adaptive normalization. CoRR abs/1903.07291.

  33. Pérez, P., Gangnet, M., & Blake, A. (2003). Poisson image editing. ACM Transactions on Graphics, 22(3), 313–318.

    Article  Google Scholar 

  34. Pumarola, A., Agudo, A., Martinez, A.M., Sanfeliu, A., & Moreno-Noguer, F. (2018). Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of the European conference on computer vision (pp. 835–851). https://doi.org/10.1007/978-3-030-01249-6_50.

  35. Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., & Sebe, N. (2018). Animating arbitrary objects via deep motion transfer. arXiv:1812.08861.

  36. Tewari, A., Bernard, F., Garrido, P., Bharaj, G., Elgharib, M., Seidel, H. P., Pérez, P., Zollhöfer, M., & Theobalt, C. (2018). FML: Face model learning from videos. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  37. Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., & Nießner, M. (2016). Face2face: Real-time face capture and reenactment of RGB videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2387–2395).

  38. Tran, L., & Liu, X. (2018). Nonlinear 3d face morphable model. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7346–7355).

  39. Tulyakov, S., Jeni, L. A., Cohn, J. F., & Sebe, N. (2018). Viewpoint-consistent 3d face alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(9), 2250–2264.

    Article  Google Scholar 

  40. Tulyakov, S., Liu, M., Yang, X., & Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1526–1535).

  41. Upchurch, P., Gardner, J., Pleiss, G., Pless, R., Snavely, N., Bala, K., & Weinberger, K. (2017). Deep feature interpolation for image content changes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7064–7073).

  42. Vlasic, D., Brand, M., Pfister, H., & Popović, J. (2005). Face transfer with multilinear models. In SIGGRAPH ’05 (pp. 426–433). New York, NY: ACM.

  43. Wang, T. C., Liu, M. Y., Zhu, J. Y., Liu, G., Tao, A., Kautz, J., & Catanzaro, B. (2018). Video-to-video synthesis. In Proceedings of the neural information processing systems conference.

  44. Wiles, O., Sophia Koepke, A., & Zisserman, A. (2018). X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European conference on computer vision (pp. 670–686).

  45. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., & Huang, T. S. (2018a). Free-form image inpainting with gated convolution. CoRR abs/1806.03589.

  46. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., & Huang, T. S. (2018b). Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5505–5514).

  47. Zhou, H., Liu, Y., Liu, Z., Luo, P., & Wang, X. (2019). Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI conference on artificial intelligence (AAAI).

  48. Zhu, J., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2242–2251).

  49. Zhu, X., Lei, Z., Liu, X., Shi, H., & Li, S. Z. (2016). Face alignment across large poses: A 3d solution. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 146–155).

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Chen Cao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, Ming-Yu Liu, Jan Kautz, Antonio Torralba.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Geng, Z., Cao, C. & Tulyakov, S. Towards Photo-Realistic Facial Expression Manipulation. Int J Comput Vis 128, 2744–2761 (2020). https://doi.org/10.1007/s11263-020-01361-8

Download citation

Keywords

  • Generative adversarial network
  • Graphics