Advertisement

Fast Bi-Layer Neural Synthesis of One-Shot Realistic Head Avatars

Conference paper
  • 1.1k Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12357)

Abstract

We propose a neural rendering-based system that creates head avatars from a single photograph. Our approach models a person’s appearance by decomposing it into two layers. The first layer is a pose-dependent coarse image that is synthesized by a small neural network. The second layer is defined by a pose-independent texture image that contains high-frequency details. The texture image is generated offline, warped and added to the coarse image to ensure a high effective resolution of synthesized head views. We compare our system to analogous state-of-the-art systems in terms of visual quality and speed. The experiments show significant inference speedup over previous neural head avatar models for a given visual quality. We also report on a real-time smartphone-based implementation of our system.

Keywords

Neural avatars Talking heads Neural rendering Head synthesis Head animation 

Supplementary material

504453_1_En_31_MOESM1_ESM.pdf (9.3 mb)
Supplementary material 1 (pdf 9541 KB)

References

  1. 1.
    PyTorch homepage. https://pytorch.org
  2. 2.
  3. 3.
    TensorFlow Lite homepage. https://www.tensorflow.org/lite
  4. 4.
    Alexander, O., et al.: The Digital Emily project: achieving a photorealistic digital actor. IEEE Comput. Graph. Appl. 30(4), 20–31 (2010)CrossRefGoogle Scholar
  5. 5.
    Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems (2016)Google Scholar
  6. 6.
    Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: 7th International Conference on Learning Representations, ICLR 2019 (2019)Google Scholar
  7. 7.
    Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230, 000 3D facial landmarks). In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017, pp. 1021–1030 (2017)Google Scholar
  8. 8.
    Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: Interspeech 2018, 19th Annual Conference of the International Speech Communication Association (2018)Google Scholar
  9. 9.
    Deng, J., Guo, J., Niannan, X., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR (2019)Google Scholar
  10. 10.
    Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1538–1546 (2015)Google Scholar
  11. 11.
    Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017 (2017)Google Scholar
  12. 12.
    Fu, C., Hu, Y., Wu, X., Wang, G., Zhang, Q., He, R.: High fidelity face manipulation with extreme pose and expression. arXiv preprint arXiv:1903.12003 (2019)
  13. 13.
    Ganin, Y., Kononenko, D., Sungatullina, D., Lempitsky, V.: DeepWarp: photorealistic image resynthesis for gaze manipulation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 311–326. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46475-6_20CrossRefGoogle Scholar
  14. 14.
    Gong, K., Gao, Y., Liang, X., Shen, X., Wang, M., Lin, L.: Graphonomy: universal human parsing via graph transfer learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2019)Google Scholar
  15. 15.
    Goodfellow, I.J., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014 (2014)Google Scholar
  16. 16.
    Ha, S., Kersner, M., Kim, B., Seo, S., Kim, D.: Marionette: Few-shot face reenactment preserving identity of unseen targets. CoRR abs/1911.08139 (2019)Google Scholar
  17. 17.
    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_38CrossRefGoogle Scholar
  18. 18.
    Hu, L., et al.: Avatar digitization from a single image for real-time rendering. ACM Trans. Graph. 36(6), 195:1–195:14 (2017)Google Scholar
  19. 19.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015 (2015)Google Scholar
  20. 20.
    Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (2017)Google Scholar
  21. 21.
    Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (2015)Google Scholar
  22. 22.
    Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46475-6_43CrossRefGoogle Scholar
  23. 23.
    Jolicoeur-Martineau, A.: The relativistic discriminator: a key element missing from standard GAN. In: 7th International Conference on Learning Representations, ICLR (2019)Google Scholar
  24. 24.
    Kim, D., Chung, J.R., Jung, S.: GRDN: grouped residual dense network for real image denoising and GAN-based real-world noise modeling. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019 (2019)Google Scholar
  25. 25.
    Kim, H., et al.: Deep video portraits. arXiv preprint arXiv:1805.11714 (2018)
  26. 26.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)Google Scholar
  27. 27.
    Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. ACM Trans. Graph. (TOG) 37(4), 68 (2018)CrossRefGoogle Scholar
  28. 28.
    Mirza, M., Osindero, S.: Conditional generative adversarial nets. CoRR abs/1411.1784 (2014)Google Scholar
  29. 29.
    Neverova, N., Alp Güler, R., Kokkinos, I.: Dense pose transfer. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 128–143. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01219-9_8CrossRefGoogle Scholar
  30. 30.
    Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of the British Machine Vision Conference 2015, BMVC 2015 (2015)Google Scholar
  31. 31.
    Shysheya, A., et al.: Textured neural avatars. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2019)Google Scholar
  32. 32.
    Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019 (2019)Google Scholar
  33. 33.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
  34. 34.
    Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (TOG) 36(4), 95 (2017)CrossRefGoogle Scholar
  35. 35.
    Tripathy, S., Kannala, J., Rahtu, E.: ICface: interpretable and controllable face reenactment using GANs. CoRR abs/1904.01909 (2019). http://arxiv.org/abs/1904.01909
  36. 36.
    Wang, T., Liu, M., Tao, A., Liu, G., Catanzaro, B., Kautz, J.: Few-shot video-to-video synthesis. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019 (2019)Google Scholar
  37. 37.
    Wang, T., Liu, M., Zhu, J., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018 (2018)Google Scholar
  38. 38.
    Wang, T., et al.: Video-to-video synthesis. In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3–8 December 2018, Montréal, Canada (2018)Google Scholar
  39. 39.
    Wiles, O., Koepke, A.S., Zisserman, A.: X2Face: a network for controlling face generation using images, audio, and pose codes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 690–706. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01261-8_41CrossRefGoogle Scholar
  40. 40.
    Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.S.: Few-shot adversarial learning of realistic neural talking head models. In: IEEE International Conference on Computer Vision, ICCV 2019 (2019)Google Scholar
  41. 41.
    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2018)Google Scholar
  42. 42.
    Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 286–301. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_18CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Samsung AI Center – MoscowMoscowRussia
  2. 2.Skolkovo Institute of Science and TechnologyMoscowRussia
  3. 3.University of CambridgeCambridgeUK

Personalised recommendations