StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN

Yin, Fei; Zhang, Yong; Cun, Xiaodong; Cao, Mingdeng; Fan, Yanbo; Wang, Xuan; Bai, Qingyan; Wu, Baoyuan; Wang, Jue; Yang, Yujiu

doi:10.1007/978-3-031-19790-1_6

Fei Yin¹²,
Yong Zhang¹³,
Xiaodong Cun¹³,
Mingdeng Cao¹²,
Yanbo Fan¹³,
Xuan Wang¹³,
Qingyan Bai¹²,
Baoyuan Wu¹⁴,
Jue Wang¹³ &
…
Yujiu Yang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13677))

Included in the following conference series:

European Conference on Computer Vision

3238 Accesses
36 Citations

Abstract

One-shot talking face generation aims at synthesizing a high-quality talking face video from an arbitrary portrait image, driven by a video or an audio segment. In this work, we provide a solution from a novel perspective that differs from existing frameworks. We first investigate the latent feature space of a pre-trained StyleGAN and discover some excellent spatial transformation properties. Upon the observation, we propose a novel unified framework based on a pre-trained StyleGAN that enables a set of powerful functionalities, i.e., high-resolution video generation, disentangled control by driving video or audio, and flexible face editing. Our framework elevates the resolution of the synthesized talking face to 1024 \(\times \) 1024 for the first time, even though the training dataset has a lower resolution. Moreover, our framework allows two types of facial editing, i.e., global editing via GAN inversion and intuitive editing via 3D morphable models. Comprehensive experiments show superior video quality and flexible controllability over state-of-the-art methods. Code is available at https://github.com/FeiiYin/StyleHEAT.

Work done during an internship at Tencent AI Lab.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdal, R., Qin, Y., Wonka, P.: Image2StyleGAN: how to embed images into the StyleGAN latent space? In: CVPR (2019)
Google Scholar
Abdal, R., Qin, Y., Wonka, P.: Image2StyleGAN++: how to edit the embedded images? In: CVPR (2020)
Google Scholar
Alaluf, Y., Patashnik, O., Cohen-Or, D.: Restyle: a residual-based StyleGAN encoder via iterative refinement. In: ICCV (2021)
Google Scholar
Anonymous: Latent image animator: learning to animate image via latent space navigation. In: ICLR (2022)
Google Scholar
Bai, Q., Xu, Y., Zhu, J., Xia, W., Yang, Y., Shen, Y.: High-fidelity GAN inversion with padding space. arXiv preprint arXiv:2203.11105 (2022)
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: SIGGRAPH (1999)
Google Scholar
Booth, J., Roussos, A., Zafeiriou, S., Ponniah, A., Dunaway, D.: A 3D morphable model learnt from 10,000 faces. In: CVPR (2016)
Google Scholar
Bounareli, S., Argyriou, V., Tzimiropoulos, G.: Finding directions in GAN’s latent space for neural face reenactment. arXiv preprint arXiv:2202.00046 (2022)
Burkov, E., Pasechnik, I., Grigorev, A., Lempitsky, V.: Neural head reenactment with latent pose descriptors. In: CVPR (2020)
Google Scholar
Cao, M., et al.: UniFaceGAN: a unified framework for temporally consistent facial video editing. IEEE TIP 30, 6107–6116 (2021)
Google Scholar
Chen, A., Liu, R., Xie, L., Chen, Z., Su, H., Yu, J.: SofGAN: a portrait image generator with dynamic styling. arXiv preprint arXiv:2007.03780 (2020)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: CVPR (2019)
Google Scholar
Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set. In: CVPR Workshops (2019)
Google Scholar
Doukas, M.C., Zafeiriou, S., Sharmanska, V.: HeadGAN: one-shot neural head synthesis and editing. In: ICCV (2021)
Google Scholar
Fox, G., Tewari, A., Elgharib, M., Theobalt, C.: StyleVideoGAN: a temporal generative model using a pretrained StyleGAN. arXiv preprint arXiv:2107.07224 (2021)
Fried, O., et al.: Text-based editing of talking-head video. TOG 38, 1–14 (2019)
Article Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS (2017)
Google Scholar
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV (2017)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Kang, K., Kim, S., Cho, S.: GAN inversion for out-of-range images with geometric transformations. In: CVPR (2021)
Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: ICLR (2018)
Google Scholar
Karras, T., et al.: Alias-free generative adversarial networks. In: NIPS (2021)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: CVPR (2020)
Google Scholar
Kim, H., et al.: Deep video portraits. TOG 37, 1–14 (2018)
Google Scholar
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
Google Scholar
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: ACM Multimedia (2020)
Google Scholar
Ren, Y., Li, G., Chen, Y., Li, T.H., Liu, S.: PIRenderer: controllable portrait image generation via semantic neural rendering. In: ICCV (2021)
Google Scholar
Richardson, E., et al.: Encoding in style: a StyleGAN encoder for image-to-image translation. In: CVPR (2021)
Google Scholar
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: NIPS (2019)
Google Scholar
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: Animating arbitrary objects via deep motion transfer. In: CVPR (2019)
Google Scholar
Siarohin, A., Woodford, O.J., Ren, J., Chai, M., Tulyakov, S.: Motion representations for articulated animation. In: CVPR (2021)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Song, G., et al.: AgileGAN: stylizing portraits by inversion-consistent transfer learning. TOG 40, 1–13 (2021)
Article Google Scholar
Tian, Y., et al.: A good image generator is what you need for high-resolution video synthesis. In: ICLR (2021)
Google Scholar
Tzaban, R., Mokady, R., Gal, R., Bermano, A.H., Cohen-Or, D.: Stitch it in time: GAN-based facial editing of real videos. arXiv preprint arXiv:2201.08361 (2022)
Wang, T., Zhang, Y., Fan, Y., Wang, J., Chen, Q.: High-fidelity GAN inversion for image attribute editing. arXiv preprint arXiv:2109.06590 (2021)
Wang, T.C., Mallya, A., Liu, M.Y.: One-shot free-view neural talking-head synthesis for video conferencing. In: CVPR (2021)
Google Scholar
Wang, X., Li, Y., Zhang, H., Shan, Y.: Towards real-world blind face restoration with generative facial prior. In: CVPR (2021)
Google Scholar
Wang, X., Yu, K., Dong, C., Loy, C.C.: Recovering realistic texture in image super-resolution by deep spatial feature transform. In: CVPR (2018)
Google Scholar
Wei, T., et al.: A simple baseline for StyleGAN inversion. arXiv preprint arXiv:2104.07661 (2021)
Wikipedia contributors: Thin plate spline—Wikipedia, the free encyclopedia (2020). https://en.wikipedia.org/wiki/Thin_plate_spline
Wiles, O., Koepke, A.S., Zisserman, A.: X2Face: a network for controlling face generation using images, audio, and pose codes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 690–706. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_41
Chapter Google Scholar
Xia, W., Zhang, Y., Yang, Y., Xue, J.H., Zhou, B., Yang, M.H.: GAN inversion: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Google Scholar
Zakharov, E., Ivakhnenko, A., Shysheya, A., Lempitsky, V.: Fast bi-layer neural synthesis of one-shot realistic head avatars. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 524–540. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_31
Chapter Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: CVPR (2021)
Google Scholar
Zhu, P., Abdal, R., Femiani, J., Wonka, P.: Barbershop: GAN-based image compositing using segmentation masks. arXiv preprint arXiv:2106.01505 (2021)

Download references

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under grant No. 61991450, the Shenzhen Key Laboratory of Marine IntelliSense and Computation under grant NO. ZDSYS20200811 142605016. Baoyuan Wu is supported by Shenzhen Science and Technology Program under grant No. ZDSYS20211021111415025.

Author information

Authors and Affiliations

Tsinghua Shenzhen International Graduate School, Tsinghua University, Beijing, China
Fei Yin, Mingdeng Cao, Qingyan Bai & Yujiu Yang
Tencent AI Lab, Shenzhen, China
Yong Zhang, Xiaodong Cun, Yanbo Fan, Xuan Wang & Jue Wang
School of Data Science, Secure Computing Lab of Big Data, The Chinese University of Hong Kong, Shenzhen, China
Baoyuan Wu

Authors

Fei Yin
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Cun
View author publications
You can also search for this author in PubMed Google Scholar
Mingdeng Cao
View author publications
You can also search for this author in PubMed Google Scholar
Yanbo Fan
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qingyan Bai
View author publications
You can also search for this author in PubMed Google Scholar
Baoyuan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yujiu Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yong Zhang or Yujiu Yang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4573 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yin, F. et al. (2022). StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13677. Springer, Cham. https://doi.org/10.1007/978-3-031-19790-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-19790-1_6
Published: 24 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19789-5
Online ISBN: 978-3-031-19790-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN