Abstract
One-shot talking face generation aims at synthesizing a high-quality talking face video from an arbitrary portrait image, driven by a video or an audio segment. In this work, we provide a solution from a novel perspective that differs from existing frameworks. We first investigate the latent feature space of a pre-trained StyleGAN and discover some excellent spatial transformation properties. Upon the observation, we propose a novel unified framework based on a pre-trained StyleGAN that enables a set of powerful functionalities, i.e., high-resolution video generation, disentangled control by driving video or audio, and flexible face editing. Our framework elevates the resolution of the synthesized talking face to 1024 \(\times \) 1024 for the first time, even though the training dataset has a lower resolution. Moreover, our framework allows two types of facial editing, i.e., global editing via GAN inversion and intuitive editing via 3D morphable models. Comprehensive experiments show superior video quality and flexible controllability over state-of-the-art methods. Code is available at https://github.com/FeiiYin/StyleHEAT.
Work done during an internship at Tencent AI Lab.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdal, R., Qin, Y., Wonka, P.: Image2StyleGAN: how to embed images into the StyleGAN latent space? In: CVPR (2019)
Abdal, R., Qin, Y., Wonka, P.: Image2StyleGAN++: how to edit the embedded images? In: CVPR (2020)
Alaluf, Y., Patashnik, O., Cohen-Or, D.: Restyle: a residual-based StyleGAN encoder via iterative refinement. In: ICCV (2021)
Anonymous: Latent image animator: learning to animate image via latent space navigation. In: ICLR (2022)
Bai, Q., Xu, Y., Zhu, J., Xia, W., Yang, Y., Shen, Y.: High-fidelity GAN inversion with padding space. arXiv preprint arXiv:2203.11105 (2022)
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: SIGGRAPH (1999)
Booth, J., Roussos, A., Zafeiriou, S., Ponniah, A., Dunaway, D.: A 3D morphable model learnt from 10,000 faces. In: CVPR (2016)
Bounareli, S., Argyriou, V., Tzimiropoulos, G.: Finding directions in GAN’s latent space for neural face reenactment. arXiv preprint arXiv:2202.00046 (2022)
Burkov, E., Pasechnik, I., Grigorev, A., Lempitsky, V.: Neural head reenactment with latent pose descriptors. In: CVPR (2020)
Cao, M., et al.: UniFaceGAN: a unified framework for temporally consistent facial video editing. IEEE TIP 30, 6107–6116 (2021)
Chen, A., Liu, R., Xie, L., Chen, Z., Su, H., Yu, J.: SofGAN: a portrait image generator with dynamic styling. arXiv preprint arXiv:2007.03780 (2020)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: CVPR (2019)
Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set. In: CVPR Workshops (2019)
Doukas, M.C., Zafeiriou, S., Sharmanska, V.: HeadGAN: one-shot neural head synthesis and editing. In: ICCV (2021)
Fox, G., Tewari, A., Elgharib, M., Theobalt, C.: StyleVideoGAN: a temporal generative model using a pretrained StyleGAN. arXiv preprint arXiv:2107.07224 (2021)
Fried, O., et al.: Text-based editing of talking-head video. TOG 38, 1–14 (2019)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS (2017)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV (2017)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Kang, K., Kim, S., Cho, S.: GAN inversion for out-of-range images with geometric transformations. In: CVPR (2021)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: ICLR (2018)
Karras, T., et al.: Alias-free generative adversarial networks. In: NIPS (2021)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: CVPR (2020)
Kim, H., et al.: Deep video portraits. TOG 37, 1–14 (2018)
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: ACM Multimedia (2020)
Ren, Y., Li, G., Chen, Y., Li, T.H., Liu, S.: PIRenderer: controllable portrait image generation via semantic neural rendering. In: ICCV (2021)
Richardson, E., et al.: Encoding in style: a StyleGAN encoder for image-to-image translation. In: CVPR (2021)
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: NIPS (2019)
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: Animating arbitrary objects via deep motion transfer. In: CVPR (2019)
Siarohin, A., Woodford, O.J., Ren, J., Chai, M., Tulyakov, S.: Motion representations for articulated animation. In: CVPR (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Song, G., et al.: AgileGAN: stylizing portraits by inversion-consistent transfer learning. TOG 40, 1–13 (2021)
Tian, Y., et al.: A good image generator is what you need for high-resolution video synthesis. In: ICLR (2021)
Tzaban, R., Mokady, R., Gal, R., Bermano, A.H., Cohen-Or, D.: Stitch it in time: GAN-based facial editing of real videos. arXiv preprint arXiv:2201.08361 (2022)
Wang, T., Zhang, Y., Fan, Y., Wang, J., Chen, Q.: High-fidelity GAN inversion for image attribute editing. arXiv preprint arXiv:2109.06590 (2021)
Wang, T.C., Mallya, A., Liu, M.Y.: One-shot free-view neural talking-head synthesis for video conferencing. In: CVPR (2021)
Wang, X., Li, Y., Zhang, H., Shan, Y.: Towards real-world blind face restoration with generative facial prior. In: CVPR (2021)
Wang, X., Yu, K., Dong, C., Loy, C.C.: Recovering realistic texture in image super-resolution by deep spatial feature transform. In: CVPR (2018)
Wei, T., et al.: A simple baseline for StyleGAN inversion. arXiv preprint arXiv:2104.07661 (2021)
Wikipedia contributors: Thin plate spline—Wikipedia, the free encyclopedia (2020). https://en.wikipedia.org/wiki/Thin_plate_spline
Wiles, O., Koepke, A.S., Zisserman, A.: X2Face: a network for controlling face generation using images, audio, and pose codes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 690–706. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_41
Xia, W., Zhang, Y., Yang, Y., Xue, J.H., Zhou, B., Yang, M.H.: GAN inversion: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Zakharov, E., Ivakhnenko, A., Shysheya, A., Lempitsky, V.: Fast bi-layer neural synthesis of one-shot realistic head avatars. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 524–540. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_31
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: CVPR (2021)
Zhu, P., Abdal, R., Femiani, J., Wonka, P.: Barbershop: GAN-based image compositing using segmentation masks. arXiv preprint arXiv:2106.01505 (2021)
Acknowledgement
This work was supported in part by the National Natural Science Foundation of China under grant No. 61991450, the Shenzhen Key Laboratory of Marine IntelliSense and Computation under grant NO. ZDSYS20200811 142605016. Baoyuan Wu is supported by Shenzhen Science and Technology Program under grant No. ZDSYS20211021111415025.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yin, F. et al. (2022). StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13677. Springer, Cham. https://doi.org/10.1007/978-3-031-19790-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-19790-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19789-5
Online ISBN: 978-3-031-19790-1
eBook Packages: Computer ScienceComputer Science (R0)