Neural Voice Puppetry: Audio-Driven Facial Reenactment

Thies, Justus; Elgharib, Mohamed; Tewari, Ayush; Theobalt, Christian; Nießner, Matthias

doi:10.1007/978-3-030-58517-4_42

Neural Voice Puppetry: Audio-Driven Facial Reenactment

Justus Thies¹²,
Mohamed Elgharib¹³,
Ayush Tewari¹³,
Christian Theobalt¹³ &
…
Matthias Nießner¹²

Conference paper
First Online: 10 October 2020

4502 Accesses
128 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12361))

Abstract

We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis (Video, Code and Demo: https://justusthies.github.io/posts/neural-voice-puppetry/). Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples, including comparisons to state-of-the-art techniques and a user study.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://text-to-speech-demo.ng.bluemix.net/.

References

Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: ACM Transactions on Graphics (Proceedings of SIGGRAPH), pp. 187–194 (1999)
Google Scholar
Bregler, C., Covell, M., Slaney, M.: Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH 1997, pp. 353–360 (1997)
Google Scholar
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7832–7841 (2019)
Google Scholar
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19
Chapter Google Scholar
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? In: British Machine Vision Conference (BMVC) (2017)
Google Scholar
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.: Capture, learning, and synthesis of 3D speaking styles. In: Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37(4), 112:1–112:11 (2018)
Article Google Scholar
Ezzat, T., Geiger, G., Poggio, T.: Trainable videorealistic speech animation. ACM Trans. Graph. 21, 388–398 (2002)
Article Google Scholar
Finn, K.E.: Video-Mediated Communication. L. Erlbaum Associates Inc., Mahwah (1997)
Google Scholar
Fried, O., et al.: Text-based editing of talking-head video. ACM Trans. Graph. (Proceedings of SIGGRAPH) 38(4), 68:1–68:14 (2019)
Google Scholar
Garrido, P., et al.: VDub - modifying face video of actors for plausible visual alignment to a dubbed audio track. In: Computer Graphics Forum (Proceedings of EUROGRAPHICS) (2015)
Google Scholar
Garrido, P., et al.: Reconstruction of personalized 3D face rigs from monocular video. ACM Trans. Graph. (Proceedings of SIGGRAPH) 35(3), 28 (2016)
Google Scholar
Hannun, A., et al.: DeepSpeech: scaling up end-to-end speech recognition, December 2014
Google Scholar
Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: International Conference on Neural Information Processing Systems (NIPS), pp. 4485–4495 (2018)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. (Proceedings of SIGGRAPH) 36(4), 1–12 (2017)
Article Google Scholar
Kim, H., et al.: Neural style-preserving visual dubbing. ACM Trans. Graph. (SIGGRAPH Asia) 38, 1–13 (2019)
Google Scholar
Kim, H., et al.: Deep video portraits. ACM Trans. Graph. (Proceedings of SIGGRAPH) 37(4), 163:1–163:14 (2018)
Google Scholar
Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36(6), 194:1–194:17 (2017). Two first authors contributed equally
Google Scholar
Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. ACM Trans. Graph. (Proceedings of SIGGRAPH) 37(4), 68:1–68:13 (2018)
Google Scholar
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv (2016). https://arxiv.org/abs/1609.03499
Pham, H.X., Cheung, S., Pavlovic, V.: Speech-driven 3D facial animation with implicit emotional awareness: a deep learning approach. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 2328–2336. IEEE Computer Society (2017)
Google Scholar
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (Proceedings of SIGGRAPH) 36(4), 1–13 (2017)
Article Google Scholar
Tarasuik, J., Kaufman, J., Galligan, R.: Seeing is believing but is hearing? comparing audio and video communication for young children. Front. Psychol. 4, 64 (2013)
Article Google Scholar
Taylor, S., et al.: A deep learning approach for generalized speech animation. ACM Trans. Graph. 36(4), 1–11 (2017)
Article Google Scholar
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: FaceVR: real-time gaze-aware facial reenactment in virtual reality. ACM Trans. Graph. 37(2), 1–15 (2018)
Article Google Scholar
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: HeadOn: real-time reenactment of human portrait videos. ACM Trans. Graph. (Proceedings of SIGGRAPH) 37, 1–13 (2018)
Google Scholar
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2Face: real-time face capture and reenactment of RGB videos. In: CVPR (2016)
Google Scholar
Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. (Proceedings of SIGGRAPH) 38, 1–12 (2019)
Article Google Scholar
Tzirakis, P., Papaioannou, A., Lattas, A., Tarasiou, M., Schuller, B.W., Zafeiriou, S.: Synthesising 3D facial motion from “in-the-wild” speech. CoRR abs/1904.07002 (2019)
Google Scholar
Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven facial animation with temporal GANs. In: BMVC (2018)
Google Scholar
Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with GANs. Int. J. Comput. Vis. 128(5), 1398–1413 (2019)
Article Google Scholar
Zhang, H., Goodfellow, I.J., Metaxas, D.N., Odena, A.: Self-attention generative adversarial networks. arXiv:1805.08318 (2018)
Zollhöfer, M., et al.: State of the art on monocular 3D face reconstruction, tracking, and applications. Comput. Graph. Forum (Eurographics State of the Art Reports) 37, 523–550 (2018)
Google Scholar

Download references

Acknowledgments

We gratefully acknowledge the support by the AI Foundation, Google, Sony, a TUM-IAS Rudolf Mößbauer Fellowship, the ERC Starting Grant Scan2CAD (804724), the ERC Consolidator Grant 4DRepLy (770784), and a Google Faculty Award.

Author information

Authors and Affiliations

Technical University of Munich, Munich, Germany
Justus Thies & Matthias Nießner
Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
Mohamed Elgharib, Ayush Tewari & Christian Theobalt

Authors

Justus Thies
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Elgharib
View author publications
You can also search for this author in PubMed Google Scholar
Ayush Tewari
View author publications
You can also search for this author in PubMed Google Scholar
Christian Theobalt
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Nießner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Justus Thies .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4108 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M. (2020). Neural Voice Puppetry: Audio-Driven Facial Reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12361. Springer, Cham. https://doi.org/10.1007/978-3-030-58517-4_42

Download citation

DOI: https://doi.org/10.1007/978-3-030-58517-4_42
Published: 10 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58516-7
Online ISBN: 978-3-030-58517-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics