Abstract
We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis (Video, Code and Demo: https://justusthies.github.io/posts/neural-voice-puppetry/). Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples, including comparisons to state-of-the-art techniques and a user study.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: ACM Transactions on Graphics (Proceedings of SIGGRAPH), pp. 187–194 (1999)
Bregler, C., Covell, M., Slaney, M.: Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH 1997, pp. 353–360 (1997)
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7832–7841 (2019)
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? In: British Machine Vision Conference (BMVC) (2017)
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.: Capture, learning, and synthesis of 3D speaking styles. In: Computer Vision and Pattern Recognition (CVPR) (2019)
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37(4), 112:1–112:11 (2018)
Ezzat, T., Geiger, G., Poggio, T.: Trainable videorealistic speech animation. ACM Trans. Graph. 21, 388–398 (2002)
Finn, K.E.: Video-Mediated Communication. L. Erlbaum Associates Inc., Mahwah (1997)
Fried, O., et al.: Text-based editing of talking-head video. ACM Trans. Graph. (Proceedings of SIGGRAPH) 38(4), 68:1–68:14 (2019)
Garrido, P., et al.: VDub - modifying face video of actors for plausible visual alignment to a dubbed audio track. In: Computer Graphics Forum (Proceedings of EUROGRAPHICS) (2015)
Garrido, P., et al.: Reconstruction of personalized 3D face rigs from monocular video. ACM Trans. Graph. (Proceedings of SIGGRAPH) 35(3), 28 (2016)
Hannun, A., et al.: DeepSpeech: scaling up end-to-end speech recognition, December 2014
Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: International Conference on Neural Information Processing Systems (NIPS), pp. 4485–4495 (2018)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. (Proceedings of SIGGRAPH) 36(4), 1–12 (2017)
Kim, H., et al.: Neural style-preserving visual dubbing. ACM Trans. Graph. (SIGGRAPH Asia) 38, 1–13 (2019)
Kim, H., et al.: Deep video portraits. ACM Trans. Graph. (Proceedings of SIGGRAPH) 37(4), 163:1–163:14 (2018)
Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36(6), 194:1–194:17 (2017). Two first authors contributed equally
Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. ACM Trans. Graph. (Proceedings of SIGGRAPH) 37(4), 68:1–68:13 (2018)
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv (2016). https://arxiv.org/abs/1609.03499
Pham, H.X., Cheung, S., Pavlovic, V.: Speech-driven 3D facial animation with implicit emotional awareness: a deep learning approach. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 2328–2336. IEEE Computer Society (2017)
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (Proceedings of SIGGRAPH) 36(4), 1–13 (2017)
Tarasuik, J., Kaufman, J., Galligan, R.: Seeing is believing but is hearing? comparing audio and video communication for young children. Front. Psychol. 4, 64 (2013)
Taylor, S., et al.: A deep learning approach for generalized speech animation. ACM Trans. Graph. 36(4), 1–11 (2017)
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: FaceVR: real-time gaze-aware facial reenactment in virtual reality. ACM Trans. Graph. 37(2), 1–15 (2018)
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: HeadOn: real-time reenactment of human portrait videos. ACM Trans. Graph. (Proceedings of SIGGRAPH) 37, 1–13 (2018)
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2Face: real-time face capture and reenactment of RGB videos. In: CVPR (2016)
Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. (Proceedings of SIGGRAPH) 38, 1–12 (2019)
Tzirakis, P., Papaioannou, A., Lattas, A., Tarasiou, M., Schuller, B.W., Zafeiriou, S.: Synthesising 3D facial motion from “in-the-wild” speech. CoRR abs/1904.07002 (2019)
Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven facial animation with temporal GANs. In: BMVC (2018)
Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with GANs. Int. J. Comput. Vis. 128(5), 1398–1413 (2019)
Zhang, H., Goodfellow, I.J., Metaxas, D.N., Odena, A.: Self-attention generative adversarial networks. arXiv:1805.08318 (2018)
Zollhöfer, M., et al.: State of the art on monocular 3D face reconstruction, tracking, and applications. Comput. Graph. Forum (Eurographics State of the Art Reports) 37, 523–550 (2018)
Acknowledgments
We gratefully acknowledge the support by the AI Foundation, Google, Sony, a TUM-IAS Rudolf Mößbauer Fellowship, the ERC Starting Grant Scan2CAD (804724), the ERC Consolidator Grant 4DRepLy (770784), and a Google Faculty Award.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M. (2020). Neural Voice Puppetry: Audio-Driven Facial Reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12361. Springer, Cham. https://doi.org/10.1007/978-3-030-58517-4_42
Download citation
DOI: https://doi.org/10.1007/978-3-030-58517-4_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58516-7
Online ISBN: 978-3-030-58517-4
eBook Packages: Computer ScienceComputer Science (R0)