Skip to main content

Neural Voice Puppetry: Audio-Driven Facial Reenactment

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12361))

Abstract

We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis (Video, Code and Demo: https://justusthies.github.io/posts/neural-voice-puppetry/). Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples, including comparisons to state-of-the-art techniques and a user study.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://text-to-speech-demo.ng.bluemix.net/.

References

  1. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: ACM Transactions on Graphics (Proceedings of SIGGRAPH), pp. 187–194 (1999)

    Google Scholar 

  2. Bregler, C., Covell, M., Slaney, M.: Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH 1997, pp. 353–360 (1997)

    Google Scholar 

  3. Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7832–7841 (2019)

    Google Scholar 

  4. Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19

    Chapter  Google Scholar 

  5. Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? In: British Machine Vision Conference (BMVC) (2017)

    Google Scholar 

  6. Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.: Capture, learning, and synthesis of 3D speaking styles. In: Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  7. Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37(4), 112:1–112:11 (2018)

    Article  Google Scholar 

  8. Ezzat, T., Geiger, G., Poggio, T.: Trainable videorealistic speech animation. ACM Trans. Graph. 21, 388–398 (2002)

    Article  Google Scholar 

  9. Finn, K.E.: Video-Mediated Communication. L. Erlbaum Associates Inc., Mahwah (1997)

    Google Scholar 

  10. Fried, O., et al.: Text-based editing of talking-head video. ACM Trans. Graph. (Proceedings of SIGGRAPH) 38(4), 68:1–68:14 (2019)

    Google Scholar 

  11. Garrido, P., et al.: VDub - modifying face video of actors for plausible visual alignment to a dubbed audio track. In: Computer Graphics Forum (Proceedings of EUROGRAPHICS) (2015)

    Google Scholar 

  12. Garrido, P., et al.: Reconstruction of personalized 3D face rigs from monocular video. ACM Trans. Graph. (Proceedings of SIGGRAPH) 35(3), 28 (2016)

    Google Scholar 

  13. Hannun, A., et al.: DeepSpeech: scaling up end-to-end speech recognition, December 2014

    Google Scholar 

  14. Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: International Conference on Neural Information Processing Systems (NIPS), pp. 4485–4495 (2018)

    Google Scholar 

  15. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43

    Chapter  Google Scholar 

  16. Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. (Proceedings of SIGGRAPH) 36(4), 1–12 (2017)

    Article  Google Scholar 

  17. Kim, H., et al.: Neural style-preserving visual dubbing. ACM Trans. Graph. (SIGGRAPH Asia) 38, 1–13 (2019)

    Google Scholar 

  18. Kim, H., et al.: Deep video portraits. ACM Trans. Graph. (Proceedings of SIGGRAPH) 37(4), 163:1–163:14 (2018)

    Google Scholar 

  19. Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36(6), 194:1–194:17 (2017). Two first authors contributed equally

    Google Scholar 

  20. Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. ACM Trans. Graph. (Proceedings of SIGGRAPH) 37(4), 68:1–68:13 (2018)

    Google Scholar 

  21. van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv (2016). https://arxiv.org/abs/1609.03499

  22. Pham, H.X., Cheung, S., Pavlovic, V.: Speech-driven 3D facial animation with implicit emotional awareness: a deep learning approach. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 2328–2336. IEEE Computer Society (2017)

    Google Scholar 

  23. Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (Proceedings of SIGGRAPH) 36(4), 1–13 (2017)

    Article  Google Scholar 

  24. Tarasuik, J., Kaufman, J., Galligan, R.: Seeing is believing but is hearing? comparing audio and video communication for young children. Front. Psychol. 4, 64 (2013)

    Article  Google Scholar 

  25. Taylor, S., et al.: A deep learning approach for generalized speech animation. ACM Trans. Graph. 36(4), 1–11 (2017)

    Article  Google Scholar 

  26. Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: FaceVR: real-time gaze-aware facial reenactment in virtual reality. ACM Trans. Graph. 37(2), 1–15 (2018)

    Article  Google Scholar 

  27. Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: HeadOn: real-time reenactment of human portrait videos. ACM Trans. Graph. (Proceedings of SIGGRAPH) 37, 1–13 (2018)

    Google Scholar 

  28. Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2Face: real-time face capture and reenactment of RGB videos. In: CVPR (2016)

    Google Scholar 

  29. Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. (Proceedings of SIGGRAPH) 38, 1–12 (2019)

    Article  Google Scholar 

  30. Tzirakis, P., Papaioannou, A., Lattas, A., Tarasiou, M., Schuller, B.W., Zafeiriou, S.: Synthesising 3D facial motion from “in-the-wild” speech. CoRR abs/1904.07002 (2019)

    Google Scholar 

  31. Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven facial animation with temporal GANs. In: BMVC (2018)

    Google Scholar 

  32. Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with GANs. Int. J. Comput. Vis. 128(5), 1398–1413 (2019)

    Article  Google Scholar 

  33. Zhang, H., Goodfellow, I.J., Metaxas, D.N., Odena, A.: Self-attention generative adversarial networks. arXiv:1805.08318 (2018)

  34. Zollhöfer, M., et al.: State of the art on monocular 3D face reconstruction, tracking, and applications. Comput. Graph. Forum (Eurographics State of the Art Reports) 37, 523–550 (2018)

    Google Scholar 

Download references

Acknowledgments

We gratefully acknowledge the support by the AI Foundation, Google, Sony, a TUM-IAS Rudolf Mößbauer Fellowship, the ERC Starting Grant Scan2CAD (804724), the ERC Consolidator Grant 4DRepLy (770784), and a Google Faculty Award.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Justus Thies .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4108 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M. (2020). Neural Voice Puppetry: Audio-Driven Facial Reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12361. Springer, Cham. https://doi.org/10.1007/978-3-030-58517-4_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58517-4_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58516-7

  • Online ISBN: 978-3-030-58517-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics