Advertisement

Neural Voice Puppetry: Audio-Driven Facial Reenactment

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12361)

Abstract

We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis (Video, Code and Demo: https://justusthies.github.io/posts/neural-voice-puppetry/). Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples, including comparisons to state-of-the-art techniques and a user study.

Notes

Acknowledgments

We gratefully acknowledge the support by the AI Foundation, Google, Sony, a TUM-IAS Rudolf Mößbauer Fellowship, the ERC Starting Grant Scan2CAD (804724), the ERC Consolidator Grant 4DRepLy (770784), and a Google Faculty Award.

Supplementary material

504471_1_En_42_MOESM1_ESM.pdf (4 mb)
Supplementary material 1 (pdf 4108 KB)

References

  1. 1.
    Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: ACM Transactions on Graphics (Proceedings of SIGGRAPH), pp. 187–194 (1999)Google Scholar
  2. 2.
    Bregler, C., Covell, M., Slaney, M.: Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH 1997, pp. 353–360 (1997)Google Scholar
  3. 3.
    Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7832–7841 (2019)Google Scholar
  4. 4.
    Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-54427-4_19CrossRefGoogle Scholar
  5. 5.
    Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? In: British Machine Vision Conference (BMVC) (2017)Google Scholar
  6. 6.
    Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.: Capture, learning, and synthesis of 3D speaking styles. In: Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  7. 7.
    Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37(4), 112:1–112:11 (2018)CrossRefGoogle Scholar
  8. 8.
    Ezzat, T., Geiger, G., Poggio, T.: Trainable videorealistic speech animation. ACM Trans. Graph. 21, 388–398 (2002)CrossRefGoogle Scholar
  9. 9.
    Finn, K.E.: Video-Mediated Communication. L. Erlbaum Associates Inc., Mahwah (1997)Google Scholar
  10. 10.
    Fried, O., et al.: Text-based editing of talking-head video. ACM Trans. Graph. (Proceedings of SIGGRAPH) 38(4), 68:1–68:14 (2019)Google Scholar
  11. 11.
    Garrido, P., et al.: VDub - modifying face video of actors for plausible visual alignment to a dubbed audio track. In: Computer Graphics Forum (Proceedings of EUROGRAPHICS) (2015)Google Scholar
  12. 12.
    Garrido, P., et al.: Reconstruction of personalized 3D face rigs from monocular video. ACM Trans. Graph. (Proceedings of SIGGRAPH) 35(3), 28 (2016)Google Scholar
  13. 13.
    Hannun, A., et al.: DeepSpeech: scaling up end-to-end speech recognition, December 2014Google Scholar
  14. 14.
    Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: International Conference on Neural Information Processing Systems (NIPS), pp. 4485–4495 (2018)Google Scholar
  15. 15.
    Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46475-6_43CrossRefGoogle Scholar
  16. 16.
    Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. (Proceedings of SIGGRAPH) 36(4), 1–12 (2017)CrossRefGoogle Scholar
  17. 17.
    Kim, H., et al.: Neural style-preserving visual dubbing. ACM Trans. Graph. (SIGGRAPH Asia) 38, 1–13 (2019)Google Scholar
  18. 18.
    Kim, H., et al.: Deep video portraits. ACM Trans. Graph. (Proceedings of SIGGRAPH) 37(4), 163:1–163:14 (2018)Google Scholar
  19. 19.
    Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36(6), 194:1–194:17 (2017). Two first authors contributed equally Google Scholar
  20. 20.
    Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. ACM Trans. Graph. (Proceedings of SIGGRAPH) 37(4), 68:1–68:13 (2018)Google Scholar
  21. 21.
    van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv (2016). https://arxiv.org/abs/1609.03499
  22. 22.
    Pham, H.X., Cheung, S., Pavlovic, V.: Speech-driven 3D facial animation with implicit emotional awareness: a deep learning approach. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 2328–2336. IEEE Computer Society (2017)Google Scholar
  23. 23.
    Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (Proceedings of SIGGRAPH) 36(4), 1–13 (2017)CrossRefGoogle Scholar
  24. 24.
    Tarasuik, J., Kaufman, J., Galligan, R.: Seeing is believing but is hearing? comparing audio and video communication for young children. Front. Psychol. 4, 64 (2013)CrossRefGoogle Scholar
  25. 25.
    Taylor, S., et al.: A deep learning approach for generalized speech animation. ACM Trans. Graph. 36(4), 1–11 (2017)CrossRefGoogle Scholar
  26. 26.
    Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: FaceVR: real-time gaze-aware facial reenactment in virtual reality. ACM Trans. Graph. 37(2), 1–15 (2018)CrossRefGoogle Scholar
  27. 27.
    Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: HeadOn: real-time reenactment of human portrait videos. ACM Trans. Graph. (Proceedings of SIGGRAPH) 37, 1–13 (2018)Google Scholar
  28. 28.
    Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2Face: real-time face capture and reenactment of RGB videos. In: CVPR (2016)Google Scholar
  29. 29.
    Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. (Proceedings of SIGGRAPH) 38, 1–12 (2019)CrossRefGoogle Scholar
  30. 30.
    Tzirakis, P., Papaioannou, A., Lattas, A., Tarasiou, M., Schuller, B.W., Zafeiriou, S.: Synthesising 3D facial motion from “in-the-wild” speech. CoRR abs/1904.07002 (2019)Google Scholar
  31. 31.
    Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven facial animation with temporal GANs. In: BMVC (2018)Google Scholar
  32. 32.
    Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with GANs. Int. J. Comput. Vis. 128(5), 1398–1413 (2019)CrossRefGoogle Scholar
  33. 33.
    Zhang, H., Goodfellow, I.J., Metaxas, D.N., Odena, A.: Self-attention generative adversarial networks. arXiv:1805.08318 (2018)
  34. 34.
    Zollhöfer, M., et al.: State of the art on monocular 3D face reconstruction, tracking, and applications. Comput. Graph. Forum (Eurographics State of the Art Reports) 37, 523–550 (2018) Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Technical University of MunichMunichGermany
  2. 2.Max Planck Institute for Informatics, Saarland Informatics CampusSaarbrückenGermany

Personalised recommendations