Lip Movements Generation at a Glance

  • Lele ChenEmail author
  • Zhiheng Li
  • Ross K. Maddox
  • Zhiyao Duan
  • Chenliang Xu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11211)


In this paper, we consider the task: given an arbitrary audio speech and one lip image of arbitrary target identity, generate synthesized lip movements of the target identity saying the speech. To perform well, a model needs to not only consider the retention of target identity, photo-realistic of synthesized images, consistency and smoothness of lip images in a sequence, but more importantly, learn the correlations between audio speech and lip movements. To solve the collective problems, we devise a network to synthesize lip movements and propose a novel correlation loss to synchronize lip changes and speech changes. Our full model utilizes four losses for a comprehensive consideration; it is trained end-to-end and is robust to lip shapes, view angles and different facial characteristics. Thoughtful experiments on three datasets ranging from lab-recorded to lips in-the-wild show that our model significantly outperforms other state-of-the-art methods extended to this task.


Lip movements generation Audio visual correlation 



This work was supported in part by NSF BIGDATA 1741472, NIH grant R00 DC014288 and the University of Rochester AR/VR Pilot Award. We gratefully acknowledge the gift donations of Markable, Inc., Tencent and the support of NVIDIA with the donation of GPUs used for this research. This article solely reflects the opinions and conclusions of its authors and not the funding agents.


  1. 1.
    Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: end-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2017)
  2. 2.
    Chandrasekaran, C., Trubanova, A., Stillittano, S., Caplier, A., Ghazanfar, A.A.: The natural statistics of audiovisual speech. PLOS Comput. Biol. 5(7) (2009)CrossRefGoogle Scholar
  3. 3.
    Charles, J., Magee, D., Hogg, D.: Virtual immortality: reanimating characters from TV shows. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 879–886. Springer, Cham (2016). Scholar
  4. 4.
    Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of Multimedia Thematic Workshops. ACM (2017)Google Scholar
  5. 5.
    Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: Proceedings of NIPS. Curran Associates, Inc. (2016)Google Scholar
  6. 6.
    Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? In: Proceedings of BMVC. Springer (2017)Google Scholar
  7. 7.
    Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). Scholar
  8. 8.
    Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)CrossRefGoogle Scholar
  9. 9.
    Cutler, R., Davis, L.S.: Look who’s talking: speaker detection using video and audio correlation. In: Proceedings of ICME. IEEE (2000)Google Scholar
  10. 10.
    Das, P., Xu, C., Doell, R., Corso, J.J.: A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: Proceedings of CVPR. IEEE (2013)Google Scholar
  11. 11.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of ICCV. IEEE (2015)Google Scholar
  12. 12.
    Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional LSTM. In: ICASSP. IEEE (2015)Google Scholar
  13. 13.
    Garrido, P., et al.: VDub: modifying face video of actors for plausible visual alignment to a dubbed audio track. Comput. Graph. Forum 34(2), 193–204 (2015)CrossRefGoogle Scholar
  14. 14.
    Goodfellow, I.J., et al.: Generative adversarial nets. In: Proceedings of NIPS. Curran Associates, Inc. (2014)Google Scholar
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of CVPR. IEEE (2015)Google Scholar
  16. 16.
    Hotelling, H.: Relations between two sets of variates. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics, pp. 162–190. Springer, New York (1992). Scholar
  17. 17.
    Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). Scholar
  18. 18.
    King, D.E.: Dlib-ml: a machine learning toolkit. JMLR 10, 1755–1758 (2009)Google Scholar
  19. 19.
    Kulkarni, G., et al.: Baby talk: understanding and generating simple image descriptions. In: Proceedings of CVPR. IEEE (2011)Google Scholar
  20. 20.
    Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
  21. 21.
    Narvekar, N.D., Karam, L.J.: A no-reference image blur metric based on the cumulative probability of blur detection (CPBD). IEEE TIP 20(9), 2678–2683 (2011)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: Proceedings of ICML. PMLR (2017)Google Scholar
  23. 23.
    Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of CVPR. IEEE (2016)Google Scholar
  24. 24.
    Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: Proceedings of Multimedia. ACM (2010)Google Scholar
  25. 25.
    Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: ICML. PMLR (2016)Google Scholar
  26. 26.
    Richie, S., Warburton, C., Carter, M.: Audiovisual database of spoken American English. Linguistic Data Consortium (2009)Google Scholar
  27. 27.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)CrossRefGoogle Scholar
  28. 28.
    Son Chung, J., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of CVPR. IEEE (2017)Google Scholar
  29. 29.
    Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. 36(4), 95:1–95:13 (2017)CrossRefGoogle Scholar
  30. 30.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Proceedings of NIPS. Curran Associates, Inc. (2016)Google Scholar
  31. 31.
    Waibel, A.H., Hanazawa, T., Hinton, G.E., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989)CrossRefGoogle Scholar
  32. 32.
    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of RochesterRochesterUSA

Personalised recommendations