Skip to main content

Speech-driven facial animation with spectral gathering and temporal attention


In this paper, we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip. By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism, we design a light-weight speech encoder that learns useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data. To learn subject-independent facial motion, we use deformation gradients as the internal representation, which allows nuanced local motions to be better synthesized than using vertex offsets. Compared with state-of-the-art automatic-speech-recognition-based methods, our model is much smaller but achieves similar robustness and quality most of the time, and noticeably better results in certain challenging cases.

This is a preview of subscription content, access via your institution.


  1. Cao C, Hou Q, Zhou K. Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on Graphics, 2014, 33(4): 1–10

    Google Scholar 

  2. Nagano K, Saito S, Goldwhite L, San K, Hong A, Hu L, Wei L, Xing J, Xu Q, Kung H W, Kuang J, Agarwal A, Castellanos E, Seo J, Fursund J, Li H. Pinscreen avatars in your pocket: mobile pagan engine and personalized gaming. In: Proceedings of SIGGRAPH Asia 2018 RealTime Live!. 2018, 1–1

  3. Edwards P, Landreth C, Fiume E, Singh K. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics, 2016, 35(4): 1–11

    Article  Google Scholar 

  4. Karras T, Aila T, Laine S, Herva A, Lehtinen J. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics, 2017, 36(4): 1–12

    Article  Google Scholar 

  5. Pham H X, Wang Y, Pavlovic V. End-to-end learning for 3d facial animation from speech. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. 2018, 361–365

  6. Cudeiro D, Bolkart T, Laidlaw C, Ranjan A, Black M J. Capture, learning, and synthesis of 3d speaking styles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 10101–10111

  7. Hati Y, Rousseaux F, Duhart C. Text-driven mouth animation for human computer interaction with personal assistant. In: Proceedings of the 25th International Conference on Auditory Display. 2019, 75–82

  8. Jurafsky D, Martin J H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd Edition). Upper Saddle River, New Jersey: Pearson Prentice Hall, 2009

    Google Scholar 

  9. Suwajanakorn S, Seitz S M, Kemelmacher-Shlizerman I. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics, 2017, 36(4): 1–13

    Article  Google Scholar 

  10. Taylor S, Kim T, Yue Y, Mahler M, Krahe J, Rodriguez A G, Hodgins J, Matthews I. A deep learning approach for generalized speech animation. ACM Transactions on Graphics, 2017, 36(4): 1–11

    Article  Google Scholar 

  11. Hussen Abdelaziz A, Theobald B J, Binder J, Fanelli G, Dixon P, Apostoloff N, Weise T, Kajareker S. Speaker-independent speech-driven visual speech synthesis using domain-adapted acoustic models. In: Proceedings of the 2019 International Conference on Multimodal Interaction. 2019, 220–225

  12. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780

    Article  Google Scholar 

  13. Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, et al. Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv: 14125567, 2014

  14. Pham H X, Cheung S, Pavlovic V. Speech-driven 3d facial animation with implicit emotional awareness: a deep learning approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2017, 2328–2336

  15. Tian G, Yuan Y, Liu Y. Audio2face: generating speech/face animation from single audio with attention-based bidirectional lstm networks. In: Proceedings of 2019 IEEE International Conference on Multimedia and Expo Workshops. 2019, 366–371

  16. Tzirakis P, Papaioannou A, Lattas A, Tarasiou M, Schuller B, Zafeiriou S. Synthesising 3d facial motion from “in-the-wild” speech. arXiv preprint arXiv: 190407002, 2019

  17. Nishimura R, Sakata N, Tominaga T, Hijikata Y, Harada K, Kiyokawa K. Speech-driven facial animation by lstm-rnn for communication use. In: Proceedings of 2019 IEEE Conference on Virtual Reality and 3D User Interfaces. 2019, 1102–1103

  18. Sumner R W, Popović J. Deformation transfer for triangle meshes. ACM Transactions on Graphics, 2004, 23(3): 399–405

    Article  Google Scholar 

  19. Wu Q, Zhang J, Lai Y K, Zheng J, Cai J. Alive caricature from 2d to 3d. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 7336–7345

  20. Gao L, Lai Y, Yang J, Zhang L X, Kobbelt L, Xia S. Sparse data driven mesh deformation. IEEE Transactions on Visualization and Computer Graphics, 2019

  21. Orvalho V, Bastos P, Parke F I, Oliveira B, Alvarez X. A facial rigging survey. In: Proceedings of Eurographics 2012 — State of the Art Reports. 2012, 183–204

  22. Kent R D, Minifie F D. Coarticulation in recent speech production models. Journal of Phonetics, 1977, 5(2): 115–133

    Article  Google Scholar 

  23. Pelachaud C, Badler N I, Steedman M. Generating facial expressions for speech. Cognitive Science, 1996, 20(1): 1–46

    Article  Google Scholar 

  24. Wang A, Emmi M, Faloutsos P. Assembling an expressive facial animation system. In: Proceedings of the 2007 ACM SIGGRAPH Symposium on Video Games. 2007, 21–26

  25. Cohen M M, Massaro D W. Modeling coarticulation in synthetic visual speech. In: Proceedings of Models and Techniques in Computer Animation. 1993, 139–156

  26. Xu Y, Feng A W, Marsella S, Shapiro A. A practical and configurable lip sync method for games. In: Proceedings of Motion on Games. 2013, 131–140

  27. Bregler C, Covell M, Slaney M. Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. 1997, 353–360

  28. Ezzat T, Geiger G, Poggio T. Trainable videorealistic speech animation. ACM Transactions on Graphics, 2002, 21(3): 388–398

    Article  Google Scholar 

  29. Taylor S L, Mahler M, Theobald B J, Matthews I. Dynamic units of visual speech. In: Proceedings of the 11th ACM SIGGRAPH/Eurographics Conference on Computer Animation. 2012, 275–284

  30. Brand M. Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. 1999, 21–28

  31. Xie L, Liu Z Q. Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Transactions on Multimedia, 2007, 9(3): 500–510

    Article  Google Scholar 

  32. Wang L, Han W, Soong F K, Huo Q. Text driven 3d photo-realistic talking head. In: Proceedings of Interspeech. 2011, 3307–3308

  33. Zhang X, Wang L, Li G, Seide F, Soong F K. A new language independent, photo-realistic talking head driven by voice only. In: Proceedings of Interspeech. 2013, 2743–2747

  34. Shimba T, Sakurai R, Yamazoe H, Lee J H. Talking heads synthesis from audio with deep neural networks. In: Proceedings of the IEEE/SICE International Symposium on System Integration. 2015, 100–105

  35. Fan B, Xie L, Yang S, Wang L, Soong F K. A deep bidirectional lstm approach for video-realistic talking head. Multimedia Tools and Applications, 2016, 75(9): 5287–5309

    Article  Google Scholar 

  36. Eskimez S E, Maddox R K, Xu C, Duan Z. Generating talking face landmarks from speech. In: Proceedings of the International Conference on Latent Variable Analysis and Signal Separation. 2018, 372–381

  37. Aneja D, Li W. Real-time lip sync for live 2d animation. arXiv preprint arXiv: 191008685, 2019

  38. Greenwood D, Matthews I, Laycock S. Joint learning of facial expression and head pose from speech. In: Proceedings of Interspeech. 2018, 2484–2488

  39. Websdale D, Taylor S, Milner B. The effect of real-time constraints on automatic speech animation. In: Proceedings of Interspeech. 2018, 2479–2483

  40. Schwartz J L, Savariaux C. No, there is no 150 ms lead of visual speech on auditory speech, but a range of audiovisual asynchronies varying from small audio lead to large audio lag. PLOS Computational Biology, 2014, 10(7): e1003743

    Article  Google Scholar 

  41. Shen J, Pang R, Weiss R J, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2018, 4779–4783

  42. Prenger R, Valle R, Catanzaro B. Waveglow: a flow-based generative network for speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2019, 3617–3621

  43. Vougioukas K, Petridis S, Pantic M. Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 2020, 128(5): 1398–1413

    Article  Google Scholar 

  44. Chen L, Maddox R K, Duan Z, Xu C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7832–7841

  45. Abdel-Hamid O, Mohamed A R, Jiang H, Deng L, Penn G, Yu D. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(10): 1533–1545

    Article  Google Scholar 

  46. Sainath T N, Li B. Modeling time-frequency patterns with lstm vs. convolutional architectures for lvcsr tasks. In: Proceedings of Interspeech. 2016, 813–817

  47. Liu Y, Wang D. Time and frequency domain long short-term memory for noise robust pitch tracking. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2017, 5600–5604

  48. Denil M, Bazzani L, Larochelle H, de Freitas N. Learning where to attend with deep architectures for image tracking. Neural Computation, 2012, 24(8): 2151–2184

    Article  MathSciNet  MATH  Google Scholar 

  49. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 14090473, 2014

  50. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in pytorch. In: Proceedings of Neural Information Processing Systems 2017 Workshop on Autodiff. 2017

  51. Kingma D P, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv: 14126980, 2014

  52. Ekman P, Friesen W V, Hager J C. Facial Action Coding System: The Manual on CD-ROM. Instructor’s Guide. Salt Lake City: Network Information Research Co., 2002

    Google Scholar 

  53. Mori M, MacDorman K F, Kageki N. The uncanny valley. IEEE Robotics and Automation Magazine, 2012, 19(2): 98–100

    Article  Google Scholar 

  54. Kim C, Shin H V, Oh T H, Kaspar A, Elgharib M, Matusik W. On learning associations of faces and voices. In: Proceedings of the Asian Conference on Computer Vision. 2018, 276–292

  55. Vielzeuf V, Kervadec C, Pateux S, Lechervy A, Jurie F. An occam’s razor view on learning audiovisual emotion recognition with small training sets. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. 2018, 589–593

  56. Avots E, Sapiński T, Bachmann M, Kamińska D. Audiovisual emotion recognition in wild. Machine Vision and Applications, 2019, 30(5): 975–985

    Article  Google Scholar 

  57. Oh T H, Dekel T, Kim C, Mosseri I, Freeman W T, Rubinstein M, Matusik W. Speech2face: learning the face behind a voice. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7539–7548

  58. Wang R, Liu X, Cheung Y m, Cheng K, Wang N, Fan W. Learning discriminative joint embeddings for efficient face and voice association. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020, 1881–1884

  59. Zhu H, Luo M, Wang R, Zheng A, He R. Deep audio-visual learning: a survey. arXiv preprint arXiv: 200104758, 2020

  60. Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J. Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 3497–3506

Download references


We would like to thank VOCA group for publishing their database. This work was partially supported by the National Key Research & Development Program of China (2016YFB1001403) and the National Natural Science Foundation of China (Grant No. 61572429).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Yanlin Weng.

Additional information

Yujin Chai received his BS degree in the Computer Science and Technology Department from Zhejiang University, China in 2016. He is now a PhD student at the State Key Lab of CAD&CG, Zhejiang University. His research interests include machine learning and facial animation.

Yanlin Weng is currently an associate professor of the School of Computer Science and Technology at Zhejiang University, China. She got her PhD degree in computer science from University of Wisconsin — Milwaukee, USA and her master’s and bachelor’s degrees in control science and engineering from Zhejiang University. Her research interests include computer graphics and multimedia.

Lvdi Wang is currently the Director of FaceUnity Research Center of Beijing, China. He received his PhD degree in computer science from Institute for Advanced Study, Tsinghua University, China in 2011, under the supervision of Professor Baining Guo. After that, he has spent four years at Microsoft Research Asia and two years at Apple Inc. His works focus on computer graphics, speech, and natural language processing techniques.

Kun Zhou is a professor in the Computer Science Department of Zhejiang University, and the Director of the State Key Lab of CAD&CG, China. Prior to joining Zhejiang University in 2008, Dr. Zhou was a leader researcher of the Internet Graphics Group at Microsoft Research Asia. He received his BS degree and PhD degree in computer science from Zhejiang University in 1997 and 2002, respectively. His research interests are in visual computing, parallel computing, human computer interaction, and virtual reality. He is a Fellow of IEEE.

Electronic supplementary material

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chai, Y., Weng, Y., Wang, L. et al. Speech-driven facial animation with spectral gathering and temporal attention. Front. Comput. Sci. 16, 163703 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: