Abstract
In this paper, we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip. By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism, we design a light-weight speech encoder that learns useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data. To learn subject-independent facial motion, we use deformation gradients as the internal representation, which allows nuanced local motions to be better synthesized than using vertex offsets. Compared with state-of-the-art automatic-speech-recognition-based methods, our model is much smaller but achieves similar robustness and quality most of the time, and noticeably better results in certain challenging cases.
This is a preview of subscription content, access via your institution.
References
Cao C, Hou Q, Zhou K. Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on Graphics, 2014, 33(4): 1–10
Nagano K, Saito S, Goldwhite L, San K, Hong A, Hu L, Wei L, Xing J, Xu Q, Kung H W, Kuang J, Agarwal A, Castellanos E, Seo J, Fursund J, Li H. Pinscreen avatars in your pocket: mobile pagan engine and personalized gaming. In: Proceedings of SIGGRAPH Asia 2018 RealTime Live!. 2018, 1–1
Edwards P, Landreth C, Fiume E, Singh K. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics, 2016, 35(4): 1–11
Karras T, Aila T, Laine S, Herva A, Lehtinen J. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics, 2017, 36(4): 1–12
Pham H X, Wang Y, Pavlovic V. End-to-end learning for 3d facial animation from speech. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. 2018, 361–365
Cudeiro D, Bolkart T, Laidlaw C, Ranjan A, Black M J. Capture, learning, and synthesis of 3d speaking styles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 10101–10111
Hati Y, Rousseaux F, Duhart C. Text-driven mouth animation for human computer interaction with personal assistant. In: Proceedings of the 25th International Conference on Auditory Display. 2019, 75–82
Jurafsky D, Martin J H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd Edition). Upper Saddle River, New Jersey: Pearson Prentice Hall, 2009
Suwajanakorn S, Seitz S M, Kemelmacher-Shlizerman I. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics, 2017, 36(4): 1–13
Taylor S, Kim T, Yue Y, Mahler M, Krahe J, Rodriguez A G, Hodgins J, Matthews I. A deep learning approach for generalized speech animation. ACM Transactions on Graphics, 2017, 36(4): 1–11
Hussen Abdelaziz A, Theobald B J, Binder J, Fanelli G, Dixon P, Apostoloff N, Weise T, Kajareker S. Speaker-independent speech-driven visual speech synthesis using domain-adapted acoustic models. In: Proceedings of the 2019 International Conference on Multimodal Interaction. 2019, 220–225
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780
Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, et al. Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv: 14125567, 2014
Pham H X, Cheung S, Pavlovic V. Speech-driven 3d facial animation with implicit emotional awareness: a deep learning approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2017, 2328–2336
Tian G, Yuan Y, Liu Y. Audio2face: generating speech/face animation from single audio with attention-based bidirectional lstm networks. In: Proceedings of 2019 IEEE International Conference on Multimedia and Expo Workshops. 2019, 366–371
Tzirakis P, Papaioannou A, Lattas A, Tarasiou M, Schuller B, Zafeiriou S. Synthesising 3d facial motion from “in-the-wild” speech. arXiv preprint arXiv: 190407002, 2019
Nishimura R, Sakata N, Tominaga T, Hijikata Y, Harada K, Kiyokawa K. Speech-driven facial animation by lstm-rnn for communication use. In: Proceedings of 2019 IEEE Conference on Virtual Reality and 3D User Interfaces. 2019, 1102–1103
Sumner R W, Popović J. Deformation transfer for triangle meshes. ACM Transactions on Graphics, 2004, 23(3): 399–405
Wu Q, Zhang J, Lai Y K, Zheng J, Cai J. Alive caricature from 2d to 3d. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 7336–7345
Gao L, Lai Y, Yang J, Zhang L X, Kobbelt L, Xia S. Sparse data driven mesh deformation. IEEE Transactions on Visualization and Computer Graphics, 2019
Orvalho V, Bastos P, Parke F I, Oliveira B, Alvarez X. A facial rigging survey. In: Proceedings of Eurographics 2012 — State of the Art Reports. 2012, 183–204
Kent R D, Minifie F D. Coarticulation in recent speech production models. Journal of Phonetics, 1977, 5(2): 115–133
Pelachaud C, Badler N I, Steedman M. Generating facial expressions for speech. Cognitive Science, 1996, 20(1): 1–46
Wang A, Emmi M, Faloutsos P. Assembling an expressive facial animation system. In: Proceedings of the 2007 ACM SIGGRAPH Symposium on Video Games. 2007, 21–26
Cohen M M, Massaro D W. Modeling coarticulation in synthetic visual speech. In: Proceedings of Models and Techniques in Computer Animation. 1993, 139–156
Xu Y, Feng A W, Marsella S, Shapiro A. A practical and configurable lip sync method for games. In: Proceedings of Motion on Games. 2013, 131–140
Bregler C, Covell M, Slaney M. Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. 1997, 353–360
Ezzat T, Geiger G, Poggio T. Trainable videorealistic speech animation. ACM Transactions on Graphics, 2002, 21(3): 388–398
Taylor S L, Mahler M, Theobald B J, Matthews I. Dynamic units of visual speech. In: Proceedings of the 11th ACM SIGGRAPH/Eurographics Conference on Computer Animation. 2012, 275–284
Brand M. Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. 1999, 21–28
Xie L, Liu Z Q. Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Transactions on Multimedia, 2007, 9(3): 500–510
Wang L, Han W, Soong F K, Huo Q. Text driven 3d photo-realistic talking head. In: Proceedings of Interspeech. 2011, 3307–3308
Zhang X, Wang L, Li G, Seide F, Soong F K. A new language independent, photo-realistic talking head driven by voice only. In: Proceedings of Interspeech. 2013, 2743–2747
Shimba T, Sakurai R, Yamazoe H, Lee J H. Talking heads synthesis from audio with deep neural networks. In: Proceedings of the IEEE/SICE International Symposium on System Integration. 2015, 100–105
Fan B, Xie L, Yang S, Wang L, Soong F K. A deep bidirectional lstm approach for video-realistic talking head. Multimedia Tools and Applications, 2016, 75(9): 5287–5309
Eskimez S E, Maddox R K, Xu C, Duan Z. Generating talking face landmarks from speech. In: Proceedings of the International Conference on Latent Variable Analysis and Signal Separation. 2018, 372–381
Aneja D, Li W. Real-time lip sync for live 2d animation. arXiv preprint arXiv: 191008685, 2019
Greenwood D, Matthews I, Laycock S. Joint learning of facial expression and head pose from speech. In: Proceedings of Interspeech. 2018, 2484–2488
Websdale D, Taylor S, Milner B. The effect of real-time constraints on automatic speech animation. In: Proceedings of Interspeech. 2018, 2479–2483
Schwartz J L, Savariaux C. No, there is no 150 ms lead of visual speech on auditory speech, but a range of audiovisual asynchronies varying from small audio lead to large audio lag. PLOS Computational Biology, 2014, 10(7): e1003743
Shen J, Pang R, Weiss R J, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2018, 4779–4783
Prenger R, Valle R, Catanzaro B. Waveglow: a flow-based generative network for speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2019, 3617–3621
Vougioukas K, Petridis S, Pantic M. Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 2020, 128(5): 1398–1413
Chen L, Maddox R K, Duan Z, Xu C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7832–7841
Abdel-Hamid O, Mohamed A R, Jiang H, Deng L, Penn G, Yu D. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(10): 1533–1545
Sainath T N, Li B. Modeling time-frequency patterns with lstm vs. convolutional architectures for lvcsr tasks. In: Proceedings of Interspeech. 2016, 813–817
Liu Y, Wang D. Time and frequency domain long short-term memory for noise robust pitch tracking. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2017, 5600–5604
Denil M, Bazzani L, Larochelle H, de Freitas N. Learning where to attend with deep architectures for image tracking. Neural Computation, 2012, 24(8): 2151–2184
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 14090473, 2014
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in pytorch. In: Proceedings of Neural Information Processing Systems 2017 Workshop on Autodiff. 2017
Kingma D P, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv: 14126980, 2014
Ekman P, Friesen W V, Hager J C. Facial Action Coding System: The Manual on CD-ROM. Instructor’s Guide. Salt Lake City: Network Information Research Co., 2002
Mori M, MacDorman K F, Kageki N. The uncanny valley. IEEE Robotics and Automation Magazine, 2012, 19(2): 98–100
Kim C, Shin H V, Oh T H, Kaspar A, Elgharib M, Matusik W. On learning associations of faces and voices. In: Proceedings of the Asian Conference on Computer Vision. 2018, 276–292
Vielzeuf V, Kervadec C, Pateux S, Lechervy A, Jurie F. An occam’s razor view on learning audiovisual emotion recognition with small training sets. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. 2018, 589–593
Avots E, Sapiński T, Bachmann M, Kamińska D. Audiovisual emotion recognition in wild. Machine Vision and Applications, 2019, 30(5): 975–985
Oh T H, Dekel T, Kim C, Mosseri I, Freeman W T, Rubinstein M, Matusik W. Speech2face: learning the face behind a voice. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7539–7548
Wang R, Liu X, Cheung Y m, Cheng K, Wang N, Fan W. Learning discriminative joint embeddings for efficient face and voice association. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020, 1881–1884
Zhu H, Luo M, Wang R, Zheng A, He R. Deep audio-visual learning: a survey. arXiv preprint arXiv: 200104758, 2020
Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J. Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 3497–3506
Acknowledgements
We would like to thank VOCA group for publishing their database. This work was partially supported by the National Key Research & Development Program of China (2016YFB1001403) and the National Natural Science Foundation of China (Grant No. 61572429).
Author information
Authors and Affiliations
Corresponding author
Additional information
Yujin Chai received his BS degree in the Computer Science and Technology Department from Zhejiang University, China in 2016. He is now a PhD student at the State Key Lab of CAD&CG, Zhejiang University. His research interests include machine learning and facial animation.
Yanlin Weng is currently an associate professor of the School of Computer Science and Technology at Zhejiang University, China. She got her PhD degree in computer science from University of Wisconsin — Milwaukee, USA and her master’s and bachelor’s degrees in control science and engineering from Zhejiang University. Her research interests include computer graphics and multimedia.
Lvdi Wang is currently the Director of FaceUnity Research Center of Beijing, China. He received his PhD degree in computer science from Institute for Advanced Study, Tsinghua University, China in 2011, under the supervision of Professor Baining Guo. After that, he has spent four years at Microsoft Research Asia and two years at Apple Inc. His works focus on computer graphics, speech, and natural language processing techniques.
Kun Zhou is a professor in the Computer Science Department of Zhejiang University, and the Director of the State Key Lab of CAD&CG, China. Prior to joining Zhejiang University in 2008, Dr. Zhou was a leader researcher of the Internet Graphics Group at Microsoft Research Asia. He received his BS degree and PhD degree in computer science from Zhejiang University in 1997 and 2002, respectively. His research interests are in visual computing, parallel computing, human computer interaction, and virtual reality. He is a Fellow of IEEE.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Chai, Y., Weng, Y., Wang, L. et al. Speech-driven facial animation with spectral gathering and temporal attention. Front. Comput. Sci. 16, 163703 (2022). https://doi.org/10.1007/s11704-020-0133-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-020-0133-7