Multimedia Tools and Applications

, Volume 73, Issue 1, pp 397–415 | Cite as

Speech driven photo realistic facial animation based on an articulatory DBN model and AAM features

  • Dongmei Jiang
  • Yong ZhaoEmail author
  • Hichem Sahli
  • Yanning Zhang


This paper presents a photo realistic facial animation synthesis approach based on an audio visual articulatory dynamic Bayesian network model (AF_AVDBN), in which the maximum asynchronies between the articulatory features, such as lips, tongue and glottis/velum, can be controlled. Perceptual Linear Prediction (PLP) features from audio speech, as well as active appearance model (AAM) features from face images of an audio visual continuous speech database, are adopted to train the AF_AVDBN model parameters. Based on the trained model, given an input audio speech, the optimal AAM visual features are estimated via a maximum likelihood estimation (MLE) criterion, which are then used to construct face images for the animation. In our experiments, facial animations are synthesized for 20 continuous audio speech sentences, using the proposed AF_AVDBN model, as well as the state-of-art methods, being the audio visual state synchronous DBN model (SS_DBN) implementing a multi-stream Hidden Markov Model, and the state asynchronous DBN model (SA_DBN). Objective evaluations on the learned AAM features show that much more accurate visual features can be learned from the AF_AVDBN model. Subjective evaluations show that the synthesized facial animations using AF_AVDBN are better than those using the state based SA_DBN and SS_DBN models, in the overall naturalness and matching accuracy of the mouth movements to the speech content.


Facial animation AF_AVDBN Asynchrony AAM 



This work is supported within the framework of the National Natural Science Foundation of China (grant 61273265), the Shaanxi Provincial Key International Cooperation Project (2011KW-04), the LIAMA-CAVSA project, the EU FP7 project ALIZ-E (grant 248116), and the VUB-HOA CaDE project.


  1. 1.
    Abboud B, Davoine F, Dang M (2004) Facial expression recognition and synthesis based on an appearance model. Sig Process: Image Commun. doi: 10.1016/j.image.2004.05.009 Google Scholar
  2. 2.
    Bilmes J, Zweig G (2002) The graphical models toolkit: an open source software system for speech and time series processing. Proc IEEE Int Conf Acoust, Speech, Signal Process 4:3916–3919Google Scholar
  3. 3.
    Brand M (1999) Voice puppetry. SIGGRAPH ’99 proceedings of the 26th annual conference on computer graphics and interactive techniques, pp. 21–28Google Scholar
  4. 4.
    Bregler C, Covell M, Slaney M (2006) Video rewrite: driving visual speech with audio. Computer graphics annual conference series (SIGGRAPH), 353–360, Los Angeles, CaliforniaGoogle Scholar
  5. 5.
    Choi K, Luo Y, Hwang J (2001) Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. J VLSI Signal Process 29:51–61CrossRefzbMATHGoogle Scholar
  6. 6.
    Cootes TF, Edwards GJ, Taylor CJ (1998) Active appearance models. Proc European Conf Comput Vis 2:484–498Google Scholar
  7. 7.
    Cossato E, Graf HP (2000) Photo-realistic talking heads from image samples. IEEE Trans Multimedia 2(3):152–163CrossRefGoogle Scholar
  8. 8.
    Ezzat T, Geiger G, Poggio T (2002) Trainable video realistic speech animation. SIGGRAPH ’02 proceedings of the 29th annual conference on computer graphics and interactive techniques, pp. 388–398Google Scholar
  9. 9.
    Gowdy JN, Subramanya A, et al. (2004) DBN based multi-stream models for audio-visual speech recognition. Proc. International Conference on Acoustics, Speech and Signal Processing, pp. 993–996Google Scholar
  10. 10.
    Gutierrez-Osuna R, Kakumanu PK, Esposito A et al (2005) Speech-driven facial animation with realistic dynamics. IEEE Trans Multimedia 7(1):33–42CrossRefGoogle Scholar
  11. 11.
    Hou Y, Sahli H, Ravyse I, Zhang Y, Zhao R (2007) Robust shape based head tracking. Proc Adv Concepts Intell Vis Syst LNCS 4678:340–351CrossRefGoogle Scholar
  12. 12.
  13. 13., accessed on February 23, 2013
  14. 14.
    Jiang D, Ravyse I, Liu P, Sahli H, Verhelst W (2010) Realistic mouth animation based on an articulatory DBN model with constrained asynchrony. Proc. 35th IEEE Int. Conf. Audio, speech and signal processing (ICASSP), March 14–19, Texas, USA, pp. 2478–2481Google Scholar
  15. 15.
    Li Y, Shum H-Y (2006) Learning dynamic audio-visual mapping with input-output hidden Markov models. IEEE Trans Multimedia 8(3):542–549CrossRefGoogle Scholar
  16. 16.
    Livescu K, Centin O, H J Mark, et al (2006). Articulatory feature-based methods for acoustic and audio-visual speech recognition: 2006 JHU summer workshop final report. Center for Language and Speech Processing, Johns Hopkins UniversityGoogle Scholar
  17. 17.
    Massaro W (2003) A computer-animated tutor for spoken and written language learning. Int. Conf. Multimodal Interfaces, 172–175Google Scholar
  18. 18.
    Mattheyses W, Latacz L, Verhelst V (2010) Active appearance models for photorealistic visual speech synthesis. Proc. INTERSPEECH 2010, pp. 1113–1116Google Scholar
  19. 19.
    Mattheyses W, Latacz L, Verhelst V, Sahli H (2008) Multimodal unit selection for 2D audiovisual text-to-speech synthesis. Proceedings of the 5th international workshop on machine learning for multimodal interaction, In: Popescu-Belis A, Stiefelhagen R (Eds.), Lecture notes in computer science, Vol. 5237, pp. 125–136Google Scholar
  20. 20.
    Ning L, Ning F, Kamata S (2010) 3D reconstruction from a single image for a Chinese talking face. TENCON 2010–2010 I.E. Region 10 Conference, pp. 1613–1616, Nov. 21–24, Shanghai, ChinaGoogle Scholar
  21. 21.
    Rao RR, Chen T, Mersereau RM (1998) Audio-to-visual conversion for multimedia communication. IEEE Trans Ind Electron 45(1):15–22CrossRefGoogle Scholar
  22. 22.
    Salvi G, Beskow J et al (2009) SynFace-speech-driven facial animation for virtual speech-reading support. EURASIP J Audio, Speech, Music Process. doi: 10.1155/2009/191940 Google Scholar
  23. 23.
    Terissi LD, Gomez JC (2008) Audio-to-visual conversion via HMM inversion for speech-driven facial animation. Lecture Notes on Artif Intell, LNAI 5249:33–42Google Scholar
  24. 24.
    Wu P, Jiang D, Zhang H, Sahli H (2011) Realistic visual speech synthesis based on AAM features and an articulatory DBN model with constrained asynchrony. Proc. Audio-Visual Speech Processing (AVSP), pp. 59–64Google Scholar
  25. 25.
    Xie L, Liu Z (2007) Speech animation using coupled hidden Markov models. Pattern Recognit 40(8):2325–2340CrossRefzbMATHGoogle Scholar
  26. 26.
    Xie L, Liu Z (2007) Realistic mouth-synching for speech driven talking face using articulatory modelling. IEEE Trans Multimedia 9(3):500–510CrossRefGoogle Scholar
  27. 27.
    Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on hidden Markov models. Speech Commun 26(1–2):105–115CrossRefGoogle Scholar
  28. 28.
    Young S (2001) The HTK book. Cambridge University Engineering Department, UKGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Dongmei Jiang
    • 1
    • 2
  • Yong Zhao
    • 1
    • 2
    Email author
  • Hichem Sahli
    • 3
    • 4
  • Yanning Zhang
    • 1
    • 2
  1. 1.School of Computer ScienceNorthwestern Polytechnical UniversityXi’anPeople’s Republic of China
  2. 2.Shaanxi Provincial Key Laboratory of Speech and Image Information ProcessingShaanxiChina
  3. 3.Electronics & Informatics Department (ETRO)Vrije Universiteit BrusselBrusselsBelgium
  4. 4.Interuniversity Microelectronics Center (IMEC)LeuvenBelgium

Personalised recommendations