Multimedia Tools and Applications

, Volume 73, Issue 1, pp 377–396 | Cite as

A statistical parametric approach to video-realistic text-driven talking avatar

  • Lei XieEmail author
  • Naicai Sun
  • Bo Fan


This paper proposes a statistical parametric approach to video-realistic text-driven talking avatar. We follow the trajectory HMM approach where audio and visual speech are jointly modeled by HMMs and continuous audiovisual speech parameter trajectories are synthesized based on the maximum likelihood criterion. Previous trajectory HMM approaches only focus on mouth animation, which synthesizes simple geometric mouth shapes or video-realistic effects of the lip motion. Our approach uses trajectory HMM to generate visual parameters of the lower face and it realizes video-realistic animation of the whole face. Specifically, we use active appearance model (AAM) to model the visual speech, which offers a convenient and compact statistical model of both the shape and the appearance variations of the face. To realize video-realistic effects with high fidelity, we use Poisson image editing technique to stitch the synthesized lower-face image to a whole face image seamlessly. Objective and subjective experiments show that the proposed approach can produce natural facial animation.


Taking avatar Visual speech synthesis Facial animation Hidden Markov model  Active appearance model 



This work is supported by the National Natural Science Foundation of China (61175018), the Natural Science Basic Research Plan of Shaanxi Province (2011JM8009) and the Fok Ying Tung Education Foundation (131059).


  1. 1.
    Berger MA, Hofer G, Shimodaira H (2011) Carnival—combining speech technology and computer animation. IEEE Comput Graph Appl 80–89Google Scholar
  2. 2.
    Blanz V, Vetter T (1999) A morphable model for the synthesis of 3d faces. In: Siggraph, pp 187–194Google Scholar
  3. 3.
    Blanz V, Basso C, Poggio T, Vetter T (2003) Reanimating faces in images and video. In: Eurographics, pp 641–650Google Scholar
  4. 4.
    Brand M (1999) Voice puppetry. In: Siggraph, pp 21–28Google Scholar
  5. 5.
    Bregler C, Covell M, Slaney M (2007) Video rewrite: driving visual speech with audio. In: Siggraph, pp 353–360Google Scholar
  6. 6.
    Chen T (2001) Audiovisual speech processing: lip reading and lip synchronization. IEEE Signal Proc Mag 18(1):9–21CrossRefzbMATHGoogle Scholar
  7. 7.
    Choi K, Hwang JN (1999) Baum–welch hidden markov model inversion for reliable audio-to-visual conversion. In: Proc. IEEE 3rd workshop multimedia signal processing, pp 175–180Google Scholar
  8. 8.
    Cootes TG, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685CrossRefGoogle Scholar
  9. 9.
    Cosatto E, Ostermann J, Graf HP, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1428CrossRefGoogle Scholar
  10. 10.
    Deng Z, Neumann U (eds) (2008) Data-driven 3D facial animation. Springer, New YorkGoogle Scholar
  11. 11.
    Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vis 38(1):45–57CrossRefzbMATHGoogle Scholar
  12. 12.
    Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: Siggraph, pp 388–397Google Scholar
  13. 13.
    Fagel S, Bailly GB, Theobald B-J (2009) Animating virtual speakers or singers fromaudio: lip-synching facial animation. In: EURASIP journal on audio, speech, and music processing 2009, pp 1–2Google Scholar
  14. 14.
    Fu S, Gutierrez-Osuna R, Esposito A, Kakumanu KP, Garcia ON (2005) Audio/visual mapping with cross-modal hidden markov models. IEEE Trans Multimedia 7:243–251Google Scholar
  15. 15.
    Hofer G, Yamagishi J, Shimodaira H (2008) Speech-driven lip motion generation with a trajectory hmm. In: Proc. of interspeechGoogle Scholar
  16. 16.
    Hura S, Leathem C, Shaked N (2010) Avatars meet the challenge. Speech Technol 30–32Google Scholar
  17. 17.
    Jia J, Zhang S, Meng F, Wang Y, Cai L (2011) Emotional audio-visual speech synthesis based on pad. EURASIP J Audio Speech Music Process 19(3):570–582CrossRefGoogle Scholar
  18. 18.
    Jia J, Wu Z, Zhang S, Meng H, Cai L (2013) Head and facial gestures synthesis using pad model for an expressive talking avatar. Multimed Tools Appl. doi: 10.1007/S11042-013-1604-8
  19. 19.
    Kessentini Y, Paquet T, Hamadou AB (2010) Off-line handwritten word recognition using multi-stream hidden markov models. Pattern Recogn Lett 31(1):60–70Google Scholar
  20. 20.
    Liu K, Ostermann J (2009) Optimization of an image-based talking head system. In: EURASIP journal on audio, speech, and music processing, vol 2009Google Scholar
  21. 21.
    Meng F, Wu Z, Jia J, Meng H, Cai L (2013) Synthesizing english emphatic speech for multimodal correstive feedback in computer-aided pronunciation training. Multimed Tools Appl. doi: 10.1007/s11042-013-1601-y
  22. 22.
    McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748CrossRefGoogle Scholar
  23. 23.
    Ohman T, Salvi G (1999) Using hmms and anns formapping acoustic to visual speech. TMH-QPSR 40(1–2):45–50Google Scholar
  24. 24.
    Ostermann J, Weissenfeld A (2004) Talking faces - technologies and applications. In: Proc. of ICPR, vol 3, pp 826–833Google Scholar
  25. 25.
    Pandzic IS, Forchheimer R (eds) (2002) MPEG-4 facial animation the standard, implementation and applications. Wiley, New YorkGoogle Scholar
  26. 26.
    Pèrez P, Gangnet M, Blake A (2003) Poisson image editing. In: ACM Trans. Graphics, vol 22, pp 313–318Google Scholar
  27. 27.
    Pighin F, Hecker J, Lischinski D, Szeliski R, Salesin DH (1998) Synthesizing realistic facial expressions from photographs. In: Siggraph, pp 75–84Google Scholar
  28. 28.
    Potamianos G, Neti C, Luettin J, Matthews I (2004) Issues in visual and audio-visual speech processing. Ch. Audio-visual automatic speech recognition: an overview. MIT Press, pp 121–148Google Scholar
  29. 29.
    Salvi G, Beskow J, Moubayed SA, Granstrom B (2009) Synface–speech-driven facial animation for virtual speech-reading support. In: EURASIP journal on audio, speech, and music processing, vol 2009Google Scholar
  30. 30.
    Shinji Sako KT, Masuko T, Kobayashi T, Kitamura T (2000) Hmm-based text-to-audio-visual speech synthesis. In: InterspeechGoogle Scholar
  31. 31.
    Summereld AQ (1987) Some preliminaries to a comprehensive account of audio-visual speech perception. Lawrence Erlbaum Associates, Ch. Hearing by Eye: The Psychology of Lip-Reading, pp 97–113Google Scholar
  32. 32.
    Tamura M, Kondo S, Masuko T, Kobayashi T (1999) Text to audio-visual speech synthesis based on parameter generation from HMM. In: Eurospeech, pp 959–962Google Scholar
  33. 33.
    Theobald B-J, Wilkinson N (2007) A real-time speech-driven talking head using active appearance models. In: AVSPGoogle Scholar
  34. 34.
    Theobald B-J, Fagel S, Bailly G, Elisei F (2008) Lips2008: visual speech synthesis challenge. In: Proc. of interspeechGoogle Scholar
  35. 35.
    Theobald B, Matthews I, Wilkinson N, Cohn JF, Boker S (2007) Animating faces using appearance models. In: Proceedings of the workshop on vision, video and graphicsGoogle Scholar
  36. 36.
    Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T (2000) Speech parameter generation algorigthms for hmm-based speech synthesis. In: ICASSP, pp 1315–1318Google Scholar
  37. 37.
    Wang L, Qian X, Han W, Soong FK (2010) Synthesizing photo-real talking head via trajectory-guided sample selection. In: InterspeechGoogle Scholar
  38. 38.
    Wang L, Han W, Soong FK, Huo Q (2011) Text driven 3d photo-realistic talking head. In: Interspeech, pp 3307–3310Google Scholar
  39. 39.
    Weise T, Bouaziz S, Li H, Pauly M (2011) Realtime performance-based facial animation. In: SiggraphGoogle Scholar
  40. 40.
    Wu Z, Zhang S, Cai L, Meng H (2006) Real-time synthesis of chinese visual speech and facial expressions using mpeg-4 fap features in a three-dimensional avatar. In: Proc. Interspeech, pp 1802–1805Google Scholar
  41. 41.
    Xie L, Liu Z-Q (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(23):500–510Google Scholar
  42. 42.
    Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on hidden markov models. Speech Comm 26(1–2):105–115CrossRefGoogle Scholar
  43. 43.
    Yamagishi J, Masuko T, Tokuda K, Kobayashi T (2003) A training method for average voice model based on shared decision tree context clustering and speaker adaptive training. In: ICASSP, pp 716–719Google Scholar
  44. 44.
    Zeng Z, Tu J, Pianfetti BM, Huang TS (2008) Audio-visual affective expression recognition through multistream fused hmm. IEEE Trans Multimed 10(4):570–577CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.School of Computer ScienceNorthwestern Polytechnical UniversityXi’anChina

Personalised recommendations