Multimedia Tools and Applications

, Volume 74, Issue 22, pp 9849–9869 | Cite as

HMM trajectory-guided sample selection for photo-realistic talking head

  • Lijuan Wang
  • Frank K. Soong


In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-realistic talking head synthesis. An audio-visual database of a person is recorded first for training a statistical Hidden Markov Model (HMM) of Lips movement. The HMM is then used to generate the dynamic trajectory of lips movement for given speech signals in the maximum probability sense. The generated trajectory is then used as a guide to select, from the original training database, an optimal sequence of lips images which are then stitched back to a background head video. We also propose a minimum generation error (MGE) training method to refine the audio-visual HMM to improve visual speech trajectory synthesis. Compared with the traditional maximum likelihood (ML) estimation, the proposed MGE training explicitly optimizes the quality of generated visual speech trajectory, where the audio-visual HMM modeling is jointly refined by using a heuristic method to find the optimal state alignment and a probabilistic descent algorithm to optimize the model parameters under the MGE criterion. In objective evaluation, compared with the ML-based method, the proposed MGE-based method achieves consistent improvement in the mean square error reduction, correlation increase, and recovery of global variance. For as short as 20 min recording of audio/video footage, the proposed system can synthesize a highly photo-realistic talking head in sync with the given speech signals (natural or TTS synthesized). This system won the first place in the A/V consistency contest in LIPS Challenge, perceptually evaluated by recruited human subjects.


Visual speech synthesis Photo- realistic Talking head Trajectory-guided sample selection 


  1. 1.
    Blanz V, Vetter T (1999) A morphable model for the synthesis Of 3D faces. Proc ACM SIGGRAPH 99:187–194Google Scholar
  2. 2.
    Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. Proc ACM SIGGRAPH 97:353–360Google Scholar
  3. 3.
    Chen T (2001) Audiovisual speech processing. Signal Proc Mag 18(1):9–21zbMATHCrossRefGoogle Scholar
  4. 4.
    Cosatto E, Graf HP (2000) Photo-realistic talking heads from image samples. IEEE Trans Multimed 2(3):152–163CrossRefGoogle Scholar
  5. 5.
    Donovan RE, Eide EM (1998) The IBM trainable speech synthesis system. Proc 1998 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 98) IEEE, pp 1703–1706Google Scholar
  6. 6.
    Ezzat T, Geiger G, PoggioT (2002) Trainable video realistic speech animation. Proc ACM SIGGRAPH 2002, pp 388–398Google Scholar
  7. 7.
    Ezzat T, Poggio T (1998) Miketalk: a talking facial display based on morphing visemes. Proc Comput Animat, pp 96–102Google Scholar
  8. 8.
    Hirai T, Tenpaku S (2004) Using 5ms segments in concatenative speech synthesis. Proc 5th ISCA Speech Synt Work Int’l Speech Comm Assoc pp 37–42Google Scholar
  9. 9.
    Huang F, Cosatto E, Graf HP (2002) Triphone based unit selection for concatenative visual speech synthesis. Proc 2002 I.E. Int’l Conf Acoust Speech Signal Proc (ICASSP 02) IEEE, pp 2037–2040Google Scholar
  10. 10.
    Huang X et al (1997) Recent improvements on microsoft’s trainable text-to-speech system – Whistler. Proc 1997 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 97) IEEE, pp 959–962Google Scholar
  11. 11.
    Hunt A, Black A (1996) Unit selection in a concatenative speech synthesis system using a large speech database. Proc 1996 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 96) IEEE, pp 373–376Google Scholar
  12. 12.
    King SA, Parent RE (2005) Creating speech-synchronized animation. IEEE Trans Vis Comput Graph 11(3):341–352CrossRefGoogle Scholar
  13. 13.
    Lewis JP Fast normalized cross-correlation. Industrial Light & MagicGoogle Scholar
  14. 14.
    Ling ZH, Wang RH (2006) HMM-based unit selection using frame sized speech segments. Proc 7th Ann Conf Int’l Speech Comm Assoc. (Interspeech 06) Int’l Speech Comm Assoc, pp 2034–2037Google Scholar
  15. 15.
    Liu K, Ostermann J (2008) Realistic facial animation system for interactive services. Proc 9th Ann Conf Int’l Speech Comm Assoc (Interspeech 08), Int’l Speech Comm Assoc, pp 2330–2333Google Scholar
  16. 16.
    Liu K, Weissenfeld A, Ostermann J (2006) Parameterization of mouth images by LLE and PCA for image-based facial animation. Proc 2006 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 06) IEEE, pp 461–464Google Scholar
  17. 17.
    Mattheyses W et al (2008) Multimodal unit selection for 2D audiovisual text-to-speech synthesis. Lect Note Comput Sci, pp 125–136Google Scholar
  18. 18.
    Nakamura S (2002) Statistical multimodal integration for audio-visual speech processing. IEEE Trans Neural Netw 13(4):854–866CrossRefGoogle Scholar
  19. 19.
    Perez P, Gangnet M, Blake A (2003) Poisson image editing. ACM Trans Graph (SIGGRAPH’03) 22(3):313–318Google Scholar
  20. 20.
    Pighin F et al (1998) Synthesizing realistic facial expressions from photographs. Proc ACM SIGGRAPH 98:75–84Google Scholar
  21. 21.
    Sako S et al (2000) HMM-based text-to-audio-visual speech synthesis. Proc 6th Int’l Conf on Spoken Lang Process (ICSLP 00) Int’l Speech Comm Assoc, pp 25–28Google Scholar
  22. 22.
    Scott MR, Liu X, Zhou M (2011) Towards a specialized search engine for language learners. Proc IEEE, pp 1462–1465Google Scholar
  23. 23.
    Theobald BJ et al (2004) Near videorealistic synthetic talking faces: implementation and evaluation. Speech Comm 44:127–140CrossRefGoogle Scholar
  24. 24.
    Theobald B et al (2008) LIPS2008: visual speech synthesis challenge. Proc 9th Ann Conf Int’l Speech Comm Assoc (Interspeech 08) Int’l Speech Comm Assoc, pp 2310–2313Google Scholar
  25. 25.
    Toda T, Black A, Tokuda K Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter. Proc 2005 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 05) IEEE, pp 9–12Google Scholar
  26. 26.
    Tokuda K et al (1996) Speech synthesis using HMMs with dynamic features Proc 1996 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 96) IEEE, pp 389–392Google Scholar
  27. 27.
    Video demonstration of our synthesis results:
  28. 28.
    Wang JQ et al (2004) A real-time cantonese text-to-audiovisual speech synthesizer. Proc 2004 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 04) IEEE, pp I–653–I–656Google Scholar
  29. 29.
    Wang Q et al (2006) Real-time Bayesian 3-D pose tracking. IEEE Trans Circ Syst Video Technol 16(12):1533–1541CrossRefGoogle Scholar
  30. 30.
    Wang LJ et al (2010) Synthesizing photo-real talking head via trajectory-guided sample selection. Proc 11th Ann Conf Int’l Speech Comm Assoc (Interspeech 10) Int’l Speech Comm Assoc, pp 446–449Google Scholar
  31. 31.
    Wang LJ et al (2011) Synthesizing visual speech trajectory with minimum generation error. Proc 2011 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 11) IEEE, pp 4580–4583Google Scholar
  32. 32.
    Wang LJ et al (2012) Computer-assisted audiovisual language learning. Computer 45(6):38–47, Computer SocietyCrossRefGoogle Scholar
  33. 33.
    Wu Y-J, Qin L, Tokuda K (2009) An improved minimum generation error based model adaptation for HMM-based speech synthesis. Proc 10th Ann Conf Int’l Speech Comm Assoc (Interspeech 09) Int’l Speech Comm Assoc, pp 1787–1790Google Scholar
  34. 34.
    Wu Y-J, Wang R-H (2006) Minimum generation error training for HMM-based speech synthesis. Proc 2006 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 06) IEEE I:89–92Google Scholar
  35. 35.
    Wu KK et al (2011) A sparse and low-rank approach to efficient face alignment for photo-real talking head synthesis. Proc 2011 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 11) IEEE, pp 1397–1400Google Scholar
  36. 36.
    Xie L, Liu ZQ (2006) Speech animation using coupled hidden Markov models. Proc 2006 Int’l Conf Pattern Recognit (ICPR’06), pp 1128–1131Google Scholar
  37. 37.
    Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(3):500–510CrossRefGoogle Scholar
  38. 38.
    Yan ZJ, Qian Y, Soong F (2010) Rich-context Unit Selection (RUS) approach to high quality TTS. Proc 2010 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 10) IEEE, pp 4798–4801Google Scholar
  39. 39.
    Zhang S et al (2007) Head movement synthesis based on semantic and prosodic features for a Chinese expressive avatar. Proc 2007 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 07) IEEE, pp IV–837–IV–840Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Microsoft Research AsiaBeijingChina

Personalised recommendations