Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System

  • Kyoungho Choi
  • Ying Luo
  • Jenq-Neng Hwang


MPEG-4 standard allows composition of natural or synthetic video with facial animation. Based on this standard, an animated face can be inserted into natural or synthetic video to create new virtual working environments such as virtual meetings or virtual collaborative environments. For these applications, audio-to-visual conversion techniques can be used to generate a talking face that is synchronized with the voice. In this paper, we address audio-to-visual conversion problems by introducing a novel Hidden Markov Model Inversion (HMMI) method. In training audio-visual HMMs, the model parameters {λav} can be chosen to optimize some criterion such as maximum likelihood. In inversion of audio-visual HMMs, visual parameters that optimize some criterion can be found based on given speech and model parameters {λav}. By using the proposed HMMI technique, an animated talking face can be synchronized with audio and can be driven realistically. The HMMI technique combined with MPEG-4 standard to create a virtual conference system, named VIRTUAL-FACE, is introduced to show the role of HMMI for applications of MPEG-4 facial animation.

HMMI audio-to-visual conversion MPEG-4 facial animation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    K. Kiyokawa, H. Takemura, and N. Yokoya, “SeamlessDesign: A Face-to-Face Collaborative Virtual/Augmented Environment for Rapid Prototyping of Geometrically Constrained 3-D Ob-jects, ” IEEE International Conference on Multimedia Comput-ing and Systems, vol. 2, 1999, pp. 447–453.CrossRefGoogle Scholar
  2. 2.
    Yao-Jen Chang, Chih-Chung Chen, Jen-Chung Chou, and Yung-Chang Chen, “Implementation of a Virtual Chat Room for Mul-timedia Communications, ” 1999 IEEE 3rd Workshop on Multi-media Signal Processing, 1999, pp. 599–604.Google Scholar
  3. 3.
    S. Yura, T. Usaka, and K. Sakamura, “Video Avatar: Embed-ded Video for Collaborative Virtual Environment, ” IEEE Inter-national Conference on Multimedia Computing and Systems, vol. 2, 1999, pp. 433–438.CrossRefGoogle Scholar
  4. 4.
    S. Morishima and H. Harashima, “A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface, ” IEEE Journal on Sel. Areas in Communications, vol. 9, no. 4, 1991, pp. 594–600.CrossRefGoogle Scholar
  5. 5.
    Fabio Lavagetto, “Converting Speech into Lip Movement: A Multimedia Telephone for Hard of Hearing People, ” IEEE Transaction on Rehabilitation Engineering, vol. 3, no. 1, 1995, pp. 90–102.CrossRefGoogle Scholar
  6. 6.
    Ram R. Rao, Tsuhan Chen, and Russell M. Mersereau, “Audio-to-Visual Conversion for Multimedia Communication, ” IEEE. Transactions on Industrial Electronics, vol. 45, no. 1, 1998, pp. 15–22.CrossRefGoogle Scholar
  7. 7.
    S. Nakamura, E. Yamamoto, and K. Shikano, “Speech-Lip Movement Synthesis Maximizing Audio-Visual Joint Probability Based on EM Algorithm, ” IEEE International Workshop on Multimedia Signal Processing, 1998, pp. 53–58.Google Scholar
  8. 8.
    KyoungHo Choi and J.N. Hwang, “Baum–Welch HMM Inversion for Audio-to-Visual Conversion, ” IEEE International Workshop on Multimedia Signal Processing, 1999, pp. 175–180.Google Scholar
  9. 9.
    S.Y. Moon and J.N. Hwang, “Noisy Speech Recognition Using Robust Inversion of Hidden Markov Models, ” IEEE International Conf. Acoust., Speech, Signal Processing, 1995, pp. 145–148.Google Scholar
  10. 10.
    S.Y. Moon and J.N. Hwang, “Robust Speech Recognition Based on Joint Model and Feature Space Optimization of Hidden Markov Models, ” IEEE Transactions on Neural Networks, vol. 8, no. 2, 1997, pp. 194–204.CrossRefGoogle Scholar
  11. 11.
    L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall Inc., 1993.Google Scholar
  12. 12.
    Nadia Magnenat Thalmann, Prem Kalra, and Marc Escher, “Face to Virtual Face, ” Proceedings of the IEEE, vol. 86, no. 5, 1998, pp. 870–883.CrossRefGoogle Scholar
  13. 13.
    Fabio Lavagetto, “Time-Delay Neural Networks for Estimating Lip Movements From Speech Analysis: A Useful Tool in Audio-Video Synchronization, ” IEEE Transactions on Circuits and Systems for Video Technology, vol. 7, no. 5, 1997, pp. 786–800.CrossRefGoogle Scholar
  14. 14.
    Won-Sook Lee, Marc Escher, Gael Sannier, and Nadia Magnenat-Thalmann, “MPEG-4 Compatible Faces from Orthogonal Photos, ” International Conference on Computer An-imation, 1999, pp. 186–194.Google Scholar
  15. 15.
    Won-Sook Lee and N. Magnenat-Thalmann, “Fast Head Modeling for Animation, ” Journal of Image and Vision Computing, vol. 18, no 4, 2000, pp. 355–364.CrossRefGoogle Scholar
  16. 16.
    L. Moccozet and N. Magnenat-Thalmann, “Dirichlet Free-Form Deformations and Their Application to Hand Simulation, ” The Proceedings of Computer Animation, 1997, pp. 93–102.Google Scholar
  17. 17.
    Frederic Pighin, Richard Szeliski, and David H. Salesin, “Resynthesizing Facial Animation Through 3D Model-Based Tracking, ” The Proceedings of the Seventh IEEE Internation Conference on Computer Vision, vol. 1, 1999, pp. 143–150.CrossRefGoogle Scholar
  18. 18.
    J. Strom, T. Jebara, S. Basu, and A. Pentland, “Real Time Tracking and Modeling of Faces: An EKF-based Analysis by Synthesis Approach, ” Proceedings IEEE International Workshop on Modeling People, 1999, pp. 55–61.Google Scholar

Copyright information

© Kluwer Academic Publishers 2001

Authors and Affiliations

  • Kyoungho Choi
    • 1
  • Ying Luo
    • 1
  • Jenq-Neng Hwang
    • 1
  1. 1.Information Processing Lab., Department of Electrical EngineeringUniversity of WashingtonSeattleUSA

Personalised recommendations