Audio-to-Visual Conversion Via HMM Inversion for Speech-Driven Facial Animation

  • Lucas D. Terissi
  • Juan Carlos Gómez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5249)


In this paper, the inversion of a joint Audio-Visual Hidden Markov Model is proposed to estimate the visual information from speech data in a speech driven MPEG-4 compliant facial animation system. The inversion algorithm is derived for the general case of considering full covariance matrices for the audio-visual observations. The system performance is evaluated for the cases of full and diagonal covariance matrices. Experimental results show that full covariance matrices are preferable since similar, to the case of using diagonal matrices, performance can be achieved using a less complex model. The experiments are carried out using audio-visual databases compiled by the authors.


Hidden Markov Models Audio-Visual Speech Processing Facial Animation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Yamamoto, E., Nakamura, S., Shikano, K.: Lip movement synthesis from speech based on Hidden Markov Models. Speech Communication 26(1-2), 105–115 (1998)CrossRefGoogle Scholar
  2. 2.
    Rao, R., Chen, T., Mersereau, R.: Audio-to-visual conversion for multimedia communication. IEEE Trans. on Industrial Electronics 45(1), 15–22 (1998)CrossRefGoogle Scholar
  3. 3.
    Chen, T.: Audiovisual speech processing. IEEE Signal Processing Magazine 18(1), 9–21 (2001)CrossRefGoogle Scholar
  4. 4.
    Brand, M.: Voice puppetry. In: Proceedings of SIGGRAPH, Los Angeles, CA USA, pp. 21–28 (August 1999)Google Scholar
  5. 5.
    Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. on Information Theories 13, 260–269 (1967)CrossRefzbMATHGoogle Scholar
  6. 6.
    Choi, K., Luo, Y., Hwang, J.: Hidden Markov Model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. Journal of VLSI Signal Processing 29(1-2), 51–61 (2001)CrossRefzbMATHGoogle Scholar
  7. 7.
    Moon, S., Hwang, J.: Noisy speech recognition using robust inversion of Hidden Markov Models. In: Proceedings of IEEE International Conf. Acoust., Speech, Signal Processing, pp. 145–148 (1995)Google Scholar
  8. 8.
    Fu, S., Gutierrez-Osuna, R., Esposito, A., Kakumanu, P., Garcia, O.: Audio/visual mapping with cross-modal Hidden Markov Models. IEEE Trans. on Multimedia 7(2), 243–252 (2005)CrossRefGoogle Scholar
  9. 9.
    Xie, L., Liu, Z.Q.: A coupled HMM approach to video-realistic speech animation. Pattern Recognition 40, 2325–2340 (2007)CrossRefzbMATHGoogle Scholar
  10. 10.
    ISO/IEC IS 14496-2, Visual (1999)Google Scholar
  11. 11.
    Baum, L.E., Sell, G.R.: Growth functions for transformations on manifolds. Pacific Journal of Mathematics 27(2), 211–227 (1968)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Inc., New York (2001)CrossRefGoogle Scholar
  13. 13.
    Gävert, H., Hurri, J., Särelä, J., Hyvärinen, A.: FastICA package for MATLAB. Lab. of Computer and Information Science, Helsinki University of TechnologyGoogle Scholar
  14. 14.
    Terissi, L.D., Gómez, J.C.: Facial motion tracking and animation: An ICA-based approach. In: Proceedings of 15th European Signal Processing Conference, Poznań, Poland, September 3-7, pp. 292–296 (2007)Google Scholar
  15. 15.
    Ostermann, J.: Face Animation in MPEG-4. In: MPEG-4 Facial Animation - The Standard, Implementation and Applications, pp. 17–56. John Wiley & Sons, Chichester (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Lucas D. Terissi
    • 1
  • Juan Carlos Gómez
    • 1
  1. 1.Laboratory for System Dynamics and Signal ProcessingFCEIA, Universidad Nacional de Rosario CIFASIS, CONICETRosarioArgentina

Personalised recommendations