International Journal of Computer Vision

, Volume 38, Issue 1, pp 45–57 | Cite as

Visual Speech Synthesis by Morphing Visemes

  • Tony Ezzat
  • Tomaso Poggio


We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face.

computer vision machine learning facial modelling facial animation morphing optical flow speech synthesis lip synchronization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Avidan, S., Evgeniou, T., Shashua, A., and Poggio, T. 1997. Image-based view synthesis by combining trilinear tensors and learning techniques. In VRST' 97 Proceedings, Lausanne, Switzerland, pp. 103–109.Google Scholar
  2. Barron, J.L., Fleet, D.J., and Beauchemin, S.S. 1994. Performance of optical flowtechniques. International Journal of Computer Vision, 12(1):43–77.Google Scholar
  3. Beier, T. and Neely, S. 1992. Feature-based image metamorphosis. In SIGGRAPH' 92 Proceedings, Chicago, IL, pp. 35–42.Google Scholar
  4. Bergen, J.R. and Hingorani, R. 1990. Hierarchical motion-based frame rate conversion. Technical Report, David Sarnoff Research Center, Princeton, New Jersey.Google Scholar
  5. Beymer, D., Shashua, A., and Poggio, T. 1993. Example based image analysis and synthesis. Technical Report 1431, MIT AI Lab.Google Scholar
  6. Black, A. and Taylor, P. 1997. The Festival Speech Synthesis System. University of Edinburgh.Google Scholar
  7. Bregler, C., Covell, M., and Slaney, M. 1997. Video rewrite: Driving visual speech with audio. In SIGGRAPH' 97 Proceedings, Los Angeles, CA.Google Scholar
  8. Burt, P.J. and Adelson, E.H. 1983. The laplacian pyramid as a compact image code. IEEE Trans. on Communications, COM-31(4):532–540.Google Scholar
  9. Chen, S.E. and Williams, L. 1993. View interpolation for image synthesis. In SIGGRAPH' 93 Proceedings, Anaheim, CA, pp. 279–288.Google Scholar
  10. Cohen, M.M. and Massaro, D.W. 1993. Modeling coarticulation in synthetic visual speech. In N.M. Thalmann and D. Thalmann, (Eds.), Models and Techniques in Computer Animation, Springer-Verlag: Tokyo, pp. 139–156.Google Scholar
  11. Cootes, T.F., Edwards, G.J., and Taylor, C.J. 1998. Active appearance models. In Proceedings of the European Conference on Computer Vision, Freiburg, Germany.Google Scholar
  12. Cosatto, E. and Graf, H. 1998. Sample-based synthesis of photorealistic talking heads. In Proceedings of Computer Animation' 98, Philadelphia, Pennsylvania, pp. 103–110.Google Scholar
  13. Ezzat, T. and Poggio, T. A morphable model for the human mouth. Technical Report, MIT AI Lab, forthcoming.Google Scholar
  14. Fisher, C.G. 1968. Confusions among visually perceived consonants. Jour. Speech and Hearing Research, 11:796–804.Google Scholar
  15. Guenter, B., Grimm, C., Wood, D., Malvar, H., and Pighin, F. 1998. Making faces. In SIGGRAPH' 98 Proceedings, Orlando, FL, pp. 55–66.Google Scholar
  16. Horn, B.K.P. and Schunck, B.G. 1981. Determining optical flow. Artificial Intelligence, 17:185–203.Google Scholar
  17. Jones, M. and Poggio, T. 1998. Multidimensional morphable models: A framework for representing and maching object classes. In Proceedings of the International Conference on Computer Vision, Bombay, India.Google Scholar
  18. Lee, S.Y., Chwa, K.Y., Shin, S.Y., and Wolberg, G. 1992. Image metemorphosis using snakes and free-form deformations. In SIGGRAPH' 92 Proceedings, pp. 439–448.Google Scholar
  19. Lee, Y., Terzopoulos, D., and Waters, K. 1995. Realistic modeling for facial animation. In SIGGRAPH' 95 Proceedings, Los Angeles, California, pp. 55–62.Google Scholar
  20. LeGoff, B. and Benoit, C. 1996. A text-to-audiovisual-speech synthesizer for french. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Philadelphia, USA.Google Scholar
  21. Lim, J. 1990. Two-Dimensional Signal and Image Processing. Prentice Hall: Englewood Cliffs, New Jersey.Google Scholar
  22. Montgomery, A. and Jackson, P. 1983. Physical characteristics of the lips underlying vowel lipreading performance. Jour. Acoust. Soc. Am., 73(6):2134–2144.Google Scholar
  23. Moulines, E. and Charpentier, F. 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9:453–467.Google Scholar
  24. Olive, J., Greenwood, A., and Coleman, J. 1993. Acoustics of American English Speech: A Dynamic Approach. Springer-Verlag: New York, USA.Google Scholar
  25. Owens, E. and Blazek, B. 1985. Visemes observed by hearing-impaired and normal-hearing adult viewers. Jour. Speech and Hearing Research, 28:381–393.Google Scholar
  26. Parke, F.I. 1974. A parametric model of human faces. Ph.D. Thesis, University of Utah.Google Scholar
  27. Pearce, A., Wyvill, B., Wyvill, G., and Hill, D. 1986. Speech and expression: A computer solution to face animation. In Graphics Interface, Vancouver, pp. 136–140.Google Scholar
  28. Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., and Salesin, D. 1998. Synthesizing realistic facial expressions from photographs. In SIGGRAPH' 98 Proceedings, Orlando, FL.Google Scholar
  29. Scott, K.C., Kagels, D.S., Watson, S.H., Rom, H., Wright, J.R., Lee, M., and Hussey, K.J. 1994. Synthesis of speaker facial movement to match selected speech sequences. In Proceedings of the Fifth Australian Conference on Speech Science and Technology, Vol. 2, pp. 620–625.Google Scholar
  30. Seitz, S. and Dyer, C. 1996. View morphing. In SIGGRAPH' 96 Proceedings, pp. 21–30.Google Scholar
  31. Waters, K. and Levergood, T. 1993. Decface: An automatic lipsynchronization algorithm for synthetic faces. Technical report, Digital Equipment Corporation CRL Report.Google Scholar
  32. Watson, S.H., Wright, J.R., Scott, K.C., Kagels, D.S., Freda, D., and Hussey, K.J. 1997. An advanced morphing algorithm for interpolating phoneme images to simulate speech. Jet Propulsion Laboratory, California Institute of Technology.Google Scholar
  33. Wolberg, G. 1990. Digital Image Warping. IEEE Computer Society Press: Los Alamitos, CA.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Tony Ezzat
    • 1
  • Tomaso Poggio
    • 1
  1. 1.Center for Biological and Computational Learning, Artificial Intelligence LaboratoryMITCambridgeUSA

Personalised recommendations