International Journal of Speech Technology

, Volume 6, Issue 4, pp 331–346 | Cite as

Audiovisual Speech Synthesis

  • G. Bailly
  • M. Bérar
  • F. Elisei
  • M. Odisio


This paper presents the main approaches used to synthesize talking faces, and provides greater detail on a handful of these approaches. An attempt is made to distinguish between facial synthesis itself (i.e. the manner in which facial movements are rendered on a computer screen), and the way these movements may be controlled and predicted using phonetic input. The two main synthesis techniques (model-based vs. image-based) are contrasted and presented by a brief description of the most illustrative existing systems. The challenging issues—evaluation, data acquisition and modeling—that may drive future models are also discussed and illustrated by our current work at ICP.

text-to-speech synthesis audiovisual synthesis facial animation talking faces 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Badin, P., Borel, P., Bailly, G., Revéret, L., Baciu, M., and Segebarth, C. (2000). Towards an audiovisual virtual talking head: 3D articulatory modeling of tongue, lips and face based on MRI and video images. Proceedings of the 5th Speech Production Seminar, Germany: Kloster Seeon, pp. 261-264.Google Scholar
  2. Bailly,G. (1998). Learning to speak. Sensori-motor control of speech movements. Speech Communication, 22(2/3):251-267.Google Scholar
  3. Bailly, G., Gibert, G., and Odisio, M. (2002). Evaluation of movement generation systems using the point-light technique. IEEE Workshop on Speech Synthesis, Santa Monica, CA.Google Scholar
  4. Benoît, C., Lallouache, T., Mohamadi, T., and Abry, C. (1992). A set of French visemes for visual speech synthesis. In G. Bailly and C. Benoît (Eds.), Talking Machines: Theories, Models and Designs. Elsevier B.V., pp. 485-501.Google Scholar
  5. Bergeron, P. and Lachapelle, P. (1985). Controlling facial expression and body movements in the computer-generated short “Tony de Peltrie”. SIGGRAPH, Advanced Computer Animation Seminar Notes, San Francisco, CA.Google Scholar
  6. Beskow, J. (1995). Rule-based Visual Speech Synthesis. Madrid, Spain, Eurospeech, pp. 299-302.Google Scholar
  7. Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K.-E., and Öhman, T. (1997). The Teleface project-multimodal speech communication for the hearing impaired. Rhodos, Greece: Eurospeech, 2003-2010.Google Scholar
  8. Brand, M. (1999). Voice pupperty. SIGGRAPH'99, Los Angeles, CA, pp. 21-28.Google Scholar
  9. Bregler, C., Covell, M., and Slaney, M. (1997a). VideoRewrite: Driving visual speech with audio. SIGGRAPH'97, Los Angeles, CA, pp. 353-360.Google Scholar
  10. Bregler, C., Covell, M., and Slaney, M. (1997b). Video rewrite: Visual speech synthesis from video. International Conference on Auditory-Visual Speech Processing, Rhodes, Greece, pp. 153-156.Google Scholar
  11. Brooke, N.M. and Scott, S.D. (1998). Two-and three-dimensional audio-visual speech synthesis. International Conference on Auditory-Visual Speech Processing, Terrigal, Australia, pp. 213-218.Google Scholar
  12. Browman, C.P. and Goldstein, L.M. (1990). Gestural specification using dynamically-defined articulatory structures. Journal of Phonetics, 18(3):299-320.Google Scholar
  13. Chabanas, M. and Payan,Y. (2000). A3Dfinite element model of the face for simulation in plastic and maxillo-facial surgery. International Conference on Medical Image Computing and Computer-Assisted Interventions, Pittsburgh, USA, pp. 1068-1075.Google Scholar
  14. Cohen, M.M. and Massaro, D.W. (1993). Modeling coarticulation in synthetic visual speech. In D. Thalmann and N. Magnenat-Thalmann (Eds.), Models and Techniques in Computer Animation. Springer-Verlag: Tokyo, pp. 141-155.Google Scholar
  15. Cootes, T.F., Edwards, G.J., and Taylor, C.J. (2001). Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681-685.Google Scholar
  16. Cosatto, E. and Graf, H.P. (1997). Sample-based synthesis of photo-realistic talking-heads. SIGGRAPH'97, Los Angeles, CA, pp. 353-360.Google Scholar
  17. Cosatto, E. and Graf, H.P. (1998). Sample-based synthesis of photo-realistic talking heads. Computer Animation, Philadelphia, Pennsylvania, pp. 103-110.Google Scholar
  18. Couteau, B., Payan, Y., and Lavallée, S. (2000). The Mesh-Matching algorithm: An automatic 3D mesh generator for finite element structures. Journal of Biomechanics, 33(8):pp1005-1009.Google Scholar
  19. Doenges, P., Capin, T.K., Lavagetto, F., Ostermann, J., Pandzic, I., and Petajan, E. (1997). MPEG-4: audio/video and synthetic graphics/ audio for real-time, interactive media delivery. Image Communications Journal, 9(4):433-463.Google Scholar
  20. Eisert, P. and Girod, B. (1998). Analyzing facial expressions for virtual conferencing. IEEE Computer Graphics & Applications: Special Issue: Computer Animation forVirtual Humans, 18(5):70-78.Google Scholar
  21. Ekman, P. and Friesen,W.V. (1975). Unmasking the Face. Palo Alto, California: Consulting Psychologists Press.Google Scholar
  22. Ekman, P. and Friesen, W. (1978). Facial Action Coding System (FACS): A Technique for the Measurement of Facial Action. Palo Alto, California: Consulting Psychologists Press.Google Scholar
  23. Elisei, F., Odisio, M., Bailly, G., and Badin, P. (2001). Creating and controlling video-realistic talking heads. Auditory-Visual Speech Processing Workshop, Scheelsminde, Denmark, pp. 90-97.Google Scholar
  24. Ezzat, T. and Poggio, T. (1998). MikeTalk: A Talking Facial Display Based on Morphing Visemes. Philadelphia, PA: Computer Animation, pp. 96-102.Google Scholar
  25. Ezzat, T., Geiger, G., and Poggio, T. (2002). Trainable videorealistic speech animation. ACM Transactions on Graphics, 21(3):388-398.Google Scholar
  26. Hällgren, Å. and Lyberg, B. (1998). Visual speech synthesis with concatenative speech. Auditory-Visual Speech Processing Conference, Terrigal-Sydney, Australia, pp. 181-183.Google Scholar
  27. Harshman, R.A. and Lundy, M.E. (1984). The PARAFAC model for three-way factor analysis and multidimensional scaling. In H.G. Law, C.W. Snyder, J.A. Hattie, and R.P. MacDonald (Eds.), Research Methods for Multimode Data Analysis.New-York: Praeger, pp. 122-215.Google Scholar
  28. Ishikawa, T., Sera, H., Morishima, S., and Terzopoulos, D. (1998). Facial image reconstruction by estimated muscle parameter. International Conference on Automatic Face and Gesture Recognition, Nara, Japan, pp. 342-347.Google Scholar
  29. Li, H., Roivanen, P., and Forchheimer, R. (1993). 3D motion estimation in model-based facial image coding. IEEE Transactions on PAMI, 15(6):545-555.Google Scholar
  30. Massaro, D. (1998a). Illusions and issues in bimodal speech perception. Auditory-Visual Speech Processing Conference, Terrigal, Sydney, Australia, pp. 21-26.Google Scholar
  31. Massaro, D.W. (1998b). Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. Cambridge, MA: MIT Press.Google Scholar
  32. Matthews, I., Cootes, T.F., and Bangham, J.A. (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):198-213.Google Scholar
  33. McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 26:746-748.Google Scholar
  34. Minnis, S. and Breen, A.P. (1998). Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis. ICSLP, Beijing, China, pp. 759-762.Google Scholar
  35. Odisio, M., Elisei, F., Bailly, G., and Badin, P. (to appear). 3D talking clones for virtual teleconferencing. Annals of Telecommunications.Google Scholar
  36. Ogden, R., Hawkins, S., House, J., Huckvale, M., Local, J., Carter, P., Dankovicov´a, J. and Heid, S. (2000). ProSynth: An integrated prosodic approach to device-independent, natural-sounding speech synthesis. Computer Speech and Language, 14(3):177-210.Google Scholar
  37. Öhman, S.E.G. (1967). Numerical model of coarticulation. Journal of the Acoustical Society of America, 41:310-320.Google Scholar
  38. Okadome, T., Kaburagi, T., and Honda, M. (1999). Articulatory movement formation by kinematic triphone model. IEEE International Conference on Systems Man and Cybernetics, Tokyo, Japan, pp. 469-474.Google Scholar
  39. Olives, J.-L., Möttönen, R., Kulju, J., and Sams, M. (1999). Audio-visual speech synthesis for finnish. Auditory-Visual Speech Processing Workshop, Santa Cruz, CA, pp. 157-162.Google Scholar
  40. Pandzic, I., Ostermann, J., and Millen, D. (1999). Users evaluation: Synthetic talking faces for interactive services. The Visual Computer, 15:330-340.Google Scholar
  41. Parke, F.I. (1972). Computer generated animation of faces. ACM National Conference, Salt Lake City, pp. 451-457.Google Scholar
  42. Parke, F.I. (1975). A model for human faces that allows speech synchronized animation. Journal of Computers and Graphics, 1(1):1-4.Google Scholar
  43. Parke, F.I. (1982). A parametrized model for facial animation. IEEE Computer Graphics and Applications, 2(9):61-70.Google Scholar
  44. Parke, F.I. and Waters, K. (1996). Computer Facial Animation. Wellesley, MA, USA, A.K. Peters.Google Scholar
  45. Perrier, P., Ostry, D.J., and Laboissi`ere, R. (1996). The equilibrium point hypothesis and its application to speech motor control. Journal of Speech and Hearing Research, 39:365-377.Google Scholar
  46. Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., and Salesin, D.H. (1998). Synthesizing realistic facial expressions from photographs. Proceedings of Siggraph, Orlando, FL, USA, pp. 75-84.Google Scholar
  47. Pisoni, D.B. (1997). Perception of synthetic speech. In J.P.H.V. Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. Springer Verlag: New York. pp. 541-560.Google Scholar
  48. Platt, S.M. and Badler, N.I. (1981). Animating facial expressions. Computer Graphics, 15(3):245-252.Google Scholar
  49. Pockaj, R., Costa, M., Lavagetto, F., and Braccini, C. (1999). MPEG-4 facial animation:Animplementation. InternationalWorkshop on Synthetic-Natural Hybrid Coding and Three Dimensional Imaging, Santorini, Greece, pp. 33-36.Google Scholar
  50. Revéret, L., Bailly, G., and Badin, P. (2000). MOTHER: A new generation of talking heads providing a flexible articulatory control for video-realistic speech animation. International Conference on Speech and Language Processing, Beijing, China, pp. 755-758.Google Scholar
  51. Rydfalk, M. (1987). CANDIDE, a parameterized face. Sweden, Dept. of Electrical Engineering, Linköping University: LiTH-ISYI-866.Google Scholar
  52. Seitz, S.M. and Dyer, C.R. (1996). View morphing. ACM SIGGRAPH, New Orleans, Louisiana, pp. 21-30.Google Scholar
  53. Shaiman, S. and Porter, R.J. (1991). Different phase-stable relationships of the upper lip and jaw for production of vowels and diphthongs. Journal of the Acoustical Society of America, 90:3000-3007.Google Scholar
  54. Takeda, K., Abe, K., and Sagisaka, Y. (1992). On the basic scheme and algorithms in non-uniform unit speech synthesis. In G. Bailly and C. Benoît (Eds.), Talking Machines: Theories, Models and Designs. Elsevier B.V., pp. 93-105.Google Scholar
  55. Tamura, M., Kondo, S., Masuko, T., and Kobayashi, T. (1999). Textto-audio-visual speech synthesis based on parameter generation fromHMM.European Conference on Speech Communication and Technology, Budapest, Hungary, pp. 959-962.Google Scholar
  56. Tekalp, A.M. and Ostermann, J. (2000). Face and 2-D Mesh animation in MPEG-4. Signal Processing: Image Communication, 15:387-421.Google Scholar
  57. Terzopoulos, D. and Waters, K. (1990). Physically-based facial modeling, analysis and animation. The Journal of Visual and Computer Animation, 1:73-80.Google Scholar
  58. Theobald, B.J., Bangham, J.A., Matthews, I., and Cawley, G.C. (2001). Visual speech synthesis using statistical models of shape and appearance. Auditory-Visual Speech Processing Workshop, Scheelsminde, Denmark, pp. 78-83.Google Scholar
  59. Tsai, C.-J., Eisert, P., Girod, B., and Katsaggelos, A.K. (1997). Model-based synthetic view generation from a monocular video sequence. Proceedings of the International Conference on Image Processing, Santa Barbara, California, pp. 444-447.Google Scholar
  60. Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71-86.Google Scholar
  61. Vignoli, F. and Braccini, C. (1999). A text-speech synchronization technique with applications to talking heads. Auditory-Visual Speech Processing Conference, Santa Cruz, California, USA, pp. 128-132.Google Scholar
  62. Waters, K. (1987). A muscle model for animating three-dimensional facial expression. Computer Graphics, 21(4):17-24.Google Scholar
  63. Waters, K. and Terzopoulos, D. (1992). The computer synthesis of expressive faces. Philosophical Transactions of the Royal Society of London (B), 335:87-93.Google Scholar
  64. Yamamoto, E., Nakamura, S., and Shikano, K. (1998). Lipmovement synthesis from speech based on Hidden Markov Models. Speech Communication, 26(1-2):105-115.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • G. Bailly
    • 1
  • M. Bérar
    • 1
  • F. Elisei
    • 1
  • M. Odisio
    • 1
  1. 1.Institut de la Communication Parlée UMR CNRS no 5009 INPG/Univ. Stendhal 46Grenoble CedexFrance

Personalised recommendations