Multi-modal Speech Synthesis with Applications

  • Björn Granström


This chapter builds on research projects at KTH concerned with development of multi-modal speech synthesis. The synthesis strategy chosen is model-based parametric speech synthesis for both the auditory and visual modality. The modalities are controlled from the same rule synthesis framework. The visual model can also be directly controlled, for aspects that are not phonetic in nature. This flexible set-up has made it possible to exploit the technology in several different applications, like spoken dialogue systems. In the Teleface project the synthetic face is evaluated as a lip-reading support for hard-of-hearing persons. In this project several studies of multi-modal speech intelligibility have been carried out using different combinations of natural/synthetic, auditory/visual speech.


Vocal Tract Speech Synthesis Dialogue System Visual Speech Facial Animation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Piatt, S. M. and Badler, N. I. (1981), “Animating Facial Expressions” Computer Graphics, Vol. 15, No. 3, pp. 245–252.CrossRefGoogle Scholar
  2. Waters, K. (1987), “A muscle model for animating three-dimensional facial expressions”, Computer Graphics, 21: 17–24.CrossRefGoogle Scholar
  3. Terzopoulos, D., Waters, K. (1990) “Physically based facial modelling, analysis and animation” Visualisation and Computer Animation, 1: 73–80.CrossRefGoogle Scholar
  4. Parke F I (1982). Parametrized models for facial animation. IEEE Computer Graphics, 2 (9), pp 61–68.CrossRefGoogle Scholar
  5. Öhman T (1998). An audio-visual database in Swedish for bimodal speech processing. TMH-QPSR, KTH, 1 /1998.Google Scholar
  6. Beskow, J. (1995) “Rule-based Visual Speech Synthesis” In Proceedings ofEurospeech ’95, Madrid, Spain.Google Scholar
  7. Ezzat T & Tomaso P (1998). MikeTalk: A talking facial display based on morphing visemes, Proceedings of the Computer Animation Conference, Philadelphia, PAGoogle Scholar
  8. Bregler C, Covell M & Slaney M (1997). Video Rewrite: Visual speech synthesis from video, Proceedings of the ESCA Workshop on Audiovisual Speech Processing, Rhodes, GreeceGoogle Scholar
  9. Brooke N M & Scott S D (1998). An audio-visual speech synthesiser, Proceedings of STiLL, The ESCA Workshop on Speech Technology in Language Learning, Marholmen, SwedenGoogle Scholar
  10. Carlson R, Granström B and Hunnicutt S (1991). Multilingual text-to-speech development and applications. A. W. Ainsworth (Ed.), Advances in speech, hearing and language processing, JAI Press, London, UK.Google Scholar
  11. Cohen, M. M., & Massaro, D. W. (1993) Modeling coarticulation in synthetic visual speech. In N. M. Thalmann & D. Thalmann (Eds.) Models and Techniques in Computer Animation. Tokyo: Springer-Verlag, 139–156.Google Scholar
  12. Beskow, J. (1997) Animation of Talking Agents, In Proceedings of AVSP′97, ESCA Workshop on Audio-Visual Speech Processing, Rhodes, Greece.Google Scholar
  13. Pelachaud, C. (1991). Communication and Coarticulation in Facial Animation, Ph.D. dissertion, University of Pennsylvania.Google Scholar
  14. Cole R et al. (1998), Intelligent animated agents for interactive language training, Proceedings of STiLL, The ESCA Workshop on Speech Technology in Language Learning, Marholmen, SwedenGoogle Scholar
  15. Massaro D W (1998). Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. Cambridge, MA: MIT PressGoogle Scholar
  16. Bertenstam J, Beskow J, Blomberg M, Carlson R, Elenius K, Granström B, Gustafson J, Hunnicutt S, Högberg J, Lindell R, Neovius L, de Serpa-Leitao A, Nord L and Ström N (1995). The Waxholm system — a progress report, In Proceedings of Spoken Dialogue Systems, Vigsø, Denmark.Google Scholar
  17. Beskow J, Elenius K & McGlashan S (1997). Olga - A dialogue system with an animated talking agent, Proceedings of EUROSPEECH′97,Rhodes,Greece.Google Scholar
  18. Beskow J & McGlashan S (1997). OLGA - A conversational agent with gestures. In: André E, ed., Proc of the IJCAI -97 Workshop on Animated Interface Agents: Making them Intelligent, Nagoya, JapanGoogle Scholar
  19. Cassel, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S. and Achorn B. (1994), “Modeling the Interaction between Speech and Gesture”, In Proceedings of 16th Annual Conference of the Cognitive Science Society, Georgia Institute of Technology, Atlanta, USA.Google Scholar
  20. Katashi, N. and Akikazu, T (1994) “Speech Dialogue with Facial Displays: Multimodal Human-Computer Conversation”, Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL-94), pp. 102–109.Google Scholar
  21. Thórisson K R (1997), Gandalf: An Embodied Humanoid Capable of Real-Time Multimodal Dialogue with People, Proceedings of First ACM International Conference on Autonomous Agents, Marina del Rey, California, pp. 536–7Google Scholar
  22. Beskow J, Dahlquist M, Granstrom B, Lundeberg M, Spens K-E & Öhman T (1997). The teleface project — multimodal speech communication for the hearing impaired. In Proceedings of Eurospeech ′97, Rhodos, Greece.Google Scholar
  23. Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E & Öhman, T. (1998). The synthetic face in a hearing impaired view,.Proceedings of Fonetik’98, Stockholm, Sweden.Google Scholar
  24. Lundeberg, M. (1997). Multimodal talkommunikation — Utveckling av testmiljö, Master of science thesis (in Swedish). TMH-KTH, Stockholm, Sweden.Google Scholar
  25. MacLeod A & Summerfleld Q (1990). A procedure for measuring auditory and audiovisual speech reception thresholds for sentences in noise. Rationale, evaluation and recommenda-tions for use. British Journal of Audiology, 24: 29–43.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 1999

Authors and Affiliations

  • Björn Granström
    • 1
  1. 1.Deparment of Speech, Music and Hearing, KTHCentre for Speech Technology (CTT)StockholmSweden

Personalised recommendations