Analysis and Synthesis of Multimodal Verbal and Non-verbal Interaction for Animated Interface Agents

  • Jonas Beskow
  • Björn Granström
  • David House
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4775)


The use of animated talking agents is a novel feature of many multimodal spoken dialogue systems. The addition and integration of a virtual talking head has direct implications for the way in which users approach and interact with such systems. However, understanding the interactions between visual expressions, dialogue functions and the acoustics of the corresponding speech presents a substantial challenge. Some of the visual articulation is closely related to the speech acoustics, while there are other articulatory movements affecting speech acoustics that are not visible on the outside of the face. Many facial gestures used for communicative purposes do not affect the acoustics directly, but might nevertheless be connected on a higher communicative level in which the timing of the gestures could play an important role. This chapter looks into the communicative function of the animated talking agent, and its effect on intelligibility and the flow of the dialogue.


ECA animated agent audiovisual speech non-verbal communication visual prosody 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cassell, J., Bickmore, T., Campbell, L., Hannes, V., Yan, H.: Conversation as a System Framework: Designing Embodied Conversational Agents. In: Cassell, J., Sullivan, J., Prevost, S., Churchill, E. (eds.) Embodied Conversational Agents, pp. 29–63. The MIT Press, Cambridge, MA (2000)Google Scholar
  2. 2.
    Massaro, D.W.: Perceiving Talking Faces: From Speech Perception to a Behavioural Principle. The MIT Press, Cambridge, MA (1998)Google Scholar
  3. 3.
    Bickmore, T., Cassell, J.: Social Dialogue with Embodied Conversational Agents. In: van Kuppevelt, J., Dybkjaer, L., Bernsen, N.O. (eds.) Advances in Natural Multimodal Dialogue Systems, pp. 23–54. Springer, Dordrecht, The Netherlands (2005)CrossRefGoogle Scholar
  4. 4.
    Granström, B., House, D., Beskow, J., Lundeberg, M.: Verbal and visual prosody in multimodal speech perception. In: von Dommelen, W., Fretheim, T. (eds.) Nordic Prosody: Proc. of the VIIIth Conference, Trondheim 2000. Frankfurt am Main: Peter Lang, pp. 77–88 (2001)Google Scholar
  5. 5.
    Carlson, R., Granström, B.: Speech Synthesis. In: Hardcastle, W., Laver, J. (eds.) The Handbook of Phonetic Sciences, pp. 768–788. Blackwell Publishers Ltd., Oxford (1997)Google Scholar
  6. 6.
    Beskow, J.: Rule-based Visual Speech Synthesis. In: Proceedings of Eurospeech 1995, Madrid, Spain, pp. 299–302 (1995)Google Scholar
  7. 7.
    Beskow, J.: Animation of Talking Agents. In: Proceedings of AVSP 1997, ESCA Workshop on Audio-Visual Speech Processing, Rhodes, Greece, pp. 149–152 (1997)Google Scholar
  8. 8.
    Parke, F.I.: Parameterized models for facial animation. IEEE Computer Graphics 2(9), 61–68 (1982)CrossRefGoogle Scholar
  9. 9.
    Engwall, O.: Combining MRI, EMA and EPG measurements in a three-dimensional tongue model. Speech Communication 41(2-3), 303–329 (2003)CrossRefGoogle Scholar
  10. 10.
    Cole, R., Massaro, D.W., de Villiers, J., Rundle, B., Shobaki, K., Wouters, J., Cohen, M., Beskow, J., Stone, P., Connors, P., Tarachow, A., Solcher, D.: New tools for interactive speech and language training: Using animated conversational agents in the classrooms of profoundly deaf children. In: MATISSE. Proceedings of ESCA/Socrates Workshop on Method and Tool Innovations for Speech Science Education, pp. 45–52. University College London, London (1999)Google Scholar
  11. 11.
    Massaro, D.W., Bosseler, A., Light, J.: Development and Evaluation of a Computer-Animated Tutor for Language and Vocabulary Learning. In: ICPhS 2003. 15th International Congress of Phonetic Sciences, Barcelona, Spain, pp. 143–146 (2003)Google Scholar
  12. 12.
    Massaro, D.W., Light, J.: Read My Tongue Movements: Bimodal Learning To Perceive And Produce Non-Native Speech /r/ and /l/. In: Eurospeech 2003, Geneva, Switzerland, pp. 2249–2252 (2003)Google Scholar
  13. 13.
    Engwall, O., Bälter, O., Öster, A.-M., Kjellström, H.: Designing the user interface of the computer-based speech training system ARTUR based on early user tests. Journal of Behavioural and Information Technology 25(4), 353–365 (2006)CrossRefGoogle Scholar
  14. 14.
    Sjölander, K., Beskow, J.: WaveSurfer - an Open Source Speech Tool. In: Proc of ICSLP 2000, Beijing, vol. 4, pp. 464–467 (2000)Google Scholar
  15. 15.
    Beskow, J., Karlsson, I., Kewley, J., Salvi, G.: SYNFACE: - A talking head telephone for the hearing-impaired. In: Miesenberger, K., Klaus, J., Zagler, W., Burger, D. (eds.) Computers Helping People with Special Needs, pp. 1178–1186. Springer, Heidelberg (2004)Google Scholar
  16. 16.
    Engwall, O., Wik, P., Beskow, J., Granström, G.: Design strategies for a virtual language tutor. In: Kim, S.H.Y. (ed.) Proc ICSLP 2004, Jeju Island, Korea, pp. 1693–1696 (2004)Google Scholar
  17. 17.
    Nordstrand, M., Svanfeldt, G., Granström, B., House, D.: Measurements of articulatory variation in expressive speech for a set of Swedish vowels. Speech Communication 44, 187–196 (2004)CrossRefGoogle Scholar
  18. 18.
    Cohen, M.M., Massaro, D.W.: Modelling Coarticulation in Synthetic Visual Speech. In: Magnenat-Thalmann, N., Thalmann, D. (eds.) Models and Techniques in Computer Animation, pp. 139–156. Springer, Tokyo (1993)Google Scholar
  19. 19.
    Beskow, J., Nordenberg, M.: Data-driven Synthesis of Expressive Visual Speech using an MPEG-4 Talking Head. In: Beskow, J., Nordenberg, M. (eds.) Proceedings of INTERSPEECH 2005, Lisbon, Portugal, pp. 793–796 (2005)Google Scholar
  20. 20.
    Beskow, J., Engwall, O., and Granström, B.: Resynthesis of Facial and Intraoral Articulation from Simultaneous Measurements. In: Solé, M.J., D. R., Romero, J.(eds.) Proceedings of the 15th ICPhS, Barcelona, Spain, pp: 431–434 (2003)Google Scholar
  21. 21.
    Beskow, J., Granström, B., House, D.: Visual correlates to prominence in several expressive modes. In: Proceedings of Interspeech 2006, Pittsburg, PA, pp. 1272–1275 (2006)Google Scholar
  22. 22.
    Pandzic, I.S., Forchheimer, R.: MPEG Facial animation – the standard, implementation and applications. John Wiley & Sons, Chichester, England (2002)Google Scholar
  23. 23.
    Beskow, J., Cerrato, L., Cosi, P., Costantini, E., Nordstrand, M., Pianesi, F., Prete, M., Svanfeldt, G.: Preliminary Cross-cultural Evaluation of Expressiveness in Synthetic Faces. In: Proc. Affective Dialogue Systems (ADS) 2004, Kloster Irsee, Germany, pp. 301–304 (2004)Google Scholar
  24. 24.
    Beskow, J., Cerrato, L., Granström, B., House, D., Nordenberg, M., Nordstrand, M., Svanfeldt, G.: Expressive Animated Agents for Affective Dialogue Systems. In: Proc. Affective Dialogue Systems (ADS) 2004, Kloster Irsee, Germany, pp. 240–243 (2004)Google Scholar
  25. 25.
    Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K.-E., Öhman, T.: Synthetic faces as a lipreading support. In: Proceedings of ICSLP 1998, Sydney, Australia, pp. 3047–3050 (1998)Google Scholar
  26. 26.
    Beskow, J., Granström, B., Spens, K.-E.: Articulation strength - Readability experiments with a synthetic talking face. TMH-QPSR, vol. 44, KTH, Stockholm, pp. 97–100 (2002) Google Scholar
  27. 27.
    Westervelt, P.J.: Parametric acoustic array. J. Acoust. Soc. Amer. 35, 535–537 (1963)CrossRefGoogle Scholar
  28. 28.
    Svanfeldt, G., Olszewski, D.: Perception experiment combining a parametric loudspeaker and a synthetic talking head. In: Svanfeldt, G., Olszewski, D. (eds.) Proceedings of INTERSPEECH 2005, Lisbon, Portugal, pp. 1721–1724 (2005)Google Scholar
  29. 29.
    Bruce, G., Granström, B., House, D.: Prosodic phrasing in Swedish speech synthesis. In: Bailly, G., Benoit, C., Sawallis, T.R. (eds.) Talking Machines: Theories, Models, and Designs, pp. 113–125. North Holland, Amsterdam (1992)Google Scholar
  30. 30.
    House, D., Beskow, J., Granström, B.: Timing and interaction of visual cues for prominence in audiovisual speech perception. In: Proc. Eurospeech 2001, Aalborg, Denmark, pp. 387–390 (2001)Google Scholar
  31. 31.
    Keating, P., Baroni, M., Mattys, S., Scarborough, R., Alwan, A., Auer, E., Bernstein, L.: Optical Phonetics and Visual Perception of Lexical and Phrasal Stress in English. In: Proc. 15th International Congress of Phonetic Sciences, pp. 2071–2074 (2003)Google Scholar
  32. 32.
    Dohen, M.: Deixis prosodique multisensorielle: Production et perception audiovisuelle de la focalisation contrastive en Français. PhD thesis, Institut de la Communication Parlée, Grenoble (2005)Google Scholar
  33. 33.
    Beskow, J., Cerrato, L., Granström, B., House, D., Nordstrand, M., Svanfeldt, G.: The Swedish PF-Star Multimodal Corpora. In: Proc. LREC Workshop, Multimodal Corpora: Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces, Lisbon, Portugal, pp. 34–37 (2004)Google Scholar
  34. 34.
    Bell, L., Gustafson, J.: Interacting with an animated agent: an analysis of a Swedish database of spontaneous computer directed speech. In: Proc of Eurospeech 1999, pp. 1143–1146 (1999)Google Scholar
  35. 35.
    House, D.: Phrase-final rises as a prosodic feature in wh-questions in Swedish human–machine dialogue. Speech Communication 46, 268–283 (2005)CrossRefGoogle Scholar
  36. 36.
    Granström, B., House, D., Swerts, M.G.: Multimodal feedback cues in human-machine interactions. In: Bernard, B., Isabelle, M. (eds.) Proceedings of the Speech Prosody 2002 Conference, pp. 347–350. Aix-en-Provence, Laboratoire Parole et Langage (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Jonas Beskow
    • 1
  • Björn Granström
    • 1
  • David House
    • 1
  1. 1.Centre for Speech Technology, CSC, KTH, StockholmSweden

Personalised recommendations