Multimedia Tools and Applications

, Volume 54, Issue 1, pp 143–164 | Cite as

Multimodal behavior realization for embodied conversational agents

  • Aleksandra Čereković
  • Igor S. Pandžić


Applications with intelligent conversational virtual humans, called Embodied Conversational Agents (ECAs), seek to bring human-like abilities into machines and establish natural human-computer interaction. In this paper we discuss realization of ECA multimodal behaviors which include speech and nonverbal behaviors. We devise RealActor, an open-source, multi-platform animation system for real-time multimodal behavior realization for ECAs. The system employs a novel solution for synchronizing gestures and speech using neural networks. It also employs an adaptive face animation model based on Facial Action Coding System (FACS) to synthesize face expressions. Our aim is to provide a generic animation system which can help researchers create believable and expressive ECAs.


Multimodal behavior realization Virtual characters Character animation system 


  1. 1.
    Albrecht I, Haber J, Peter Seidel H (2002) Automatic generation of nonverbal facial expressions from speech. In: In Proc. Computer Graphics International 2002, pp 283–293Google Scholar
  2. 2.
    Bianchi-Berthouze N, Kleinsmith A (2003) A categorical approach to affective gesture recognition. Connect Sci 15(4):259–269CrossRefGoogle Scholar
  3. 3.
  4. 4.
    Brkic M, Smid K, Pejsa T, Pandzic IS (2008) Towards natural head movement of autonomous speaker agent. In: Proceedings of the 12th International Conference on Knowledge-Based Intelligent Information and Engineering Systems KES 2008. 5178:73–80Google Scholar
  5. 5.
    Cassell J (2000) Embodied conversational agents. The MIT (April 2000)Google Scholar
  6. 6.
    Cassell J, Vilhjalmsson HH, Bickmore T (2001) Beat: the behavior expression animation toolkit. In: SIGGRAPH ’01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, ACM, pp 477–486Google Scholar
  7. 7.
    Cerekovic A, Huang HH, Furukawa T, Yamaoka Y, Pandzic, IS, Nishida T, Nakano Y (2009) Implementing a multiuser tour guide system with an embodied conversational agent, International Conference on Active Media Technology (AMT2009), Beijin, China, October 22–24, (2009)Google Scholar
  8. 8.
    Cerekovic A, Pejsa T, Pandzic IS (2010) A controller-based animation system for synchronizing and realizing human-like conversational behaviors, Proceedings of COST Action 2102 International School DublinGoogle Scholar
  9. 9.
    Chovil N (1991) Discourse-oriented facial displays in conversation. Res Lang Soc Interact 25:163–194Google Scholar
  10. 10.
    Coulson M (2004) Attributing emotion to static body postures: recognition accuracy, confusions, and viewpoint dependence. J Nonverbal Behav 28(2):117–139MathSciNetCrossRefGoogle Scholar
  11. 11.
    Dariouch B, Ech Chafai N, Mancini M, Pelachaud C (2004) Tools to Create Individual ECAs, Workshop Humaine, Santorini, September (2004)Google Scholar
  12. 12.
    Ekman P (1973) Cross-cultural studies of facial expression, pp 169–222 in P. Ekman (ed.) Darwin and Facial ExpressionGoogle Scholar
  13. 13.
    Ekman P (1979) In: about brows: emotional and conversational signals. Cambridge University Press, Cambridge, pp 169–202Google Scholar
  14. 14.
    Ekman P, Friesen W (1978) Facial action coding system: a technique for the measurement of facial movement. Consulting Psychologists, Palo AltoGoogle Scholar
  15. 15.
    Face Gen 3d Human Faces:
  16. 16.
    Foster ME (2007) Enhancing human-computer interaction with embodied conversational agents, Universal access in human-computer interaction. Ambient Interaction, ISSN 0302–9743, Springer VerlagGoogle Scholar
  17. 17.
    Fratarcangeli M, Adolfi M, Stankovic K, Pandzic IS (2009) Animatable face models from uncalibrated input features. In: Proceedings of the 10th International Conference on Telecommunications ConTELGoogle Scholar
  18. 18.
    Gebhard P, Schröder M, Charfuelan M, Endres C, Kipp M, Pammi S, Rumpler M, Türk O (2008) IDEAS4Games: building expressive virtual characters for computer games. In Proceedings of the 8th international Conference on intelligent Virtual Agents (Tokyo, Japan, September 01–03, 2008). H. Prendinger, J. Lester, and M. Ishizuka, Eds. Lecture Notes In Artificial Intelligence, vol. 5208. Springer-Verlag, Berlin, Heidelberg, 426–440Google Scholar
  19. 19.
    Gosselin P (1995) Kirouac, Gilles, Le decodage de prototypes emotionnels faciaux, Canadian Journal of Experimental Psychology, pp 313–329Google Scholar
  20. 20.
    Hartmann B, Mancini M, Pelachaud C (2002) Formational parameters and adaptive prototype instantiation for MPEG-4 compliant gesture synthesis. In: Proc. Computer Animation. (19–21), pp 111–119Google Scholar
  21. 21.
    Heck R, Gleicher M (2007) Parametric motion graphs. In: I3D ’07: Proceedings of the 2007 symposium on Interactive 3D graphics and games, New York, NY, USA, ACM (2007) 129–136Google Scholar
  22. 22.
    Heloir A, Kipp M (2009) EMBR—A realtime animation engine for interactive embodied agents. IVA 2009, 393–404Google Scholar
  23. 23.
    Horde3D - Next-Generation Graphics Engine,
  24. 24.
    Ingemars N (2007) A feature based face tracker using extended Kalman filtering, 2007, University essay from Linköpings universitetGoogle Scholar
  25. 25.
  26. 26.
    Johnston M, Bangalore S (2000) Finite-state multimodal parsing and understanding. In Proceedings of the 18th Conference on Computational Linguistics - Volume 1 (Saarbrücken, Germany, July 31–August 04, 2000). International Conference On Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, 369–375Google Scholar
  27. 27.
    Johnston M, Cohen PR, McGee D, Oviatt SL, Pittman JA, Smith I (1997) Unification-based multimodal integration. In Proceedings of the Eighth Conference on European Chapter of the Association For Computational Linguistics (Madrid, Spain, July 07–12, 1997). European Chapter Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 281–288Google Scholar
  28. 28.
    Kleinsmith A, Bianchi-Berthouze N (2007) Recognizing affective dimensions from body posture. ACII (2007) 48–58Google Scholar
  29. 29.
    Kopp S, Wachsmuth I (2004) Synthesizing multimodal utterances for conversational agents. Comp Animat and Virtual Worlds 15:39–52CrossRefGoogle Scholar
  30. 30.
    Kopp S, Krenn B, Marsella S, Marshall A, Pelachaud C, Pirker H, Thorisson K, Vilhjalmsson H (2006) Towards a common framework for multimodal generation: the behavior markup language. In: Intelligent Virtual Agents, pp 205–217Google Scholar
  31. 31.
    Kovar L (2004) Automated methods for data-driven synthesis of realistic and controllable human motion. PhD thesis, University of Wisconsin-MadisonGoogle Scholar
  32. 32.
    Lee J, Marsella S (2006) Nonverbal behavior generator for embodied conversational agents. In: Intelligent Virtual Agents, pp 243–255Google Scholar
  33. 33.
  34. 34.
    McNeill D (1992) Hand and mind: what gestures reveal about thought. University of Chicago PressGoogle Scholar
  35. 35.
    Microsoft Speech API:
  36. 36.
    Neff M, Kipp M, Albrecht I, Seidel HP (2008) Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Trans Graph 27(1):1–24CrossRefGoogle Scholar
  37. 37.
    OGRE - Open Source 3D Graphics Engine,
  38. 38.
    Oviatt SL, DeAngeli A, Kuhn K (1997) Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of the Conference on Human Factors in Computing Systems: CHI ’97, pages 415–422, Atlanta, Georgia. ACM Press, New York.Google Scholar
  39. 39.
    Pandzic IS, Forchheimer R (2002) MPEG-4 Facial Animation—The standard, implementations and applications”, John Wiley & Sons (2002) ISBN 0-470-84465-5Google Scholar
  40. 40.
    Pandzic IS, Ahlberg J, Wzorek M, Rudol P, Mosmondor M (2003) Faces everywhere: towards ubiquitous production and delivery of face animation. In: Proceedings of the 2nd International Conference on Mobile and Ubiquitous Multimedia MUM 2003, pp 49–55Google Scholar
  41. 41.
    Pejsa T, Pandzic IS (2009) Architecture of an animation system for human characters. In: Proceedings of the 10th International Conference on Telecommunications ConTEL 2009Google Scholar
  42. 42.
    Pelachaud C (2009) Studies on gesture expressivity for a virtual agent. Speech Communication, special issue in honor of Bjorn Granstrom and Rolf Carlson, to appearGoogle Scholar
  43. 43.
    Rojas R (1996) Neural networks—a systematic introduction. Springer-VerlagGoogle Scholar
  44. 44.
    Schroeder M, Hunecke A (2007) Mary tts participation in the blizzard challenge 2007. In: Proceedings of the Blizzard Challenge 2007Google Scholar
  45. 45.
    Smid K, Zoric G, Pandzic IS (2006) [huge]: Universal architecture for statistically based human gesturing. In: Proceedings of the 6th International Conference on Intelligent Virtual Agents IVA 2006, pp 256–269Google Scholar
  46. 46.
    Spierling U (2005) Interactive digital storytelling: towards a hybrid conceptual approach. Paper presented at DIGRA 2005, Simon Fraser University, Burnaby, BC, CanadaGoogle Scholar
  47. 47.
    Spierling U (2005) Beyond virtual tutors: semi-autonomous characters as learning companions. In ACM SIGGRAPH 2005 Educators Program (Los Angeles, California, July 31–August 04, 2005). P. Beckmann-Wells, Ed. SIGGRAPH ’05. ACM, New York, NY, 5Google Scholar
  48. 48.
    Steinmetz R (1996) Human perception of jitter and media synchronization. IEEE J Sel Areas Commun 14(1)Google Scholar
  49. 49.
    Stone M, DeCarlo D, Oh I, Rodriguez C, Stere A, Lees A, Bregler C (2004) Speaking with hands: Creating animated conversational characters from recordings of human performance. In: Proceedings of ACM SIGGRAPH 2004 23:506–513Google Scholar
  50. 50.
    Taylor PA, Black A, Caley R (1998) The architecture of the festival speech synthesis system. In: The Third ESCA Workshop in Speech Synthesis, pp 147–151Google Scholar
  51. 51.
    Thiebaux M, Marshall A, Marsella S, Kallmann M (2008) Smartbody: behavior realization for embodied conversational agents. In: Proceedings of Autonomous Agents and Multi-Agent Systems AAMASGoogle Scholar
  52. 52.
    Van Deemter K, Krenn B, Piwek P, Klesen M, Schroeder M, Baumann S (2008) Fully generated scripted dialogue for embodied agents. Articial Intelligence, pp 1219–1244Google Scholar
  53. 53.
    Vilhjalmsson H, Cantelmo N, Cassell J, Chafai NE, Kipp M, Kopp S, Mancini M, Marsella S, Marshall AN, Pelachaud C, Ruttkay Z, Thorisson KR, Welbergen H, Werf RJ (2007) The behavior markup language: recent developments and challenges. In: IVA ’07: Proceedings of the 7th international conference on Intelligent Virtual Agents, Springer-Verlag, pp 99–11Google Scholar
  54. 54.
    Vinayagamoorthy V, Gillies M, Steed A, Tanguy E, Pan X, Loscos C, Slater M (2006) Building expression into virtual characters. In Eurographics Conference State of the Art ReportsGoogle Scholar
  55. 55.
    Wehrle T, Kaiser S, Schmidt S, Scherer KR (2000) Studying the dynamics of emotional expression using synthesized facial muscle movements. J Pers Soc Psychol 78(1):105–119CrossRefGoogle Scholar
  56. 56.
    Zorić G, Pandžić IS (2005) A real-time lip sync system using a genetic algorithm for automatic neural network configuration, in Proceedings of the International Conference on Multimedia & Expo, ICME 2005, Amsterdam, NetherlandsGoogle Scholar
  57. 57.
    Zoric G, Smid K, Pandzic IS (2009) Towards facial gestures generation by speech signal analysis using huge architecture. In: Multimodal signals: cognitive and algorithmic issues: COST Action 2102 and euCognition International School Vietri sul Mare, Italy, April 21–26, 2008 Revised Selected and Invited Papers, Berlin, Heidelberg, Springer-Verlag, pp 112–120Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Faculty of Electrical Engineering and ComputingUniversity of ZagrebZagrebCroatia

Personalised recommendations