Skip to main content
Log in

Multimodal behavior realization for embodied conversational agents

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Applications with intelligent conversational virtual humans, called Embodied Conversational Agents (ECAs), seek to bring human-like abilities into machines and establish natural human-computer interaction. In this paper we discuss realization of ECA multimodal behaviors which include speech and nonverbal behaviors. We devise RealActor, an open-source, multi-platform animation system for real-time multimodal behavior realization for ECAs. The system employs a novel solution for synchronizing gestures and speech using neural networks. It also employs an adaptive face animation model based on Facial Action Coding System (FACS) to synthesize face expressions. Our aim is to provide a generic animation system which can help researchers create believable and expressive ECAs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Albrecht I, Haber J, Peter Seidel H (2002) Automatic generation of nonverbal facial expressions from speech. In: In Proc. Computer Graphics International 2002, pp 283–293

  2. Bianchi-Berthouze N, Kleinsmith A (2003) A categorical approach to affective gesture recognition. Connect Sci 15(4):259–269

    Article  Google Scholar 

  3. BML Specification http://wiki.mindmakers.org/projects:BML:main

  4. Brkic M, Smid K, Pejsa T, Pandzic IS (2008) Towards natural head movement of autonomous speaker agent. In: Proceedings of the 12th International Conference on Knowledge-Based Intelligent Information and Engineering Systems KES 2008. 5178:73–80

  5. Cassell J (2000) Embodied conversational agents. The MIT (April 2000)

  6. Cassell J, Vilhjalmsson HH, Bickmore T (2001) Beat: the behavior expression animation toolkit. In: SIGGRAPH ’01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, ACM, pp 477–486

  7. Cerekovic A, Huang HH, Furukawa T, Yamaoka Y, Pandzic, IS, Nishida T, Nakano Y (2009) Implementing a multiuser tour guide system with an embodied conversational agent, International Conference on Active Media Technology (AMT2009), Beijin, China, October 22–24, (2009)

  8. Cerekovic A, Pejsa T, Pandzic IS (2010) A controller-based animation system for synchronizing and realizing human-like conversational behaviors, Proceedings of COST Action 2102 International School Dublin

  9. Chovil N (1991) Discourse-oriented facial displays in conversation. Res Lang Soc Interact 25:163–194

    Google Scholar 

  10. Coulson M (2004) Attributing emotion to static body postures: recognition accuracy, confusions, and viewpoint dependence. J Nonverbal Behav 28(2):117–139

    Article  MathSciNet  Google Scholar 

  11. Dariouch B, Ech Chafai N, Mancini M, Pelachaud C (2004) Tools to Create Individual ECAs, Workshop Humaine, Santorini, September (2004)

  12. Ekman P (1973) Cross-cultural studies of facial expression, pp 169–222 in P. Ekman (ed.) Darwin and Facial Expression

  13. Ekman P (1979) In: about brows: emotional and conversational signals. Cambridge University Press, Cambridge, pp 169–202

  14. Ekman P, Friesen W (1978) Facial action coding system: a technique for the measurement of facial movement. Consulting Psychologists, Palo Alto

    Google Scholar 

  15. Face Gen 3d Human Faces: http://www.facegen.com/

  16. Foster ME (2007) Enhancing human-computer interaction with embodied conversational agents, Universal access in human-computer interaction. Ambient Interaction, ISSN 0302–9743, Springer Verlag

  17. Fratarcangeli M, Adolfi M, Stankovic K, Pandzic IS (2009) Animatable face models from uncalibrated input features. In: Proceedings of the 10th International Conference on Telecommunications ConTEL

  18. Gebhard P, Schröder M, Charfuelan M, Endres C, Kipp M, Pammi S, Rumpler M, Türk O (2008) IDEAS4Games: building expressive virtual characters for computer games. In Proceedings of the 8th international Conference on intelligent Virtual Agents (Tokyo, Japan, September 01–03, 2008). H. Prendinger, J. Lester, and M. Ishizuka, Eds. Lecture Notes In Artificial Intelligence, vol. 5208. Springer-Verlag, Berlin, Heidelberg, 426–440

  19. Gosselin P (1995) Kirouac, Gilles, Le decodage de prototypes emotionnels faciaux, Canadian Journal of Experimental Psychology, pp 313–329

  20. Hartmann B, Mancini M, Pelachaud C (2002) Formational parameters and adaptive prototype instantiation for MPEG-4 compliant gesture synthesis. In: Proc. Computer Animation. (19–21), pp 111–119

  21. Heck R, Gleicher M (2007) Parametric motion graphs. In: I3D ’07: Proceedings of the 2007 symposium on Interactive 3D graphics and games, New York, NY, USA, ACM (2007) 129–136

  22. Heloir A, Kipp M (2009) EMBR—A realtime animation engine for interactive embodied agents. IVA 2009, 393–404

  23. Horde3D - Next-Generation Graphics Engine, http://www.horde3d.org/

  24. Ingemars N (2007) A feature based face tracker using extended Kalman filtering, 2007, University essay from Linköpings universitet

  25. Irrlicht Engine, http://irrlicht.sourceforge.net/

  26. Johnston M, Bangalore S (2000) Finite-state multimodal parsing and understanding. In Proceedings of the 18th Conference on Computational Linguistics - Volume 1 (Saarbrücken, Germany, July 31–August 04, 2000). International Conference On Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, 369–375

  27. Johnston M, Cohen PR, McGee D, Oviatt SL, Pittman JA, Smith I (1997) Unification-based multimodal integration. In Proceedings of the Eighth Conference on European Chapter of the Association For Computational Linguistics (Madrid, Spain, July 07–12, 1997). European Chapter Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 281–288

  28. Kleinsmith A, Bianchi-Berthouze N (2007) Recognizing affective dimensions from body posture. ACII (2007) 48–58

  29. Kopp S, Wachsmuth I (2004) Synthesizing multimodal utterances for conversational agents. Comp Animat and Virtual Worlds 15:39–52

    Article  Google Scholar 

  30. Kopp S, Krenn B, Marsella S, Marshall A, Pelachaud C, Pirker H, Thorisson K, Vilhjalmsson H (2006) Towards a common framework for multimodal generation: the behavior markup language. In: Intelligent Virtual Agents, pp 205–217

  31. Kovar L (2004) Automated methods for data-driven synthesis of realistic and controllable human motion. PhD thesis, University of Wisconsin-Madison

  32. Lee J, Marsella S (2006) Nonverbal behavior generator for embodied conversational agents. In: Intelligent Virtual Agents, pp 243–255

  33. Matlab http://www.mathworks.com

  34. McNeill D (1992) Hand and mind: what gestures reveal about thought. University of Chicago Press

  35. Microsoft Speech API: http://www.microsoft.com/speech

  36. Neff M, Kipp M, Albrecht I, Seidel HP (2008) Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Trans Graph 27(1):1–24

    Article  Google Scholar 

  37. OGRE - Open Source 3D Graphics Engine, http://www.ogre3d.org/

  38. Oviatt SL, DeAngeli A, Kuhn K (1997) Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of the Conference on Human Factors in Computing Systems: CHI ’97, pages 415–422, Atlanta, Georgia. ACM Press, New York.

  39. Pandzic IS, Forchheimer R (2002) MPEG-4 Facial Animation—The standard, implementations and applications”, John Wiley & Sons (2002) ISBN 0-470-84465-5

  40. Pandzic IS, Ahlberg J, Wzorek M, Rudol P, Mosmondor M (2003) Faces everywhere: towards ubiquitous production and delivery of face animation. In: Proceedings of the 2nd International Conference on Mobile and Ubiquitous Multimedia MUM 2003, pp 49–55

  41. Pejsa T, Pandzic IS (2009) Architecture of an animation system for human characters. In: Proceedings of the 10th International Conference on Telecommunications ConTEL 2009

  42. Pelachaud C (2009) Studies on gesture expressivity for a virtual agent. Speech Communication, special issue in honor of Bjorn Granstrom and Rolf Carlson, to appear

  43. Rojas R (1996) Neural networks—a systematic introduction. Springer-Verlag

  44. Schroeder M, Hunecke A (2007) Mary tts participation in the blizzard challenge 2007. In: Proceedings of the Blizzard Challenge 2007

  45. Smid K, Zoric G, Pandzic IS (2006) [huge]: Universal architecture for statistically based human gesturing. In: Proceedings of the 6th International Conference on Intelligent Virtual Agents IVA 2006, pp 256–269

  46. Spierling U (2005) Interactive digital storytelling: towards a hybrid conceptual approach. Paper presented at DIGRA 2005, Simon Fraser University, Burnaby, BC, Canada

  47. Spierling U (2005) Beyond virtual tutors: semi-autonomous characters as learning companions. In ACM SIGGRAPH 2005 Educators Program (Los Angeles, California, July 31–August 04, 2005). P. Beckmann-Wells, Ed. SIGGRAPH ’05. ACM, New York, NY, 5

  48. Steinmetz R (1996) Human perception of jitter and media synchronization. IEEE J Sel Areas Commun 14(1)

  49. Stone M, DeCarlo D, Oh I, Rodriguez C, Stere A, Lees A, Bregler C (2004) Speaking with hands: Creating animated conversational characters from recordings of human performance. In: Proceedings of ACM SIGGRAPH 2004 23:506–513

  50. Taylor PA, Black A, Caley R (1998) The architecture of the festival speech synthesis system. In: The Third ESCA Workshop in Speech Synthesis, pp 147–151

  51. Thiebaux M, Marshall A, Marsella S, Kallmann M (2008) Smartbody: behavior realization for embodied conversational agents. In: Proceedings of Autonomous Agents and Multi-Agent Systems AAMAS

  52. Van Deemter K, Krenn B, Piwek P, Klesen M, Schroeder M, Baumann S (2008) Fully generated scripted dialogue for embodied agents. Articial Intelligence, pp 1219–1244

  53. Vilhjalmsson H, Cantelmo N, Cassell J, Chafai NE, Kipp M, Kopp S, Mancini M, Marsella S, Marshall AN, Pelachaud C, Ruttkay Z, Thorisson KR, Welbergen H, Werf RJ (2007) The behavior markup language: recent developments and challenges. In: IVA ’07: Proceedings of the 7th international conference on Intelligent Virtual Agents, Springer-Verlag, pp 99–11

  54. Vinayagamoorthy V, Gillies M, Steed A, Tanguy E, Pan X, Loscos C, Slater M (2006) Building expression into virtual characters. In Eurographics Conference State of the Art Reports

  55. Wehrle T, Kaiser S, Schmidt S, Scherer KR (2000) Studying the dynamics of emotional expression using synthesized facial muscle movements. J Pers Soc Psychol 78(1):105–119

    Article  Google Scholar 

  56. Zorić G, Pandžić IS (2005) A real-time lip sync system using a genetic algorithm for automatic neural network configuration, in Proceedings of the International Conference on Multimedia & Expo, ICME 2005, Amsterdam, Netherlands

  57. Zoric G, Smid K, Pandzic IS (2009) Towards facial gestures generation by speech signal analysis using huge architecture. In: Multimodal signals: cognitive and algorithmic issues: COST Action 2102 and euCognition International School Vietri sul Mare, Italy, April 21–26, 2008 Revised Selected and Invited Papers, Berlin, Heidelberg, Springer-Verlag, pp 112–120

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aleksandra Čereković.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Čereković, A., Pandžić, I.S. Multimodal behavior realization for embodied conversational agents. Multimed Tools Appl 54, 143–164 (2011). https://doi.org/10.1007/s11042-010-0530-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-010-0530-2

Keywords

Navigation