A multimodal approach of generating 3D human-like talking agent

  • Minghao YangEmail author
  • Jianhua Tao
  • Kaihui Mu
  • Ya Li
  • Jianfeng Che
Original Paper


This paper introduces a multimodal framework of generating a 3D human-like talking agent which can communicate with user through speech, lip movement, head motion, facial expression and body animation. In this framework, lip movements are obtained by searching and matching acoustic features which are represented by Mel-frequency cepstral coefficients (MFCC) in audio-visual bimodal database. Head motion is synthesized by visual prosody which maps textual prosodic features into rotational and translational parameters. Facial expression and body animation are generated by transferring motion data to skeleton. A simplified high level Multimodal Marker Language (MML), in which only a few fields are used to coordinate the agent channels, is introduced to drive the agent. The experiments validate the effectiveness of the proposed multimodal framework.


Multimodal 3D talking agent Lip movement Head motion MFCC Facial expression Gesture animation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

(AVI 1.50 MB)

(AVI 667 kB)

(AVI 1.63 MB)

12193_2011_73_MOESM4_ESM.avi (2.7 mb)
(AVI 2.74 MB)


  1. 1. (2011) Accessed 16 August
  2. 2.
    Wik P, Hjalmarsson A (2009) Embodied conversational agents in computer assisted language learning. Speech Commun 51(10):1024–1037 CrossRefGoogle Scholar
  3. 3. (2011) Accessed 16 August
  4. 4. (2011) Accessed 16 August
  5. 5.
    Badler N, Steedman M, Achorn B, Bechet T, Douville B, Prevost S, Cassell J, Pelachaud C, Stone M (1994) Animated conversation: rule-based generation of facial expression gesture and spoken intonation for multiple conversation agents. In: Proceedings of SIGGRAPH, pp 73–80 Google Scholar
  6. 6.
    Van Welbergen H, Reidsma D, Ruttkay ZM, Zwiers Elckerlyc J (2010) A BML realizer for continuous, multimodal interaction with a virtual human. J Multimodal User Interfaces 3(4):271–284 ISSN 1783-7677 CrossRefGoogle Scholar
  7. 7.
    Cerekovic A, Pejsa T, Pandzic IS (2009) RealActor: character animation and multimodal behavior realization system. In: IVA, pp 486–487 Google Scholar
  8. 8.
    Kipp M, Heloir A, Gebhard P, Schroeder M (2010) Realizing multimodal behavior: closing the gap between behavior planning and embodied agent presentation. In: Proceedings of the 10th international conference on intelligent virtual agents. Springer, Berlin Google Scholar
  9. 9.
    Courgeon M, Rebillat M, Katz B, Clavel C, Martin J-C (2010) Life-sized audiovisual spatial social scenes with multiple characters: MARC SMART-I2. In: Proceedings of the 5th meeting of the French association for virtual reality Google Scholar
  10. 10.
    Park SI, Shin HJ, Shin SY (2002) On-line locomotion generation based on motion blending. In: Proc of the ACM SIGGRAPH/eurographics symposium on computer animation, New York, NY, USA. ACM Press, New York, pp 105–111 CrossRefGoogle Scholar
  11. 11.
    Baerlocher P (2001) Inverse kinematics techniques for the interactive posture control of articulated figures. PhD thesis, Swiss Federal Institute of Technology, EPFL Google Scholar
  12. 12.
    Cassell J, Vilhjalmsson HH, Bickmore TW (2001) Beat: the behavior expression animation toolkit. In: Proceedings of SIGGRAPH, pp 477–486 Google Scholar
  13. 13.
    Gu E, Badler N (2006) Visual attention and eye gaze during multipartite conversations with distractions. In: Proc of intelligent virtual agents (IVA’06), Marina del Rey, CA Google Scholar
  14. 14.
    Faloutsos P, van de Panne M, Terzopoulos D (2001) Composable controllers for physics-based character animation. In: SIGGRAPH ’01: proceedings of the 28th annual conference on computer graphics and interactive techniques, New York, NY, USA. ACM Press, New York, pp 251–260 CrossRefGoogle Scholar
  15. 15.
    Kuffner JJ, Latombe JC (2000) Interactive manipulation planning for animated characters. In: Proc of pacic graphics’00, Hong Kong Google Scholar
  16. 16.
    Kallmann M (2005) Scalable solutions for interactive virtual humans that can manipulate objects. In: Artificial intelligence and interactive digital entertainment (AIIDE), Marina del Rey, CA Google Scholar
  17. 17.
    Carolis BD, Pelachaud C, Poggi I, de Rosis F (2001) Behavior planning for a reflexive agent. In: Proceedings of the international joint conference on artificial intelligence (IJCAI’01), Seattle Google Scholar
  18. 18.
    Graf HP, Cosatto E, Ostermann J, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91:1406–1429 CrossRefGoogle Scholar
  19. 19.
    Graf HP, Cosatto S., Huang F (2002) Visual prosody: facial movements accompanying speech. In: Fifth IEEE international conference on automatic face and gesture recognition Google Scholar
  20. 20.
    Chuang E, Bregler C (2005) Mood swings: expressive speech animation. ACM Trans Graph 24:331–347 CrossRefGoogle Scholar
  21. 21.
    Bodenheimer B, Rose C, Rosenthal S, Pella J (1997) The process of motion capture: Dealing with the data. In: Thalmann, computer animation and simulation. Eurographics Animation Workshop. Springer, New York, p 318 Google Scholar
  22. 22.
    Boulic R, Becheiraz P, Emering L, Thalmann D (1997) Integration of motion control techniques for virtual human and avatar real-time animation. In: Proc of virtual reality software and technology, Switzerland, pp 111–118 CrossRefGoogle Scholar
  23. 23.
    Stone M, DeCarlo D, Oh I, Rodriguez C, Stere A, Lees A, Bregler C (2004) Speaking with hands: creating animated conversational characters from recordings of human performance. ACM Trans Graph 23(3):506–513 CrossRefGoogle Scholar
  24. 24.
    Thiebaux M, Marsella S, Marshall AN, Kallmann M (2008) SmartBody behavior realization for embodied conversational agents AAMAS. In: Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems, international foundation for autonomous agents and multiagent systems, 2008, pp 151–158 Google Scholar
  25. 25.
    Mancini Greta M, Pelachaud C (2007) Dynamic behavior qualifiers for conversational agents. In: Intelligent virtual agents, IVA’07, Paris Google Scholar
  26. 26.
    Lewis J (1991) Automated lip-sync: Background and techniques. J Vis Comput Animat 2:118–122 CrossRefGoogle Scholar
  27. 27.
    Wen Z, Hong P, Huang TS (2002) Real-time speech-driven face animation with expressions using neural networks. IEEE Trans Neural Netw 13:916–927 CrossRefGoogle Scholar
  28. 28.
    Xin L, Tao J, Yin P (2009) Realistic visual speech synthesis based on hybrid concatenation method. IEEE Trans Audio Speech Lang Process 17:469–477 CrossRefGoogle Scholar
  29. 29.
    Che J, Yang M, Mu K, Tao J (2010) Real-time speech-driven lip synchronization. In: 4th International universal communication symposium, pp 377–381 Google Scholar
  30. 30.
    Oki BM, Goldberg D, Nichols D, Terry D (1992) Using collaborative filtering to weave an information tapestry. Commun ACM 35:61–70 Google Scholar
  31. 31.
    Tekalp AM, Ostermann J (2000) Face and 2-d mesh animation in mpeg-4. Signal Process Image Commun 15:387–421 CrossRefGoogle Scholar
  32. 32.
    Young S et al. (2000) The HTK book (v3.0). Cambridge University Engineering Department, Cambridge Google Scholar
  33. 33.
    Xu M, Duan L-Y, Cai J et al. (2004) HMM-based audio keyword generation. In: Advances in multimedia information processing, 5th Pacific rim conference on multimedia Google Scholar
  34. 34.
    Jia H, Liu F, Tao J (2008) A maximum entropy based hierarchical model for automatic prosodic boundary labeling in mandarin. In: Proceedings of 6th international symposium on Chinese spoken language processing Google Scholar
  35. 35.
    Mu K, Tao J, Che J, Yang M (2010) Mood Avatar: Automatic Text-Driven Head Motion Synthesis. In: 12th international conference on multimodal interfaces and 7th workshop on machine learning for multimodal interaction, Beijing, China Google Scholar
  36. 36.
    Deng Z, Neumann U, Busso C, Narayanan S (2005) Natural head motion synthesis driven by acoustic prosodic features. Comput Animat Virtual Worlds 16:283–290 CrossRefGoogle Scholar
  37. 37. (2011) Accessed 16 August

Copyright information

© OpenInterface Association 2011

Authors and Affiliations

  • Minghao Yang
    • 1
    Email author
  • Jianhua Tao
    • 1
  • Kaihui Mu
    • 1
  • Ya Li
    • 1
  • Jianfeng Che
    • 2
  1. 1.National Laboratory of Pattern Recognition (NLPR), Institute of AutomationChinese Academy of SciencesBeijingP.R. China
  2. 2.Institute of AutomationChinese Academy of SciencesBeijingChina

Personalised recommendations