Real-Time Visual Prosody for Interactive Virtual Agents

  • Herwin van WelbergenEmail author
  • Yu Ding
  • Kai Sattler
  • Catherine Pelachaud
  • Stefan Kopp
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9238)


Speakers accompany their speech with incessant, subtle head movements. It is important to implement such “visual prosody” in virtual agents, not only to make their behavior more natural, but also because it has been shown to help listeners understand speech. We contribute a visual prosody model for interactive virtual agents that shall be capable of having live, non-scripted interactions with humans and thus have to use Text-To-Speech rather than recorded speech. We present our method for creating visual prosody online from continuous TTS output, and we report results from three crowdsourcing experiments carried out to see if and to what extent it can help in enhancing the interaction experience with an agent.


Visual prosody Nonverbal behavior Realtime animation Interactive agents 



We would like to thank Kirsten Bergmann and Philipp Kulms for their feedback on the design of the study and their help with the evaluation of the results. This work was partially performed within the Labex SMART (ANR-11-LABX-65) supported by French state funds managed by the ANR within the Investissements d’Avenir programme under reference ANR-11-IDEX-0004-02. It was also partially funded by the EU H2020 project ARIA-VALUSPA; and by the German Federal Ministry of Education and Research (BMBF) within the Leading-Edge Cluster Competition, managed by the Project Management Agency Karlsruhe (PTKA). The authors are responsible for the contents of this publication.


  1. 1.
    Bergmann, Kirsten, Kopp, Stefan, Eyssel, Friederike: Individualized gesturing outperforms average gesturing – evaluating gesture production in virtual humans. In: Safonova, Alla (ed.) IVA 2010. LNCS, vol. 6356, pp. 104–117. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  2. 2.
    Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J., Lee, S., Narayanan, S.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)CrossRefGoogle Scholar
  3. 3.
    Busso, C., Deng, Z., Neumann, U., Narayanan, S.: Natural head motion synthesis driven by acoustic prosodic features. Comput. Animation Virtual Worlds 16(3–4), 283–290 (2005)CrossRefGoogle Scholar
  4. 4.
    Chuang, E., Bregler, C.: Mood swings: expressive speech animation. Trans. Graph. 24(2), 331–347 (2005)CrossRefGoogle Scholar
  5. 5.
    Ding, Y., Pelachaud, C., Artières, T.: Modeling multimodal behaviors from speech prosody. In: Aylett, R., Krenn, B., Pelachaud, C., Shimodaira, H. (eds.) IVA 2013. LNCS, vol. 8108, pp. 217–228. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  6. 6.
    Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In: Conference on Multimedia, pp. 835–838. ACM (2013)Google Scholar
  7. 7.
    Fiske, S.T., Cuddy, A.J.C., Glick, P.: Universal dimensions of social cognition: warmth and competence. Trends Cogn. Sci. 11(2), 77–83 (2007)CrossRefGoogle Scholar
  8. 8.
    Graf, H.P., Cosatto, E., Strom, V., Hang, F.J.: Visual prosody: facial movements accompanying speech. In: Automatic Face and Gesture Recognition, pp. 381–386. IEEE Computer Society (2002)Google Scholar
  9. 9.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  10. 10.
    Heylen, D.K.J.: Head gestures, gaze and the principles of conversational structure. Int. J. Humanoid Rob. 3(3), 241–267 (2006)CrossRefGoogle Scholar
  11. 11.
    Le, B.H., Ma, X., Deng, Z.: Live speech driven head-and-eye motion generators. Trans. Visual Comput. Graphics 18(11), 1902–1914 (2012)CrossRefGoogle Scholar
  12. 12.
    Lee, J., Marsella, S.: Modeling speaker behavior: a comparison of two approaches. In: Nakano, Y., Neff, M., Paiva, A., Walker, M. (eds.) IVA 2012. LNCS, vol. 7502, pp. 161–174. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  13. 13.
    Levine, S., Krähenbühl, P., Thrun, S., Koltun, V.: Gesture controllers. Trans. Graph. 29(4), 124:1–124:11 (2010)CrossRefGoogle Scholar
  14. 14.
    Levine, S., Theobalt, C., Koltun, V.: Real-time prosody-driven synthesis of body language. In: SIGGRAPH Asia, pp. 1–10. ACM, New York (2009)Google Scholar
  15. 15.
    Mariooryad, S., Busso, C.: Generating human-like behaviors using joint, speech-driven models for conversational agents. Audio Speech Lang. Process. 20(8), 2329–2340 (2012)CrossRefGoogle Scholar
  16. 16.
    Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Vatikiotis-Bateson, E.: Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychol. Sci. 15(2), 133–137 (2004)CrossRefGoogle Scholar
  17. 17.
    van Welbergen, H., Yaghoubzadeh, R., Kopp, S.: AsapRealizer 2.0: the next steps in fluent behavior realization for ECAs. In: Bickmore, T., Marsella, S., Sidner, C. (eds.) IVA 2014. LNCS, vol. 8637, pp. 449–462. Springer, Heidelberg (2014) Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Herwin van Welbergen
    • 1
    • 2
    Email author
  • Yu Ding
    • 2
  • Kai Sattler
    • 1
    • 3
  • Catherine Pelachaud
    • 2
  • Stefan Kopp
    • 1
  1. 1.Social Cognitive Systems Group, CITEC, Faculty of TechnologyBielefeld UniversityBielefeldGermany
  2. 2.CNRS-LTCI, Télécom-ParisTechParisFrance
  3. 3.Department of PsychologyUniversity of BambergBambergGermany

Personalised recommendations