Journal on Multimodal User Interfaces

, Volume 4, Issue 2, pp 97–118 | Cite as

Continuous interaction with a virtual human

  • Dennis ReidsmaEmail author
  • Iwan de Kok
  • Daniel Neiberg
  • Sathish Chandra Pammi
  • Bart van Straalen
  • Khiet Truong
  • Herwin van Welbergen
Open Access
Original Paper


This paper presents our progress in developing a Virtual Human capable of being an attentive speaker. Such a Virtual Human should be able to attend to its interaction partner while it is speaking—and modify its communicative behavior on-the-fly based on what it observes in the behavior of its partner. We report new developments concerning a number of aspects, such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and strategies for generating appropriate reactions to listener responses. On the basis of this progress, a task-based setup for a responsive Virtual Human was implemented to carry out two user studies, the results of which are presented and discussed in this paper.


Virtual humans Attentive speaking Listener responses Continuous interaction 


  1. 1.
    Allwood J, Cerrate L (2003) A study of gestural feedback expressions. In: Paggio P, Jokinen K, Jönsson K (eds) 1st Nordic symposium on multimodal communication, pp 7–22 Google Scholar
  2. 2.
    Anderson AH, Bader M, Bard EG, Boyle E, Doherty-Sneddon G, Garrod S, Isard S, Kowtko JC, McAllister J, Miller J, Sotillo C, Thompson H, Weinert R (1991) The HCRC Map Task corpus. Lang Speech 34:351–366 Google Scholar
  3. 3.
    Bavelas JB, Coates L, Johnson T (2000) Listeners as co-narrators. J Pers Soc Psychol 79(6):941–952 CrossRefGoogle Scholar
  4. 4.
    Bavelas JB, Coates L, Johnson T (2002) Listener responses as a collaborative process: The role of gaze. J Commun 52(3):566–580 CrossRefGoogle Scholar
  5. 5.
    Benus S, Gravano A, Hirschberg J (2007) The prosody of backchannels in American English. In: Proceedings of the 16th international congress of phonetic sciences 2007, pp 1065–1068 Google Scholar
  6. 6.
    Black AW, Tokuda K, Zen H (2002) An HMM-based speech synthesis system applied to English. In: Proc of 2002 IEEE SSW, Santa Monica, CA, USA Google Scholar
  7. 7.
    Brady PT (1968) A statistical analysis of on-off patterns in 16 conversations. Bell Syst Tech J 47:73–91 Google Scholar
  8. 8.
    Carletta JC, Isard S, Doherty-Sneddon G, Isard A, Kowtko JC, AH Anderson (1997) The reliability of a dialogue structure coding scheme. Comput Linguist 23(1):13–31 Google Scholar
  9. 9.
    Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at
  10. 10.
    Clark HH (1996) Using language. Cambridge University Press, Cambridge Google Scholar
  11. 11.
    Clark HH, Brennan SE (1991) Grounding in communication. In: Resnick LB, Levine JM, Teasly SD (eds) Perspectives on socially shared cognition. American Psychological Association, Washington Google Scholar
  12. 12.
    Clark HH, Krych MA (2004) Speaking while monitoring addressees for understanding. J Mem Lang 50(1):62–81. doi: 10.1016/j.jml.2003.08.004 CrossRefGoogle Scholar
  13. 13.
    Dhillon R, Bhagat S, Carvey H, Shriberg E (2004) Meeting recorder project: Dialog act labeling guide. Tech Rep ICSI Technical Report TR-04-002, International Computer Science Institute Google Scholar
  14. 14.
    Duncan S Jr (1972) Some signals and rules for taking speaking turns in conversation. J Pers Soc Psychol 23(2) Google Scholar
  15. 15.
    Duncan S Jr (1974) On the structure of speaker-auditor interaction during speaking turns. Lang Soc 3(2):161–180. doi: 10.1017/s0047404500004322 CrossRefGoogle Scholar
  16. 16.
    Edlund J, Heldner M, Al Moubayed S, Gravano A, Hirschberg J (2010) Very short utterances in conversation. In: Proceedings of fonetik Google Scholar
  17. 17.
    Eyben F, Woellmer M, Schuller B (2010) openSMILE—the Munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM multimedia, pp 1459–1462 Google Scholar
  18. 18.
    French P, Local J (1983) Turn-competitive incomings. J Pragmat 7:17–38 CrossRefGoogle Scholar
  19. 19.
    Fujimoto DT (2007) Listener responses in interaction: a case for abandoning the term, backchannel. J Osaka Jogakuin 2 Year Coll 37:35–54 Google Scholar
  20. 20.
    Goldwater S, Jurafsky D, Manning CD (2010) Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates. Speech Commun 52:181–200 CrossRefGoogle Scholar
  21. 21.
    Goodwin C (1981) Conversational organization: interaction between speakers and hearers. Academic Press, San Diego Google Scholar
  22. 22.
    Goodwin C (1986) Between and within: alternative sequential treatments of continuers and assessments. Hum Stud 9(2–3):205–217. doi: 10.1007/bf00148127 CrossRefGoogle Scholar
  23. 23.
    Gravano A, Hirschberg J (2009) Backchannel-inviting cues in task-oriented dialogue. In: Proceedings of interspeech, Brighton, pp 1019–1022 Google Scholar
  24. 24.
    Gustafson J, Neiberg D (2010) Prosodic cues to engagement in non-lexical response tokens in Swedish. In: DiSS-LPSS Joint Workshop Google Scholar
  25. 25.
    Heldner M, Edlund J (2010) Pauses, gaps and overlaps in conversations. J Phonetics 38(4):555–568. doi: 10.1016/j.wocn.2010.08.002 CrossRefGoogle Scholar
  26. 26.
    Heylen D (2006) Head gestures gaze and the principles of conversational structure International. Int J Humanoid Robot 3(3):241–267 CrossRefGoogle Scholar
  27. 27.
    Heylen D, Bevacqua E, Tellier M, Pelachaud C (2007) Searching for prototypical facial feedback signals. In: Pelachaud C, Martin JC, André E, Chollet G, Karpouzis K, Pelé D (eds) Proceedings of the 7th international conference intelligent virtual agents. Lecture notes in computer science, vol 4722. Springer, Berlin, pp 147–153. doi: 10.1007/978-3-540-74997-4_14 Google Scholar
  28. 28.
    Kendon A (1967) Some functions of gaze direction in social interaction. Acta Psychol 26:22–63 CrossRefGoogle Scholar
  29. 29.
    de Kok I, Heylen D (2011) The MultiLis corpus—dealing with individual differences of nonverbal listening behavior. In: Proceedings of COST 2102: toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues, pp 362–375 CrossRefGoogle Scholar
  30. 30.
    Kopp S (2010) Social resonance and embodied coordination in face-to-face conversation with artificial interlocutors. Speech Commun 52(6):587–597. doi: 10.1016/j.specom.2010.02.007 CrossRefGoogle Scholar
  31. 31.
    Kopp S, Krenn B, Marsella SC, AN Marshall, Pelachaud C, Pirker H, Thórisson KR, Vilhjálmsson HH (2006) Towards a common framework for multimodal generation: the behavior markup language. In: Gratch J, Young MR, Aylett RS, Ballin D, Olivier P (eds) Proceedings of the 6th international conference on intelligent virtual agents. Lecture notes in computer science, vol 4133. Springer, Berlin, pp 205–217 Google Scholar
  32. 32.
    Kurtic E, Brown GJ, Wells B (2010) Resources for turn competition in overlap in multi-party conversations: speech rate, pausing and duration. In: Proceedings of interspeech, pp 2550–2553 Google Scholar
  33. 33.
    Lee CC, Lee S, Narayanan SS (2008) An analysis of multimodal cues of interruption in dyadic spoken interactions. In: Proceedings of interspeech, pp 1678–1681 Google Scholar
  34. 34.
    ter Maat M, Truong KP, Heylen D (2010) How turn-taking strategies influence users’ impressions of an agent. In: Allbeck J, Badler NI, Bickmore T, Pelachaud C, Safonova A (eds) Proceedings of the 10th international conference on intelligent virtual agents, Philadelphia, Pennsylvania, USA. Lecture notes in computer science, vol 6356. Springer, Berlin, pp 441–453. doi: 10.1007/978-3-642-15892-6_48 Google Scholar
  35. 35.
    Manusov V, Trees AR (2002) “Are you kidding me?”: The role of nonverbal cues in the verbal accounting process. J Commun 52(3):640–656. doi: 10.1111/j.1460-2466.2002.tb02566.x CrossRefGoogle Scholar
  36. 36.
    McKinneya MF, Moelants D, Davies MEP, Klapuri A (2007) Evaluation of audio beat tracking and music tempo extraction algorithms. J New Music Res 36(1):1–16 CrossRefGoogle Scholar
  37. 37.
    Neiberg D, Gustafson J (2010) The prosody of Swedish conversational grunts. In: Proc of Interspeech Google Scholar
  38. 38.
    Neiberg D, Truong KP (2011) Online detection of vocal listener responses with maximum latency constraints. In: Proc of ICASSP, p 2011 Google Scholar
  39. 39.
    Nijholt A, Reidsma D, van Welbergen H, op den Akker H, Ruttkay ZM (2008) Mutually coordinated anticipatory multimodal interaction. In: Esposito A, Bourbakis NG, Avouris N, Hatzilygeroudis I (eds) Verbal and nonverbal features of human-human and human-machine interaction. Lecture notes in computer science, vol 5042. Springer, Berlin, pp 70–89 CrossRefGoogle Scholar
  40. 40.
    Norwine AC, Murphy OJ (1938) Characteristic time intervals in telephonic conversation. Bell Syst Tech J 17:281–291 Google Scholar
  41. 41.
    Reidsma D (2008) Annotations and subjective machines—of annotators, embodied agents, users, and other humans. PhD thesis, University of Twente. doi: 10.3990/1.9789036527262
  42. 42.
    Reidsma D, Truong K, van Welbergen H, Neiberg D, Pammi S, de Kok I, van Straalen B (2010) Continuous interaction with a virtual human. In: Salah AA, Gevers T (eds) Proceedings of the eNTERFACE’10 summer workshop on multimodal interfaces, pp 24–39 Google Scholar
  43. 43.
    Sacks H, Schegloff E, Jefferson G (1974) A simplest systematics for the organization of turn-taking for conversation. Language 50:696–735 CrossRefGoogle Scholar
  44. 44.
    Schegloff E (2000) Overlapping talk and the organization of turn-taking for conversation. Lang Soc 29:1–63 CrossRefGoogle Scholar
  45. 45.
    Schlangen D, Skantze G (2009) A general, abstract model of incremental dialogue processing. In: Proceedings of the 12th conference of the European chapter of the Association for Computational Linguistics (EACL-09) Google Scholar
  46. 46.
    Schröder M (2010) The SEMAINE API: Towards a standards-based framework for building emotion-oriented systems. Adv Hum-Comput Interact 2010:319406. doi: 10.1155/2010/319406 Google Scholar
  47. 47.
    Schröder M, Trouvain J (2003) The German text-to-speech synthesis system MARY: a tool for research, development and teaching. Int J Speech Technol 6(4):365–377 CrossRefGoogle Scholar
  48. 48.
    Schröder M, Charfuelan M, Pammi S, Türk O (2008) The MARY TTS entry in the Blizzard Challenge 2008. In: Proc of the Blizzard Challenge Google Scholar
  49. 49.
    Skantze G, Hjalmarsson A (2010) Towards incremental speech generation in dialogue systems. In: Proceedings of SIGdial Google Scholar
  50. 50.
    Thiebaux M, Marshall AN, Marsella SC, Kallmann M (2008) Smartbody: Behavior realization for embodied conversational agents. In: Proceedings of the 7th international conference on autonomous agents and multiagent systems, pp 151–158 Google Scholar
  51. 51.
    Thórisson KR (2002) Natural turn-taking needs no manual: Computational theory and model, from perception to action. In: Granström B, House D, Karlsson I (eds) Multimodality in language and speech systems. Kluwer Academic, Dordrecht, pp 173–207 Google Scholar
  52. 52.
    Toda T, Tokuda K (2007) A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans Inf Syst E90-D(5):816–824 CrossRefGoogle Scholar
  53. 53.
    Walker MB, Trimboli C (1982) Smooth transitions in conversational interactions. J Soc Psychol 117:305–306 CrossRefGoogle Scholar
  54. 54.
    Ward N (2006) Non-lexical conversational sounds in American English. Pragmat Cogn 14(1):129–182 CrossRefGoogle Scholar
  55. 55.
    Ward N, Tsukahara W (2000) Prosodic features which cue back-channel responses in English and Japanese. J Pragmat 32(8):1177–1207 CrossRefGoogle Scholar
  56. 56.
    van Welbergen H, Reidsma D, Ruttkay ZM, Zwiers J (2010a) Elckerlyc: A BML realizer for continuous, multimodal interaction with a virtual human. J Multimodal User Interfaces 3(4):271–284. doi: 10.1007/s12193-010-0051-3 CrossRefGoogle Scholar
  57. 57.
    van Welbergen H, Reidsma D, Zwiers J (2010b) A demonstration of continuous interaction with Elckerlyc. In: Proceedings of the third workshop on multimodal output generation, CTIT Workshop Proceedings. vol WP2010, pp 51–57 Google Scholar

Copyright information

© The Author(s) 2011

Open AccessThis is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (, which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Authors and Affiliations

  • Dennis Reidsma
    • 1
    Email author
  • Iwan de Kok
    • 1
  • Daniel Neiberg
    • 2
  • Sathish Chandra Pammi
    • 3
  • Bart van Straalen
    • 1
  • Khiet Truong
    • 1
  • Herwin van Welbergen
    • 1
  1. 1.Human Media InteractionUniversity of TwenteEnschedeNetherlands
  2. 2.Dept. of Speech, Music and HearingKTH Royal Institute of TechnologyStockholmSweden
  3. 3.Language Technology LabGerman Research Center for Artificial Intelligence DFKISaarbrueckenGermany

Personalised recommendations