Skip to main content

Continuous interaction with a virtual human

Abstract

This paper presents our progress in developing a Virtual Human capable of being an attentive speaker. Such a Virtual Human should be able to attend to its interaction partner while it is speaking—and modify its communicative behavior on-the-fly based on what it observes in the behavior of its partner. We report new developments concerning a number of aspects, such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and strategies for generating appropriate reactions to listener responses. On the basis of this progress, a task-based setup for a responsive Virtual Human was implemented to carry out two user studies, the results of which are presented and discussed in this paper.

References

  1. Allwood J, Cerrate L (2003) A study of gestural feedback expressions. In: Paggio P, Jokinen K, Jönsson K (eds) 1st Nordic symposium on multimodal communication, pp 7–22

    Google Scholar 

  2. Anderson AH, Bader M, Bard EG, Boyle E, Doherty-Sneddon G, Garrod S, Isard S, Kowtko JC, McAllister J, Miller J, Sotillo C, Thompson H, Weinert R (1991) The HCRC Map Task corpus. Lang Speech 34:351–366

    Google Scholar 

  3. Bavelas JB, Coates L, Johnson T (2000) Listeners as co-narrators. J Pers Soc Psychol 79(6):941–952

    Article  Google Scholar 

  4. Bavelas JB, Coates L, Johnson T (2002) Listener responses as a collaborative process: The role of gaze. J Commun 52(3):566–580

    Article  Google Scholar 

  5. Benus S, Gravano A, Hirschberg J (2007) The prosody of backchannels in American English. In: Proceedings of the 16th international congress of phonetic sciences 2007, pp 1065–1068

    Google Scholar 

  6. Black AW, Tokuda K, Zen H (2002) An HMM-based speech synthesis system applied to English. In: Proc of 2002 IEEE SSW, Santa Monica, CA, USA

    Google Scholar 

  7. Brady PT (1968) A statistical analysis of on-off patterns in 16 conversations. Bell Syst Tech J 47:73–91

    Google Scholar 

  8. Carletta JC, Isard S, Doherty-Sneddon G, Isard A, Kowtko JC, AH Anderson (1997) The reliability of a dialogue structure coding scheme. Comput Linguist 23(1):13–31

    Google Scholar 

  9. Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  10. Clark HH (1996) Using language. Cambridge University Press, Cambridge

    Google Scholar 

  11. Clark HH, Brennan SE (1991) Grounding in communication. In: Resnick LB, Levine JM, Teasly SD (eds) Perspectives on socially shared cognition. American Psychological Association, Washington

    Google Scholar 

  12. Clark HH, Krych MA (2004) Speaking while monitoring addressees for understanding. J Mem Lang 50(1):62–81. doi:10.1016/j.jml.2003.08.004

    Article  Google Scholar 

  13. Dhillon R, Bhagat S, Carvey H, Shriberg E (2004) Meeting recorder project: Dialog act labeling guide. Tech Rep ICSI Technical Report TR-04-002, International Computer Science Institute

  14. Duncan S Jr (1972) Some signals and rules for taking speaking turns in conversation. J Pers Soc Psychol 23(2)

  15. Duncan S Jr (1974) On the structure of speaker-auditor interaction during speaking turns. Lang Soc 3(2):161–180. doi:10.1017/s0047404500004322

    Article  Google Scholar 

  16. Edlund J, Heldner M, Al Moubayed S, Gravano A, Hirschberg J (2010) Very short utterances in conversation. In: Proceedings of fonetik

    Google Scholar 

  17. Eyben F, Woellmer M, Schuller B (2010) openSMILE—the Munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM multimedia, pp 1459–1462

    Google Scholar 

  18. French P, Local J (1983) Turn-competitive incomings. J Pragmat 7:17–38

    Article  Google Scholar 

  19. Fujimoto DT (2007) Listener responses in interaction: a case for abandoning the term, backchannel. J Osaka Jogakuin 2 Year Coll 37:35–54

    Google Scholar 

  20. Goldwater S, Jurafsky D, Manning CD (2010) Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates. Speech Commun 52:181–200

    Article  Google Scholar 

  21. Goodwin C (1981) Conversational organization: interaction between speakers and hearers. Academic Press, San Diego

    Google Scholar 

  22. Goodwin C (1986) Between and within: alternative sequential treatments of continuers and assessments. Hum Stud 9(2–3):205–217. doi:10.1007/bf00148127

    Article  Google Scholar 

  23. Gravano A, Hirschberg J (2009) Backchannel-inviting cues in task-oriented dialogue. In: Proceedings of interspeech, Brighton, pp 1019–1022

    Google Scholar 

  24. Gustafson J, Neiberg D (2010) Prosodic cues to engagement in non-lexical response tokens in Swedish. In: DiSS-LPSS Joint Workshop

    Google Scholar 

  25. Heldner M, Edlund J (2010) Pauses, gaps and overlaps in conversations. J Phonetics 38(4):555–568. doi:10.1016/j.wocn.2010.08.002

    Article  Google Scholar 

  26. Heylen D (2006) Head gestures gaze and the principles of conversational structure International. Int J Humanoid Robot 3(3):241–267

    Article  Google Scholar 

  27. Heylen D, Bevacqua E, Tellier M, Pelachaud C (2007) Searching for prototypical facial feedback signals. In: Pelachaud C, Martin JC, André E, Chollet G, Karpouzis K, Pelé D (eds) Proceedings of the 7th international conference intelligent virtual agents. Lecture notes in computer science, vol 4722. Springer, Berlin, pp 147–153. doi:10.1007/978-3-540-74997-4_14

    Google Scholar 

  28. Kendon A (1967) Some functions of gaze direction in social interaction. Acta Psychol 26:22–63

    Article  Google Scholar 

  29. de Kok I, Heylen D (2011) The MultiLis corpus—dealing with individual differences of nonverbal listening behavior. In: Proceedings of COST 2102: toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues, pp 362–375

    Chapter  Google Scholar 

  30. Kopp S (2010) Social resonance and embodied coordination in face-to-face conversation with artificial interlocutors. Speech Commun 52(6):587–597. doi:10.1016/j.specom.2010.02.007

    Article  Google Scholar 

  31. Kopp S, Krenn B, Marsella SC, AN Marshall, Pelachaud C, Pirker H, Thórisson KR, Vilhjálmsson HH (2006) Towards a common framework for multimodal generation: the behavior markup language. In: Gratch J, Young MR, Aylett RS, Ballin D, Olivier P (eds) Proceedings of the 6th international conference on intelligent virtual agents. Lecture notes in computer science, vol 4133. Springer, Berlin, pp 205–217

    Google Scholar 

  32. Kurtic E, Brown GJ, Wells B (2010) Resources for turn competition in overlap in multi-party conversations: speech rate, pausing and duration. In: Proceedings of interspeech, pp 2550–2553

    Google Scholar 

  33. Lee CC, Lee S, Narayanan SS (2008) An analysis of multimodal cues of interruption in dyadic spoken interactions. In: Proceedings of interspeech, pp 1678–1681

    Google Scholar 

  34. ter Maat M, Truong KP, Heylen D (2010) How turn-taking strategies influence users’ impressions of an agent. In: Allbeck J, Badler NI, Bickmore T, Pelachaud C, Safonova A (eds) Proceedings of the 10th international conference on intelligent virtual agents, Philadelphia, Pennsylvania, USA. Lecture notes in computer science, vol 6356. Springer, Berlin, pp 441–453. doi:10.1007/978-3-642-15892-6_48

    Google Scholar 

  35. Manusov V, Trees AR (2002) “Are you kidding me?”: The role of nonverbal cues in the verbal accounting process. J Commun 52(3):640–656. doi:10.1111/j.1460-2466.2002.tb02566.x

    Article  Google Scholar 

  36. McKinneya MF, Moelants D, Davies MEP, Klapuri A (2007) Evaluation of audio beat tracking and music tempo extraction algorithms. J New Music Res 36(1):1–16

    Article  Google Scholar 

  37. Neiberg D, Gustafson J (2010) The prosody of Swedish conversational grunts. In: Proc of Interspeech

    Google Scholar 

  38. Neiberg D, Truong KP (2011) Online detection of vocal listener responses with maximum latency constraints. In: Proc of ICASSP, p 2011

    Google Scholar 

  39. Nijholt A, Reidsma D, van Welbergen H, op den Akker H, Ruttkay ZM (2008) Mutually coordinated anticipatory multimodal interaction. In: Esposito A, Bourbakis NG, Avouris N, Hatzilygeroudis I (eds) Verbal and nonverbal features of human-human and human-machine interaction. Lecture notes in computer science, vol 5042. Springer, Berlin, pp 70–89

    Chapter  Google Scholar 

  40. Norwine AC, Murphy OJ (1938) Characteristic time intervals in telephonic conversation. Bell Syst Tech J 17:281–291

    Google Scholar 

  41. Reidsma D (2008) Annotations and subjective machines—of annotators, embodied agents, users, and other humans. PhD thesis, University of Twente. doi:10.3990/1.9789036527262

  42. Reidsma D, Truong K, van Welbergen H, Neiberg D, Pammi S, de Kok I, van Straalen B (2010) Continuous interaction with a virtual human. In: Salah AA, Gevers T (eds) Proceedings of the eNTERFACE’10 summer workshop on multimodal interfaces, pp 24–39

    Google Scholar 

  43. Sacks H, Schegloff E, Jefferson G (1974) A simplest systematics for the organization of turn-taking for conversation. Language 50:696–735

    Article  Google Scholar 

  44. Schegloff E (2000) Overlapping talk and the organization of turn-taking for conversation. Lang Soc 29:1–63

    Article  Google Scholar 

  45. Schlangen D, Skantze G (2009) A general, abstract model of incremental dialogue processing. In: Proceedings of the 12th conference of the European chapter of the Association for Computational Linguistics (EACL-09)

  46. Schröder M (2010) The SEMAINE API: Towards a standards-based framework for building emotion-oriented systems. Adv Hum-Comput Interact 2010:319406. doi:10.1155/2010/319406

    Google Scholar 

  47. Schröder M, Trouvain J (2003) The German text-to-speech synthesis system MARY: a tool for research, development and teaching. Int J Speech Technol 6(4):365–377

    Article  Google Scholar 

  48. Schröder M, Charfuelan M, Pammi S, Türk O (2008) The MARY TTS entry in the Blizzard Challenge 2008. In: Proc of the Blizzard Challenge

    Google Scholar 

  49. Skantze G, Hjalmarsson A (2010) Towards incremental speech generation in dialogue systems. In: Proceedings of SIGdial

    Google Scholar 

  50. Thiebaux M, Marshall AN, Marsella SC, Kallmann M (2008) Smartbody: Behavior realization for embodied conversational agents. In: Proceedings of the 7th international conference on autonomous agents and multiagent systems, pp 151–158

    Google Scholar 

  51. Thórisson KR (2002) Natural turn-taking needs no manual: Computational theory and model, from perception to action. In: Granström B, House D, Karlsson I (eds) Multimodality in language and speech systems. Kluwer Academic, Dordrecht, pp 173–207

    Google Scholar 

  52. Toda T, Tokuda K (2007) A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans Inf Syst E90-D(5):816–824

    Article  Google Scholar 

  53. Walker MB, Trimboli C (1982) Smooth transitions in conversational interactions. J Soc Psychol 117:305–306

    Article  Google Scholar 

  54. Ward N (2006) Non-lexical conversational sounds in American English. Pragmat Cogn 14(1):129–182

    Article  Google Scholar 

  55. Ward N, Tsukahara W (2000) Prosodic features which cue back-channel responses in English and Japanese. J Pragmat 32(8):1177–1207

    Article  Google Scholar 

  56. van Welbergen H, Reidsma D, Ruttkay ZM, Zwiers J (2010a) Elckerlyc: A BML realizer for continuous, multimodal interaction with a virtual human. J Multimodal User Interfaces 3(4):271–284. doi:10.1007/s12193-010-0051-3

    Article  Google Scholar 

  57. van Welbergen H, Reidsma D, Zwiers J (2010b) A demonstration of continuous interaction with Elckerlyc. In: Proceedings of the third workshop on multimodal output generation, CTIT Workshop Proceedings. vol WP2010, pp 51–57

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dennis Reidsma.

Additional information

This paper is base upon a project report of the eNTERFACE’10 Summer Workshop on Multimodal Interfaces [42].

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and Permissions

About this article

Cite this article

Reidsma, D., de Kok, I., Neiberg, D. et al. Continuous interaction with a virtual human. J Multimodal User Interfaces 4, 97–118 (2011). https://doi.org/10.1007/s12193-011-0060-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12193-011-0060-x

Keywords

  • Virtual humans
  • Attentive speaking
  • Listener responses
  • Continuous interaction