Abstract
This paper presents our progress in developing a Virtual Human capable of being an attentive speaker. Such a Virtual Human should be able to attend to its interaction partner while it is speaking—and modify its communicative behavior on-the-fly based on what it observes in the behavior of its partner. We report new developments concerning a number of aspects, such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and strategies for generating appropriate reactions to listener responses. On the basis of this progress, a task-based setup for a responsive Virtual Human was implemented to carry out two user studies, the results of which are presented and discussed in this paper.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Allwood J, Cerrate L (2003) A study of gestural feedback expressions. In: Paggio P, Jokinen K, Jönsson K (eds) 1st Nordic symposium on multimodal communication, pp 7–22
Anderson AH, Bader M, Bard EG, Boyle E, Doherty-Sneddon G, Garrod S, Isard S, Kowtko JC, McAllister J, Miller J, Sotillo C, Thompson H, Weinert R (1991) The HCRC Map Task corpus. Lang Speech 34:351–366
Bavelas JB, Coates L, Johnson T (2000) Listeners as co-narrators. J Pers Soc Psychol 79(6):941–952
Bavelas JB, Coates L, Johnson T (2002) Listener responses as a collaborative process: The role of gaze. J Commun 52(3):566–580
Benus S, Gravano A, Hirschberg J (2007) The prosody of backchannels in American English. In: Proceedings of the 16th international congress of phonetic sciences 2007, pp 1065–1068
Black AW, Tokuda K, Zen H (2002) An HMM-based speech synthesis system applied to English. In: Proc of 2002 IEEE SSW, Santa Monica, CA, USA
Brady PT (1968) A statistical analysis of on-off patterns in 16 conversations. Bell Syst Tech J 47:73–91
Carletta JC, Isard S, Doherty-Sneddon G, Isard A, Kowtko JC, AH Anderson (1997) The reliability of a dialogue structure coding scheme. Comput Linguist 23(1):13–31
Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Clark HH (1996) Using language. Cambridge University Press, Cambridge
Clark HH, Brennan SE (1991) Grounding in communication. In: Resnick LB, Levine JM, Teasly SD (eds) Perspectives on socially shared cognition. American Psychological Association, Washington
Clark HH, Krych MA (2004) Speaking while monitoring addressees for understanding. J Mem Lang 50(1):62–81. doi:10.1016/j.jml.2003.08.004
Dhillon R, Bhagat S, Carvey H, Shriberg E (2004) Meeting recorder project: Dialog act labeling guide. Tech Rep ICSI Technical Report TR-04-002, International Computer Science Institute
Duncan S Jr (1972) Some signals and rules for taking speaking turns in conversation. J Pers Soc Psychol 23(2)
Duncan S Jr (1974) On the structure of speaker-auditor interaction during speaking turns. Lang Soc 3(2):161–180. doi:10.1017/s0047404500004322
Edlund J, Heldner M, Al Moubayed S, Gravano A, Hirschberg J (2010) Very short utterances in conversation. In: Proceedings of fonetik
Eyben F, Woellmer M, Schuller B (2010) openSMILE—the Munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM multimedia, pp 1459–1462
French P, Local J (1983) Turn-competitive incomings. J Pragmat 7:17–38
Fujimoto DT (2007) Listener responses in interaction: a case for abandoning the term, backchannel. J Osaka Jogakuin 2 Year Coll 37:35–54
Goldwater S, Jurafsky D, Manning CD (2010) Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates. Speech Commun 52:181–200
Goodwin C (1981) Conversational organization: interaction between speakers and hearers. Academic Press, San Diego
Goodwin C (1986) Between and within: alternative sequential treatments of continuers and assessments. Hum Stud 9(2–3):205–217. doi:10.1007/bf00148127
Gravano A, Hirschberg J (2009) Backchannel-inviting cues in task-oriented dialogue. In: Proceedings of interspeech, Brighton, pp 1019–1022
Gustafson J, Neiberg D (2010) Prosodic cues to engagement in non-lexical response tokens in Swedish. In: DiSS-LPSS Joint Workshop
Heldner M, Edlund J (2010) Pauses, gaps and overlaps in conversations. J Phonetics 38(4):555–568. doi:10.1016/j.wocn.2010.08.002
Heylen D (2006) Head gestures gaze and the principles of conversational structure International. Int J Humanoid Robot 3(3):241–267
Heylen D, Bevacqua E, Tellier M, Pelachaud C (2007) Searching for prototypical facial feedback signals. In: Pelachaud C, Martin JC, André E, Chollet G, Karpouzis K, Pelé D (eds) Proceedings of the 7th international conference intelligent virtual agents. Lecture notes in computer science, vol 4722. Springer, Berlin, pp 147–153. doi:10.1007/978-3-540-74997-4_14
Kendon A (1967) Some functions of gaze direction in social interaction. Acta Psychol 26:22–63
de Kok I, Heylen D (2011) The MultiLis corpus—dealing with individual differences of nonverbal listening behavior. In: Proceedings of COST 2102: toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues, pp 362–375
Kopp S (2010) Social resonance and embodied coordination in face-to-face conversation with artificial interlocutors. Speech Commun 52(6):587–597. doi:10.1016/j.specom.2010.02.007
Kopp S, Krenn B, Marsella SC, AN Marshall, Pelachaud C, Pirker H, Thórisson KR, Vilhjálmsson HH (2006) Towards a common framework for multimodal generation: the behavior markup language. In: Gratch J, Young MR, Aylett RS, Ballin D, Olivier P (eds) Proceedings of the 6th international conference on intelligent virtual agents. Lecture notes in computer science, vol 4133. Springer, Berlin, pp 205–217
Kurtic E, Brown GJ, Wells B (2010) Resources for turn competition in overlap in multi-party conversations: speech rate, pausing and duration. In: Proceedings of interspeech, pp 2550–2553
Lee CC, Lee S, Narayanan SS (2008) An analysis of multimodal cues of interruption in dyadic spoken interactions. In: Proceedings of interspeech, pp 1678–1681
ter Maat M, Truong KP, Heylen D (2010) How turn-taking strategies influence users’ impressions of an agent. In: Allbeck J, Badler NI, Bickmore T, Pelachaud C, Safonova A (eds) Proceedings of the 10th international conference on intelligent virtual agents, Philadelphia, Pennsylvania, USA. Lecture notes in computer science, vol 6356. Springer, Berlin, pp 441–453. doi:10.1007/978-3-642-15892-6_48
Manusov V, Trees AR (2002) “Are you kidding me?”: The role of nonverbal cues in the verbal accounting process. J Commun 52(3):640–656. doi:10.1111/j.1460-2466.2002.tb02566.x
McKinneya MF, Moelants D, Davies MEP, Klapuri A (2007) Evaluation of audio beat tracking and music tempo extraction algorithms. J New Music Res 36(1):1–16
Neiberg D, Gustafson J (2010) The prosody of Swedish conversational grunts. In: Proc of Interspeech
Neiberg D, Truong KP (2011) Online detection of vocal listener responses with maximum latency constraints. In: Proc of ICASSP, p 2011
Nijholt A, Reidsma D, van Welbergen H, op den Akker H, Ruttkay ZM (2008) Mutually coordinated anticipatory multimodal interaction. In: Esposito A, Bourbakis NG, Avouris N, Hatzilygeroudis I (eds) Verbal and nonverbal features of human-human and human-machine interaction. Lecture notes in computer science, vol 5042. Springer, Berlin, pp 70–89
Norwine AC, Murphy OJ (1938) Characteristic time intervals in telephonic conversation. Bell Syst Tech J 17:281–291
Reidsma D (2008) Annotations and subjective machines—of annotators, embodied agents, users, and other humans. PhD thesis, University of Twente. doi:10.3990/1.9789036527262
Reidsma D, Truong K, van Welbergen H, Neiberg D, Pammi S, de Kok I, van Straalen B (2010) Continuous interaction with a virtual human. In: Salah AA, Gevers T (eds) Proceedings of the eNTERFACE’10 summer workshop on multimodal interfaces, pp 24–39
Sacks H, Schegloff E, Jefferson G (1974) A simplest systematics for the organization of turn-taking for conversation. Language 50:696–735
Schegloff E (2000) Overlapping talk and the organization of turn-taking for conversation. Lang Soc 29:1–63
Schlangen D, Skantze G (2009) A general, abstract model of incremental dialogue processing. In: Proceedings of the 12th conference of the European chapter of the Association for Computational Linguistics (EACL-09)
Schröder M (2010) The SEMAINE API: Towards a standards-based framework for building emotion-oriented systems. Adv Hum-Comput Interact 2010:319406. doi:10.1155/2010/319406
Schröder M, Trouvain J (2003) The German text-to-speech synthesis system MARY: a tool for research, development and teaching. Int J Speech Technol 6(4):365–377
Schröder M, Charfuelan M, Pammi S, Türk O (2008) The MARY TTS entry in the Blizzard Challenge 2008. In: Proc of the Blizzard Challenge
Skantze G, Hjalmarsson A (2010) Towards incremental speech generation in dialogue systems. In: Proceedings of SIGdial
Thiebaux M, Marshall AN, Marsella SC, Kallmann M (2008) Smartbody: Behavior realization for embodied conversational agents. In: Proceedings of the 7th international conference on autonomous agents and multiagent systems, pp 151–158
Thórisson KR (2002) Natural turn-taking needs no manual: Computational theory and model, from perception to action. In: Granström B, House D, Karlsson I (eds) Multimodality in language and speech systems. Kluwer Academic, Dordrecht, pp 173–207
Toda T, Tokuda K (2007) A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans Inf Syst E90-D(5):816–824
Walker MB, Trimboli C (1982) Smooth transitions in conversational interactions. J Soc Psychol 117:305–306
Ward N (2006) Non-lexical conversational sounds in American English. Pragmat Cogn 14(1):129–182
Ward N, Tsukahara W (2000) Prosodic features which cue back-channel responses in English and Japanese. J Pragmat 32(8):1177–1207
van Welbergen H, Reidsma D, Ruttkay ZM, Zwiers J (2010a) Elckerlyc: A BML realizer for continuous, multimodal interaction with a virtual human. J Multimodal User Interfaces 3(4):271–284. doi:10.1007/s12193-010-0051-3
van Welbergen H, Reidsma D, Zwiers J (2010b) A demonstration of continuous interaction with Elckerlyc. In: Proceedings of the third workshop on multimodal output generation, CTIT Workshop Proceedings. vol WP2010, pp 51–57
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper is base upon a project report of the eNTERFACE’10 Summer Workshop on Multimodal Interfaces [42].
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Reidsma, D., de Kok, I., Neiberg, D. et al. Continuous interaction with a virtual human. J Multimodal User Interfaces 4, 97–118 (2011). https://doi.org/10.1007/s12193-011-0060-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-011-0060-x