Journal on Multimodal User Interfaces

, Volume 8, Issue 1, pp 97–108 | Cite as

An architecture for fluid real-time conversational agents: integrating incremental output generation and input processing

  • Stefan KoppEmail author
  • Herwin van Welbergen
  • Ramin Yaghoubzadeh
  • Hendrik Buschmeier
Original Paper


Embodied conversational agents still do not achieve the fluidity and smoothness of natural conversational interaction. One main reason is that current system often respond with big latencies and in inflexible ways. We argue that to overcome these problems, real-time conversational agents need to be based on an underlying architecture that provides two essential features for fast and fluent behavior adaptation: a close bi-directional coordination between input processing and output generation, and incrementality of processing at both stages. We propose an architectural framework for conversational agents [Artificial Social Agent Platform (ASAP)] providing these two ingredients for fluid real-time conversation. The overall architectural concept is described, along with specific means of specifying incremental behavior in BML and technical implementations of different modules. We show how phenomena of fluid real-time conversation, like adapting to user feedback or smooth turn-keeping, can be realized with ASAP and we describe in detail an example real-time interaction with the implemented system.


Embodied conversational agents architecture Fluid real-time interaction Generation–interpretation coordination Incremental processing ASAP BMLA 



This research is supported by the Deutsche Forschungsgemeinschaft (DFG) in the Center of Excellence EXC 277 in “Cognitive Interaction Technology” (CITEC) as well as the German Federal Ministry of Education and Research (BMBF) within the Leading-Edge Cluster it’s OWL, managed by the Project Management Agency Karlsruhe (PTKA). The authors are responsible for the content of this publication.


  1. 1.
    Atterer M, Baumann T, Schlangen D. No sooner said than done? Testing incrementality of semantic interpretations of spontaneous speech. In: Proceedings of INTERSPEECH 2009, Brighton, UK, pp 1855–1858Google Scholar
  2. 2.
    Baumann T, Schlangen D (2012) Inpro_iSS: A component for just-in-time incremental speech synthesis. In: Proceedings of the ACL System Demonstrations, Jeju Island, Korea, pp 103–108Google Scholar
  3. 3.
    Buschmeier H, Baumann T, Dosch B, Kopp S, Schlangen D. Combining incremental language generation and incremental speech synthesis for adaptive information presentation. In: Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Seoul, South Korea, pp 295–303Google Scholar
  4. 4.
    Buschmeier H, Kopp S (2011) Towards conversational agents that attend to and adapt to communicative user feedback. In: Proceedings of the 11th International Conference on Intelligent Virtual Agents, Reykjavik, Iceland, pp 169–182Google Scholar
  5. 5.
    Buss O, Schlangen D (2011) DIUM—an incremental dialogue manager that can produce self-corrections. In: SemDial 2011: Proceedings of the 15th Workshop on the Semantics and Pragmatics of Dialogue, Los Angeles, CA, USA, pp 47–54Google Scholar
  6. 6.
    Cassell J, Bickmore T, Campbell L, Vilhjálmsson H, Yan H (2000) Human conversation as a systems framework: Designing Embodied Conversational Agents. In: Cassell J, Sullivan J, Prevost S, Churchill E (eds) Embodied conversational agents. The MIT Press, Cambridge, pp 29–63Google Scholar
  7. 7.
    Clark HH, Krych MA (2004) Speaking while monitoring addressees for understanding. J Memory Language 50:62–81CrossRefGoogle Scholar
  8. 8.
    Crook N, Field D, Smith C, Harding S, Pulman S, Cavazza M, Charlton D, Moore R, Boye J (2012) Generating context-sensitive ECA responses to user barge-in interruptions. J Multimodal User Interfaces 6:13–25CrossRefGoogle Scholar
  9. 9.
    de Kok I, Heylen D (2012) Integrating backchannel prediction models into embodied conversational agents. In: Proceedings of the 12th International Conference on Intelligent Virtual Agents, Santa Cruz, CA, USA, pp 268–274Google Scholar
  10. 10.
    Eyben F, Woellmer M, Schuller B (2010) openSMILE—the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th International Conference on Multimedia, Florence, Italy, pp 1459–1462Google Scholar
  11. 11.
    Garrod S, Pickering MJ (2004) Why is conversation so easy? Trends Cognit Sci 8:8–11CrossRefGoogle Scholar
  12. 12.
    Gravano A, Hirschberg J (2011) Turn-taking cues in task-oriented dialogue. Comput Speech Language 25:601–634CrossRefGoogle Scholar
  13. 13.
    Guhe M (2007) Incremental conceptualization for language production. Lawrence Erlbaum Associates, MahwahGoogle Scholar
  14. 14.
    Haazebroek P, van Dantzig S, Hommel B (2011) A computational model of perception and action for cognitive robotics. Cognit Process 12:355–365CrossRefGoogle Scholar
  15. 15.
    Hartholt A, Traum D, Marsella SC, Shapiro A, Stratou G, Leuski A (2013) All together now. In: Proceedings of the 13th International Conference on Intelligent Virtual Agents, Edinburgh, UK, pp 368–381Google Scholar
  16. 16.
    Hartmann B, Mancini M, Pelachaud C (2002) Formational parameters and adaptive prototype instantiation for MPEG-4 compliant gesture synthesis. In: Computer, Animation, pp 111–119Google Scholar
  17. 17.
    Hoffman G, Breazeal C (2008) Anticipatory perceptual simulation for human-robot joint practice: Theory and application study. In: Proceedings of the 23rd AAAI Confererence for Artificial Intelligence, Chicago, IL, USA, pp 1357–1362Google Scholar
  18. 18.
    Hoffmann H (2007) Perception through visuomotor anticipation in a mobile robot. Neural Netw 20:22–33CrossRefzbMATHGoogle Scholar
  19. 19.
    Howes C, Purver M, Healey PGT, Mills G, Gregoromichelaki E (2011) On incrementality in dialogue: evidence from compound contributions. Dialogue Discourse 2:297–311CrossRefGoogle Scholar
  20. 20.
    Kenny PG, Parsons TD, Pataki C, Pato M, St. George C, Sugar J, Rizzo A (2008) Virtual Justina: A PTSD virtual patient for clinical classroom training. Annu Rev CyberTher Telemed 6:113–118Google Scholar
  21. 21.
    Kopp S (2010) Social resonance and embodied coordination in face-to-face conversation with artificial interlocutors. Speech Commun 52:587–597CrossRefGoogle Scholar
  22. 22.
    Kopp S, Gesellensetter L, Kramer NC, Wachsmuth I (2005) A conversational agent as museum guide - Design and evaluation of a real-world application. In: Proceedings of the 5th International Working Conference on Intelligent Virtual Agents, Kos, Greece, pp 329–343Google Scholar
  23. 23.
    Kopp S, Krenn B, Marsella SC, Marshall AN, Pelachaud C, Pirker H, Thórisson KR, Vilhjálmsson HH (2006) Towards a common framework for multimodal generation: the behavior markup language. In: Proceedings of the 6th International Working Conference on Intelligent Virtual Agents, vol 4133, Marina del Rey, CA, USA, pp 205–217Google Scholar
  24. 24.
    Kopp S, Wachsmuth I (2004) Synthesizing multimodal utterances for conversational agents. Comput Animat Virtual Worlds 15:39–52CrossRefGoogle Scholar
  25. 25.
    Lemon O, Gruenstein A (2004) Multithreaded context for robust conversational interfaces: context-sensitive speech recognition and interpretation of corrective fragments. ACM Trans Comput Human Interact 11:241–267CrossRefGoogle Scholar
  26. 26.
    Lison P, Kruijff G-J (2008) Salience-driven contextual priming of speech recognition for human-robot interaction. In: Proceedings of the 18th European Conference on Artificial Intelligence, Patras, Greece, pp 636–640Google Scholar
  27. 27.
    Neiberg D, Truong KP (2011) Online detection of vocal listener responses with maximum latency constraints. In: International Conference on Acoustics, Speech, and, Signal Processing, pp 5836–2539Google Scholar
  28. 28.
    Nijholt A, Reidsma D, van Welbergen H, op den Akker H , Ruttkay ZM (2008) Mutually coordinated anticipatory multimodal interaction. In: Esposito A, Bourbakis NG, Avouris N, Hatzilygeroudis I (eds) Verbal and nonverbal features of human–human and human–machine interaction, Springer, Berlin, pp 70–89Google Scholar
  29. 29.
    Reidsma D, Dehling E, van Welbergen H, Zwiers J, Nijholt A (2011) Leading and following with a virtual trainer. In: Proceedings of the 4th International Workshop on Whole Body Interaction in Games and Entertainment, Lisbon, PortugalGoogle Scholar
  30. 30.
    Reidsma D, Nijholt A, Bos P (2008) Temporal interaction between an artificial orchestra conductor and human musicians. Comput Entertain 6:1–22CrossRefGoogle Scholar
  31. 31.
    Reidsma D, van Welbergen H, Poppe R, Bos P, Nijholt A (2006) Towards bi-directional dancing interaction. In: Proceedings of the 5th International Conference on Entertainment Computing, Cambridge, UK, pp 1–12 Google Scholar
  32. 32.
    Ribeiro T, Vala M, Paiva A (2012) Thalamus: closing the mind-body loop in interactive embodied characters. In: Proceedings of the 12th International Conference on Intelligent Virtual Agents, Santa Cruz, CA, USA, pp 189–195Google Scholar
  33. 33.
    Sacks H, Schegloff EA, Jefferson G (1974) A simplest systematics for the organization of turn-taking for conversation. Language 50:696–735CrossRefGoogle Scholar
  34. 34.
    Sadeghipour A, Kopp S (2011) Embodied gesture processing: Motor-based perception-action integration in social artificial agents. Cognitive Computation 3:419–435CrossRefGoogle Scholar
  35. 35.
    Schegloff E (2000) Overlapping talk and the organization of turn-taking for conversation. Language Soc 29:1–63CrossRefGoogle Scholar
  36. 36.
    Scherer S, Marsella S, Stratou G, Xu Y, Morbini F, Egan A, Rizzo AS, Morency L-P (2012) Perception markup language: towards a standardized representation of perceived nonverbal behaviors. In: Proceedings of the 12th International Conference on Intelligent Virtual Agents, Santa Cruz, CA, USA, pp 455–463Google Scholar
  37. 37.
    Schlangen D, Baumann T, Buschmeier H, Buß O, Kopp S, Skantze G, Yaghoubzadeh R (2010) Middleware for incremental processing in conversational agents. In: Proceedings of the 11th Annual SIGdial Meeting on Discourse and, Dialogue, pp 51–54Google Scholar
  38. 38.
    Schlangen D, Skantze G (2011) A general, abstract model of incremental dialogue processing. Dialogue Discourse 2:83–111CrossRefGoogle Scholar
  39. 39.
    Schuler W, Wu S, Schwartz L (2009) A framework for fast incremental interpretation during speech decoding. Comput Linguist 35:313–343CrossRefGoogle Scholar
  40. 40.
    Seneff S, Wang C, Hetherington L, Chung G (2004) A dynamic vocabulary spoken dialogue interface. In: Proceedings of INTERSPEECH 2004, Jeju Island, Korea, pp 321–324Google Scholar
  41. 41.
    Skantze G, Hjalmarsson A (2010) Towards incremental speech generation in dialogue systems. In: Proceedings of the 11th Annual SIGdial Meeting on Discourse and, Dialogue, pp 1–8Google Scholar
  42. 42.
    Stone M, Doran C, Webber B, Bleam T, Palmer M (2003) Microplanning with communicative intentions: the SPUD system. Computat Intell 19:311–381CrossRefMathSciNetGoogle Scholar
  43. 43.
    Street RL (1984) Speech convergence and speech evaluation in fact-finding interviews. Human Commun Res 11:139–169CrossRefGoogle Scholar
  44. 44.
    Tanenhaus MK, Spivey-Knowlton MJ, Eberhard KM, Sedivy JC (1995) Integration of visual and linguistic information in spoken language comprehension. Science 268:1632–1634CrossRefGoogle Scholar
  45. 45.
    Thórisson KR (1996) Communicative Humanoids. A computational model of psychosocial dialogue skills. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USAGoogle Scholar
  46. 46.
    Traum D, DeVault D, Lee J, Wang Z, Marsella S (2012) Incremental dialogue understanding and feedback for multiparty, multimodal conversation. In: Proceedings of the 12th International Conference on Intelligent Virtual Agents, Santa Cruz, CA, USA, pp 275–288Google Scholar
  47. 47.
    van Welbergen H, Reidsma D, Kopp S (2012) An incremental multimodal realizer for behavior co-articulation and coordination. In: Proceedings of the 12th International Conference on Intelligent Virtual Agents, Santa Cruz, CA, USA, pp 175–188Google Scholar
  48. 48.
    Vilhjálmsson HH, Cantelmo N, Cassell J, Chafai NE, Kipp M, Kopp S, Mancini M, Marsella SC, Marshall AN, Pelachaud C, Ruttkay ZM, Thórisson KR, van Welbergen H, van der Werf RJ (2007) The behavior markup language: recent developments and challenges. In: Proceedings of the 7th International Conference on Intelligent Virtual Agents, Paris, France, pp 99–120Google Scholar
  49. 49.
    Wykowska A, Schubö A, Hommel B (2009) How you move is what you see: action planning biases selection in visual search. J Exp Psychol Human Percept Perform 35:1755–1769CrossRefGoogle Scholar

Copyright information

© OpenInterface Association 2013

Authors and Affiliations

  • Stefan Kopp
    • 1
    Email author
  • Herwin van Welbergen
    • 1
  • Ramin Yaghoubzadeh
    • 1
  • Hendrik Buschmeier
    • 1
  1. 1.Sociable Agents Group, CITEC and Faculty of TechnologyBielefeld UniversityBielefeldGermany

Personalised recommendations