Low-level grounding in a multimodal mobile service robot conversational system using graphical models


The main task of a service robot with a voice-enabled communication interface is to engage a user in dialogue providing an access to the services it is designed for. In managing such interaction, inferring the user goal (intention) from the request for a service at each dialogue turn is the key issue. In service robot deployment conditions speech recognition limitations with noisy speech input and inexperienced users may jeopardize user goal identification. In this paper, we introduce a grounding state-based model motivated by reducing the risk of communication failure due to incorrect user goal identification. The model exploits the multiple modalities available in the service robot system to provide evidence for reaching grounding states. In order to handle the speech input as sufficiently grounded (correctly understood) by the robot, four proposed states have to be reached. Bayesian networks combining speech and non-speech modalities during user goal identification are used to estimate probability that each grounding state has been reached. These probabilities serve as a base for detecting whether the user is attending to the conversation, as well as for deciding on an alternative input modality (e.g., buttons) when the speech modality is unreliable. The Bayesian networks used in the grounding model are specially designed for modularity and computationally efficient inference. The potential of the proposed model is demonstrated comparing a conversational system for the mobile service robot RoboX employing only speech recognition for user goal identification, and a system equipped with multimodal grounding. The evaluation experiments use component and system level metrics for technical (objective) and user-based (subjective) evaluation with multimodal data collected during the conversations of the robot RoboX with users.

This is a preview of subscription content, access via your institution.


  1. 1.

    Aji SM, McEliece RJ (2000) The generalized distributive law. IEEE Trans Inf Theory 46(2):325–343

    MATH  Article  MathSciNet  Google Scholar 

  2. 2.

    Aoyama K, Shimomura H (2005) Real world speech interaction with a humanoid robot on a layered robot behavior architecture. In: Proceedings of the 2005 IEEE International Conference on Robotics and Automation, ICRA05, Barcelona, Spain, pp 3825–3830

  3. 3.

    Brennan SE, Hulteen EA (1995) Interaction and feedback in a spoken language system: a theoretical framework. Knowl Based Syst 8(2–3):143–151

    Article  Google Scholar 

  4. 4.

    Burgard W, Cremers AB, Fox D, Hhnel D, Lakemeyer G, Schulz D, Steiner W, Thrun S (1999) Experiences with an interactive museum tour-guide robot. Artif Intell 114(1–2): 1–53

    Google Scholar 

  5. 5.

    Clark H, Brennan S (1991) Perspectives on socially shared cognition Grounding in Communication American Psychological Association, Washington, pp 127–149

  6. 6.

    Clark HH, Schaefer EF (1989) Contributing to discourse. Cognit Sci 13(2):259–294

    Article  Google Scholar 

  7. 7.

    Cooper GF (1990) The computational complexity of probabilistic inference using Bayesian belief networks (research note). Artif Intell 42(2–3):393–405

    Article  Google Scholar 

  8. 8.

    Drygajlo A, Prodanov P, Ramel G, Messier M, Siegwart R (2003) On developing voice enabled interface for interactive tour-guide robots. Adv Robot 17(7):599–616

    Article  Google Scholar 

  9. 9.

    Dybkjaer L, Bernsen NO, Minker W (2004) Evaluation and usability of multimodal spoken language dialogue systems. Speech Communi 43(1–2):33–54

    Article  Google Scholar 

  10. 10.

    Gibbon D, Moore R, R. Winski, e. (1997) Handbook of standards and resources for spoken language systems. Mouton de Gruyter, Berlin

  11. 11.

    Gibbon D, Mertins I, R. Moore, e. (2000) Handbook of multimodal and spoken dialogue systems: resources, termonology and product evaluation. Kluwer, Dordchet

  12. 12.

    Hong J-H, Song Y-S, Cho S-B (2005) A hierarchical bayesian network for mixed-initiative human-robot interaction. In: 2005 IEEE International Conference on Robotics and Automation, ICRA 2005 Barcelona, Spain, pp 3819–3824

  13. 13.

    Horvitz E, Paek T (1999) A computational architecture for conversation. In: UM ’99: Proceedings of the seventh international conference on User modeling, Springer, New York, Secaucus, NJ, USA, pp 201–210

  14. 14.

    Horvitz E, Paek T (2000) Deeplistener: Harnessing expected utility to guide clarification dialog in spoken language systems. In: ICSLP 2000: 6th international conference on spoken language processing, Beijing, China

  15. 15.

    Horvitz E, Paek T (2001) Harnessing models of users’ goals to mediate clarification dialog in spoken language systems. In: UM ’01: Proceedings of the 8th international conference on user modeling 2001, Springer, Berlin, London, UK, pp 3–13

  16. 16.

    Huang X, Acero A, Hon H-W (2001) Spoken language Processing: a guide to theory, algorithm and system development, 1st edn. Prentice Hall

  17. 17.

    Huttenrauch H, Green A, Norman M, Oestreicher L, Eklundh K (2004) Involving users in the design of a mobile office robot. IEEE Trans Syst Man Cybern, C 34(2):113–124

    Article  Google Scholar 

  18. 18.

    Jensen B, Froidevaux G, Greppin X, Lorotte A, Mayor L, Meisser M, Ramel G, Siegwart R (2002a) The interactive autonomous mobile system robox. In: International Conference on intelligent robots and systems, IROS 2002, Lausanne, Switzerland, pp 1221–1227

  19. 19.

    Jensen B, Froidevaux G, Greppin X, Lorotte A, Mayor L, Meisser M, Ramel G, Siegwart R (2002b) Visitor flow management using human-robot interaction at expo.02. In: Workshop: robotics in exhibitions, IROS 2002, Lausanne, Switzerland

  20. 20.

    Jensen B, Tomatis N, Mayor L, Drygajlo A, Siegwart R (2005) Robots meet humans—interaction in public spaces. IEEE Trans Ind Electron 52(6):1530–1546

    Article  Google Scholar 

  21. 21.

    Jensen F (1996) An introduction to Bayesian networks, 1st edn. UCL Press

  22. 22.

    Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Machine Learn 37(2):183–233

    MATH  Article  Google Scholar 

  23. 23.

    Josifovski L (2002) Robust automatic speech recognition with missing and unreliable data. Ph.D. thesis, Department of Computer Science, University of Sheffield, UK

  24. 24.

    Kleinehagenbrock M, Lang S, Fritsch J, Lömker F, Fink GA, Sagerer G (2002) Person tracking with a mobile robot based on multi-modal anchoring. In: Proceedings IEEE International workshop on robot and human interactive communication (ROMAN), IEEE Berlin, Germany, IEEE, pp 423–429

  25. 25.

    Lang S, Kleinehagenbrock M, Hohenner S, Fritsch J, Fink GA, Sagerer G (2003) Providing the basis for human–robot-interaction: a multi-modal attention system for a mobile robot. In: ICMI ’03: Proceedings of the 5th international conference on multimodal interfaces, NY, USA ACM Press, New York, pp 28–35

  26. 26.

    Li S, Haasch A, Wrede B, Fritsch J, Sagerer G (2005) Human-style interaction with a robot for cooperative learning of scene objects. In: ICMI ’05: Proceedings of the 7th international conference on Multimodal interfaces, ACM Press, New York, NY, USA, pp 151–158

  27. 27.

    Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid objection detection. IEEE ICIP, 900–903

  28. 28.

    Murphy K (2002) Dynamic bayesian networks: representation, inference and learning. Ph.D. thesis, U.C. Berkeley

  29. 29.

    Nakadai K, Hidai K, Mizoguchi H, Okuno HG, Kitano H (2001) Real-time auditory and visual multiple-object tracking for humanoids. In: Proceedings of the 17th international joint conference on artificial intelligence, IJCAI 2001, Seattle, Washington, USA, pp 1425–1436

  30. 30.

    Paek T, Horvitz E (1999) Uncertainty, utility, and misunderstanding: A decision-theoretic perspective on grounding in conversational systems. In: Brennan SE, Giboin A, Traum D (eds) Working papers of the AAAI fall symposium on psychological models of communication in collaborative systems, American Association for Artificial Intelligence, Menlo Park, California, pp 85–92

  31. 31.

    Paek T, Horvitz E, Ringger E (2000) Continuous listening for unconstrained spoken dialog. In: ICSLP 2000: 6th international conference on Spoken Language Processing, Beijing, China

  32. 32.

    Pavlovic VI (1999) Dynamic Bayesian networks for information fusion with application to human-computer interfaces. Ph.D. thesis, University of Illinois Urbana-Champaign

  33. 33.

    Prodanov P, Drygajlo A (2005a) Bayesian networks based multi-modality fusion for error handling in human-robot dialogues under noisy conditions. Speech Communi 45(3): 231–248

    Article  Google Scholar 

  34. 34.

    Prodanov P, Drygajlo A (2005b) Decision networks for repair strategies in speech-based interaction with mobile tour-guide robots. In: Proceedings of international conference on robotics and automation, IEEE ICRA 2005, Barcelona, Spain

  35. 35.

    Russell S, Norvig P (2003) Artificial intelligence: a modern approach.2nd edn. Prentice Hall

  36. 36.

    Sidner CL, Kidd C, Lee C, Lesh N (2004) Where to look: A study of human-robot engagement. In: Proceedings intelligent user interfaces (IUI), Funchal, Island of Madeira, Portugal, pp 78–84

  37. 37.

    Tasaki T, Komatani K, Ogata T, Okuno HG (2005) Spatially mapping of friendliness for human-robot interaction. In: Proceedings of IEEE/RSJ international conference on intelligent robots and systems (IROS 2005), Edmonton, Alberta, Canada

  38. 38.

    Traum D (1999) Computational models of grounding in collaborative systems. In: AAAI fall symposium on psychological models of communication, pp 124–131

  39. 39.

    Traum DR, Dillenbourg P (1998) Towards a normative model of grounding in collaboration.

  40. 40.

    Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: IEEE computer society conference on computer vision and pattern recognition (CVPR), ISSN: 1063-6919, vol 1, pp 511–518

Download references

Author information



Corresponding author

Correspondence to Plamen Prodanov.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Prodanov, P., Drygajlo, A., Richiardi, J. et al. Low-level grounding in a multimodal mobile service robot conversational system using graphical models. Intel Serv Robotics 1, 3–26 (2008). https://doi.org/10.1007/s11370-006-0001-9

Download citation


  • Service robots
  • Spoken interaction
  • Grounding
  • Bayesian networks
  • Efficient inference