Skip to main content

Abstract

In this paper, we argue that embodiment can play an important role in the evaluation of systems developed for Human Computer Interaction. To this end, we describe a simulation platform for building Embodied Human Computer Interactions (EHCI). This system, VoxWorld, enables multimodal dialogue systems that communicate through language, gesture, action, facial expressions, and gaze tracking, in the context of task-oriented interactions. A multimodal simulation is an embodied 3D virtual realization of both the situational environment and the co-situated agents, as well as the most salient content denoted by communicative acts in a discourse. It is built on the modeling language VoxML, which encodes objects with rich semantic typing and action affordances, and actions themselves as multimodal programs, enabling contextually salient inferences and decisions in the environment. Through simulation experiments in VoxWorld, we can begin to identify and then evaluate the diverse parameters involved in multimodal communication between agents. VoxWorld enables an embodied HCI by situating both human and computational agents within the same virtual simulation environment, where they share perceptual and epistemic common ground. In this first part of this paper series, we discuss the consequences of embodiment and common ground, and how they help evaluate parameters of the interaction between humans and agents, and demonstrate different behaviors and types of interactions on different classes of agents.

This work was supported by Contract W911NF-15-C-0238 with the US Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO). Approved for Public Release, Distribution Unlimited. The views expressed herein are ours and do not reflect the official policy or position of the Department of Defense or the U.S. Government. We would like to thank Ken Lai, Bruce Draper, Ross Beveridge, and Francisco Ortega for their comments and suggestions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    This is similar in many respects to the representations introduced in [16, 32] and [20] for modeling action and control with robots.

References

  1. Anderson, M.L.: Embodied cognition: a field guide. Artif. Intell. 149(1), 91–130 (2003)

    Article  Google Scholar 

  2. Andrist, S., Gleicher, M., Mutlu, B.: Looking coordinated: bidirectional gaze mechanisms for collaborative interaction with virtual characters. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems CHI 2017, pp. 2571–2582. ACM, New York (2017). https://doi.org/10.1145/3025453.3026033, http://doi.acm.org/10.1145/3025453.3026033

  3. Asher, N.: Common ground, corrections and coordination. J. Semant. (1998)

    Google Scholar 

  4. Asher, N., Lascarides, A.: Logics of Conversation. Cambridge University Press, Cambridge (2003)

    Google Scholar 

  5. Asher, N., Pogodalla, S.: SDRT and continuation semantics. In: Onada, T., Bekki, D., McCready, E. (eds.) JSAI-isAI 2010. LNCS (LNAI), vol. 6797, pp. 3–15. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25655-4_2

    Chapter  Google Scholar 

  6. Barsalou, L.W.: Perceptions of perceptual symbols. Behav. Brain Sci. 22(4), 637–660 (1999)

    Article  Google Scholar 

  7. Bergen, B.K.: Louder than Words: The New Science of How the Mind Makes Meaning. Basic Books, New York (2012)

    Google Scholar 

  8. Bolt, R.A.: “Put-that-there”: voice and gesture at the graphics interface, vol. 14. ACM (1980)

    Google Scholar 

  9. Brennan, S.E., Chen, X., Dickinson, C.A., Neider, M.B., Zelinsky, G.J.: Coordinating cognition: the costs and benefits of shared gaze during collaborative search. Cognition 106(3), 1465–1477 (2008). https://doi.org/10.1016/j.cognition.2007.05.012. http://www.sciencedirect.com/science/article/pii/S0010027707001448

    Article  Google Scholar 

  10. Cassell, J.: Embodied Conversational Agents. MIT Press, Cambridge (2000)

    Book  Google Scholar 

  11. Cassell, J., Stone, M., Yan, H.: Coordination and context-dependence in the generation of embodied conversation. In: Proceedings of the First International Conference on Natural Language Generation, vol. 14, pp. 171–178. Association for Computational Linguistics (2000)

    Google Scholar 

  12. Chrisley, R.: Embodied artificial intelligence. Artif. Intell. 149(1), 131–150 (2003)

    Article  Google Scholar 

  13. Clair, A.S., Mead, R., Matarić, M.J., et al.: Monitoring and guiding user attention and intention in human-robot interaction. In: ICRA-ICAIR Workshop, Anchorage, AK, USA, vol. 1025 (2010)

    Google Scholar 

  14. Clark, H.H., Brennan, S.E.: Grounding in communication. In: Resnick, L.B., Levine, J.M., Teasley, S.D. (eds.) Perspectives on Socially Shared Cognition, vol. 13, pp. 127–149. American Psychological Association, Washington DC (1991)

    Chapter  Google Scholar 

  15. Clark, H.H., Wilkes-Gibbs, D.: Referring as a collaborative process. Cognition 22(1), 1–39 (1986). https://doi.org/10.1016/0010-0277(86)90010-7. http://www.sciencedirect.com/science/article/pii/0010027786900107

    Article  Google Scholar 

  16. Cooper, R., Ginzburg, J.: Type theory with records for natural language semantics. In: Lappin, S., Fox, C. (eds.) The Handbook of Contemporary Semantic Theory, p. 375. Wiley, Hoboken (2015)

    Chapter  Google Scholar 

  17. Craik, K.J.W.: The Nature of Explanation. Cambridge University, Cambridge (1943)

    Google Scholar 

  18. De Groote, P.: Type raising, continuations, and classical logic. In: Proceedings of the Thirteenth Amsterdam Colloquium, pp. 97–101 (2001)

    Google Scholar 

  19. Dillenbourg, P., Traum, D.: Sharing solutions: persistence and grounding in multimodal collaborative problem solving. J. Learn. Sci. 15(1), 121–151 (2006)

    Article  Google Scholar 

  20. Dobnik, S., Cooper, R., Larsson, S.: Modelling language, action, and perception in type theory with records. In: Duchier, D., Parmentier, Y. (eds.) CSLP 2012. LNCS, vol. 8114, pp. 70–91. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41578-4_5

    Chapter  Google Scholar 

  21. Dumas, B., Lalanne, D., Oviatt, S.: Multimodal interfaces: a survey of principles, models and frameworks. In: Lalanne, D., Kohlas, J. (eds.) Human Machine Interaction. LNCS, vol. 5440, pp. 3–26. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00437-7_1

    Chapter  Google Scholar 

  22. Eisenstein, J., Barzilay, R., Davis, R.: Discourse topic and gestural form. In: AAAI, pp. 836–841 (2008)

    Google Scholar 

  23. Eisenstein, J., Barzilay, R., Davis, R.: Gesture salience as a hidden variable for coreference resolution and keyframe extraction. J. Artif. Intell. Res. 31, 353–398 (2008)

    Article  Google Scholar 

  24. Evans, V.: Language and Time: a Cognitive Linguistics Approach. Cambridge University Press, Cambridge (2013)

    Book  Google Scholar 

  25. Feldman, J.: Embodied language, best-fit analysis, and formal compositionality. Phys. Life Rev. 7(4), 385–410 (2010)

    Article  Google Scholar 

  26. Fernando, T.: Situations in LTL as strings. Inf. Comput. 207(10), 980–999 (2009)

    Article  MathSciNet  Google Scholar 

  27. Fussell, S.R., Kraut, R.E., Siegel, J.: Coordination of communication: effects of shared visual context on collaborative work. In: Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work CSCW 2000, pp. 21–30. ACM, New York (2000). https://doi.org/10.1145/358916.358947, http://doi.acm.org/10.1145/358916.358947

  28. Fussell, S.R., Setlock, L.D., Yang, J., Ou, J., Mauer, E., Kramer, A.D.I.: Gestures over video streams to support remote collaboration on physical tasks. Hum. Comput. Interact. 19(3), 273–309 (2004). https://doi.org/10.1207/s15327051hci1903_3

    Article  Google Scholar 

  29. Gergle, D., Kraut, R.E., Fussell, S.R.: Action as language in a shared visual space. In: Proceedings of the 2004 ACM Conference on Computer Supported Cooperative Work CSCW 2004, pp. 487–496. ACM, New York (2004). https://doi.org/10.1145/1031607.1031687, http://doi.acm.org/10.1145/1031607.1031687

  30. Gibson, J.J., Reed, E.S., Jones, R.: Reasons for Realism: Selected Essays of James J. Gibson. Lawrence Erlbaum Associates, Mahwah (1982)

    Google Scholar 

  31. Gilbert, M.: On Social Facts. Princeton University Press, Princeton (1992)

    Book  Google Scholar 

  32. Ginzburg, J., Fernández, R.: Computational models of dialogue. In: Clark, A., Fox, C., Lappin, S. (eds.) The Handbook of Computational Linguistics and Natural Language Processing, vol. 57, p. 1. Wiley, Hoboken (2010)

    Google Scholar 

  33. Goldman, A.I.: Interpretation psychologized*. Mind Lang. 4(3), 161–185 (1989)

    Article  Google Scholar 

  34. Goldman, A.I.: Simulating Minds: The Philosophy, Psychology, and Neuroscience of Mindreading. Oxford University Press, Oxford (2006)

    Book  Google Scholar 

  35. Gordon, R.M.: Folk psychology as simulation. Mind Lang. 1(2), 158–171 (1986)

    Article  Google Scholar 

  36. Graesser, A.C., Singer, M., Trabasso, T.: Constructing inferences during narrative text comprehension. Psychol. Rev. 101(3), 371 (1994)

    Article  Google Scholar 

  37. Heal, J.: Simulation, theory, and content. In: Carruthers, P., Smith, P.K. (eds.) Theories of Theories of Mind, pp. 75–89. Cambridge University Press, Cambridge (1996)

    Chapter  Google Scholar 

  38. Johnson-Laird, P.N., Byrne, R.M.: Conditionals: a theory of meaning, pragmatics, and inference. Psychol. Rev. 109(4), 646 (2002)

    Article  Google Scholar 

  39. Johnson-Laird, P.: How could consciousness arise from the computations of the brain. In: Mindwaves, pp. 247–257 Basil Blackwell, Oxford (1987)

    Google Scholar 

  40. Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  41. Kennington, C., Kousidis, S., Schlangen, D.: Interpreting situated dialogue utterances: an update model that uses speech, gaze, and gesture information. In: Proceedings of SigDial 2013 (2013)

    Google Scholar 

  42. Kiela, D., Bulat, L., Vero, A.L., Clark, S.: Virtual embodiment: a scalable long-term strategy for artificial intelligence research. arXiv preprint arXiv:1610.07432 (2016)

  43. Kraut, R.E., Fussell, S.R., Siegel, J.: Visual information as a conversational resource in collaborative physical tasks. Hum. Comput. Interact. 18(1), 13–49 (2003). https://doi.org/10.1207/S15327051HCI1812_2

    Article  Google Scholar 

  44. Krishnaswamy, N., Pustejovsky, J.: Multimodal semantic simulations of linguistically underspecified motion events. In: Barkowsky, T., Burte, H., Hölscher, C., Schultheis, H. (eds.) Spatial Cognition/KogWis -2016. LNCS (LNAI), vol. 10523, pp. 177–197. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68189-4_11

    Chapter  Google Scholar 

  45. Krishnaswamy, N., Pustejovsky, J.: Multimodal continuation-style architectures for human-robot interaction. arXiv preprint arXiv:1909.08161 (2019)

  46. Lascarides, A., Stone, M.: Formal semantics for iconic gesture. In: Proceedings of the 10th Workshop on the Semantics and Pragmatics of Dialogue (BRANDIAL), pp. 64–71 (2006)

    Google Scholar 

  47. Lascarides, A., Stone, M.: Discourse coherence and gesture interpretation. Gesture 9(2), 147–180 (2009). https://doi.org/10.1075/gest.9.2.01las. http://www.jbe-platform.com/content/journals/10.1075/gest.9.2.01las

    Article  Google Scholar 

  48. Lascarides, A., Stone, M.: A formal semantic analysis of gesture. J. Semant. 26, 393–449 (2009)

    Article  Google Scholar 

  49. Lücking, A., Mehler, A., Walther, D., Mauri, M., Kurfürst, D.: Finding recurrent features of image schema gestures: the figure corpus. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 1426–1431 (2016)

    Google Scholar 

  50. Lücking, A., Pfeiffer, T., Rieser, H.: Pointing and reference reconsidered. J. Pragmat. 77, 56–79 (2015)

    Article  Google Scholar 

  51. Marshall, P., Hornecker, E.: Theories of embodiment in HCI. In: Price, S., Jewitt, C., Brown, B. (eds.) The SAGE Handbook of Digital Technology Research, vol. 1, pp. 144–158. Sage, Thousand Oaks (2013)

    Chapter  Google Scholar 

  52. Matuszek, C., Bo, L., Zettlemoyer, L., Fox, D.: Learning from unscripted deictic gesture and language for human-robot interactions. In: AAAI, pp. 2556–2563 (2014)

    Google Scholar 

  53. Mehlmann, G., Häring, M., Janowski, K., Baur, T., Gebhard, P., André, E.: Exploring a model of gaze for grounding in multimodal HRI. In: Proceedings of the 16th International Conference on Multimodal Interaction ICMI 2014, pp. 247–254. ACM, New York (2014). https://doi.org/10.1145/2663204.2663275, http://doi.acm.org/10.1145/2663204.2663275

  54. Narayanan, S.: Mind changes: a simulation semantics account of counterfactuals. Cogn. Sci. (2010)

    Google Scholar 

  55. Naumann, R.: Aspects of changes: a dynamic event semantics. J. Semant. 18, 27–81 (2001)

    Article  Google Scholar 

  56. Pustejovsky, J.: The Generative Lexicon. MIT Press, Cambridge (1995)

    Google Scholar 

  57. Pustejovsky, J.: Dynamic event structure and habitat theory. In: Proceedings of the 6th International Conference on Generative Approaches to the Lexicon (GL2013), pp. 1–10. ACL (2013)

    Google Scholar 

  58. Pustejovsky, J.: From actions to events: communicating through language and gesture. Interact. Stud. 19(1–2), 289–317 (2018)

    Article  Google Scholar 

  59. Pustejovsky, J.: From experiencing events in the action-perception cycle to representing events in language. Interact. Stud. 19 (2018)

    Google Scholar 

  60. Pustejovsky, J., Krishnaswamy, N.: VoxML: A visualization modeling language. In: Chair, N.C.C., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), Paris, France, May 2016

    Google Scholar 

  61. Pustejovsky, J., Krishnaswamy, N.: Embodied human-computer interactions through situated grounding. In: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, pp. 1–3 (2020)

    Google Scholar 

  62. Pustejovsky, J., Krishnaswamy, N.: Embodied human computer interaction. KĂĽnstliche Intelligenz (2021)

    Google Scholar 

  63. Pustejovsky, J., Krishnaswamy, N.: Situated meaning in multimodal dialogue: Human-robot and human-computer interactions. Traitement Automatique des Langues 62(1) (2021)

    Google Scholar 

  64. Pustejovsky, J., Moszkowicz, J.: The qualitative spatial dynamics of motion. J. Spatial Cognit. Comput. 11, 15–44 (2011)

    Article  Google Scholar 

  65. Quek, F., et al.: Multimodal human discourse: gesture and speech. ACM Trans. Comput.-Hum. Interact. (TOCHI) 9(3), 171–193 (2002)

    Article  Google Scholar 

  66. Ravenet, B., Pelachaud, C., Clavel, C., Marsella, S.: Automating the production of communicative gestures in embodied characters. Front. Psychol. 9, 1144 (2018)

    Article  Google Scholar 

  67. Shapiro, L.: The Routledge Handbook of Embodied Cognition. Routledge, New York (2014)

    Book  Google Scholar 

  68. Skantze, G., Hjalmarsson, A., Oertel, C.: Turn-taking, feedback and joint attention in situated human-robot interaction. Speech Commun. 65, 50–66 (2014). https://doi.org/10.1016/j.specom.2014.05.005. http://www.sciencedirect.com/science/article/pii/S016763931400051X

    Article  Google Scholar 

  69. Stalnaker, R.: Common ground. Linguist. Philos. 25(5–6), 701–721 (2002)

    Article  Google Scholar 

  70. Tomasello, M., Carpenter, M.: Shared intentionality. Dev. Sci. 10(1), 121–125 (2007)

    Article  Google Scholar 

  71. Turk, M.: Multimodal interaction: a review. Pattern Recogn. Lett. 36, 189–195 (2014)

    Article  Google Scholar 

  72. Unger, C.: Dynamic semantics as monadic computation. In: Okumura, M., Bekki, D., Satoh, K. (eds.) JSAI-isAI 2011. LNCS (LNAI), vol. 7258, pp. 68–81. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32090-3_7

    Chapter  Google Scholar 

  73. Zwaan, R.A., Pecher, D.: Revisiting mental simulation in language comprehension: six replication attempts. PLoS ONE 7(12), e51382 (2012)

    Article  Google Scholar 

  74. Zwaan, R.A., Radvansky, G.A.: Situation models in language comprehension and memory. Psychol. Bull. 123(2), 162 (1998)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James Pustejovsky .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pustejovsky, J., Krishnaswamy, N. (2021). The Role of Embodiment and Simulation in Evaluating HCI: Theory and Framework. In: Duffy, V.G. (eds) Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Human Body, Motion and Behavior. HCII 2021. Lecture Notes in Computer Science(), vol 12777. Springer, Cham. https://doi.org/10.1007/978-3-030-77817-0_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-77817-0_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-77816-3

  • Online ISBN: 978-3-030-77817-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics