Real-Time Human-Robot Communication for Manipulation Tasks in Partially Observed Environments

  • Jacob ArkinEmail author
  • Rohan Paul
  • Daehyung Park
  • Subhro Roy
  • Nicholas Roy
  • Thomas M. Howard
Conference paper
Part of the Springer Proceedings in Advanced Robotics book series (SPAR, volume 11)


In human teams, visual and auditory cues are often used to communicate information about the task and/or environment that may not otherwise be directly observable. Analogously, robots that primarily rely on visual sensors cannot directly observe some attributes of objects that may be necessary for reference resolution or task execution. The experiments in this paper address natural language interaction in human-robot teams for tasks where multi-modal (e.g. visual, auditory, haptic, etc) observations are necessary for robust execution. We present a probabilistic model, verified through physical experiments, that allows robots to acquire knowledge about the latent aspects of the workspace through language and physical interaction in an efficient manner. The model’s effectiveness is demonstrated on a mobile and a stationary manipulator in real-world scenarios by following instructions under partial knowledge of object states in the environment.



Authors gratefully acknowledge funding support in part by the Robotics Consortium of the U.S. Army Research Laboratory under the Collaborative Technology Alliance Program (RCTA) and the Toyota Research Institute (TRI) Award Number LP-C000765-SR. We thank our colleagues in the lab for helpful feedback on this paper.


  1. 1.
    Arkin, J., Howard, T.M.: Experiments in proactive symbol grounding for efficient physically situated human-robot dialogue. In: Late-breaking Track at the SIGDIAL Special Session on Physically Situated Dialogue (RoboDIAL-18), July 2018Google Scholar
  2. 2.
    Bishop, C.M.: Probability distributions. In: Jordan, M., Kleinberg, J., Schölkopf, B. (eds.) Pattern Recognition and Machine Learning, Chapter 2. Springer-Verlag, New York (2006)zbMATHGoogle Scholar
  3. 3.
    Chu, V., McMahon, I., Riano, L., McDonald, C.G., He, Q., Perez-Tejada, J.M., Arrigo, M., Darrell, T., Kuchenbecker, K.J.: Robotic learning of haptic adjectives through physical interaction. Robot. Auton. Syst. 63, 279–292 (2015)CrossRefGoogle Scholar
  4. 4.
    Duvallet, F., Walter, M.R., Howard, T.M., Hemachandra, S., Oh, J., Teller, S., Roy, N., Stentz, A.: Inferring maps and behaviors from natural language instructions. In: Experimental Robotics, pp. 373–388. Springer (2016)Google Scholar
  5. 5.
    Hemachandra, S., Duvallet, F., Howard, T.M., Roy, N., Stentz, A., Walter, M.R.: Learning models for following natural language directions in unknown environments. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, May 2015Google Scholar
  6. 6.
    Hogman, V., Bjorkman, M., Kragic, D.: Interactive object classification using sensorimotor contingencies. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2799–2805. IEEE (2013)Google Scholar
  7. 7.
    Howard, T.M., Tellex, S., Roy, N.: A natural language planner interface for mobile manipulators. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 6652–6659. IEEE (2014)Google Scholar
  8. 8.
    Kollar, T., Krishnamurthy, J., Strimel, G.P.: Toward interactive grounded language acqusition. Robot. Sci. Syst. 1, 721–732 (2013)Google Scholar
  9. 9.
    Kollar, T., Perera, V., Nardi, D., Veloso, M.: Learning environmental knowledge from task-based human-robot dialog. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 4304–4309. IEEE (2013)Google Scholar
  10. 10.
    Matuszek, C., Bo, L., Zettlemoyer, L., Fox, D.: Learning from unscripted deictic gesture and language for human-robot interactions. In: Proceedings of National Conference on Artificial Intelligence (AAAI) (2014)Google Scholar
  11. 11.
    Matuszek, C., FitzGerald, N., Zettlemoyer, L., Bo, L., Fox, D.: A joint model of language and perception for grounded attribute learning. In: Proceedings of the: International Conference on Machine Learning, p. 2012. Edinburgh, Scotland (2012)Google Scholar
  12. 12.
    Oh, J., Howard, T.M., Walter, M.R., Barber, D., Zhu, M., Park, S., Suppe, A., Navarro-Serment, L., Duvallet, F., Boularias, A., Romero, O., Vinkkrov, J., Keegan, T., Dean, R., Lennon, C., Bodt, B., Childers, M., Shi, J., Daniilidis, K., Roy, N., Lebiere, C., Hebert, M., Stentz, A.: Integrated intelligence for human-robot teams. In: Proceedings of the 2016 International Symposium on Experimental Robotics, October 2016Google Scholar
  13. 13.
    Paul, R., Arkin, J., Aksaray, D., Roy, N., Howard, T.M.: Efficient grounding of abstract spatial concepts for natural language interaction with robot platforms. Int. J. Robot. Res. 37, 1269–1299 (2018) CrossRefGoogle Scholar
  14. 14.
    Paul, R., Barbu, A., Felshin, S., Katz, B., Roy, N.: Temporal grounding graphs for language understanding with accrued visual-linguistic context. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp. 4506–4514 (2017)Google Scholar
  15. 15.
    Perera, I.E., Allen, J.F.: Sall-e: Situated agent for language learning. In: Proceedings of the 27th AAAI Conference on Artificial Intelligence, pp. 1241–1247 (2013)Google Scholar
  16. 16.
    Sinapov, J., Schenck, C., Staley, K., Sukhoy, V., Stoytchev, A.: Grounding semantic categories in behavioral interactions: Experiments with 100 objects. Robot. Auton. Syst. 62(5), 632–645 (2014)CrossRefGoogle Scholar
  17. 17.
    Sinapov, J., Stoytchev, A.: From acoustic object recognition to object categorization by a humanoid robot. In: Proceedings of the RSS: Workshop on Mobile Manipulation, p. 2009. Seattle, WA (2009)Google Scholar
  18. 18.
    Thomason, J., Sinapov, J., Svetlik, M., Stone, P., Mooney, R.J.: Learning multi-modal grounded linguistic semantics by playing “I Spy”. In: IJCAI, pp. 3477–3483 (2016)Google Scholar
  19. 19.
    Walter, M.R., Hemachandra, S., Homberg, B., Tellex, S., Teller, S.: A framework for learning semantic maps from grounded natural language descriptions. Int. J. Robot. Res. 33(9), 1167–1190 (2014)CrossRefGoogle Scholar
  20. 20.
    Whitney, D., Eldon, M., Oberlin, J., Tellex, S.: Interpreting multimodal referring expressions in real time. In: Proceedings IEEE International Conference on Robotics and Automation (ICRA), pp. 3331–3338. IEEE (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Jacob Arkin
    • 1
    Email author
  • Rohan Paul
    • 2
  • Daehyung Park
    • 2
  • Subhro Roy
    • 2
  • Nicholas Roy
    • 2
  • Thomas M. Howard
    • 1
  1. 1.University of RochesterRochesterUSA
  2. 2.MITCambridgeUSA

Personalised recommendations