A Multiview Approach to Learning Articulated Motion Models

  • Andrea F. DanieleEmail author
  • Thomas M. Howard
  • Matthew R. Walter
Conference paper
Part of the Springer Proceedings in Advanced Robotics book series (SPAR, volume 10)


In order for robots to operate effectively in homes and workplaces, they must be able to manipulate the articulated objects common within environments built for and by humans. Kinematic models provide a concise representation of these objects that enable deliberate, generalizable manipulation policies. However, existing approaches to learning these models rely upon visual observations of an object’s motion, and are subject to the effects of occlusions and feature sparsity. Natural language descriptions provide a flexible and efficient means by which humans can provide complementary information in a weakly supervised manner suitable for a variety of different interactions (e.g., demonstrations and remote manipulation). In this paper, we present a multimodal learning framework that incorporates both vision and language information acquired in situ to estimate the structure and parameters that define kinematic models of articulated objects. The visual signal takes the form of an RGB-D image stream that opportunistically captures object motion in an unprepared scene. Accompanying natural language descriptions of the motion constitute the linguistic signal. We model linguistic information using a probabilistic graphical model that grounds natural language descriptions to their referent kinematic motion. By exploiting the complementary nature of the vision and language observations, our method infers correct kinematic models for various multiple-part objects on which the previous state-of-the-art, visual-only system fails. We evaluate our multimodal learning framework on a dataset comprised of a variety of household objects, and demonstrate a \(23\%\) improvement in model accuracy over the vision-only baseline.



This work was supported in part by the National Science Foundation under grants IIS-1638072 and IIS-1637813, and by the Robotics Consortium of the U.S. Army Research Laboratory under the Collaborative Technology Alliance Program Cooperative Agreement W911NF-10-2-0016.


  1. 1.
    Alayrac, J.B., Sivic, J., Laptev, I., Lacoste-Julien, S.: Joint discovery of object states and manipulation actions (2017). arXiv:170202738
  2. 2.
    Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering (2015). arXiv:150500468
  3. 3.
    Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. (2009)Google Scholar
  4. 4.
    Artzi, Y., Zettlemoyer, L.: Weakly supervised learning of semantic parsers for mapping instructions to actions. Trans. Assoc. Comput. Linguist. (2013)Google Scholar
  5. 5.
    Bouguet, J.Y.: Pyramidal implementation of the affine Lucas-Kanade feature tracker description of the algorithm. Intel Corp. (2001)Google Scholar
  6. 6.
    Byravan, A., Fox, D.: SE3-Nets: learning rigid body motion using deep neural networks. In: Proceedings of ICRA (2017)Google Scholar
  7. 7.
    Chen, D.L., Mooney, R.J.: Learning to interpret natural language navigation instructions from observations. In: Proceedings of AAAI (2011)Google Scholar
  8. 8.
    Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description (2014). arXiv:14114389
  9. 9.
    Duvallet, F., Walter, M.R., Howard, T., Hemachandra, S., Oh, J., Teller, S., Roy, N., Stentz, A.: Inferring maps and behaviors from natural language instructions. In: Proceedings of ISER (2014)Google Scholar
  10. 10.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD (1996)Google Scholar
  11. 11.
    Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM (1981)Google Scholar
  12. 12.
    Guadarrama, S., Riano, L., Golland, D., Gohring, D., Jia, Y., Klein, D., Abbeel, P., Darrell, T.: Grounding spatial relations for human-robot interaction. In: Proceedings of IROS (2013)Google Scholar
  13. 13.
    Guadarrama, S., Rodner, E., Saenko, K., Zhang, N., Farrell, R., Donahue, J., Darrell, T.: Open-vocabulary object retrieval. In: Proceedings of RSS (2014)Google Scholar
  14. 14.
    Hausman, K., Niekum, S., Ostenoski, S., Sukhatme, G.S.: Active articulation model estimation through interactive perception. In: Proceedings of ICRA (2015)Google Scholar
  15. 15.
    Hemachandra, S., Walter, M.R., Tellex, S., Teller, S.: Learning spatially-semantic representations from natural language descriptions and scene classifications. In: Proceedings of ICRA (2014)Google Scholar
  16. 16.
    Hemachandra, S., Duvallet, F., Howard, T.M., Roy, N., Stentz, A., Walter, M.R.: Learning models for following natural language directions in unknown environments. In: Proceedings of ICRA (2015)Google Scholar
  17. 17.
    Howard, T.M., Tellex, S., Roy, N.: A natural language planner interface for mobile manipulators. In: Proceedings of ICRA (2014)Google Scholar
  18. 18.
    Huang, X., Walker, I., Birchfield, S.: Occlusion-aware reconstruction and manipulation of 3D articulated objects. In: Proceedings of ICRA (2012)Google Scholar
  19. 19.
    Jain, A., Kemp, C.C.: Pulling open doors and drawers: coordinating an omni-directional base and a compliant arm with equilibrium point control. In: Proceedings of ICRA (2010)Google Scholar
  20. 20.
    Kaess, M., Ranganathan, A., Dellaert, F.: iSAM: incremental smoothing and mapping. Trans. Robot. (2008)Google Scholar
  21. 21.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of CVPR (2015)Google Scholar
  22. 22.
    Katz, D., Orthey, A., Brock, O.: Interactive perception of articulated objects. In: Proceedings of ISER (2010)Google Scholar
  23. 23.
    Katz, D., Kazemi, M., Andrew Bagnell, J., Stentz, A.: Interactive segmentation, tracking, and kinematic modeling of unknown 3D articulated objects. In: Proceedings of ICRA (2013)Google Scholar
  24. 24.
    Kollar, T., Tellex, S., Roy, D., Roy, N.: Toward understanding natural language directions. In: Proceedings of HRI (2010)Google Scholar
  25. 25.
    Kollar, T., Krishnamurthy. J., Strimel. G.: Toward interactive grounded language acquisition. In: Proceedings of RSS (2013)Google Scholar
  26. 26.
    Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to-image coreference. In: Proceedings of CVPR (2014)Google Scholar
  27. 27.
    Krishnamurthy, J., Kollar, T.: Jointly learning to parse and perceive: connecting natural language to the physical world. Trans. Assoc. Comput. Linguist. (2013)Google Scholar
  28. 28.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. (2004)Google Scholar
  29. 29.
    Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., Murphy, K.: What’s cookin’? Interpreting cooking videos using text, speech and vision (2015). arXiv:150301558
  30. 30.
    Martín-Martín, R., Brock, O.: Online interactive perception of articulated objects with multi-level recursive estimation based on task-specific priors. In: Proceedings of IROS (2014)Google Scholar
  31. 31.
    Martín-Martín, R., Höfer, S., Brock, O.: An integrated approach to visual perception of articulated objects. In: Proceedings of ICRA (2016)Google Scholar
  32. 32.
    Matuszek, C., Fox, D., Koscher, K.: Following directions using statistical machine translation. In: Proceedings of HRI (2010)Google Scholar
  33. 33.
    Matuszek, C., Herbst, E., Zettlemoyer, L., Fox, D.: Learning to parse natural language commands to a robot control system. In: Proceedings of ISER (2012)Google Scholar
  34. 34.
    Mei, H., Bansal, M., Walter, M.R.: Listen, attend, and walk: neural mapping of navigational instructions to action sequences. In: Proceedings of AAAI (2016)Google Scholar
  35. 35.
    Misra, D.K., Sung, J., Lee, K., Saxena, A.: Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions. Int. J. Robot. Res. (2016)Google Scholar
  36. 36.
    Olson, E.: AprilTag: a robust and flexible visual fiducial system. In: Proceedings of ICRA (2011)Google Scholar
  37. 37.
    Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. In: Proceedings of NIPS (2011)Google Scholar
  38. 38.
    Paul, R., Arkin, J., Roy, N., Howard, T.M.: Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators. In: Proceedings of RSS (2016)Google Scholar
  39. 39.
    Pillai, S., Walter. M.R., Teller. S.: Learning articulated motions from visual demonstration. In: Proceedings of RSS (2014)Google Scholar
  40. 40.
    Pronobis, A., Jensfelt, P.: Large-scale semantic mapping and reasoning with heterogeneous modalities. In: Proceedings of ICRA (2012)Google Scholar
  41. 41.
    Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L.: Linking people in videos with their names using coreference resolution. In: Proceedings of ECCV (2014)Google Scholar
  42. 42.
    Schmidt, T., Newcombe, R., Fox, D.: DART: dense articulated real-time tracking. In: Proceedings of RSS (2014)Google Scholar
  43. 43.
    Sener, O., Zamir, A.R., Savarese, S., Saxena, A.: Unsupervised semantic parsing of video collections. In: Proceedings of ICCV (2015)Google Scholar
  44. 44.
    Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: Proceedings of ICML (2015)Google Scholar
  45. 45.
    Sturm, J., Stachniss. C., Burgard, W.: A probabilistic framework for learning kinematic models of articulated objects. J. Artif. Intell. Res. (2011)Google Scholar
  46. 46.
    Sung, J., Jin, S.H., Saxena, A.: Robobarista: object part-based transfer of manipulation trajectories from crowd-sourcing in 3D pointclouds. In: Proceedings of ISRR (2015)Google Scholar
  47. 47.
    Tellex, S., Kollar, T., Dickerson, S., Walter, M.R., Banerjee, A.G., Teller, S., Roy, N.: Understanding natural language commands for robotic navigation and mobile manipulation. In: Proceedings of AAAI (2011)Google Scholar
  48. 48.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of CVPR (2015)Google Scholar
  49. 49.
    Walter, M.R., Hemachandra, S., Homberg, B., Tellex, S., Teller, S.: Learning semantic maps from natural language descriptions. In: Proceedings of RSS (2013)Google Scholar
  50. 50.
    Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Proceedings of CVPR (2011)Google Scholar
  51. 51.
    Winograd, T.: Understanding natural language. Cogn. Psychol. (1972)Google Scholar
  52. 52.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of ICML (2015)Google Scholar
  53. 53.
    Yan, J., Pollefeys, M.: A general framework for motion segmentation: independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In: Proceedings ECCV (2006)Google Scholar
  54. 54.
    Yang, Y., Li, Y., Fermüller, C., Aloimonos, Y.: Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In: Proceedings of AAAI (2015)Google Scholar
  55. 55.
    Yu, S.I., Jiang, L., Hauptmann, A.: Instructional videos for unsupervised harvesting and learning of action examples. In: Proceedings of International Conference on Multimedia (2014)Google Scholar
  56. 56.
    Zender, H., Martínez Mozos, O., Jensfelt, P., Kruijff, G., Burgard, W.: Conceptual spatial representations for indoor mobile robots. Robot. Auton. Syst. (2008)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Andrea F. Daniele
    • 1
    Email author
  • Thomas M. Howard
    • 2
  • Matthew R. Walter
    • 1
  1. 1.Toyota Technological Institute at ChicagoChicagoUSA
  2. 2.University of RochesterRochesterUSA

Personalised recommendations