The Curious Robot: Learning Visual Representations via Physical Interactions

  • Lerrel Pinto
  • Dhiraj Gandhi
  • Yuanfeng Han
  • Yong-Lae Park
  • Abhinav Gupta
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9906)


What is the right supervisory signal to train visual representations? Current approaches in computer vision use category labels from datasets such as ImageNet to train ConvNets. However, in case of biological agents, visual representation learning does not require millions of semantic labels. We argue that biological agents use physical interactions with the world to learn visual representations unlike current vision systems which just use passive observations (images and videos downloaded from web). For example, babies push objects, poke them, put them in their mouth and throw them to learn representations. Towards this goal, we build one of the first systems on a Baxter platform that pushes, pokes, grasps and observes objects in a tabletop environment. It uses four different types of physical interactions to collect more than 130K datapoints, with each datapoint providing supervision to a shared ConvNet architecture allowing us to learn visual representations. We show the quality of learned representations by observing neuron activations and performing nearest neighbor retrieval on this learned representation. Quantitatively, we evaluate our learned ConvNet on image classification tasks and show improvements compared to learning without external data. Finally, on the task of instance retrieval, our network outperforms the ImageNet network on recall@1 by 3 %.


Visual Representation Image Retrieval Kernel Size Deep Belief Network Convolutional Layer 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was supported by ONR MURI N000141612007, NSF IIS-1320083 and gift from Google. The authors would like to thank Yahoo! and Nvidia for the compute cluster and GPU donations respectively. The authors would also like to thank Xiaolong Wang for helpful discussions and code.


  1. 1.
    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015)Google Scholar
  2. 2.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802 (2015)Google Scholar
  3. 3.
    Jayaraman, D., Grauman, K.: Learning image representations tied to ego-motion. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1413–1421 (2015)Google Scholar
  4. 4.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes (2013). arXiv preprint arXiv:1312.6114
  5. 5.
    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
  6. 6.
    Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using Laplacian pyramid of adversarial networks. In: Advances in Neural Information Processing Systems, pp. 1486–1494 (2015)Google Scholar
  7. 7.
    Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks (2015). arXiv preprint arXiv:1511.06434
  8. 8.
    Salakhutdinov, R., Hinton, G.E.: Deep Boltzmann machines. In: International Conference on Artificial Intelligence and Statistics, pp. 448–455 (2009)Google Scholar
  9. 9.
    Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al.: Greedy layer-wise training of deep networks. Adv. Neural Inf. Process. Syst. 19, 153 (2007)Google Scholar
  10. 10.
    Wang, X., Gupta, A.: Generative image modeling using style and structure adversarial networks. In: ECCV (2016)Google Scholar
  11. 11.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 37–45 (2015)Google Scholar
  12. 12.
    Mobahi, H., Collobert, R., Weston, J.: Deep learning from temporal coherence in video. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 737–744. ACM (2009)Google Scholar
  13. 13.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems, pp. 2366–2374 (2014)Google Scholar
  14. 14.
    Wang, X., Fouhey, D., Gupta, A.: Designing deep networks for surface normal estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 539–547 (2015)Google Scholar
  15. 15.
    Walker, J., Gupta, A., Hebert, M.: Dense optical flow prediction from a static image. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2443–2451 (2015)Google Scholar
  16. 16.
    Held, R., Hein, A.: Movement-produced stimulation in the development of visually guided behavior. J. Comp. Physiol. Psychol. 56(5), 872 (1963)CrossRefGoogle Scholar
  17. 17.
    Bicchi, A., Kumar, V.: Robotic grasping and contact: a review. In: ICRA, pp. 348–353, Citeseer (2000)Google Scholar
  18. 18.
    Bohg, J., Morales, A., Asfour, T., Kragic, D.: Data-driven grasp synthesis a survey. IEEE Trans. Robot. 30(2), 289–309 (2014)CrossRefGoogle Scholar
  19. 19.
    Lenz, I., Lee, H., Saxena, A.: Deep learning for detecting robotic grasps (2013). arXiv preprint arXiv:1301.3592
  20. 20.
    Morales, A., Chinellato, E., Fagg, A.H., Del Pobil, A.P.: Using experience for assessing grasp reliability. In: IJRRGoogle Scholar
  21. 21.
    Detry, R., Baseski, E., Popovic, M., Touati, Y., Kruger, N., Kroemer, O., Peters, J., Piater, J.: Learning object-specific grasp affordance densities. In: ICDL (2009)Google Scholar
  22. 22.
    Paolini, R., Rodriguez, A., Srinivasa, S., Mason, M.T.: A data-driven statistical framework for post-grasp manipulation. IJRR 33(4), 600–615 (2014)Google Scholar
  23. 23.
    Pinto, L., Gupta, A.: Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours (2015). arXiv preprint arXiv:1509.06825
  24. 24.
    Levine, S., Pastor, P., Krizhevsky, A., Quillen, D.: Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection (2016). arXiv preprint arXiv:1603.02199
  25. 25.
    Balorda, Z.: Reducing uncertainty of objects by robot pushing. In: Proceedings of 1990 IEEE International Conference on Robotics and Automation, pp. 1051–1056. IEEE (1990)Google Scholar
  26. 26.
    Balorda, Z.: Automatic planning of robot pushing operations. In: Proceedings of 1993 IEEE International Conference on Robotics and Automation, pp. 732–737. IEEE (1993)Google Scholar
  27. 27.
    Lynch, K.M., Mason, M.T.: Stable pushing: mechanics, controllability, and planning. Int. J. Robot. Res. 15(6), 533–556 (1996)CrossRefGoogle Scholar
  28. 28.
    Dogar, M., Srinivasa, S.: A framework for push-grasping in clutter. In: Robotics: Science and Systems VII (2011)Google Scholar
  29. 29.
    Yun, X.: Object handling using two arms without grasping. Int. J. Robot. Res. 12(1), 99–106 (1993)CrossRefGoogle Scholar
  30. 30.
    Zhou, J., Paolini, R., Bagnell, J.A., Mason, M.T.: A convex polynomial force-motion model for planar sliding: Identification and application (2016)Google Scholar
  31. 31.
    Park, Y.L., Ryu, S.C., Black, R.J., Chau, K.K., Moslehi, B., Cutkosky, M.R.: Exoskeletal force-sensing end-effectors with embedded optical fiber-bragg-grating sensors. IEEE Trans. Robot. 25(6), 1319–1331 (2009)CrossRefGoogle Scholar
  32. 32.
    Schneider, A., Sturm, J., Stachniss, C., Reisert, M., Burkhardt, H., Burgard, W.: Object identification with tactile sensors using bag-of-features. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009, pp. 243–248. IEEE (2009)Google Scholar
  33. 33.
    Aloimonos, J., Weiss, I., Bandyopadhyay, A.: Active vision. Int. J. Comput. Vis. 1(4), 333–356 (1988)CrossRefGoogle Scholar
  34. 34.
    Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D shapenets: a deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920 (2015)Google Scholar
  35. 35.
    Mahler, J., Pokorny, F.T., Hou, B., Roderick, M., Laskey, M., Aubry, M., Kohlhoff, K., Kroeger, T., Kuffner, J., Goldberg, K.: Dex-Net 1.0: a cloud-based network of 3D objects for robust grasp planning using a multi-armed bandit model with correlated rewardsGoogle Scholar
  36. 36.
    Redmon, J., Angelova, A.: Real-time grasp detection using convolutional neural networks (2014). arXiv preprint arXiv:1412.3128
  37. 37.
    Levine, S., Wagener, N., Abbeel, P.: Learning contact-rich manipulation skills with guided policy search (2015). arXiv preprint arXiv:1501.05611
  38. 38.
    Tzeng, E., Devin, C., Hoffman, J., Finn, C., Peng, X., Levine, S., Saenko, K., Darrell, T.: Towards adapting deep visuomotor representations from simulated to real environments (2015). arXiv preprint arXiv:1511.07111
  39. 39.
    Finn, C., Tan, X.Y., Duan, Y., Darrell, T., Levine, S., Abbeel, P.: Deep spatial autoencoders for visuomotor learning. Reconstruction 117(117), 240 (2015)Google Scholar
  40. 40.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)Google Scholar
  41. 41.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR 2009, pp. 248–255. IEEE (2009)Google Scholar
  42. 42.
    Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view RGB-D object dataset. In: 2011 IEEE International Conference on Robotics and Automation (ICRA), pp. 1817–1824. IEEE (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Lerrel Pinto
    • 1
  • Dhiraj Gandhi
    • 1
  • Yuanfeng Han
    • 1
  • Yong-Lae Park
    • 1
  • Abhinav Gupta
    • 1
  1. 1.The Robotics InstituteCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations