ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking

  • Oliver GrothEmail author
  • Fabian B. Fuchs
  • Ingmar Posner
  • Andrea Vedaldi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)


Physical intuition is pivotal for intelligent agents to perform complex tasks. In this paper we investigate the passive acquisition of an intuitive understanding of physical principles as well as the active utilisation of this intuition in the context of generalised object stacking. To this end, we provide ShapeStacks (Source code & data are available at a simulation-based dataset featuring 20,000 stack configurations composed of a variety of elementary geometric primitives richly annotated regarding semantics and structural stability. We train visual classifiers for binary stability prediction on the ShapeStacks data and scrutinise their learned physical intuition. Due to the richness of the training data our approach also generalises favourably to real-world scenarios achieving state-of-the-art stability prediction on a publicly available benchmark of block towers. We then leverage the physical intuition learned by our model to actively construct stable stacks and observe the emergence of an intuitive notion of stackability - an inherent object affordance - induced by the active stacking task. Our approach performs well exceeding the stack height observed during training and even manages to counterbalance initially unstable structures.


Intuitive physics Stability prediction Object stacking 



This research was funded by the European Research Council under grant ERC 677195-IDIU and the EPSRC AIMS Centre for Doctoral Training at Oxford University.


  1. 1.
    Battaglia, P.W., Hamrick, J.B., Tenenbaum, J.B.: Simulation as an engine of physical scene understanding. Proc. Natl. Acad. Sci. 110(45), 18327–18332 (2013). Scholar
  2. 2.
    Battaglia, P., Pascanu, R., Lai, M., Rezende, D.J., et al.: Interaction networks for learning about objects, relations and physics. In: Advances in Neural Information Processing Systems, pp. 4502–4510 (2016)Google Scholar
  3. 3.
    Chang, M.B., Ullman, T., Torralba, A., Tenenbaum, J.B.: A compositional object-based approach to learning physical dynamics, pp. 1–15 (2016).
  4. 4.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE (2009)Google Scholar
  5. 5.
    Fragkiadaki, K., Agrawal, P., Levine, S., Malik, J.: Learning visual predictive models of physics for playing billiards, pp. 1–12 (2015).
  6. 6.
    Furrer, F., et al.: Autonomous robotic stone stacking with online next best object target pose planning. In: Proceedings - IEEE International Conference on Robotics and Automation, pp. 2350–2356 (2017).
  7. 7.
    Gupta, A., Efros, A.A., Hebert, M.: Blocks world revisited: image understanding using qualitative geometry and mechanics. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 482–496. Springer, Heidelberg (2010). Scholar
  8. 8.
    Hamrick, J.B., Battaglia, P.W., Griffiths, T.L., Tenenbaum, J.B.: Inferring mass in complex scenes by mental simulation. Cognition 157 (2016). Scholar
  9. 9.
    Hinton, G., Srivastava, N., Swersky, K.: Coursera, neural networks for machine learning, lecture 6e (2014).
  10. 10.
    Jia, Z., Gallagher, A.C., Saxena, A., Chen, T.: 3D reasoning from blocks to stability. IEEE Trans. Pattern Anal. Mach. Intell. 37(5), 905–918 (2015). Scholar
  11. 11.
    Kjellström, H., Romero, J., Kragić, D.: Visual object-action recognition: inferring object affordances from human demonstration. Comput. Vis. Image Underst. 115(1), 81–90 (2011). Scholar
  12. 12.
    Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 14–29 (2016). Scholar
  13. 13.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1–9 (2012). Scholar
  14. 14.
    Kubricht, J.R., Holyoak, K.J., Lu, H.: Intuitive physics: current research and controversies. Trends Cogn. Sci. 21(10), 749–759 (2017). Scholar
  15. 15.
    Lerer, A., Gross, S., Fergus, R.: Learning physical intuition of block towers by example. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 2016, pp. 430–438. (2016).
  16. 16.
    Li, W., Azimi, S., Leonardis, A., Fritz, M.: To fall or not to fall: a visual approach to physical stability prediction. arXiv preprint arXiv:1604.00066 (2016)
  17. 17.
    Li, W., Leonardis, A., Fritz, M.: Visual stability prediction for robotic manipulation. In: Proceedings - IEEE International Conference on Robotics and Automation, pp. 2606–2613 (2017).
  18. 18.
    Mottaghi, R., Bagherinezhad, H., Rastegari, M., Farhadi, A.: Newtonian image understanding: unfolding the dynamics of objects in static images. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
  19. 19.
    Ornan, O., Degani, A.: Toward autonomous disassembling of randomly piled objects with minimal perturbation. IEEE International Conference on Intelligent Robots and Systems, pp. 4983–4989 (2013).
  20. 20.
    Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: AAAI, vol. 4, p. 12 (2017)Google Scholar
  21. 21.
    Wang, J., Rogers, P., Parker, L., Brooks, D., Stilman, M.: Robot Jenga: autonomous and strategic block extraction. In: 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009, pp. 5248–5253 (2009).
  22. 22.
    Wieber, P.B.: On the stability of walking systems. In: Proceedings of the Third IARP International Workshop on Humanoid and Human Friendly Robotics, pp. 1–7 (2002). Scholar
  23. 23.
    Wu, J., Lu, E., Kohli, P., Freeman, W.T., Tenenbaum, J.B.: Learning to see physics via visual de-animation. In: Advances in Neural Information Processing Systems (NIPS) (2017)Google Scholar
  24. 24.
    Wu, J., Yildirim, I., Lim, J., Freeman, W., Tenenbaum, J.: Galileo: perceiving physical object properties by integrating a physics engine with deep learning. In: Advances in Neural Information Processing Systems 28 (NIPS 2015), pp. 1–9 (2015)Google Scholar
  25. 25.
    Zhu, Y., et al.: Reinforcement and imitation learning for diverse visuomotor skills. CoRR abs/1802.09564 (2018).

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Oliver Groth
    • 1
    Email author
  • Fabian B. Fuchs
    • 1
  • Ingmar Posner
    • 1
  • Andrea Vedaldi
    • 1
  1. 1.Department of EngineeringUniversity of OxfordOxfordUK

Personalised recommendations