Learning to Disentangle Latent Physical Factors for Video Prediction

  • Deyao ZhuEmail author
  • Marco Munderloh
  • Bodo Rosenhahn
  • Jörg Stückler
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11824)


Physical scene understanding is a fundamental human ability. Empowering artificial systems with such understanding is an important step towards flexible and adaptive behavior in the real world. As a step in this direction, we propose a novel approach to physical scene understanding in video. We train a deep neural network for video prediction which embeds the video sequence in a low-dimensional recurrent latent space representation. We optimize the total correlation of the latent dimensions within a variational recurrent auto-encoder framework. This encourages the representation to disentangle the latent physical factors of variation in the training data. To train and evaluate our approach, we use synthetic video sequences in three different physical scenarios with various degrees of difficulty. Our experiments demonstrate that our model can disentangle several appearance-related properties in the unsupervised case. If we add supervision signals for the latent code, our model can further improve the disentanglement of dynamics-related properties.



This work has been supported through Cyber Valley.

Supplementary material

480714_1_En_42_MOESM1_ESM.pdf (18 mb)
Supplementary material 1 (pdf 18400 KB)


  1. 1.
    Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R., Levine, S.: Stochastic variational video prediction. In: ICLR (2018)Google Scholar
  2. 2.
    Battaglia, P., Pascanu, R., Lai, M., Rezende, D., Kavukcuoglu, K.: Interaction networks for learning about objects, relations and physics. In: NIPS (2016)Google Scholar
  3. 3.
    Burgess, C., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., Lerchner, A.: Understanding disentangling in beta -VAE. In: Learning Disentangle Representations: From Perception to Control workshop (2017)Google Scholar
  4. 4.
    Chen, T., Li, X., Grosse, R., Duvenaud, D.: Isolating sources of disentanglement in VAEs. In: NIPS (2018)Google Scholar
  5. 5.
    Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: NIPS (2016)Google Scholar
  6. 6.
    Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS Workshop (2014)Google Scholar
  7. 7.
    Ebert, F., Finn, C., Lee, X., Levine, S.: Self-supervised visual planning with temporal skip connections. In: CoRL (2017)Google Scholar
  8. 8.
    Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: NIPS (2016)Google Scholar
  9. 9.
    Fraccaro, M., Kamronn, S., Paquet, U., Winther, O.: A disentangled recognition and nonlinear dynamics model for unsupervised learning (2017)Google Scholar
  10. 10.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)Google Scholar
  11. 11.
    Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2555–2565. PMLR (2019)Google Scholar
  12. 12.
    Higgins, I., et al.: beta-VAE: learning basic visual concepts with a constrained variational framework. In: ICLR (2017)Google Scholar
  13. 13.
    Hochreiter, S., Schmidhuber, J.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Neural Computation (1997)Google Scholar
  14. 14.
    Johnson, M., Duvenaud, D.K., Wiltschko, A., Adams, R.P., Datta, S.R.: Composing graphical models with neural networks for structured representations and fast inference. In: Advances in Neural Information Processing Systems 29 (NIPS), pp. 2946–2954 (2016)Google Scholar
  15. 15.
    Kim, H., Mnih, A.: Disentangling by factorising. In: CoRR (2018)Google Scholar
  16. 16.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  17. 17.
    Kingma, D., Welling, M.: Auto-encoding variational Bayes. In: CoRR (2013)Google Scholar
  18. 18.
    Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012). Scholar
  19. 19.
    Larsen, A., Sønderby, S., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. In: ICML (2016)Google Scholar
  20. 20.
    Lee, X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. In: arXiv preprint (2018)Google Scholar
  21. 21.
    Lerer, A., Gross, S., Fergus, R.: Learning physical intuition of block towers by example. In: ICML (2016)Google Scholar
  22. 22.
    Mottaghi, R., Bagherinezhad, H., Rastegari, M., Farhadi, A.: Newtonian scene understanding: Unfolding the dynamics of objects in static images. In: CVPR (2016)Google Scholar
  23. 23.
    Piloto, L., et al.: Probing physics knowledge using tools from developmental psychology. In: CoRR (2018)Google Scholar
  24. 24.
    Riochet, R., et al.: IntPhys: a framework and benchmark for visual intuitive physics reasoning. In: arXiv preprint (2018)Google Scholar
  25. 25.
    Sanchez-Gonzalez, A., et al.: Graph networks as learnable physics engines for inference and control. In: ICML (2018)Google Scholar
  26. 26.
    Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)Google Scholar
  27. 27.
    Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). Scholar
  28. 28.
    Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P., Zoran, D.: Visual interaction networks. In: NIPS (2017)Google Scholar
  29. 29.
    Wu, J., Lim, J.J., Zhang, H., Tenenbaum, J.B., Freeman, W.T.: Physics 101: learning physical object properties from unlabeled videos. In: BMVC (2016)Google Scholar
  30. 30.
    Wu, J., Lu, E., Kohli, P., Freeman, W., Tenenbaum, J.: Learning to see physics via visual de-animation. In: NIPS (2017)Google Scholar
  31. 31.
    Ye, T., Wang, X., Davidson, J., Gupta, A.: Interpretable intuitive physics model. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 89–105. Springer, Cham (2018). Scholar
  32. 32.
    Zhang, R., Wu, J., Zhang, C., Freeman, W., Tenenbaum, J.: A comparative evaluation of approximate probabilistic simulation and deep neural networks as accounts of human physical scene understanding. In: Annual Conference of the Cognitive Science Society (2016)Google Scholar
  33. 33.
    Zheng, B., Zhao, Y., Yu, J., Ikeuchi, K., Zhu, S.: Scene understanding by reasoning stability and safety. In: IJCV (2015)Google Scholar
  34. 34.
    Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 286–301. Springer, Cham (2016). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Deyao Zhu
    • 1
    • 2
    Email author
  • Marco Munderloh
    • 2
  • Bodo Rosenhahn
    • 2
  • Jörg Stückler
    • 1
  1. 1.Max Planck Institute for Intelligent SystemsTübingenGermany
  2. 2.Leibniz Universität HannoverHanoverGermany

Personalised recommendations