Skip to main content

Unsupervised separation of dynamics from pixels

Abstract

We present an approach to learn the dynamics of multiple objects from image sequences in an unsupervised way. We introduce a probabilistic model that first generate noisy positions for each object through a separate linear state-space model, and then renders the positions of all objects in the same image through a highly non-linear process. Such a linear representation of the dynamics enables us to propose an inference method that uses exact and efficient inference tools and that can be deployed to query the model in different ways without retraining.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    Whilst in practice we need to consider all observed sequences in the KL, to simplify the notation we focus the exposition on one sequence only.

  2. 2.

    In practice, as the state \(s_0^n\) encodes which way we can interrogate \(v_1\) to infer \(a_1^n\), we have obtained better results by learning separate \(\phi _{s_0^n}\) that depend on the number of objects N in the image.

References

  1. 1.

    Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R., Levine, S.: Stochastic variational video prediction. In: 6th International Conference on Learning Representations (2018)

  2. 2.

    Bar-Shalom, Y., Li, X.R.: Estimation and Tracking: Principles, Techniques, and Software. Artech House, Norwood (1993)

    MATH  Google Scholar 

  3. 3.

    Barber, D., Cemgil, A.T., Chiappa, S.: Inference and estimation in probabilistic time series models. In: Bayesian Time Series Models, pp. 1–31 (2011)

  4. 4.

    Blackman, S., Popoli, R.: Design and Analysis of Modern Tracking Systems. Artech House, Norwood (1999)

    MATH  Google Scholar 

  5. 5.

    Chiappa, S.: Analysis and Classification of EEG Signals using Probabilistic Models for Brain Computer Interfaces. Ph.D. thesis, EPF Lausanne, Switzerland (2006)

  6. 6.

    Chiappa, S.: A Bayesian approach to switching linear Gaussian state-space models for unsupervised time-series segmentation. In: Proceedings of the Seventh International Conference on Machine Learning and Applications, pp. 3–9 (2008)

  7. 7.

    Chiappa, S.: Explicit-duration Markov switching models. Found. Trends Mach. Learn. 7(6), 803–886 (2014)

    Article  MATH  Google Scholar 

  8. 8.

    Chiappa, S., Racanière, S., Wierstra, D., Mohamed, S.: Recurrent environment simulators. In: 5th International Conference on Learning Representations (2017)

  9. 9.

    Denton, E.L., Birodkar, V.: Unsupervised learning of disentangled representations from video. Adv. Neural Inf. Process. Syst. 30, 4414–4423 (2017)

    Google Scholar 

  10. 10.

    Finn, C., Goodfellow, I.J., Levine, S.: Unsupervised learning for physical interaction through video prediction. Adv. Neural Inf. Process. Syst. 29, 64–72 (2016)

    Google Scholar 

  11. 11.

    Fraccaro, M., Kamronn, S., Paquet, U., Winther, O.: A disentangled recognition and nonlinear dynamics model for unsupervised learning. Adv. Neural Inf. Process. Syst. 30, 3604–3613 (2017)

    Google Scholar 

  12. 12.

    Fraccaro, M., Sønderby, S.K., Paquet, U., Winther, O.: Sequential neural models with stochastic layers. Adv. Neural Inf. Process. Syst. 29, 2199–2207 (2016)

    Google Scholar 

  13. 13.

    Gao, Y., Archer, E.W., Paninski, L., Cunningham, J.P.: Linear dynamical neural population models through nonlinear embeddings. Adv. Neural Inf. Process. Syst. 29, 163–171 (2016)

    Google Scholar 

  14. 14.

    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  15. 15.

    Johnson, M., Duvenaud, D.K., Wiltschko, A., Adams, R.P., Datta, S.R.: Composing graphical models with neural networks for structured representations and fast inference. Adv. Neural Inf. Process. Syst. 29, 2946–2954 (2016)

    Google Scholar 

  16. 16.

    Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: 2nd International Conference on Learning Representations (2014)

  17. 17.

    Krishnan, R., Shalit, U., Sontag, D.: Structured inference networks for nonlinear state space models. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 2101–2109 (2017)

  18. 18.

    Lin, W., Hubacher, N., Khan, M.E.: Variational message passing with structured inference networks. In: 6th International Conference on Learning Representations (2018)

  19. 19.

    Oh, J., Guo, X., Lee, H., Lewis, R.L., Singh, S.: Action-conditional video prediction using deep networks in Atari games. Adv. Neural Inf. Process. Syst. 28, 2863–2871 (2015)

    Google Scholar 

  20. 20.

    Pearce, M., Chiappa, S., Paquet, U.: Comparing interpretable inference models for videos of physical motion. In: Symposium on Advances in Approximate Bayesian Inference (2018)

  21. 21.

    Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of the 31st International Conference on Machine Learning, pp. 1278–1286 (2014)

  22. 22.

    Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 843–852 (2015)

  23. 23.

    Sun, W., Venkatraman, A., Boots, B., Bagnell, J.A.: Learning to filter with predictive state inference machines. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 1197–1205 (2016)

  24. 24.

    Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P., Zoran, D.: Visual interaction networks. CoRR. arXiv:1706.01433 (2017)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Silvia Chiappa.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Multi-step ahead generation of images and inference using past and future observations

Appendix: Multi-step ahead generation of images and inference using past and future observations

See Figs. 9, 10, 11 and 12.

Fig. 9
figure9

Each plot shows generated (left) versus ground-truth (right) images at time-step 30 (top) and overlaid in time (bottom) for our model

Fig. 10
figure10

Each plot shows generated (left) versus ground-truth (right) images at time-step 30 (top) and overlaid in time (bottom) for the ED-LSTM

Fig. 11
figure11

Top: Ground-truth (black), inferred (blue), generated (red), and interpolated (cyan) trajectories. Middle: Generated versus ground-truth images and interpolated versus ground-truth images at time-step 30. Bottom: Generated versus ground-truth images and interpolated versus ground-truth images overlaid in time (color figure online)

Fig. 12
figure12

Top: Ground-truth (black), inferred (blue), generated (red), and interpolated (cyan) trajectories. Middle: Generated versus ground-truth images and interpolated versus ground-truth images at time-step 30. Bottom: Generated versus ground-truth images and interpolated versus ground-truth images overlaid in time (color figure online)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chiappa, S., Paquet, U. Unsupervised separation of dynamics from pixels. METRON 77, 119–135 (2019). https://doi.org/10.1007/s40300-019-00155-4

Download citation

Keywords

  • Variational auto-encoders
  • Linear Gaussian state space models
  • Deep neural networks