Skip to main content

Learning Predictive Models from Observation and Interaction

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Abstract

Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works, and then use this learned model to plan coordinated sequences of actions to bring about desired outcomes. However, learning a model that captures the dynamics of complex skills represents a major challenge: if the agent needs a good model to perform these skills, it might never be able to collect the experience on its own that is required to learn these delicate and complex behaviors. Instead, we can imagine augmenting the training set with observational data of other agents, such as humans. Such data is likely more plentiful, but cannot always be combined with data from the original agent. For example, videos of humans might show a robot how to use a tool, but (i) are not annotated with suitable robot actions, and (ii) contain a systematic distributional shift due to the embodiment differences between humans and robots. We address the first challenge by formulating the corresponding graphical model and treating the action as an observed variable for the interaction data and an unobserved variable for the observation data, and the second challenge by using a domain-dependent prior. In addition to interaction data, our method is able to leverage videos of passive observations in a driving dataset and a dataset of robotic manipulation videos to improve video prediction performance. In a real-world tabletop robotic manipulation setting, our method is able to significantly improve control performance by learning a model from both robot data and observations of humans.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Data will be made available at https://sites.google.com/view/lpmfoai.

References

  1. van Amersfoort, J., Kannan, A., Ranzato, M., Szlam, A., Tran, D., Chintala, S.: Transformation-based models of video sequences. arXiv preprint, January 2017. http://arxiv.org/abs/1701.08435

  2. Aytar, Y., Pfaff, T., Budden, D., Paine, T., Wang, Z., de Freitas, N.: Playing hard exploration games by watching YouTube. In: Advances in Neural Information Processing Systems 31 (2018). http://papers.nips.cc/paper/7557-playing-hard-exploration-games-by-watching-youtube.pdf

  3. Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. In: International Conference on Learning Representations (2018)

    Google Scholar 

  4. Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  5. Byeon, W., Wang, Q., Srivastava, R.K., Koumoutsakos, P.: ContextVP: fully context-aware video prediction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 781–797. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_46

    Chapter  Google Scholar 

  6. Byravan, A., Leeb, F., Meier, F., Fox, D.: SE3-Pose-Nets: structured deep dynamics models for visuomotor control. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE (2018)

    Google Scholar 

  7. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (2019)

  8. Castrejon, L., Ballas, N., Courville, A.: Improved conditional VRNNs for video prediction. arXiv preprint, April 2019. http://arxiv.org/abs/1904.12165

  9. Chen, B., Wang, W., Wang, J., Chen, X.: Video imagination from a single image with transformation generation. arXiv preprint, June 2017. http://arxiv.org/abs/1706.04124

  10. Chiappa, S., Racanière, S., Wierstra, D., Mohamed, S.: Recurrent environment simulators. In: International Conference on Learning Representations (2017)

    Google Scholar 

  11. Dasari, S., et al.: RoboNet: large-scale multi-robot learning. In: Conference on Robot Learning, October 2019. http://arxiv.org/abs/1910.11215

  12. De Brabandere, B., Jia, X., Tuytelaars, T., Van Gool, L.: Dynamic filter networks. In: Neural Information Processing Systems, May 2016. http://arxiv.org/abs/1605.09673

  13. Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: International Conference on Machine Learning (ICML) (2018)

    Google Scholar 

  14. Denton, E., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: Neural Information Processing Systems, pp. 4417–4426 (2017)

    Google Scholar 

  15. Dwibedi, D., Tompson, J., Lynch, C., Sermanet, P.: Learning actionable representations from visual observations. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1577–1584. IEEE (2018). https://arxiv.org/abs/1808.00928

  16. Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., Levine, S.: Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv:1812.00568 (2018)

  17. Edwards, A.D., Sahni, H., Schroecker, Y., Isbell, C.L.: Imitating latent policies from observation. In: International Conference on Machine Learning, May 2019. http://arxiv.org/abs/1805.07914

  18. Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Neural Information Processing Systems (2016)

    Google Scholar 

  19. Finn, C., Levine, S.: Deep visual foresight for planning robot motion. In: Proceedings of International Conference in Robotics and Automation (ICRA) (2017)

    Google Scholar 

  20. Fragkiadaki, K., Agrawal, P., Levine, S., Malik, J.: Learning visual predictive models of physics for playing billiards. In: International Conference on Learning Representations, November 2016. http://arxiv.org/abs/1511.07404

  21. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)

    Google Scholar 

  22. Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2096–2030 (2016)

    MathSciNet  Google Scholar 

  23. Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. In: Neural Information Processing Systems (2018)

    Google Scholar 

  24. Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning, pp. 2555–2565. PMLR (2019)

    Google Scholar 

  25. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  26. Hoffman, J., et al.: CyCADA: cycle-consistent adversarial domain adaptation. In: International Conference on Machine Learning (ICML), November 2018. http://arxiv.org/abs/1711.03213

  27. Ilse, M., Tomczak, J.M., Louizos, C., Welling, M.: DIVA: domain invariant variational autoencoders. arXiv preprint, May 2019. http://arxiv.org/abs/1905.10427

  28. Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: model-based policy optimization. In: NeurIPS (2019)

    Google Scholar 

  29. Janner, M., Levine, S., Freeman, W.T., Tenenbaum, J.B., Finn, C., Wu, J.: Reasoning about physical interactions with object-oriented prediction and planning. In: International Conference on Learning Representations, December 2019. http://arxiv.org/abs/1812.10972

  30. Kaiser, L., et al.: Model-based reinforcement learning for Atari. In: International Conference on Learning Representations (2019)

    Google Scholar 

  31. Kalchbrenner, N., et al.: Video pixel networks. arXiv preprint, October 2016. http://arxiv.org/abs/1610.00527

  32. Kumar, A., Gupta, S., Malik, J.: Learning navigation subroutines by watching videos. CoRR abs/1905.12612 (2019). http://arxiv.org/abs/1905.12612

  33. Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv:1804.01523 abs/1804.01523 (2018)

  34. Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion GAN for future-flow embedded video prediction. In: International Conference on Computer Vision, August 2017. http://arxiv.org/abs/1708.00284

  35. Liu, Y., Gupta, A., Abbeel, P., Levine, S.: Imitation from observation: learning to imitate behaviors from raw video via context translation. Ph.D. thesis, University of California, Berkeley, July 2018. http://arxiv.org/abs/1707.03374

  36. Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4463–4471 (2017)

    Google Scholar 

  37. Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint, May 2016. http://arxiv.org/abs/1605.08104

  38. Lu, C., Hirsch, M., Scholkoph, B.: Flexible spatio-temporal networks for video prediction. In: Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  39. Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: International Conference on Computer Vision, March 2017. http://arxiv.org/abs/1703.07684

  40. Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: International Conference on Learning Representations (2016)

    Google Scholar 

  41. Oh, J., Guo, X., Lee, H., Lewis, R., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: Neural Information Processing Systems (2015)

    Google Scholar 

  42. Patraucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory. arXiv preprint, November 2015. http://arxiv.org/abs/1511.06309

  43. Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)

  44. Rizzolatti, G., Craighero, L.: The mirror-neuron system. Annu. Rev. Neurosci. 27, 169–192 (2004)

    Article  Google Scholar 

  45. Rizzolatti, G., Fadiga, L., Gallese, V., Fogassi, L.: Premotor cortex and the recognition of motor actions. Cogn. Brain. Res. 3(2), 131–141 (1996)

    Article  Google Scholar 

  46. Rybkin, O., Pertsch, K., Derpanis, K.G., Daniilidis, K., Jaegle, A.: Learning what you can do before doing anything. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=SylPMnR9Ym

  47. Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: Proceedings of International Conference in Robotics and Automation (ICRA) (2018). http://arxiv.org/abs/1704.06888

  48. Sharma, P., Pathak, D., Gupta, A.: Third-person visual imitation learning via decoupled hierarchical controller. In: Neural Information Processing Systems (2019)

    Google Scholar 

  49. Shon, A.P., Grochow, K., Hertzmann, A., Rao, R.P.: Learning shared latent structure for image synthesis and robotic imitation. In: Advances in Neural Information Processing Systems, pp. 1233–1240 (2005)

    Google Scholar 

  50. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning (ICML) (2015)

    Google Scholar 

  51. Stadie, B.C., Abbeel, P., Sutskever, I.: Third-person imitation learning. arXiv preprint arXiv:1703.01703 (2017)

  52. Sun, M., Ma, X.: Adversarial imitation learning from incomplete demonstrations. In: International Joint Conference on Artificial Intelligence, May 2019. http://arxiv.org/abs/1905.12310

  53. Sun, W., Vemula, A., Boots, B., Bagnell, J.A.: Provably efficient imitation learning from observation alone. In: International Conference on Machine Learning, May 2019. http://arxiv.org/abs/1905.10948

  54. Taigman, Y., Polyak, A., Wolf, L.: Unsupervised cross-domain image generation. In: International Conference on Learning Representations, November 2017. https://arxiv.org/abs/1611.02200

  55. Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation. In: International Joint Conference on Artificial Intelligence, May 2018. http://arxiv.org/abs/1805.01954

  56. Torabi, F., Warnell, G., Stone, P.: Generative adversarial imitation from observation. arXiv preprint, July 2018. http://arxiv.org/abs/1807.06158

  57. Torabi, F., Warnell, G., Stone, P.: Imitation learning from video by leveraging proprioception. In: International Joint Conference on Artificial Intelligence, May 2019. http://arxiv.org/abs/1905.09335

  58. Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  59. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Computer Vision and Pattern Recognition, February 2017. http://arxiv.org/abs/1702.05464

  60. Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: maximizing for domain invariance. arXiv preprint, December 2014. http://arxiv.org/abs/1412.3474

  61. Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: International Conference on Learning Representations (2017)

    Google Scholar 

  62. Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  63. Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. In: Conference on Vision and Pattern Recognition (2017)

    Google Scholar 

  64. Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_51. http://arxiv.org/abs/1606.07873

  65. Wang, A., Kurutach, T., Tamar, A., Abbeel, P.: Learning robotic manipulation through visual planning and acting. In: Robotics: Science and Systems (2019)

    Google Scholar 

  66. Watter, M., Springenberg, J.T., Boedecker, J., Riedmiller, M.: Embed to control: a locally linear latent dynamics model for control from raw images. In: Neural Information Processing Systems (2015)

    Google Scholar 

  67. Wichers, N., Villegas, R., Erhan, D., Lee, H.: Hierarchical long-term video prediction without supervision. In: ICML (2018)

    Google Scholar 

  68. Xie, A., Ebert, F., Levine, S., Finn, C.: Improvisation through physical understanding: using novel objects as tools with visual foresight. In: Robotics: Science and Systems, April 2019. http://arxiv.org/abs/1904.05538

  69. Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. (2016). http://arxiv.org/abs/1607.02586

  70. Yen-Chen, L., Bauza, M., Isola, P.: Experience-embedded visual foresight. In: Conference on Robot Learning, November 2019. http://arxiv.org/abs/1911.05071

  71. Yu, F., et al.: BDD100K: a diverse driving video database with scalable annotation tooling. arXiv preprint, May 2018. http://arxiv.org/abs/1805.04687

  72. Yu, T., et al.: One-shot imitation from observing humans via domain-adaptive meta-learning. In: Robotics: Science and Systems, February 2018. http://arxiv.org/abs/1802.01557

  73. Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M.J., Levine, S.: SOLAR: deep structured representations for model-based reinforcement learning. In: International Conference on Machine Learning, August 2018. http://arxiv.org/abs/1808.09105

  74. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  75. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  76. Zhuang, F., Cheng, X., Luo, P., Pan, S.J., He, Q.: Supervised representation learning with double encoding-layer autoencoder for transfer learning. In: International Joint Conference on Artificial Intelligence (2015). https://doi.org/10.1145/3108257

Download references

Acknowledgements

We thank Karl Pertsch, Drew Jaegle, Marvin Zhang, and Kenneth Chaney. This work was supported by the NSF GRFP, ARL RCTA W911NF-10-2-0016, ARL DCIST CRA W911NF-17-2-0181, and by Honda Research Institute.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Karl Schmeckpeper .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 2101 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Schmeckpeper, K. et al. (2020). Learning Predictive Models from Observation and Interaction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12365. Springer, Cham. https://doi.org/10.1007/978-3-030-58565-5_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58565-5_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58564-8

  • Online ISBN: 978-3-030-58565-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics