Advertisement

Learning Predictive Models from Observation and Interaction

Conference paper
  • 603 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12365)

Abstract

Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works, and then use this learned model to plan coordinated sequences of actions to bring about desired outcomes. However, learning a model that captures the dynamics of complex skills represents a major challenge: if the agent needs a good model to perform these skills, it might never be able to collect the experience on its own that is required to learn these delicate and complex behaviors. Instead, we can imagine augmenting the training set with observational data of other agents, such as humans. Such data is likely more plentiful, but cannot always be combined with data from the original agent. For example, videos of humans might show a robot how to use a tool, but (i) are not annotated with suitable robot actions, and (ii) contain a systematic distributional shift due to the embodiment differences between humans and robots. We address the first challenge by formulating the corresponding graphical model and treating the action as an observed variable for the interaction data and an unobserved variable for the observation data, and the second challenge by using a domain-dependent prior. In addition to interaction data, our method is able to leverage videos of passive observations in a driving dataset and a dataset of robotic manipulation videos to improve video prediction performance. In a real-world tabletop robotic manipulation setting, our method is able to significantly improve control performance by learning a model from both robot data and observations of humans.

Keywords

Video prediction Visual planning Action representations Robotic manipulation 

Notes

Acknowledgements

We thank Karl Pertsch, Drew Jaegle, Marvin Zhang, and Kenneth Chaney. This work was supported by the NSF GRFP, ARL RCTA W911NF-10-2-0016, ARL DCIST CRA W911NF-17-2-0181, and by Honda Research Institute.

Supplementary material

504476_1_En_42_MOESM1_ESM.zip (2.1 mb)
Supplementary material 1 (zip 2101 KB)

References

  1. 1.
    van Amersfoort, J., Kannan, A., Ranzato, M., Szlam, A., Tran, D., Chintala, S.: Transformation-based models of video sequences. arXiv preprint, January 2017. http://arxiv.org/abs/1701.08435
  2. 2.
    Aytar, Y., Pfaff, T., Budden, D., Paine, T., Wang, Z., de Freitas, N.: Playing hard exploration games by watching YouTube. In: Advances in Neural Information Processing Systems 31 (2018). http://papers.nips.cc/paper/7557-playing-hard-exploration-games-by-watching-youtube.pdf
  3. 3.
    Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. In: International Conference on Learning Representations (2018)Google Scholar
  4. 4.
    Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  5. 5.
    Byeon, W., Wang, Q., Srivastava, R.K., Koumoutsakos, P.: ContextVP: fully context-aware video prediction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 781–797. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01270-0_46CrossRefGoogle Scholar
  6. 6.
    Byravan, A., Leeb, F., Meier, F., Fox, D.: SE3-Pose-Nets: structured deep dynamics models for visuomotor control. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE (2018) Google Scholar
  7. 7.
    Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (2019)
  8. 8.
    Castrejon, L., Ballas, N., Courville, A.: Improved conditional VRNNs for video prediction. arXiv preprint, April 2019. http://arxiv.org/abs/1904.12165
  9. 9.
    Chen, B., Wang, W., Wang, J., Chen, X.: Video imagination from a single image with transformation generation. arXiv preprint, June 2017. http://arxiv.org/abs/1706.04124
  10. 10.
    Chiappa, S., Racanière, S., Wierstra, D., Mohamed, S.: Recurrent environment simulators. In: International Conference on Learning Representations (2017)Google Scholar
  11. 11.
    Dasari, S., et al.: RoboNet: large-scale multi-robot learning. In: Conference on Robot Learning, October 2019. http://arxiv.org/abs/1910.11215
  12. 12.
    De Brabandere, B., Jia, X., Tuytelaars, T., Van Gool, L.: Dynamic filter networks. In: Neural Information Processing Systems, May 2016. http://arxiv.org/abs/1605.09673
  13. 13.
    Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: International Conference on Machine Learning (ICML) (2018)Google Scholar
  14. 14.
    Denton, E., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: Neural Information Processing Systems, pp. 4417–4426 (2017)Google Scholar
  15. 15.
    Dwibedi, D., Tompson, J., Lynch, C., Sermanet, P.: Learning actionable representations from visual observations. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1577–1584. IEEE (2018). https://arxiv.org/abs/1808.00928
  16. 16.
    Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., Levine, S.: Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv:1812.00568 (2018)
  17. 17.
    Edwards, A.D., Sahni, H., Schroecker, Y., Isbell, C.L.: Imitating latent policies from observation. In: International Conference on Machine Learning, May 2019. http://arxiv.org/abs/1805.07914
  18. 18.
    Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Neural Information Processing Systems (2016)Google Scholar
  19. 19.
    Finn, C., Levine, S.: Deep visual foresight for planning robot motion. In: Proceedings of International Conference in Robotics and Automation (ICRA) (2017)Google Scholar
  20. 20.
    Fragkiadaki, K., Agrawal, P., Levine, S., Malik, J.: Learning visual predictive models of physics for playing billiards. In: International Conference on Learning Representations, November 2016. http://arxiv.org/abs/1511.07404
  21. 21.
    Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)Google Scholar
  22. 22.
    Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2096–2030 (2016)MathSciNetGoogle Scholar
  23. 23.
    Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. In: Neural Information Processing Systems (2018)Google Scholar
  24. 24.
    Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning, pp. 2555–2565. PMLR (2019)Google Scholar
  25. 25.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  26. 26.
    Hoffman, J., et al.: CyCADA: cycle-consistent adversarial domain adaptation. In: International Conference on Machine Learning (ICML), November 2018. http://arxiv.org/abs/1711.03213
  27. 27.
    Ilse, M., Tomczak, J.M., Louizos, C., Welling, M.: DIVA: domain invariant variational autoencoders. arXiv preprint, May 2019. http://arxiv.org/abs/1905.10427
  28. 28.
    Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: model-based policy optimization. In: NeurIPS (2019)Google Scholar
  29. 29.
    Janner, M., Levine, S., Freeman, W.T., Tenenbaum, J.B., Finn, C., Wu, J.: Reasoning about physical interactions with object-oriented prediction and planning. In: International Conference on Learning Representations, December 2019. http://arxiv.org/abs/1812.10972
  30. 30.
    Kaiser, L., et al.: Model-based reinforcement learning for Atari. In: International Conference on Learning Representations (2019)Google Scholar
  31. 31.
    Kalchbrenner, N., et al.: Video pixel networks. arXiv preprint, October 2016. http://arxiv.org/abs/1610.00527
  32. 32.
    Kumar, A., Gupta, S., Malik, J.: Learning navigation subroutines by watching videos. CoRR abs/1905.12612 (2019). http://arxiv.org/abs/1905.12612
  33. 33.
    Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv:1804.01523 abs/1804.01523 (2018)
  34. 34.
    Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion GAN for future-flow embedded video prediction. In: International Conference on Computer Vision, August 2017. http://arxiv.org/abs/1708.00284
  35. 35.
    Liu, Y., Gupta, A., Abbeel, P., Levine, S.: Imitation from observation: learning to imitate behaviors from raw video via context translation. Ph.D. thesis, University of California, Berkeley, July 2018. http://arxiv.org/abs/1707.03374
  36. 36.
    Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4463–4471 (2017)Google Scholar
  37. 37.
    Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint, May 2016. http://arxiv.org/abs/1605.08104
  38. 38.
    Lu, C., Hirsch, M., Scholkoph, B.: Flexible spatio-temporal networks for video prediction. In: Computer Vision and Pattern Recognition (2017)Google Scholar
  39. 39.
    Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: International Conference on Computer Vision, March 2017. http://arxiv.org/abs/1703.07684
  40. 40.
    Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: International Conference on Learning Representations (2016)Google Scholar
  41. 41.
    Oh, J., Guo, X., Lee, H., Lewis, R., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: Neural Information Processing Systems (2015)Google Scholar
  42. 42.
    Patraucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory. arXiv preprint, November 2015. http://arxiv.org/abs/1511.06309
  43. 43.
    Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
  44. 44.
    Rizzolatti, G., Craighero, L.: The mirror-neuron system. Annu. Rev. Neurosci. 27, 169–192 (2004)CrossRefGoogle Scholar
  45. 45.
    Rizzolatti, G., Fadiga, L., Gallese, V., Fogassi, L.: Premotor cortex and the recognition of motor actions. Cogn. Brain. Res. 3(2), 131–141 (1996)CrossRefGoogle Scholar
  46. 46.
    Rybkin, O., Pertsch, K., Derpanis, K.G., Daniilidis, K., Jaegle, A.: Learning what you can do before doing anything. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=SylPMnR9Ym
  47. 47.
    Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: Proceedings of International Conference in Robotics and Automation (ICRA) (2018). http://arxiv.org/abs/1704.06888
  48. 48.
    Sharma, P., Pathak, D., Gupta, A.: Third-person visual imitation learning via decoupled hierarchical controller. In: Neural Information Processing Systems (2019)Google Scholar
  49. 49.
    Shon, A.P., Grochow, K., Hertzmann, A., Rao, R.P.: Learning shared latent structure for image synthesis and robotic imitation. In: Advances in Neural Information Processing Systems, pp. 1233–1240 (2005)Google Scholar
  50. 50.
    Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning (ICML) (2015)Google Scholar
  51. 51.
    Stadie, B.C., Abbeel, P., Sutskever, I.: Third-person imitation learning. arXiv preprint arXiv:1703.01703 (2017)
  52. 52.
    Sun, M., Ma, X.: Adversarial imitation learning from incomplete demonstrations. In: International Joint Conference on Artificial Intelligence, May 2019. http://arxiv.org/abs/1905.12310
  53. 53.
    Sun, W., Vemula, A., Boots, B., Bagnell, J.A.: Provably efficient imitation learning from observation alone. In: International Conference on Machine Learning, May 2019. http://arxiv.org/abs/1905.10948
  54. 54.
    Taigman, Y., Polyak, A., Wolf, L.: Unsupervised cross-domain image generation. In: International Conference on Learning Representations, November 2017. https://arxiv.org/abs/1611.02200
  55. 55.
    Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation. In: International Joint Conference on Artificial Intelligence, May 2018. http://arxiv.org/abs/1805.01954
  56. 56.
    Torabi, F., Warnell, G., Stone, P.: Generative adversarial imitation from observation. arXiv preprint, July 2018. http://arxiv.org/abs/1807.06158
  57. 57.
    Torabi, F., Warnell, G., Stone, P.: Imitation learning from video by leveraging proprioception. In: International Joint Conference on Artificial Intelligence, May 2019. http://arxiv.org/abs/1905.09335
  58. 58.
    Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition (2018)Google Scholar
  59. 59.
    Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Computer Vision and Pattern Recognition, February 2017. http://arxiv.org/abs/1702.05464
  60. 60.
    Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: maximizing for domain invariance. arXiv preprint, December 2014. http://arxiv.org/abs/1412.3474
  61. 61.
    Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: International Conference on Learning Representations (2017)Google Scholar
  62. 62.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: Computer Vision and Pattern Recognition (2016)Google Scholar
  63. 63.
    Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. In: Conference on Vision and Pattern Recognition (2017)Google Scholar
  64. 64.
    Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_51. http://arxiv.org/abs/1606.07873
  65. 65.
    Wang, A., Kurutach, T., Tamar, A., Abbeel, P.: Learning robotic manipulation through visual planning and acting. In: Robotics: Science and Systems (2019)Google Scholar
  66. 66.
    Watter, M., Springenberg, J.T., Boedecker, J., Riedmiller, M.: Embed to control: a locally linear latent dynamics model for control from raw images. In: Neural Information Processing Systems (2015)Google Scholar
  67. 67.
    Wichers, N., Villegas, R., Erhan, D., Lee, H.: Hierarchical long-term video prediction without supervision. In: ICML (2018)Google Scholar
  68. 68.
    Xie, A., Ebert, F., Levine, S., Finn, C.: Improvisation through physical understanding: using novel objects as tools with visual foresight. In: Robotics: Science and Systems, April 2019. http://arxiv.org/abs/1904.05538
  69. 69.
    Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. (2016). http://arxiv.org/abs/1607.02586
  70. 70.
    Yen-Chen, L., Bauza, M., Isola, P.: Experience-embedded visual foresight. In: Conference on Robot Learning, November 2019. http://arxiv.org/abs/1911.05071
  71. 71.
    Yu, F., et al.: BDD100K: a diverse driving video database with scalable annotation tooling. arXiv preprint, May 2018. http://arxiv.org/abs/1805.04687
  72. 72.
    Yu, T., et al.: One-shot imitation from observing humans via domain-adaptive meta-learning. In: Robotics: Science and Systems, February 2018. http://arxiv.org/abs/1802.01557
  73. 73.
    Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M.J., Levine, S.: SOLAR: deep structured representations for model-based reinforcement learning. In: International Conference on Machine Learning, August 2018. http://arxiv.org/abs/1808.09105
  74. 74.
    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Computer Vision and Pattern Recognition (2018)Google Scholar
  75. 75.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  76. 76.
    Zhuang, F., Cheng, X., Luo, P., Pan, S.J., He, Q.: Supervised representation learning with double encoding-layer autoencoder for transfer learning. In: International Joint Conference on Artificial Intelligence (2015).  https://doi.org/10.1145/3108257

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of PennsylvaniaPhiladelphiaUSA
  2. 2.Stanford UniversityStanfordUSA
  3. 3.University of California, BerkeleyBerkeleyUSA

Personalised recommendations