Procedure Planning in Instructional Videos

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12356)


In this paper, we study the problem of procedure planning in instructional videos, which can be seen as a step towards enabling autonomous agents to plan for complex tasks in everyday settings such as cooking. Given the current visual observation of the world and a visual goal, we ask the question “What actions need to be taken in order to achieve the goal?”. The key technical challenge is to learn structured and plannable state and action spaces directly from unstructured videos. We address this challenge by proposing Dual Dynamics Networks (DDN), a framework that explicitly leverages the structured priors imposed by the conjugate relationships between states and actions in a learned plannable latent space. We evaluate our method on real-world instructional videos. Our experiments show that DDN learns plannable representations that lead to better planning performance compared to existing planning approaches and neural network policies.


Latent space planning Task planning Video understanding Representation for action and skill 



Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.

Supplementary material

504452_1_En_20_MOESM1_ESM.pdf (1.5 mb)
Supplementary material 1 (pdf 1509 KB)


  1. 1.
    Abu Farha, Y., Richard, A., Gall, J.: When will you do what?-anticipating temporal occurrences of activities. In: CVPR (2018)Google Scholar
  2. 2.
    Alayrac, J.B., Laptev, I., Sivic, J., Lacoste-Julien, S.: Joint discovery of object states and manipulation actions. In: ICCV (2017)Google Scholar
  3. 3.
    Bellman, R.: A Markovian decision process. J. Math. Mech. 6, 679–684 (1957)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Chang, C.Y., Huang, D.A., Sui, Y., Fei-Fei, L., Niebles, J.C.: D3TW: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: CVPR (2019)Google Scholar
  5. 5.
    Ehsani, K., Bagherinezhad, H., Redmon, J., Mottaghi, R., Farhadi, A.: Who let the dogs out? modeling dog behavior from visual data. In: CVPR (2018)Google Scholar
  6. 6.
    Farha, Y.A., Gall, J.: Uncertainty-aware anticipation of activities. arXiv preprint arXiv:1908.09540 (2019)
  7. 7.
    Finn, C., Levine, S.: Deep visual foresight for planning robot motion. In: ICRA (2017)Google Scholar
  8. 8.
    Finn, C., Tan, X.Y., Duan, Y., Darrell, T., Levine, S., Abbeel, P.: Deep spatial autoencoders for visuomotor learning. In: ICRA (2016)Google Scholar
  9. 9.
    Furnari, A., Farinella, G.M.: What would you expect? anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: ICCV, pp. 6252–6261 (2019)Google Scholar
  10. 10.
    Ghallab, M., Nau, D., Traverso, P.: Automated Planning: Theory and Practice. Elsevier, Amsterdam (2004) zbMATHGoogle Scholar
  11. 11.
    Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: ICML (2019)Google Scholar
  12. 12.
    Hayes, B., Scassellati, B.: Autonomously constructing hierarchical task networks for planning and human-robot collaboration. In: ICRA (2016)Google Scholar
  13. 13.
    Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). Scholar
  14. 14.
    Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: CVPR (2019)Google Scholar
  15. 15.
    Konidaris, G., Kaelbling, L.P., Lozano-Perez, T.: From skills to symbols: learning symbolic representations for abstract high-level planning. J. Artif. Intell. Res. 61, 215–289 (2018)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: CVPR (2014)Google Scholar
  17. 17.
    Kurutach, T., Tamar, A., Yang, G., Russell, S.J., Abbeel, P.: Learning plannable representations with causal infogan. In: NeurIPS (2018)Google Scholar
  18. 18.
    Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Cham (2014). Scholar
  19. 19.
    Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018)
  20. 20.
    Lu, C., Hirsch, M., Scholkopf, B.: Flexible spatio-temporal networks for video prediction. In: CVPR (2017)Google Scholar
  21. 21.
    McDermott, D., et al.: PDDL-the planning domain definition language (1998)Google Scholar
  22. 22.
    Mehrasa, N., Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: A variational auto-encoder model for stochastic point processes. In: CVPR (2019)Google Scholar
  23. 23.
    Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: ICCV (2019)Google Scholar
  24. 24.
    Oh, J., Guo, X., Lee, H., Lewis, R.L., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: NeurIPS (2015)Google Scholar
  25. 25.
    Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: CVPR Workshops (2017)Google Scholar
  26. 26.
    Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
  27. 27.
    Rhinehart, N., Kitani, K.M.: First-person activity forecasting with online inverse reinforcement learning. In: ICCV (2017)Google Scholar
  28. 28.
    Richard, A., Kuehne, H., Iqbal, A., Gall, J.: Neuralnetwork-viterbi: a framework for weakly supervised video learning. In: CVPR (2018)Google Scholar
  29. 29.
    Sener, F., Yao, A.: Zero-shot anticipation for instructional activities. In: ICCV (2019)Google Scholar
  30. 30.
    Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: ICRA (2018)Google Scholar
  31. 31.
    Srinivas, A., Jabri, A., Abbeel, P., Levine, S., Finn, C.: Universal planning networks. In: ICML (2018)Google Scholar
  32. 32.
    Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: Videobert: a joint model for video and language representation learning. In: ICCV (2019)Google Scholar
  33. 33.
    Tang, Y., et al.: Coin: a large-scale dataset for comprehensive instructional video analysis. In: CVPR (2019)Google Scholar
  34. 34.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR (2016)Google Scholar
  35. 35.
    Wang, X., Hu, J.F., Lai, J.H., Zhang, J., Zheng, W.S.: Progressive teacher-student learning for early action prediction. In: CVPR (2019)Google Scholar
  36. 36.
    Zeng, K.H., Shen, W.B., Huang, D.A., Sun, M., Carlos Niebles, J.: Visual forecasting by imitating dynamics in natural sequences. In: ICCV (2017)Google Scholar
  37. 37.
    Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: AAAI (2018)Google Scholar
  38. 38.
    Zhukov, D., Alayrac, J.B., Cinbis, R.G., Fouhey, D., Laptev, I., Sivic, J.: Cross-task weakly supervised learning from instructional videos. In: CVPR (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Stanford UniversityStanfordUSA

Personalised recommendations