Skip to main content

Learning Hierarchical Planning-Based Policies from Offline Data

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases: Research Track (ECML PKDD 2023)

Abstract

Hierarchical policy architectures incorporating some planning component into the top-level have shown superior performance and generalization in agent navigation tasks. Cost or safety reasons may, however, prevent training in an online (RL) fashion with continuous environment interaction. We therefore propose HORIBLe-VRN, an algorithm to learn a hierarchical policy with a top-level planning-based module from pre-collected data. A key challenge is to deal with the unknown, latent high-level (HL) actions. Our algorithm features an EM-style hierarchical imitation learning stage, incorporating HL action inference, and a subsequent offline RL refinement stage for the top-level policy. We empirically evaluate HORIBLe-VRN in a long horizon, sparse reward agent navigation task, investigating performance, generalization capabilities, and robustness with respect to sub-optimal demonstration data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This is not due to additional optimization steps for initialization. The success rates of our approach and HHIM stop to significantly improve long before \(N_\text {iter}=10000\).

  2. 2.

    This is less than for a full training (\(N_\text {iter}=10000\) for BC (VRN), HHIM, stage 1).

References

  1. Andrychowicz, M., et al.: Hindsight experience replay. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 5048–5058 (2017)

    Google Scholar 

  2. Bacon, P.L., Harb, J., Precup, D.: The option-critic architecture. In: AAAI Conference on Artificial Intelligence (2017)

    Google Scholar 

  3. Bain, M., Sammut, C.: A framework for behavioural cloning. In: Machine Intelligence, vol. 15, pp. 103–129 (1995)

    Google Scholar 

  4. Baum, L.E.: An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3(1), 1–8 (1972)

    MathSciNet  Google Scholar 

  5. Christen, S., Jendele, L., Aksan, E., Hilliges, O.: Learning functionally decomposed hierarchies for continuous control tasks with path planning. IEEE Robot. Autom. Lett. 6(2), 3623–3630 (2021)

    Article  Google Scholar 

  6. Daniel, C., Van Hoof, H., Peters, J., Neumann, G.: Probabilistic inference for determining options in reinforcement learning. Mach. Learn. 104, 337–357 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  7. Dayan, P., Hinton, G.E.: Feudal reinforcement learning. Adv. Neural Inf. Process. Syst. (NeurIPS) 5, 271–278 (1992)

    MATH  Google Scholar 

  8. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc.: Ser. B (Methodological) 39(1), 1–22 (1977)

    MathSciNet  MATH  Google Scholar 

  9. Eysenbach, B., Salakhutdinov, R., Levine, S.: Search on the replay buffer: bridging planning and reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019)

    Google Scholar 

  10. Fox, R., Krishnan, S., Stoica, I., Goldberg, K.: Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294 (2017)

  11. Fox, R., et al.: Hierarchical variational imitation learning of control programs. arXiv preprint arXiv:1912.12612 (2019)

  12. Francis, A., et al.: Long-range indoor navigation with PRM-RL. IEEE Trans. Robot. (2020)

    Google Scholar 

  13. Fujimoto, S., Conti, E., Ghavamzadeh, M., Pineau, J.: Benchmarking batch deep reinforcement learning algorithms. arXiv preprint arXiv:1910.01708 (2019)

  14. Fujimoto, S., Gu, S.: A minimalist approach to offline reinforcement learning. Adv. Neural Inf. Process. Syst. (NeurIPS) 34, 20132–20145 (2021)

    Google Scholar 

  15. Giammarino, V., Paschalidis, I.: Online Baum-Welch algorithm for hierarchical imitation learning. In: Conference on Decision and Control (CDC), pp. 3717–3722. IEEE (2021)

    Google Scholar 

  16. Gieselmann, R., Pokorny, F.T.: Planning-augmented hierarchical reinforcement learning. IEEE Robot. Autom. Lett. 6(3), 5097–5104 (2021)

    Article  Google Scholar 

  17. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning (ICML), pp. 1861–1870 (2018)

    Google Scholar 

  18. Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat., 73–101 (1964)

    Google Scholar 

  19. Hutsebaut-Buysse, M., Mets, K., Latré, S.: Hierarchical reinforcement learning: a survey and open research challenges. Mach. Learn. Knowl. Extract. 4(1), 172–221 (2022)

    Article  Google Scholar 

  20. Jing, M., et al.: Adversarial option-aware hierarchical imitation learning. In: International Conference on Machine Learning (ICML), pp. 5097–5106 (2021)

    Google Scholar 

  21. Kingma, D.P., Ba, J.: ADAM: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  22. Krishnan, S., Fox, R., Stoica, I., Goldberg, K.: DDCO: discovery of deep continuous options for robot learning from demonstrations. In: Conference on Robot Learning (CoRL), pp. 418–437 (2017)

    Google Scholar 

  23. Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative Q-learning for offline reinforcement learning. Adv. Neural Inf. Process. Syst. (NeurIPS) 33, 1179–1191 (2020)

    Google Scholar 

  24. Le, H., Jiang, N., Agarwal, A., Dudík, M., Yue, Y., Daumé III, H.: Hierarchical imitation and reinforcement learning. In: International Conference on Machine Learning (ICML), pp. 2917–2926 (2018)

    Google Scholar 

  25. Levy, A., Konidaris, G., Platt, R., Saenko, K.: Learning multi-level hierarchies with hindsight. In: International Conference on Learning Representations (ICLR) (2019)

    Google Scholar 

  26. Li, B., Li, J., Lu, T., Cai, Y., Wang, S.: Hierarchical learning from demonstrations for long-horizon tasks. In: International Conference on Robotics and Automation (ICRA), pp. 4545–4551. IEEE (2021)

    Google Scholar 

  27. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)

    Article  Google Scholar 

  28. Nardelli, N., Synnaeve, G., Lin, Z., Kohli, P., Torr, P.H., Usunier, N.: Value propagation networks. In: International Conference on Learning Representations (ICLR) (2019)

    Google Scholar 

  29. Prudencio, R.F., Maximo, M.R., Colombini, E.L.: A survey on offline reinforcement learning: taxonomy, review, and open problems. IEEE Trans. Neural Netw. Learn. Syst. (2023)

    Google Scholar 

  30. Salakhutdinov, R., Roweis, S.T., Ghahramani, Z.: Optimization with EM and expectation-conjugate-gradient. In: International Conference on Machine Learning (ICML), pp. 672–679 (2003)

    Google Scholar 

  31. Smith, M., Van Hoof, H., Pineau, J.: An inference-based policy gradient method for learning options. In: International Conference on Machine Learning (ICML), pp. 4703–4712. PMLR (2018)

    Google Scholar 

  32. Sutton, R.S., Precup, D., Singh, S.: Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif. Intell. 112(1–2), 181–211 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  33. Todorov, E., Erez, T., Tassa, Y.: Mujoco: A physics engine for model-based control. In: International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)

    Google Scholar 

  34. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double Q-learning. In: AAAI Conference on Artificial Intelligence, vol. 30 (2016)

    Google Scholar 

  35. Vecerik, M., et al.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017)

  36. Vezhnevets, A.S., et al.: FeUdal networks for hierarchical reinforcement learning. In: International Conference on Machine Learning (ICML), pp. 3540–3549 (2017)

    Google Scholar 

  37. Wöhlke, J., Schmitt, F., Van Hoof, H.: Hierarchies of planning and reinforcement learning for robot navigation. In: International Conference on Robotics and Automation (ICRA), pp. 10682–10688. IEEE (2021)

    Google Scholar 

  38. Wöhlke, J., Schmitt, F., Van Hoof, H.: Value refinement network (VRN). In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 3558–3565 (2022)

    Google Scholar 

  39. Zhang, Z., Paschalidis, I.: Provable hierarchical imitation learning via EM. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 883–891 (2021)

    Google Scholar 

  40. Zheng, B., Verma, S., Zhou, J., Tsang, I.W., Chen, F.: Imitation learning: progress, taxonomies and challenges. IEEE Trans. Neural Netw. Learn. Syst., 1–16 (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Wöhlke .

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

We did not collect or process any personal data, for this work. All data was collected using a physics simulation of a point mass agent. There are several possible future applications for our research, like, for example, in autonomous vehicles or robotics, which hopefully have a positive impact on society. There are, however, also risks of negative societal impact, through the form of application itself, the impact on the job market, or real-world application without proper verification and validation. Such factors should be taken into consideration when designing applications.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wöhlke, J., Schmitt, F., van Hoof, H. (2023). Learning Hierarchical Planning-Based Policies from Offline Data. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43421-1_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43420-4

  • Online ISBN: 978-3-031-43421-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics