Skip to main content
Log in

Relabeling and policy distillation of hierarchical reinforcement learning

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Hierarchical reinforcement learning (HRL) is a promising method to extend traditional reinforcement learning to solve more complex tasks. HRL can solve the problems of long-term reward sparsity and credit assignment. However, the existing HRL methods are trained in specific environments and target tasks each time, resulting in low sample utilization. In addition, the low-level sub-policies of the agent will interfere with each other during the migration process, resulting in poor policy stability. Aiming at the issue above, this paper proposes an HRL method, Relabeling and Policy Distillation of Hierarchical Reinforcement Learning (R-PD-HRL), that integrates meta-learning, shared reward relabeling and policy distillation to accelerate the learning speed and improve the policy stability of the agent. In the training process, a reward relabeling module is introduced to act on the experience buffer. Different reward functions are used to relabel the interaction trajectory for the training of other tasks under the same task distribution. At the low-level, policy distillation technology is used to compress the sub-policies of the low-level, and the interference between the policies is reduced while ensuring the correctness of the original low-level sub-policies. Finally, according to different tasks, the high-level policy calls the low-level optimal policy to complete the decision. In both continuous and discrete state-action environments, experimental results show that compared with other methods, the improved sample utilization of this method greatly accelerates the learning speed, and the success rate is as high as 0.6.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The data are not publicly available due to the data is needed for follow-up research.

References

  1. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Berlin

    Google Scholar 

  2. Zhao X, Ding S, An Y et al (2019) Applications of asynchronous deep reinforcement learning based on dynamic updating weights[J]. Appl Intell 49:581–591. https://doi.org/10.1007/s10489-018-1296-x

    Article  Google Scholar 

  3. Demir A, Çilden E, Polat F (2023) Landmark based guidance for reinforcement learning agents under partial observability[J]. Int J Mach Learn Cybern 14(4):1543–1563. https://doi.org/10.1007/s13042-022-01713-5

    Article  Google Scholar 

  4. Yao Z, Zhang G, Lu D et al (2019) Data-driven crowd evacuation: a reinforcement learning method[J]. Neurocomputing 366:314–327. https://doi.org/10.1016/j.neucom.2019.08.021

    Article  Google Scholar 

  5. Lai J, Wei J, Chen X (2021) Overview of hierarchical reinforcement learning. Comput Eng Appl 57:72–79. https://doi.org/10.3778/j.issn.1002-8331.2010-0038

    Article  Google Scholar 

  6. Sutton RS, Precup D, Singh S (1999) Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif Intell 112(1–2):181–211. https://doi.org/10.1016/S0004-3702(99)00052-1

    Article  MathSciNet  Google Scholar 

  7. Heuillet A, Couthouis F, Díaz-Rodríguez N (2021) Explainability in deep reinforcement learning[J]. Knowl-Based Syst 214:106685. https://doi.org/10.48550/arXiv.2207.01911

    Article  Google Scholar 

  8. Bacon PL, Harb J, Precup D (2017) The option-critic architecture. In: Proceedings of the AAAI conference on artificial intelligence (Vol. 31, No. 1). https://doi.org/10.48550/arXiv.1609.05140

  9. Konda V, Tsitsiklis J (1999) Actor-critic algorithms. Adv Neural Inf Process Syst 12

  10. Kamat A, Precup D (2020) Diversity-enriched option-critic. arXiv preprint arXiv:2011.02565. https://doi.org/10.48550/arXiv.2011.02565

  11. Song Y, Wang J, Lukasiewicz T et al (2019) Diversity-driven extensible hierarchical reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp 4992–4999). https://doi.org/10.1609/aaai.v33i01.33014992

  12. Song S, Weng J, Su H et al (2019) Playing FPS games with environment-aware hierarchical reinforcement learning. In IJCAI (pp 3475–3482). https://doi.org/10.24963/ijcai.2019/482

  13. Florensa C, Duan Y, Abbeel P (2017) Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012. https://doi.org/10.48550/arXiv.1704.03012

  14. Baxter J (1998) Theoretical models of learning to learn[M]. In: Learning to learn. Boston, MA: Springer US. https://doi.org/10.48550/arXiv.2002.12364

  15. Naik D K, Mammone R J, Agarwal A (1992) Meta-neural network approach to learning by learning[C]. In: Proceedings of the 1992 Artificial Neural Networks in Engineering, ANNIE'92. https://doi.org/10.1109/IJCNN.1992.287172

  16. Kailin Z, Jin X, Wang Y (2020) Survey on few-shot learning. J Software 32(2):349–369. https://doi.org/10.13328/j.cnki.jos.006138

    Article  Google Scholar 

  17. Miconi T, Stanley K, Clune J (2018) Differentiable plasticity: training plastic neural networks with backpropagation[C]. In: International Conference on Machine Learning. https://doi.org/10.48550/arXiv.1804.02464

  18. Wang J X, Kurth-Nelson Z, Tirumala D et al (2016) Learning to reinforcement learn[J]. arXiv preprint arXiv:1611.05763. https://doi.org/10.48550/arXiv.1611.05763

  19. Qi C, Shui C, Han L et al (2023) On the stability-plasticity dilemma in continual meta-learning: theory and algorithm[C]. Thirty-seventh Conference on Neural Information Processing Systems. https://doi.org/10.48550/arXiv.2302.08741

  20. De Lange M, Aljundi R, Masana M et al (2019) Continual learning: a comparative study on how to defy forgetting in classification tasks[J]. https://doi.org/10.48550/arXiv.1909.08383

  21. Zeno C, Golan I, Hoffer E et al (2018) Task agnostic continual learning using online variational bayes[J]. https://doi.org/10.48550/arXiv.1803.10123

  22. Riemer M, Cases I, Ajemian R et al (2018) Learning to learn without forgetting by maximizing transfer and minimizing interference[J]. https://doi.org/10.48550/arXiv.1810.11910

  23. Chen Q, Shui C, Marchand M (2021) Generalization bounds for meta-learning: An information-theoretic analysis. Adv Neural Inf Process Syst 34:25878–25890. https://doi.org/10.48550/arXiv.2109.14595

    Article  Google Scholar 

  24. Duan Y, Schulman J, Chen X et al (2016) Rl $^ 2$: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. https://doi.org/10.48550/arXiv.1611.02779

  25. Rakelly K, Zhou A, Finn C et al (2019) Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning (pp 5331–5340). PMLR. https://doi.org/10.48550/arXiv.1903.08254

  26. Fakoor R, Chaudhari P, Soatto S et al (2019) Meta-q-learning. arXiv preprint arXiv:1910.00125. https://doi.org/10.48550/arXiv.1910.00125

  27. Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning (pp 1126–1135). PMLR. https://doi.org/10.48550/arXiv.1703.03400

  28. Gupta A, Mendonca R, Liu Y et al (2018) Meta-reinforcement learning of structured exploration strategies. Adv Neural Inf Process Syst. https://doi.org/10.48550/arXiv.1802.07245

    Article  Google Scholar 

  29. Frans K, Ho J, Chen X et al (2017) Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767. https://doi.org/10.48550/arXiv.1710.09767

  30. Schulman J, Levine S, Abbeel P et al (2015) Trust region policy optimization. In International conference on machine learning (pp 1889–1897). PMLR. https://doi.org/10.48550/arXiv.1502.05477

  31. Zeng X, Peng H, Li A (2023) Effective and stable role-based multi-agent collaboration by structural information principles[J]. https://doi.org/10.48550/arXiv.2304.00755

  32. Zeng X, Peng H, Li A et al (2023) Hierarchical state abstraction based on structural information principles[J]. https://doi.org/10.48550/arXiv.2304.12000

  33. Zhou G, Xu Z, Zhang Z, et al (2023) SORA: improving multi-agent cooperation with a soft role assignment mechanism[C]. In: International Conference on Neural Information Processing. Singapore: Springer Nature Singapore, 2023: 319–331. https://doi.org/10.1007/978-981-99-8079-6_25

  34. Zhu L, Cheng J, Zhang H, et al (2023) Autonomous and adaptive role selection for multi-robot collaborative area search based on deep reinforcement learning[J]. arXiv preprint arXiv:2312.01747. https://arxiv.org/pdf/2312.01747

  35. Sutton RS, Precup D, Singh S (1998) Intra-Option Learning about Temporally Abstract Actions. In ICML (Vol. 98, pp 556–564)

  36. Caruana R (1997) Multitask learning. Mach Learn 28:41–75

    Article  Google Scholar 

  37. Kaelbling LP (1993) Learning to achieve goals. In IJCAI (Vol. 2, pp 1094–1098)

  38. Schaul T, Horgan D, Gregor K et al (2015) Universal value function approximators. In International conference on machine learning (pp 1312–1320). PMLR

  39. Dorfman R, Shenfeld I, Tamar A (2020) Offline meta learning of exploration. arXiv preprint arXiv:2008.02598. https://doi.org/10.48550/arXiv.2008.02598

  40. Li A, Pinto L, Abbeel P (2020) Generalized hindsight for reinforcement learning. Adv Neural Inf Process Syst 33:7754–7767. https://doi.org/10.48550/arXiv.2002.11708

    Article  Google Scholar 

  41. Wan M, Peng J, Gangwani T (2021) Hindsight foresight relabeling for meta-reinforcement learning. arXiv preprint arXiv:2109.09031. https://doi.org/10.48550/arXiv.2109.09031

  42. Eysenbach B, Geng X, Levine S et al (2020) Rewriting history with inverse rl: Hindsight inference for policy improvement. Adv Neural Inf Process Syst 33:14783–14795. https://doi.org/10.48550/arXiv.2002.11089

    Article  Google Scholar 

  43. Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. https://doi.org/10.48550/arXiv.1503.02531

  44. Czarnecki WM, Pascanu R, Osindero S et al (2019) Distilling policy distillation. In The 22nd international conference on artificial intelligence and statistics (pp 1331–1340). PMLR. https://doi.org/10.48550/arXiv.1902.02186

  45. Ross S, Gordon G, Bagnell D (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (pp 627–635). JMLR Workshop and Conference Proceedings. https://doi.org/10.48550/arXiv.1011.0686

  46. Rusu AA, Colmenarejo SG, Gulcehre C et al (2015) Policy distillation. arXiv preprint arXiv:1511.06295. https://doi.org/10.48550/arXiv.1511.06295

  47. Ghosh D, Singh A, Rajeswaran A et al. Divide-and-conquer reinforcement learning[J]. arXiv preprint arXiv:1711.09874, 2017. https://doi.org/10.48550/arXiv.1711.09874

  48. Yin H, Pan S (2017) Knowledge transfer for deep reinforcement learning with hierarchical experience replay. In: Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31, No. 1). https://doi.org/10.1609/aaai.v31i1.10733

  49. Parisotto E, Ba JL, Salakhutdinov R (2015) Actor-mimic: deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342. https://doi.org/10.48550/arXiv.1511.06342

  50. Czarnecki W, Jayakumar S, Jaderberg M et al (2018) Mix & match agent curricula for reinforcement learning. In International Conference on Machine Learning (pp 1087–1095). PMLR. https://doi.org/10.48550/arXiv.1806.01780

  51. Schmitt S, Hudson JJ, Zidek A et al (2018) Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835. https://doi.org/10.48550/arXiv.1803.03835

  52. Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531. https://doi.org/10.48550/arXiv.1503.02531

  53. Ba J, Caruana R (2014) Do deep nets really need to be deep?[J]. Adv Neural Inf Process Syst. https://doi.org/10.48550/arXiv.1312.6184

    Article  Google Scholar 

  54. Schulman J, Wolski F, Dhariwal P et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. https://doi.org/10.48550/arXiv.1707.06347

  55. Kreidieh AR, Berseth G, Trabucco B et al (2019) Inter-level cooperation in hierarchical reinforcement learning. arXiv preprint arXiv:1912.02368. https://doi.org/10.48550/arXiv.1912.02368

Download references

Acknowledgements

This work is financed by The National Nature Science Foundation of China (61673084) and Project of Liaoning Provincial Department of Education (2021LJKZ1180).

Author information

Authors and Affiliations

Authors

Contributions

The authors of this manuscript have made significant contributions to the research, analysis, and writing of the paper. Below is a summary of each author's specific contributions: Zou: Acted as the main supervisor and project leader of the whole research; Provided financial support for research; Participated in the discussion and decision of the research direction; Important revisions and modifications have been made to the whole paper. Zhao: The whole research framework and experimental scheme are designed. Responsible for setting up experimental framework and experimental operation; Data analysis and statistical processing were carried out. The methods and results of the paper are written. Gao: Literature review and related research were carried out. Assist in the design of research methods and experimental protocols; Some data results were analyzed and interpreted. Mainly responsible for writing the introduction and related work parts of the paper. Chen: Data analysis and statistical processing were carried out. Assist in revising and proofreading the whole paper. Liu: During the revision of the article, many constructive suggestions were given to help us improve the article better. Zhang: Assist in the design and execution of experiments; Some data results were analyzed and interpreted. Participated in the revision and proofreading of the paper. We confirm that all listed authors have read and approved the final version of the manuscript and agree to its submission to the journal. Thank you for considering our manuscript.

Corresponding author

Correspondence to Xiling Zhao.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zou, Q., Zhao, X., Gao, B. et al. Relabeling and policy distillation of hierarchical reinforcement learning. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02192-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13042-024-02192-6

Keywords

Navigation