Relabeling and policy distillation of hierarchical reinforcement learning

Zou, Qijie; Zhao, Xiling; Gao, Bing; Chen, Shuang; Liu, Zhiguo; Zhang, Zhejie

doi:10.1007/s13042-024-02192-6

Relabeling and policy distillation of hierarchical reinforcement learning

Original Article
Published: 11 May 2024

(2024)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Qijie Zou¹,
Xiling Zhao¹,
Bing Gao¹,
Shuang Chen¹,
Zhiguo Liu^1,2 &
…
Zhejie Zhang¹

35 Accesses
Explore all metrics

Abstract

Hierarchical reinforcement learning (HRL) is a promising method to extend traditional reinforcement learning to solve more complex tasks. HRL can solve the problems of long-term reward sparsity and credit assignment. However, the existing HRL methods are trained in specific environments and target tasks each time, resulting in low sample utilization. In addition, the low-level sub-policies of the agent will interfere with each other during the migration process, resulting in poor policy stability. Aiming at the issue above, this paper proposes an HRL method, Relabeling and Policy Distillation of Hierarchical Reinforcement Learning (R-PD-HRL), that integrates meta-learning, shared reward relabeling and policy distillation to accelerate the learning speed and improve the policy stability of the agent. In the training process, a reward relabeling module is introduced to act on the experience buffer. Different reward functions are used to relabel the interaction trajectory for the training of other tasks under the same task distribution. At the low-level, policy distillation technology is used to compress the sub-policies of the low-level, and the interference between the policies is reduced while ensuring the correctness of the original low-level sub-policies. Finally, according to different tasks, the high-level policy calls the low-level optimal policy to complete the decision. In both continuous and discrete state-action environments, experimental results show that compared with other methods, the improved sample utilization of this method greatly accelerates the learning speed, and the success rate is as high as 0.6.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A practical guide to multi-objective reinforcement learning and planning

Article Open access 13 April 2022

Monte Carlo Tree Search: a review of recent modifications and applications

Article Open access 19 July 2022

Deep learning: systematic review, models, challenges, and research directions

Article Open access 07 September 2023

Data availability

The data are not publicly available due to the data is needed for follow-up research.

References

Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Berlin
Google Scholar
Zhao X, Ding S, An Y et al (2019) Applications of asynchronous deep reinforcement learning based on dynamic updating weights[J]. Appl Intell 49:581–591. https://doi.org/10.1007/s10489-018-1296-x
Article Google Scholar
Demir A, Çilden E, Polat F (2023) Landmark based guidance for reinforcement learning agents under partial observability[J]. Int J Mach Learn Cybern 14(4):1543–1563. https://doi.org/10.1007/s13042-022-01713-5
Article Google Scholar
Yao Z, Zhang G, Lu D et al (2019) Data-driven crowd evacuation: a reinforcement learning method[J]. Neurocomputing 366:314–327. https://doi.org/10.1016/j.neucom.2019.08.021
Article Google Scholar
Lai J, Wei J, Chen X (2021) Overview of hierarchical reinforcement learning. Comput Eng Appl 57:72–79. https://doi.org/10.3778/j.issn.1002-8331.2010-0038
Article Google Scholar
Sutton RS, Precup D, Singh S (1999) Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif Intell 112(1–2):181–211. https://doi.org/10.1016/S0004-3702(99)00052-1
Article MathSciNet Google Scholar
Heuillet A, Couthouis F, Díaz-Rodríguez N (2021) Explainability in deep reinforcement learning[J]. Knowl-Based Syst 214:106685. https://doi.org/10.48550/arXiv.2207.01911
Article Google Scholar
Bacon PL, Harb J, Precup D (2017) The option-critic architecture. In: Proceedings of the AAAI conference on artificial intelligence (Vol. 31, No. 1). https://doi.org/10.48550/arXiv.1609.05140
Konda V, Tsitsiklis J (1999) Actor-critic algorithms. Adv Neural Inf Process Syst 12
Kamat A, Precup D (2020) Diversity-enriched option-critic. arXiv preprint arXiv:2011.02565. https://doi.org/10.48550/arXiv.2011.02565
Song Y, Wang J, Lukasiewicz T et al (2019) Diversity-driven extensible hierarchical reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp 4992–4999). https://doi.org/10.1609/aaai.v33i01.33014992
Song S, Weng J, Su H et al (2019) Playing FPS games with environment-aware hierarchical reinforcement learning. In IJCAI (pp 3475–3482). https://doi.org/10.24963/ijcai.2019/482
Florensa C, Duan Y, Abbeel P (2017) Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012. https://doi.org/10.48550/arXiv.1704.03012
Baxter J (1998) Theoretical models of learning to learn[M]. In: Learning to learn. Boston, MA: Springer US. https://doi.org/10.48550/arXiv.2002.12364
Naik D K, Mammone R J, Agarwal A (1992) Meta-neural network approach to learning by learning[C]. In: Proceedings of the 1992 Artificial Neural Networks in Engineering, ANNIE'92. https://doi.org/10.1109/IJCNN.1992.287172
Kailin Z, Jin X, Wang Y (2020) Survey on few-shot learning. J Software 32(2):349–369. https://doi.org/10.13328/j.cnki.jos.006138
Article Google Scholar
Miconi T, Stanley K, Clune J (2018) Differentiable plasticity: training plastic neural networks with backpropagation[C]. In: International Conference on Machine Learning. https://doi.org/10.48550/arXiv.1804.02464
Wang J X, Kurth-Nelson Z, Tirumala D et al (2016) Learning to reinforcement learn[J]. arXiv preprint arXiv:1611.05763. https://doi.org/10.48550/arXiv.1611.05763
Qi C, Shui C, Han L et al (2023) On the stability-plasticity dilemma in continual meta-learning: theory and algorithm[C]. Thirty-seventh Conference on Neural Information Processing Systems. https://doi.org/10.48550/arXiv.2302.08741
De Lange M, Aljundi R, Masana M et al (2019) Continual learning: a comparative study on how to defy forgetting in classification tasks[J]. https://doi.org/10.48550/arXiv.1909.08383
Zeno C, Golan I, Hoffer E et al (2018) Task agnostic continual learning using online variational bayes[J]. https://doi.org/10.48550/arXiv.1803.10123
Riemer M, Cases I, Ajemian R et al (2018) Learning to learn without forgetting by maximizing transfer and minimizing interference[J]. https://doi.org/10.48550/arXiv.1810.11910
Chen Q, Shui C, Marchand M (2021) Generalization bounds for meta-learning: An information-theoretic analysis. Adv Neural Inf Process Syst 34:25878–25890. https://doi.org/10.48550/arXiv.2109.14595
Article Google Scholar
Duan Y, Schulman J, Chen X et al (2016) Rl $^ 2$: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. https://doi.org/10.48550/arXiv.1611.02779
Rakelly K, Zhou A, Finn C et al (2019) Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning (pp 5331–5340). PMLR. https://doi.org/10.48550/arXiv.1903.08254
Fakoor R, Chaudhari P, Soatto S et al (2019) Meta-q-learning. arXiv preprint arXiv:1910.00125. https://doi.org/10.48550/arXiv.1910.00125
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning (pp 1126–1135). PMLR. https://doi.org/10.48550/arXiv.1703.03400
Gupta A, Mendonca R, Liu Y et al (2018) Meta-reinforcement learning of structured exploration strategies. Adv Neural Inf Process Syst. https://doi.org/10.48550/arXiv.1802.07245
Article Google Scholar
Frans K, Ho J, Chen X et al (2017) Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767. https://doi.org/10.48550/arXiv.1710.09767
Schulman J, Levine S, Abbeel P et al (2015) Trust region policy optimization. In International conference on machine learning (pp 1889–1897). PMLR. https://doi.org/10.48550/arXiv.1502.05477
Zeng X, Peng H, Li A (2023) Effective and stable role-based multi-agent collaboration by structural information principles[J]. https://doi.org/10.48550/arXiv.2304.00755
Zeng X, Peng H, Li A et al (2023) Hierarchical state abstraction based on structural information principles[J]. https://doi.org/10.48550/arXiv.2304.12000
Zhou G, Xu Z, Zhang Z, et al (2023) SORA: improving multi-agent cooperation with a soft role assignment mechanism[C]. In: International Conference on Neural Information Processing. Singapore: Springer Nature Singapore, 2023: 319–331. https://doi.org/10.1007/978-981-99-8079-6_25
Zhu L, Cheng J, Zhang H, et al (2023) Autonomous and adaptive role selection for multi-robot collaborative area search based on deep reinforcement learning[J]. arXiv preprint arXiv:2312.01747. https://arxiv.org/pdf/2312.01747
Sutton RS, Precup D, Singh S (1998) Intra-Option Learning about Temporally Abstract Actions. In ICML (Vol. 98, pp 556–564)
Caruana R (1997) Multitask learning. Mach Learn 28:41–75
Article Google Scholar
Kaelbling LP (1993) Learning to achieve goals. In IJCAI (Vol. 2, pp 1094–1098)
Schaul T, Horgan D, Gregor K et al (2015) Universal value function approximators. In International conference on machine learning (pp 1312–1320). PMLR
Dorfman R, Shenfeld I, Tamar A (2020) Offline meta learning of exploration. arXiv preprint arXiv:2008.02598. https://doi.org/10.48550/arXiv.2008.02598
Li A, Pinto L, Abbeel P (2020) Generalized hindsight for reinforcement learning. Adv Neural Inf Process Syst 33:7754–7767. https://doi.org/10.48550/arXiv.2002.11708
Article Google Scholar
Wan M, Peng J, Gangwani T (2021) Hindsight foresight relabeling for meta-reinforcement learning. arXiv preprint arXiv:2109.09031. https://doi.org/10.48550/arXiv.2109.09031
Eysenbach B, Geng X, Levine S et al (2020) Rewriting history with inverse rl: Hindsight inference for policy improvement. Adv Neural Inf Process Syst 33:14783–14795. https://doi.org/10.48550/arXiv.2002.11089
Article Google Scholar
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. https://doi.org/10.48550/arXiv.1503.02531
Czarnecki WM, Pascanu R, Osindero S et al (2019) Distilling policy distillation. In The 22nd international conference on artificial intelligence and statistics (pp 1331–1340). PMLR. https://doi.org/10.48550/arXiv.1902.02186
Ross S, Gordon G, Bagnell D (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (pp 627–635). JMLR Workshop and Conference Proceedings. https://doi.org/10.48550/arXiv.1011.0686
Rusu AA, Colmenarejo SG, Gulcehre C et al (2015) Policy distillation. arXiv preprint arXiv:1511.06295. https://doi.org/10.48550/arXiv.1511.06295
Ghosh D, Singh A, Rajeswaran A et al. Divide-and-conquer reinforcement learning[J]. arXiv preprint arXiv:1711.09874, 2017. https://doi.org/10.48550/arXiv.1711.09874
Yin H, Pan S (2017) Knowledge transfer for deep reinforcement learning with hierarchical experience replay. In: Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31, No. 1). https://doi.org/10.1609/aaai.v31i1.10733
Parisotto E, Ba JL, Salakhutdinov R (2015) Actor-mimic: deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342. https://doi.org/10.48550/arXiv.1511.06342
Czarnecki W, Jayakumar S, Jaderberg M et al (2018) Mix & match agent curricula for reinforcement learning. In International Conference on Machine Learning (pp 1087–1095). PMLR. https://doi.org/10.48550/arXiv.1806.01780
Schmitt S, Hudson JJ, Zidek A et al (2018) Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835. https://doi.org/10.48550/arXiv.1803.03835
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531. https://doi.org/10.48550/arXiv.1503.02531
Ba J, Caruana R (2014) Do deep nets really need to be deep?[J]. Adv Neural Inf Process Syst. https://doi.org/10.48550/arXiv.1312.6184
Article Google Scholar
Schulman J, Wolski F, Dhariwal P et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. https://doi.org/10.48550/arXiv.1707.06347
Kreidieh AR, Berseth G, Trabucco B et al (2019) Inter-level cooperation in hierarchical reinforcement learning. arXiv preprint arXiv:1912.02368. https://doi.org/10.48550/arXiv.1912.02368

Download references

Acknowledgements

This work is financed by The National Nature Science Foundation of China (61673084) and Project of Liaoning Provincial Department of Education (2021LJKZ1180).

Author information

Authors and Affiliations

Information Engineering Faculty, Dalian University, Dalian, 116622, Liaoning, China
Qijie Zou, Xiling Zhao, Bing Gao, Shuang Chen, Zhiguo Liu & Zhejie Zhang
Communications and Network Key Laboratory, Dalian University, Dalian, 116600, Liaoning, China
Zhiguo Liu

Authors

Qijie Zou
View author publications
You can also search for this author in PubMed Google Scholar
Xiling Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Bing Gao
View author publications
You can also search for this author in PubMed Google Scholar
Shuang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhiguo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhejie Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors of this manuscript have made significant contributions to the research, analysis, and writing of the paper. Below is a summary of each author's specific contributions: Zou: Acted as the main supervisor and project leader of the whole research; Provided financial support for research; Participated in the discussion and decision of the research direction; Important revisions and modifications have been made to the whole paper. Zhao: The whole research framework and experimental scheme are designed. Responsible for setting up experimental framework and experimental operation; Data analysis and statistical processing were carried out. The methods and results of the paper are written. Gao: Literature review and related research were carried out. Assist in the design of research methods and experimental protocols; Some data results were analyzed and interpreted. Mainly responsible for writing the introduction and related work parts of the paper. Chen: Data analysis and statistical processing were carried out. Assist in revising and proofreading the whole paper. Liu: During the revision of the article, many constructive suggestions were given to help us improve the article better. Zhang: Assist in the design and execution of experiments; Some data results were analyzed and interpreted. Participated in the revision and proofreading of the paper. We confirm that all listed authors have read and approved the final version of the manuscript and agree to its submission to the journal. Thank you for considering our manuscript.

Corresponding author

Correspondence to Xiling Zhao.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zou, Q., Zhao, X., Gao, B. et al. Relabeling and policy distillation of hierarchical reinforcement learning. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02192-6

Download citation

Received: 08 November 2023
Accepted: 22 April 2024
Published: 11 May 2024
DOI: https://doi.org/10.1007/s13042-024-02192-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Relabeling and policy distillation of hierarchical reinforcement learning

Abstract

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

Monte Carlo Tree Search: a review of recent modifications and applications

Deep learning: systematic review, models, challenges, and research directions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Relabeling and policy distillation of hierarchical reinforcement learning

Abstract

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

Monte Carlo Tree Search: a review of recent modifications and applications

Deep learning: systematic review, models, challenges, and research directions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation