A stable data-augmented reinforcement learning method with ensemble exploration and exploitation

Zuo, Guoyu; Tian, Zhipeng; Huang, Gao

doi:10.1007/s10489-023-04816-w

A stable data-augmented reinforcement learning method with ensemble exploration and exploitation

Published: 28 July 2023

Volume 53, pages 24792–24803, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

175 Accesses
1 Citation
Explore all metrics

Abstract

Learning from visual observations is a significant yet challenging problem in Reinforcement Learning (RL). Two respective problems, representation learning and task learning, need to solve to infer an optimal policy. Some methods have been proposed to utilize data augmentation in reinforcement learning to directly learn from images. Although these methods can improve generation in RL, they are often found to make the task learning unsteady and can even lead to divergence. We investigate the causes of instability and find it is usually rooted in high-variance of Q-functions. In this paper, we propose an easy-to-implement and unified method to solve above-mentioned problems, Data-augmented Reinforcement Learning with Ensemble Exploration and Exploitation (DAR-EEE). Bootstrap ensembles are incorporated into data augmented reinforcement learning and provide uncertainty estimation of both original and augmented states, which can be utilized to stabilize and accelerate the task learning. Specially, a novel strategy called uncertainty-weighted exploitation is designed for rational utilization of transition tuples. Moreover, an efficient exploration method using the highest upper confidence is used to balance exploration and exploitation. Our experimental evaluation demonstrates the improved sample efficiency and final performance of our method on a range of difficult image-based control tasks. Especially, our method has achieved the new state-of-the-art performance on Reacher-easy and Cheetah-run tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Context-Adapted Multi-policy Ensemble Method for Generalization in Reinforcement Learning

Sample-Efficient Reinforcement Learning Based on Dynamics Models via Meta-policy Optimization

Data Availability

The datasets generated during the current study are available from the corresponding author on reasonable request.

References

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Article Google Scholar
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint https://arxiv.org/abs/1707.06347 arXiv:1707.06347
Lee AX, Nagabandi A, Abbeel P, Levine S (2020) Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Adv Neural Inf Process Syst 33:741–752
Google Scholar
Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R (2021) Improving sample efficiency in model-free reinforcement learning from images. Proceed AAAI Conf Artif Intell 35:10674–10681
Google Scholar
Dwibedi D, Tompson J, Lynch C, Sermanet P (2018) Learning actionable representations from visual observations. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp 1577–1584. https://doi.org/10.1109/IROS.2018.8593951
Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R (2019) Improving sample efficiency in model-free reinforcement learning from images. In Proceedings of the AAAI Conference on Artificial Intelligence 35(12):10674–10681. https://doi.org/10.1609/aaai.v35i12.17276
Igl M, Ciosek K, Li Y, Tschiatschek S, Zhang C, Devlin S, Hofmann K (2019) Generalization in reinforcement learning with selective noise injection and information bottleneck. Adv Neural Inf Proces Syst, 32. https://proceedings.neurips.cc/paper_files/paper/2019/file/e2ccf95a7f2e1878fcafc8376649b6e8-Paper.pdf
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In International conference on machine learning, PMLR, pp 1861–1870. https://proceedings.mlr.press/v80/haarnoja18b.html
Choi J, Guo Y, Moczulski M, Oh J, Wu N, Norouzi M, Lee H (2018) Contingency-Aware Exploration in Reinforcement Learning
Osband I, Blundell C, Pritzel A, Roy BV (2016) Deep Exploration via Bootstrapped DQN. Advances in neural information processing systems, 29. https://proceedings.neurips.cc/paper_files/paper/2016/file/8d8818c8e140c64c743113f563cf750f-Paper.pdf
Chen RY, Sidor S, Abbeel P, Schulman J (2017) UCB Exploration via Q-Ensembles. arXiv preprint https://arxiv.org/abs/1706.01502 arXiv:1706.01502
Polikar R (2006) Essemble based systems in decision making. IEEE Circ Syst Mag 6(3):21–45. https://doi.org/10.1109/MCAS.2006.1688199
Article Google Scholar
Ren Y, Zhang L, Suganthan PN (2016) Ensemble classification and regression-recent developments, applications and future directions [review article]. IEEE Comput Intell Mag 11(1):41–53
Article Google Scholar
Kim Y, Sohn SY (2012) Stock fraud detection using peer group analysis. Expert Syst Appl 39(10):8986–8992
Article Google Scholar
Kavzoglu T, Colkesen I (2013) An assessment of the effectiveness of a rotation forest ensemble for land-use and land-cover mapping. Int J Remote Sens 34(11–12):4224–4241
Article Google Scholar
Min H, Liu B (2015) Ensemble of extreme learning machine for remote sensing image classification. Neurocomputing 149(pt.a):65–70
Google Scholar
Laskin M, Srinivas A, Abbeel P (2020) Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pages 5639–5650. PMLR
Laskin M, Lee K, Stooke A, Pinto L, Abbeel P, Srinivas A (2020) Reinforcement learning with augmented data. Advances in neural information processing systems 33:19884–19895. https://proceedings.neurips.cc/paper_files/paper/2020/file/e615c82aba461681ade82da2da38004a-Paper.pdf
Kostrikov I, Yarats D, Fergus R (2020) Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint https://arxiv.org/abs/2004.13649 arXiv:2004.13649
Tassa Y, Doron Y, Muldal A, Erez T, Li Y, de Las Casas D, Budden D, Abdolmaleki A, Merel J, Lefrancq A, et al. (2018) Deepmind control suite. arXiv preprint https://arxiv.org/abs/1801.00690 arXiv:1801.00690
Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P, et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint https://arxiv.org/abs/1812.05905 arXiv:1812.05905
Schwarzer M, Anand A, Goel R, Hjelm RD, Courville A, Bachman P (2020) Data-efficient reinforcement learning with self-predictive representations. arXiv preprint https://arxiv.org/abs/2007.05929 arXiv:2007.05929
Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped DQN. Adv Neural Inf Proces Syst, 29. https://proceedings.neurips.cc/paper_files/paper/2016/file/8d8818c8e140c64c743113f563cf750f-Paper.pdf
Lee K, Laskin M, Srinivas A, Abbeel P (2021) Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning, pages 6131–6141. PMLR
Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 5026–5033. https://doi.org/10.1109/IROS.2012.6386109
Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J (2019) Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, PMLR pp. 2555–2565. https://proceedings.mlr.press/v97/hafner19a.html
Hafner D, Lillicrap T, Ba J, Norouzi M (2019) Dream to control: Learning behaviors by latent imagination. arXiv preprint https://arxiv.org/abs/1912.01603 arXiv:1912.01603
Lan Q, Pan Y, Fyshe A, White M (2019) Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint https://arxiv.org/abs/2002.06487 arXiv:2002.06487

Download references

Acknowledgements

This work is partially supported by National Natural Science Foundation of China (61873008, 62103053), Beijing Natural Science Foundation (4192010) and National Key Research and Development Plan (2018YFB1307004).

Funding

This work is partially supported by National Natural Science Foundation of China (61873008, 62103053), Beijing Natural Science Foundation (4192010) and National Key Research and Development Plan (2018YFB1307004).

Author information

Authors and Affiliations

Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
Guoyu Zuo, Zhipeng Tian & Gao Huang
Beijing Key Laboratory of Computing Intelligence and Intelligent Systems, Beijing, 100124, China
Guoyu Zuo & Gao Huang
Beijing Advanced Innovation Center for Intelligent Robots and Systems, Beijing Institute of Technology, Beijing, 100081, China
Gao Huang

Authors

Guoyu Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Zhipeng Tian
View author publications
You can also search for this author in PubMed Google Scholar
Gao Huang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, experimental design and data analysis were performed by Guoyu Zuo and Zhipeng Tian. Manuscript writing and organization were done by Gao Huang. Review and commentary were done by Zhipeng Tian and Gao Huang. All authors read and approved this manuscript.

Corresponding author

Correspondence to Gao Huang.

Ethics declarations

Ethical Approval

The authors declare that they have no conflict of interest. This paper has not been previously published, it is published with the permission of the authors’ institution, and all authors of this paper are responsible for the authenticity of the data in the paper.

Consent to Participate

All authors of this paper have been informed of the revision and publication of the paper, have checked all data, figures and tables in the manuscript, and are responsible for their truthfulness and accuracy. Names of all contributing authors: Guoyu Zuo; Zhipeng Tian; Gao Huang.

Consent for publication

The publication has been approved by all co-authors.

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Competing Interests

All authors of this paper declare no conflict of interest in this paper and agree to submit this manuscript to the journal of Applied Intelligence.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zuo, G., Tian, Z. & Huang, G. A stable data-augmented reinforcement learning method with ensemble exploration and exploitation. Appl Intell 53, 24792–24803 (2023). https://doi.org/10.1007/s10489-023-04816-w

Download citation

Accepted: 19 June 2023
Published: 28 July 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10489-023-04816-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A stable data-augmented reinforcement learning method with ensemble exploration and exploitation

Abstract

Access this article

Similar content being viewed by others

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Context-Adapted Multi-policy Ensemble Method for Generalization in Reinforcement Learning

Sample-Efficient Reinforcement Learning Based on Dynamics Models via Meta-policy Optimization

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical Approval

Consent to Participate

Consent for publication

Conflict of interest

Competing Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A stable data-augmented reinforcement learning method with ensemble exploration and exploitation

Abstract

Access this article

Similar content being viewed by others

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Context-Adapted Multi-policy Ensemble Method for Generalization in Reinforcement Learning

Sample-Efficient Reinforcement Learning Based on Dynamics Models via Meta-policy Optimization

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical Approval

Consent to Participate

Consent for publication

Conflict of interest

Competing Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation