Abstract
Learning from visual observations is a significant yet challenging problem in Reinforcement Learning (RL). Two respective problems, representation learning and task learning, need to solve to infer an optimal policy. Some methods have been proposed to utilize data augmentation in reinforcement learning to directly learn from images. Although these methods can improve generation in RL, they are often found to make the task learning unsteady and can even lead to divergence. We investigate the causes of instability and find it is usually rooted in high-variance of Q-functions. In this paper, we propose an easy-to-implement and unified method to solve above-mentioned problems, Data-augmented Reinforcement Learning with Ensemble Exploration and Exploitation (DAR-EEE). Bootstrap ensembles are incorporated into data augmented reinforcement learning and provide uncertainty estimation of both original and augmented states, which can be utilized to stabilize and accelerate the task learning. Specially, a novel strategy called uncertainty-weighted exploitation is designed for rational utilization of transition tuples. Moreover, an efficient exploration method using the highest upper confidence is used to balance exploration and exploitation. Our experimental evaluation demonstrates the improved sample efficiency and final performance of our method on a range of difficult image-based control tasks. Especially, our method has achieved the new state-of-the-art performance on Reacher-easy and Cheetah-run tasks.
Similar content being viewed by others
Data Availability
The datasets generated during the current study are available from the corresponding author on reasonable request.
References
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint https://arxiv.org/abs/1707.06347 arXiv:1707.06347
Lee AX, Nagabandi A, Abbeel P, Levine S (2020) Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Adv Neural Inf Process Syst 33:741–752
Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R (2021) Improving sample efficiency in model-free reinforcement learning from images. Proceed AAAI Conf Artif Intell 35:10674–10681
Dwibedi D, Tompson J, Lynch C, Sermanet P (2018) Learning actionable representations from visual observations. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp 1577–1584. https://doi.org/10.1109/IROS.2018.8593951
Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R (2019) Improving sample efficiency in model-free reinforcement learning from images. In Proceedings of the AAAI Conference on Artificial Intelligence 35(12):10674–10681. https://doi.org/10.1609/aaai.v35i12.17276
Igl M, Ciosek K, Li Y, Tschiatschek S, Zhang C, Devlin S, Hofmann K (2019) Generalization in reinforcement learning with selective noise injection and information bottleneck. Adv Neural Inf Proces Syst, 32. https://proceedings.neurips.cc/paper_files/paper/2019/file/e2ccf95a7f2e1878fcafc8376649b6e8-Paper.pdf
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In International conference on machine learning, PMLR, pp 1861–1870. https://proceedings.mlr.press/v80/haarnoja18b.html
Choi J, Guo Y, Moczulski M, Oh J, Wu N, Norouzi M, Lee H (2018) Contingency-Aware Exploration in Reinforcement Learning
Osband I, Blundell C, Pritzel A, Roy BV (2016) Deep Exploration via Bootstrapped DQN. Advances in neural information processing systems, 29. https://proceedings.neurips.cc/paper_files/paper/2016/file/8d8818c8e140c64c743113f563cf750f-Paper.pdf
Chen RY, Sidor S, Abbeel P, Schulman J (2017) UCB Exploration via Q-Ensembles. arXiv preprint https://arxiv.org/abs/1706.01502 arXiv:1706.01502
Polikar R (2006) Essemble based systems in decision making. IEEE Circ Syst Mag 6(3):21–45. https://doi.org/10.1109/MCAS.2006.1688199
Ren Y, Zhang L, Suganthan PN (2016) Ensemble classification and regression-recent developments, applications and future directions [review article]. IEEE Comput Intell Mag 11(1):41–53
Kim Y, Sohn SY (2012) Stock fraud detection using peer group analysis. Expert Syst Appl 39(10):8986–8992
Kavzoglu T, Colkesen I (2013) An assessment of the effectiveness of a rotation forest ensemble for land-use and land-cover mapping. Int J Remote Sens 34(11–12):4224–4241
Min H, Liu B (2015) Ensemble of extreme learning machine for remote sensing image classification. Neurocomputing 149(pt.a):65–70
Laskin M, Srinivas A, Abbeel P (2020) Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pages 5639–5650. PMLR
Laskin M, Lee K, Stooke A, Pinto L, Abbeel P, Srinivas A (2020) Reinforcement learning with augmented data. Advances in neural information processing systems 33:19884–19895. https://proceedings.neurips.cc/paper_files/paper/2020/file/e615c82aba461681ade82da2da38004a-Paper.pdf
Kostrikov I, Yarats D, Fergus R (2020) Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint https://arxiv.org/abs/2004.13649 arXiv:2004.13649
Tassa Y, Doron Y, Muldal A, Erez T, Li Y, de Las Casas D, Budden D, Abdolmaleki A, Merel J, Lefrancq A, et al. (2018) Deepmind control suite. arXiv preprint https://arxiv.org/abs/1801.00690 arXiv:1801.00690
Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P, et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint https://arxiv.org/abs/1812.05905 arXiv:1812.05905
Schwarzer M, Anand A, Goel R, Hjelm RD, Courville A, Bachman P (2020) Data-efficient reinforcement learning with self-predictive representations. arXiv preprint https://arxiv.org/abs/2007.05929 arXiv:2007.05929
Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped DQN. Adv Neural Inf Proces Syst, 29. https://proceedings.neurips.cc/paper_files/paper/2016/file/8d8818c8e140c64c743113f563cf750f-Paper.pdf
Lee K, Laskin M, Srinivas A, Abbeel P (2021) Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning, pages 6131–6141. PMLR
Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 5026–5033. https://doi.org/10.1109/IROS.2012.6386109
Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J (2019) Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, PMLR pp. 2555–2565. https://proceedings.mlr.press/v97/hafner19a.html
Hafner D, Lillicrap T, Ba J, Norouzi M (2019) Dream to control: Learning behaviors by latent imagination. arXiv preprint https://arxiv.org/abs/1912.01603 arXiv:1912.01603
Lan Q, Pan Y, Fyshe A, White M (2019) Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint https://arxiv.org/abs/2002.06487 arXiv:2002.06487
Acknowledgements
This work is partially supported by National Natural Science Foundation of China (61873008, 62103053), Beijing Natural Science Foundation (4192010) and National Key Research and Development Plan (2018YFB1307004).
Funding
This work is partially supported by National Natural Science Foundation of China (61873008, 62103053), Beijing Natural Science Foundation (4192010) and National Key Research and Development Plan (2018YFB1307004).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, experimental design and data analysis were performed by Guoyu Zuo and Zhipeng Tian. Manuscript writing and organization were done by Gao Huang. Review and commentary were done by Zhipeng Tian and Gao Huang. All authors read and approved this manuscript.
Corresponding author
Ethics declarations
Ethical Approval
The authors declare that they have no conflict of interest. This paper has not been previously published, it is published with the permission of the authors’ institution, and all authors of this paper are responsible for the authenticity of the data in the paper.
Consent to Participate
All authors of this paper have been informed of the revision and publication of the paper, have checked all data, figures and tables in the manuscript, and are responsible for their truthfulness and accuracy. Names of all contributing authors: Guoyu Zuo; Zhipeng Tian; Gao Huang.
Consent for publication
The publication has been approved by all co-authors.
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Competing Interests
All authors of this paper declare no conflict of interest in this paper and agree to submit this manuscript to the journal of Applied Intelligence.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zuo, G., Tian, Z. & Huang, G. A stable data-augmented reinforcement learning method with ensemble exploration and exploitation. Appl Intell 53, 24792–24803 (2023). https://doi.org/10.1007/s10489-023-04816-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04816-w