Skip to main content
Log in

A stable data-augmented reinforcement learning method with ensemble exploration and exploitation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Learning from visual observations is a significant yet challenging problem in Reinforcement Learning (RL). Two respective problems, representation learning and task learning, need to solve to infer an optimal policy. Some methods have been proposed to utilize data augmentation in reinforcement learning to directly learn from images. Although these methods can improve generation in RL, they are often found to make the task learning unsteady and can even lead to divergence. We investigate the causes of instability and find it is usually rooted in high-variance of Q-functions. In this paper, we propose an easy-to-implement and unified method to solve above-mentioned problems, Data-augmented Reinforcement Learning with Ensemble Exploration and Exploitation (DAR-EEE). Bootstrap ensembles are incorporated into data augmented reinforcement learning and provide uncertainty estimation of both original and augmented states, which can be utilized to stabilize and accelerate the task learning. Specially, a novel strategy called uncertainty-weighted exploitation is designed for rational utilization of transition tuples. Moreover, an efficient exploration method using the highest upper confidence is used to balance exploration and exploitation. Our experimental evaluation demonstrates the improved sample efficiency and final performance of our method on a range of difficult image-based control tasks. Especially, our method has achieved the new state-of-the-art performance on Reacher-easy and Cheetah-run tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The datasets generated during the current study are available from the corresponding author on reasonable request.

References

  1. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  2. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint https://arxiv.org/abs/1707.06347 arXiv:1707.06347

  3. Lee AX, Nagabandi A, Abbeel P, Levine S (2020) Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Adv Neural Inf Process Syst 33:741–752

    Google Scholar 

  4. Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R (2021) Improving sample efficiency in model-free reinforcement learning from images. Proceed AAAI Conf Artif Intell 35:10674–10681

    Google Scholar 

  5. Dwibedi D, Tompson J, Lynch C, Sermanet P (2018) Learning actionable representations from visual observations. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp 1577–1584. https://doi.org/10.1109/IROS.2018.8593951

  6. Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R (2019) Improving sample efficiency in model-free reinforcement learning from images. In Proceedings of the AAAI Conference on Artificial Intelligence 35(12):10674–10681. https://doi.org/10.1609/aaai.v35i12.17276

  7. Igl M, Ciosek K, Li Y, Tschiatschek S, Zhang C, Devlin S, Hofmann K (2019) Generalization in reinforcement learning with selective noise injection and information bottleneck. Adv Neural Inf Proces Syst, 32. https://proceedings.neurips.cc/paper_files/paper/2019/file/e2ccf95a7f2e1878fcafc8376649b6e8-Paper.pdf

  8. Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In International conference on machine learning, PMLR, pp 1861–1870. https://proceedings.mlr.press/v80/haarnoja18b.html

  9. Choi J, Guo Y, Moczulski M, Oh J, Wu N, Norouzi M, Lee H (2018) Contingency-Aware Exploration in Reinforcement Learning

  10. Osband I, Blundell C, Pritzel A, Roy BV (2016) Deep Exploration via Bootstrapped DQN. Advances in neural information processing systems, 29. https://proceedings.neurips.cc/paper_files/paper/2016/file/8d8818c8e140c64c743113f563cf750f-Paper.pdf

  11. Chen RY, Sidor S, Abbeel P, Schulman J (2017) UCB Exploration via Q-Ensembles. arXiv preprint https://arxiv.org/abs/1706.01502 arXiv:1706.01502

  12. Polikar R (2006) Essemble based systems in decision making. IEEE Circ Syst Mag 6(3):21–45. https://doi.org/10.1109/MCAS.2006.1688199

    Article  Google Scholar 

  13. Ren Y, Zhang L, Suganthan PN (2016) Ensemble classification and regression-recent developments, applications and future directions [review article]. IEEE Comput Intell Mag 11(1):41–53

    Article  Google Scholar 

  14. Kim Y, Sohn SY (2012) Stock fraud detection using peer group analysis. Expert Syst Appl 39(10):8986–8992

    Article  Google Scholar 

  15. Kavzoglu T, Colkesen I (2013) An assessment of the effectiveness of a rotation forest ensemble for land-use and land-cover mapping. Int J Remote Sens 34(11–12):4224–4241

    Article  Google Scholar 

  16. Min H, Liu B (2015) Ensemble of extreme learning machine for remote sensing image classification. Neurocomputing 149(pt.a):65–70

    Google Scholar 

  17. Laskin M, Srinivas A, Abbeel P (2020) Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pages 5639–5650. PMLR

  18. Laskin M, Lee K, Stooke A, Pinto L, Abbeel P, Srinivas A (2020) Reinforcement learning with augmented data. Advances in neural information processing systems 33:19884–19895. https://proceedings.neurips.cc/paper_files/paper/2020/file/e615c82aba461681ade82da2da38004a-Paper.pdf

  19. Kostrikov I, Yarats D, Fergus R (2020) Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint https://arxiv.org/abs/2004.13649 arXiv:2004.13649

  20. Tassa Y, Doron Y, Muldal A, Erez T, Li Y, de Las Casas D, Budden D, Abdolmaleki A, Merel J, Lefrancq A, et al. (2018) Deepmind control suite. arXiv preprint https://arxiv.org/abs/1801.00690 arXiv:1801.00690

  21. Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P, et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint https://arxiv.org/abs/1812.05905 arXiv:1812.05905

  22. Schwarzer M, Anand A, Goel R, Hjelm RD, Courville A, Bachman P (2020) Data-efficient reinforcement learning with self-predictive representations. arXiv preprint https://arxiv.org/abs/2007.05929 arXiv:2007.05929

  23. Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped DQN. Adv Neural Inf Proces Syst, 29. https://proceedings.neurips.cc/paper_files/paper/2016/file/8d8818c8e140c64c743113f563cf750f-Paper.pdf

  24. Lee K, Laskin M, Srinivas A, Abbeel P (2021) Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning, pages 6131–6141. PMLR

  25. Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 5026–5033. https://doi.org/10.1109/IROS.2012.6386109

  26. Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J (2019) Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, PMLR pp. 2555–2565. https://proceedings.mlr.press/v97/hafner19a.html

  27. Hafner D, Lillicrap T, Ba J, Norouzi M (2019) Dream to control: Learning behaviors by latent imagination. arXiv preprint https://arxiv.org/abs/1912.01603 arXiv:1912.01603

  28. Lan Q, Pan Y, Fyshe A, White M (2019) Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint https://arxiv.org/abs/2002.06487 arXiv:2002.06487

Download references

Acknowledgements

This work is partially supported by National Natural Science Foundation of China (61873008, 62103053), Beijing Natural Science Foundation (4192010) and National Key Research and Development Plan (2018YFB1307004).

Funding

This work is partially supported by National Natural Science Foundation of China (61873008, 62103053), Beijing Natural Science Foundation (4192010) and National Key Research and Development Plan (2018YFB1307004).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, experimental design and data analysis were performed by Guoyu Zuo and Zhipeng Tian. Manuscript writing and organization were done by Gao Huang. Review and commentary were done by Zhipeng Tian and Gao Huang. All authors read and approved this manuscript.

Corresponding author

Correspondence to Gao Huang.

Ethics declarations

Ethical Approval

The authors declare that they have no conflict of interest. This paper has not been previously published, it is published with the permission of the authors’ institution, and all authors of this paper are responsible for the authenticity of the data in the paper.

Consent to Participate

All authors of this paper have been informed of the revision and publication of the paper, have checked all data, figures and tables in the manuscript, and are responsible for their truthfulness and accuracy. Names of all contributing authors: Guoyu Zuo; Zhipeng Tian; Gao Huang.

Consent for publication

The publication has been approved by all co-authors.

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Competing Interests

All authors of this paper declare no conflict of interest in this paper and agree to submit this manuscript to the journal of Applied Intelligence.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zuo, G., Tian, Z. & Huang, G. A stable data-augmented reinforcement learning method with ensemble exploration and exploitation. Appl Intell 53, 24792–24803 (2023). https://doi.org/10.1007/s10489-023-04816-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04816-w

Keywords

Navigation