Bag of Policies for Distributional Deep Exploration

Nachkov, Asen; Li, Luchen; Luise, Giulia; Valdettaro, Filippo; Faisal, A. Aldo

doi:10.1007/978-3-031-57963-9_3

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14523))

Included in the following conference series:

International Workshop on Epistemic Uncertainty in Artificial Intelligence

80 Accesses

Abstract

Efficient exploration in complex environments remains a major challenge for reinforcement learning (RL). Compared to previous Thompson sampling-inspired mechanisms that enable temporally extended exploration, i.e., deep exploration, we focus on deep exploration in distributional RL. We develop a general purpose approach, Bag of Policies (BoP), that can be built on top of any return distribution estimator by maintaining a population of its copies. BoP consists of an ensemble of multiple heads that are updated independently. During training, each episode is controlled by only one of the heads and the collected state-action pairs are used to update all heads off-policy, leading to distinct learning signals for each head which diversify learning and behaviour. To test whether optimistic ensemble method can improve on distributional RL as it does on scalar RL, we implement the BoP approach with a population of distributional actor-critics using Bayesian Distributional Policy Gradients (BDPG). The population thus approximates a posterior distribution of return distributions along with a posterior distribution of policies. Our setup allows to analyze global posterior uncertainty along with local curiosity bonus simultaneously for exploration. As BDPG is already an optimistic method, this pairing helps to investigate the extent to which accumulating curiosity bonuses is beneficial. Overall BoP results in greater robustness and speed during learning as demonstrated by our experimental results on ALE Atari games.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Barth-Maron, G., et al.: Distributed distributional deterministic policy gradients. In: Proceedings of the 6th International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013)
Article Google Scholar
Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspective on reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 449–458 (2017)
Google Scholar
Chen, R.Y., Sidor, S., Abbeel, P., Schulman, J.: UCB exploration via q-ensembles (2017)
Google Scholar
Choi, Y., Lee, K., Oh, S.: Distributional deep reinforcement learning with a mixture of Gaussians. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 9791–9797 (2019)
Google Scholar
Dabney, W., Ostrovski, G., Silver, D., Munos, R.: Implicit quantile networks for distributional reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1096–1105 (2018a)
Google Scholar
Dabney, W., Rowland, M., Bellemare, M.G., Munos, R.: Distributional reinforcement learning with quantile regression. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018b)
Google Scholar
Doan, T., Mazoure, B., Lyle, C.: GAN q-learning (2018)
Google Scholar
Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. In: Proceedings of the 5th International Conference on Learning Representations (ICLR) (2017)
Google Scholar
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. CRC Press, Boca Raton (1994)
Book Google Scholar
Espeholt, L., et al.:. IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1407–1416, Stockholmsmässan, Stockholm (2018)
Google Scholar
Freirich, D., Shimkin, T., Meir, R., Tamar, A.: Distributional multivariate policy evaluation and exploration with the bellman GAN. In: Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, vol. 97, pp. 1983–1992 (2019)
Google Scholar
Kuznetsov, A., Shvechikov, P., Grishin, A., Vetrov, D.: Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In: Proceedings of the 37th International Conference on Machine Learning (2020)
Google Scholar
Li, L., Faisal, A.: Bayesian distributional policy gradients. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 1, pp. 8429–8437 (2021)
Google Scholar
Liang, J., Makoviychuk, V., Handa, A., Chentanez, N., Macklin, M., Fox, D.: GPU-accelerated robotic simulation for distributed reinforcement learning. In: Conference on Robot Learning, pp. 270–282. PMLR (2018)
Google Scholar
Lyle, C., Bellemare, M.G., Castro, P.S.: A comparative analysis of expected and distributional reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4504–4511 (2019)
Google Scholar
Martin, J., Lyskawinski, M., Li, X., Englot, B.: Stochastically dominant distributional reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning (2020)
Google Scholar
Mavrin, B., et al.: Distributional reinforcement learning for efficient exploration. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 4424–4434 (2019)
Google Scholar
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 1928–1937 (2016)
Google Scholar
O’Donoghue, B., Osband, I., Munos, R., Mnih, V.: The uncertainty Bellman equation and exploration. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 3839–3848. Stockholmsmässan, Stockholm (2018)
Google Scholar
Osband, I., Blundell, C., Pritzel, A., Van Roy, B.: Deep exploration via Bootstrapped DQN. Adv. Neural. Inf. Process. Syst. 29, 4026–4034 (2016)
Google Scholar
Osband, I., Van Roy, B., Russo, D.J., Wen, Z.: Deep exploration via randomized value functions. J. Mach. Learn. Res. 20, 1–62 (2019)
MathSciNet Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)
Book Google Scholar
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the 4th International Conference on Learning Representations (ICLR) (2016)
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Singh, R., Lee, K., Chen, Y.: Sample-based distributional policy gradient (2020)
Google Scholar
Sutton, R.S.: Policy gradient methods for reinforcement learning with function approximation. Adv. Neural. Inf. Process. Syst. 12, 1057–1063 (1999)
Google Scholar
Tang, Y., Agrawal, S.: Exploration by distributional reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2710–2716 (2020)
Google Scholar
Thompson, W.R.: On the theory of apportionment. Am. J. Math. 57(2), 450–456 (1935)
Article MathSciNet Google Scholar
Wiering, M.A., van Hasselt, H.P.: Ensemble algorithms in reinforcement learning. IEEE Trans. Syst. Man Cybern. Part B 38(4), 930–936 (2008)
Article Google Scholar
Zhang, Z., Chen, J., Chen, Z., Li, W.: Asynchronous episodic deep deterministic policy gradient: toward continuous control in computationally complex environments. IEEE Trans. Cybern. 51, 604–613 (2019)
Article Google Scholar

Download references

Acknowledgment

We are grateful for our funding support. At the time of this work, GL and AF are sponsored by UKRI Turing AI Fellowship (EP/V025449/1), LL and FV by the PhD scholarship of the Department of Computing, Imperial College London.

Author information

Authors and Affiliations

Brain and Behaviour Lab, Department of Computing, Imperial College London, London, SW7 2AZ, UK
Asen Nachkov, Luchen Li, Giulia Luise, Filippo Valdettaro & A. Aldo Faisal
Chair in Digital Health and Data Science, University of Bayreuth, 95447, Bayreuth, Germany
A. Aldo Faisal

Authors

Asen Nachkov
View author publications
You can also search for this author in PubMed Google Scholar
Luchen Li
View author publications
You can also search for this author in PubMed Google Scholar
Giulia Luise
View author publications
You can also search for this author in PubMed Google Scholar
Filippo Valdettaro
View author publications
You can also search for this author in PubMed Google Scholar
A. Aldo Faisal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Asen Nachkov .

Editor information

Editors and Affiliations

Dept. of Computing & Mathematics, Oxford Brookes University, Oxford, UK
Fabio Cuzzolin
Oxford Brookes University, Oxford, UK
Maryam Sultana

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nachkov, A., Li, L., Luise, G., Valdettaro, F., Faisal, A.A. (2024). Bag of Policies for Distributional Deep Exploration. In: Cuzzolin, F., Sultana, M. (eds) Epistemic Uncertainty in Artificial Intelligence . Epi UAI 2023. Lecture Notes in Computer Science(), vol 14523. Springer, Cham. https://doi.org/10.1007/978-3-031-57963-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-57963-9_3
Published: 24 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57962-2
Online ISBN: 978-3-031-57963-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Bag of Policies for Distributional Deep Exploration