Abstract
Efficient exploration in complex environments remains a major challenge for reinforcement learning (RL). Compared to previous Thompson sampling-inspired mechanisms that enable temporally extended exploration, i.e., deep exploration, we focus on deep exploration in distributional RL. We develop a general purpose approach, Bag of Policies (BoP), that can be built on top of any return distribution estimator by maintaining a population of its copies. BoP consists of an ensemble of multiple heads that are updated independently. During training, each episode is controlled by only one of the heads and the collected state-action pairs are used to update all heads off-policy, leading to distinct learning signals for each head which diversify learning and behaviour. To test whether optimistic ensemble method can improve on distributional RL as it does on scalar RL, we implement the BoP approach with a population of distributional actor-critics using Bayesian Distributional Policy Gradients (BDPG). The population thus approximates a posterior distribution of return distributions along with a posterior distribution of policies. Our setup allows to analyze global posterior uncertainty along with local curiosity bonus simultaneously for exploration. As BDPG is already an optimistic method, this pairing helps to investigate the extent to which accumulating curiosity bonuses is beneficial. Overall BoP results in greater robustness and speed during learning as demonstrated by our experimental results on ALE Atari games.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barth-Maron, G., et al.: Distributed distributional deterministic policy gradients. In: Proceedings of the 6th International Conference on Learning Representations (ICLR) (2018)
Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013)
Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspective on reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 449–458 (2017)
Chen, R.Y., Sidor, S., Abbeel, P., Schulman, J.: UCB exploration via q-ensembles (2017)
Choi, Y., Lee, K., Oh, S.: Distributional deep reinforcement learning with a mixture of Gaussians. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 9791–9797 (2019)
Dabney, W., Ostrovski, G., Silver, D., Munos, R.: Implicit quantile networks for distributional reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1096–1105 (2018a)
Dabney, W., Rowland, M., Bellemare, M.G., Munos, R.: Distributional reinforcement learning with quantile regression. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018b)
Doan, T., Mazoure, B., Lyle, C.: GAN q-learning (2018)
Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. In: Proceedings of the 5th International Conference on Learning Representations (ICLR) (2017)
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. CRC Press, Boca Raton (1994)
Espeholt, L., et al.:. IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1407–1416, Stockholmsmässan, Stockholm (2018)
Freirich, D., Shimkin, T., Meir, R., Tamar, A.: Distributional multivariate policy evaluation and exploration with the bellman GAN. In: Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, vol. 97, pp. 1983–1992 (2019)
Kuznetsov, A., Shvechikov, P., Grishin, A., Vetrov, D.: Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In: Proceedings of the 37th International Conference on Machine Learning (2020)
Li, L., Faisal, A.: Bayesian distributional policy gradients. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 1, pp. 8429–8437 (2021)
Liang, J., Makoviychuk, V., Handa, A., Chentanez, N., Macklin, M., Fox, D.: GPU-accelerated robotic simulation for distributed reinforcement learning. In: Conference on Robot Learning, pp. 270–282. PMLR (2018)
Lyle, C., Bellemare, M.G., Castro, P.S.: A comparative analysis of expected and distributional reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4504–4511 (2019)
Martin, J., Lyskawinski, M., Li, X., Englot, B.: Stochastically dominant distributional reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning (2020)
Mavrin, B., et al.: Distributional reinforcement learning for efficient exploration. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 4424–4434 (2019)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 1928–1937 (2016)
O’Donoghue, B., Osband, I., Munos, R., Mnih, V.: The uncertainty Bellman equation and exploration. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 3839–3848. Stockholmsmässan, Stockholm (2018)
Osband, I., Blundell, C., Pritzel, A., Van Roy, B.: Deep exploration via Bootstrapped DQN. Adv. Neural. Inf. Process. Syst. 29, 4026–4034 (2016)
Osband, I., Van Roy, B., Russo, D.J., Wen, Z.: Deep exploration via randomized value functions. J. Mach. Learn. Res. 20, 1–62 (2019)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the 4th International Conference on Learning Representations (ICLR) (2016)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Singh, R., Lee, K., Chen, Y.: Sample-based distributional policy gradient (2020)
Sutton, R.S.: Policy gradient methods for reinforcement learning with function approximation. Adv. Neural. Inf. Process. Syst. 12, 1057–1063 (1999)
Tang, Y., Agrawal, S.: Exploration by distributional reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2710–2716 (2020)
Thompson, W.R.: On the theory of apportionment. Am. J. Math. 57(2), 450–456 (1935)
Wiering, M.A., van Hasselt, H.P.: Ensemble algorithms in reinforcement learning. IEEE Trans. Syst. Man Cybern. Part B 38(4), 930–936 (2008)
Zhang, Z., Chen, J., Chen, Z., Li, W.: Asynchronous episodic deep deterministic policy gradient: toward continuous control in computationally complex environments. IEEE Trans. Cybern. 51, 604–613 (2019)
Acknowledgment
We are grateful for our funding support. At the time of this work, GL and AF are sponsored by UKRI Turing AI Fellowship (EP/V025449/1), LL and FV by the PhD scholarship of the Department of Computing, Imperial College London.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nachkov, A., Li, L., Luise, G., Valdettaro, F., Faisal, A.A. (2024). Bag of Policies for Distributional Deep Exploration. In: Cuzzolin, F., Sultana, M. (eds) Epistemic Uncertainty in Artificial Intelligence . Epi UAI 2023. Lecture Notes in Computer Science(), vol 14523. Springer, Cham. https://doi.org/10.1007/978-3-031-57963-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-57963-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57962-2
Online ISBN: 978-3-031-57963-9
eBook Packages: Computer ScienceComputer Science (R0)