Skip to main content

Bag of Policies for Distributional Deep Exploration

  • Conference paper
  • First Online:
Epistemic Uncertainty in Artificial Intelligence (Epi UAI 2023)

Abstract

Efficient exploration in complex environments remains a major challenge for reinforcement learning (RL). Compared to previous Thompson sampling-inspired mechanisms that enable temporally extended exploration, i.e., deep exploration, we focus on deep exploration in distributional RL. We develop a general purpose approach, Bag of Policies (BoP), that can be built on top of any return distribution estimator by maintaining a population of its copies. BoP consists of an ensemble of multiple heads that are updated independently. During training, each episode is controlled by only one of the heads and the collected state-action pairs are used to update all heads off-policy, leading to distinct learning signals for each head which diversify learning and behaviour. To test whether optimistic ensemble method can improve on distributional RL as it does on scalar RL, we implement the BoP approach with a population of distributional actor-critics using Bayesian Distributional Policy Gradients (BDPG). The population thus approximates a posterior distribution of return distributions along with a posterior distribution of policies. Our setup allows to analyze global posterior uncertainty along with local curiosity bonus simultaneously for exploration. As BDPG is already an optimistic method, this pairing helps to investigate the extent to which accumulating curiosity bonuses is beneficial. Overall BoP results in greater robustness and speed during learning as demonstrated by our experimental results on ALE Atari games.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Barth-Maron, G., et al.: Distributed distributional deterministic policy gradients. In: Proceedings of the 6th International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  • Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013)

    Article  Google Scholar 

  • Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspective on reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 449–458 (2017)

    Google Scholar 

  • Chen, R.Y., Sidor, S., Abbeel, P., Schulman, J.: UCB exploration via q-ensembles (2017)

    Google Scholar 

  • Choi, Y., Lee, K., Oh, S.: Distributional deep reinforcement learning with a mixture of Gaussians. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 9791–9797 (2019)

    Google Scholar 

  • Dabney, W., Ostrovski, G., Silver, D., Munos, R.: Implicit quantile networks for distributional reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1096–1105 (2018a)

    Google Scholar 

  • Dabney, W., Rowland, M., Bellemare, M.G., Munos, R.: Distributional reinforcement learning with quantile regression. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018b)

    Google Scholar 

  • Doan, T., Mazoure, B., Lyle, C.: GAN q-learning (2018)

    Google Scholar 

  • Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. In: Proceedings of the 5th International Conference on Learning Representations (ICLR) (2017)

    Google Scholar 

  • Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. CRC Press, Boca Raton (1994)

    Book  Google Scholar 

  • Espeholt, L., et al.:. IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1407–1416, Stockholmsmässan, Stockholm (2018)

    Google Scholar 

  • Freirich, D., Shimkin, T., Meir, R., Tamar, A.: Distributional multivariate policy evaluation and exploration with the bellman GAN. In: Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, vol. 97, pp. 1983–1992 (2019)

    Google Scholar 

  • Kuznetsov, A., Shvechikov, P., Grishin, A., Vetrov, D.: Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In: Proceedings of the 37th International Conference on Machine Learning (2020)

    Google Scholar 

  • Li, L., Faisal, A.: Bayesian distributional policy gradients. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 1, pp. 8429–8437 (2021)

    Google Scholar 

  • Liang, J., Makoviychuk, V., Handa, A., Chentanez, N., Macklin, M., Fox, D.: GPU-accelerated robotic simulation for distributed reinforcement learning. In: Conference on Robot Learning, pp. 270–282. PMLR (2018)

    Google Scholar 

  • Lyle, C., Bellemare, M.G., Castro, P.S.: A comparative analysis of expected and distributional reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4504–4511 (2019)

    Google Scholar 

  • Martin, J., Lyskawinski, M., Li, X., Englot, B.: Stochastically dominant distributional reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning (2020)

    Google Scholar 

  • Mavrin, B., et al.: Distributional reinforcement learning for efficient exploration. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 4424–4434 (2019)

    Google Scholar 

  • Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 1928–1937 (2016)

    Google Scholar 

  • O’Donoghue, B., Osband, I., Munos, R., Mnih, V.: The uncertainty Bellman equation and exploration. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 3839–3848. Stockholmsmässan, Stockholm (2018)

    Google Scholar 

  • Osband, I., Blundell, C., Pritzel, A., Van Roy, B.: Deep exploration via Bootstrapped DQN. Adv. Neural. Inf. Process. Syst. 29, 4026–4034 (2016)

    Google Scholar 

  • Osband, I., Van Roy, B., Russo, D.J., Wen, Z.: Deep exploration via randomized value functions. J. Mach. Learn. Res. 20, 1–62 (2019)

    MathSciNet  Google Scholar 

  • Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)

    Book  Google Scholar 

  • Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the 4th International Conference on Learning Representations (ICLR) (2016)

    Google Scholar 

  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  • Singh, R., Lee, K., Chen, Y.: Sample-based distributional policy gradient (2020)

    Google Scholar 

  • Sutton, R.S.: Policy gradient methods for reinforcement learning with function approximation. Adv. Neural. Inf. Process. Syst. 12, 1057–1063 (1999)

    Google Scholar 

  • Tang, Y., Agrawal, S.: Exploration by distributional reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2710–2716 (2020)

    Google Scholar 

  • Thompson, W.R.: On the theory of apportionment. Am. J. Math. 57(2), 450–456 (1935)

    Article  MathSciNet  Google Scholar 

  • Wiering, M.A., van Hasselt, H.P.: Ensemble algorithms in reinforcement learning. IEEE Trans. Syst. Man Cybern. Part B 38(4), 930–936 (2008)

    Article  Google Scholar 

  • Zhang, Z., Chen, J., Chen, Z., Li, W.: Asynchronous episodic deep deterministic policy gradient: toward continuous control in computationally complex environments. IEEE Trans. Cybern. 51, 604–613 (2019)

    Article  Google Scholar 

Download references

Acknowledgment

We are grateful for our funding support. At the time of this work, GL and AF are sponsored by UKRI Turing AI Fellowship (EP/V025449/1), LL and FV by the PhD scholarship of the Department of Computing, Imperial College London.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Asen Nachkov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nachkov, A., Li, L., Luise, G., Valdettaro, F., Faisal, A.A. (2024). Bag of Policies for Distributional Deep Exploration. In: Cuzzolin, F., Sultana, M. (eds) Epistemic Uncertainty in Artificial Intelligence . Epi UAI 2023. Lecture Notes in Computer Science(), vol 14523. Springer, Cham. https://doi.org/10.1007/978-3-031-57963-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-57963-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-57962-2

  • Online ISBN: 978-3-031-57963-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics