Skip to main content

Value targets in off-policy AlphaZero: a new greedy backup

Abstract

This article presents and evaluates a family of AlphaZero value targets, subsuming previous variants and introducing AlphaZero with greedy backups (A0GB). Current state-of-the-art algorithms for playing board games use sample-based planning, such as Monte Carlo Tree Search (MCTS), combined with deep neural networks (NN) to approximate the value function. These algorithms, of which AlphaZero is a prominent example, are computationally extremely expensive to train, due to their reliance on many neural network evaluations. This limits their practical performance. We improve the training process of AlphaZero by using more effective training targets for the neural network. We introduce a three-dimensional space to describe a family of training targets, covering the original AlphaZero training target as well as the soft-Z and A0C variants from the literature. We demonstrate that A0GB, using a specific new value target from this family, is able to find the optimal policy in a small tabular domain, whereas the original AlphaZero target fails to do so. In addition, we show that soft-Z, A0C and A0GB achieve better performance and faster training than the original AlphaZero target on two benchmark board games (Connect-Four and Breakthrough). Finally, we juxtapose tabular learning with neural network-based value function approximation in Tic-Tac-Toe, and compare the effects of learning targets therein.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. 1.

    The bias that we talk about here is a different bias than the one associated with the bias-variance trade-off between bootstrapping and Monte Carlo methods. That bias is associated with the (transient) effect of value initialization. The bias we are considering here is permanent and associated with the difference between the actual and ideal target policy.

  2. 2.

    The replay buffer that is used for training the neural network also contains samples that were sampled from older versions of \(\pi _\mathrm{AlphaZero}\). One could argue that this makes AlphaZero off-policy, but we do not go into the details about the effects of replay buffers in this work.

References

  1. 1.

    Anthony T, Tian Z, Barber D (2017) Thinking fast and slow with deep learning and tree search. In: Advances in Neural Information Processing Systems, pp. 5360–5370

  2. 2.

    Auger D, Couetoux A, Teytaud O (2013) Continuous upper confidence trees with polynomial exploration–consistency. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 194–209. Springer

  3. 3.

    Baier H, Winands MH (2014) Monte-carlo tree search and minimax hybrids with heuristic evaluation functions. In: Workshop on Computer Games, pp. 45–63. Springer

  4. 4.

    Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of monte carlo tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43. https://doi.org/10.1109/TCIAIG.2012.2186810

    Article  Google Scholar 

  5. 5.

    Campbell M, Hoane A, hsiung Hsu F (2002) Deep blue. Artif Intell 134(1):57–83

    Article  Google Scholar 

  6. 6.

    Carlsson F, Öhman J (2019) Alphazero to alpha hero : A pre-study on additional tree sampling within self-play reinforcement learning. Bachelor’s thesis, KTH, School of Electrical Engineering and Computer Science (EECS)

  7. 7.

    Coulom R (2006) Efficient selectivity and backup operators in monte-carlo tree search. In: International conference on computers and games, pp. 72–83. Springer

  8. 8.

    Hasselt HV (2010) Double q-learning. In: Advances in neural information processing systems, pp. 2613–2621

  9. 9.

    Kocsis L, Szepesvári C (2006) Bandit based monte-carlo planning. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Mach Lear: ECML 2006. Springer, Berlin Heidelberg, pp 282–293

    Chapter  Google Scholar 

  10. 10.

    Lan L, Li W, Wei T, Wu I (2019) Multiple policy value monte carlo tree search. CoRR arXiv:abs/1905.13521

  11. 11.

    Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971

  12. 12.

    Moerland TM, Broekens J, Plaat A, Jonker CM (2018) A0c: Alpha zero in continuous action space. arXiv preprint arXiv:1805.09613

  13. 13.

    Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped dqn. In: Advances in neural information processing systems, pp. 4026–4034

  14. 14.

    Osband I, Doron Y, Hessel M, Aslanides J, Sezener E, Saraiva A, McKinney K, Lattimore T, Szepezvari C, Singh S, et al. (2019) Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568

  15. 15.

    Schaeffer J, Lake R, Lu P, Bryant M (1996) Chinook the world man-machine checkers champion. AI Magazine 17(1):21

    Google Scholar 

  16. 16.

    Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, et al. (2019) Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265

  17. 17.

    Sharma S, Kobti Z, Goodwin S (2008) Knowledge generation for improving simulations in uct for general game playing. In: Wobcke W, Zhang M (eds) AI 2008: Advances in Artificial Intelligence. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 49–55

    Chapter  Google Scholar 

  18. 18.

    Silver D (2015) Ucl course on rl lecture 2: Markov decision processes. https://www.davidsilver.uk/wp-content/uploads/2020/03/MDP.pdf

  19. 19.

    Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of Go with deep neural networks and tree search. Nat 529(7587):484–489. https://doi.org/10.1038/nature16961

    Article  Google Scholar 

  20. 20.

    Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T, et al. (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815

  21. 21.

    Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap T, Hui F, Sifre L, van den Driessche G, Graepel T, Hassabis D (2017) Mastering the game of go without human knowledge. Nat 550:354–359. https://doi.org/10.1038/nature24270

    Article  Google Scholar 

  22. 22.

    Sutton RS, Barto AG (2018) Introduction to Reinforcement Learning, 2nd edn. MIT Press, Cambridge, MA, USA

    MATH  Google Scholar 

  23. 23.

    Veness J, Silver D, Blair A, Uther W (2009) Bootstrapping from game tree search. In: Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, A. Culotta (eds.) Advances in Neural Information Processing Systems 22, pp. 1937–1945. Curran Associates, Inc. http://papers.nips.cc/paper/3722-bootstrapping-from-game-tree-search.pdf

  24. 24.

    Willemsen D, Baier H, Kaisers M (2020) Value targets in off-policy AlphaZero: a new greedy backup. In: Adaptive and Learning Agents (ALA) Workshop

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive feedback, which helped improve the manuscript. This work is part of the project Flexible Assets Bid Across Markets (FABAM, project number TEUE117015), funded within the Dutch Topsector Energie / TKI Urban Energy by Rijksdienst voor Ondernemend Nederland (RvO).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Daniel Willemsen.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

An abridged version of this work has previously been presented at the Adaptive and Learning Agents workshop at AAMAS 2020 [24]. This extended version introduces additional results and performance analyses on the Tic-Tac-Toe domain.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Willemsen, D., Baier, H. & Kaisers, M. Value targets in off-policy AlphaZero: a new greedy backup. Neural Comput & Applic (2021). https://doi.org/10.1007/s00521-021-05928-5

Download citation

Keywords

  • Reinforcement learning
  • Sample-based planning
  • AlphaZero
  • MCTS