Abstract
This article presents and evaluates a family of AlphaZero value targets, subsuming previous variants and introducing AlphaZero with greedy backups (A0GB). Current state-of-the-art algorithms for playing board games use sample-based planning, such as Monte Carlo Tree Search (MCTS), combined with deep neural networks (NN) to approximate the value function. These algorithms, of which AlphaZero is a prominent example, are computationally extremely expensive to train, due to their reliance on many neural network evaluations. This limits their practical performance. We improve the training process of AlphaZero by using more effective training targets for the neural network. We introduce a three-dimensional space to describe a family of training targets, covering the original AlphaZero training target as well as the soft-Z and A0C variants from the literature. We demonstrate that A0GB, using a specific new value target from this family, is able to find the optimal policy in a small tabular domain, whereas the original AlphaZero target fails to do so. In addition, we show that soft-Z, A0C and A0GB achieve better performance and faster training than the original AlphaZero target on two benchmark board games (Connect-Four and Breakthrough). Finally, we juxtapose tabular learning with neural network-based value function approximation in Tic-Tac-Toe, and compare the effects of learning targets therein.
Similar content being viewed by others
Change history
13 June 2022
A Correction to this paper has been published: https://doi.org/10.1007/s00521-022-07447-3
Notes
The bias that we talk about here is a different bias than the one associated with the bias-variance trade-off between bootstrapping and Monte Carlo methods. That bias is associated with the (transient) effect of value initialization. The bias we are considering here is permanent and associated with the difference between the actual and ideal target policy.
The replay buffer that is used for training the neural network also contains samples that were sampled from older versions of \(\pi _\mathrm{AlphaZero}\). One could argue that this makes AlphaZero off-policy, but we do not go into the details about the effects of replay buffers in this work.
References
Anthony T, Tian Z, Barber D (2017) Thinking fast and slow with deep learning and tree search. In: Advances in Neural Information Processing Systems, pp. 5360–5370
Auger D, Couetoux A, Teytaud O (2013) Continuous upper confidence trees with polynomial exploration–consistency. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 194–209. Springer
Baier H, Winands MH (2014) Monte-carlo tree search and minimax hybrids with heuristic evaluation functions. In: Workshop on Computer Games, pp. 45–63. Springer
Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of monte carlo tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43. https://doi.org/10.1109/TCIAIG.2012.2186810
Campbell M, Hoane A, hsiung Hsu F (2002) Deep blue. Artif Intell 134(1):57–83
Carlsson F, Öhman J (2019) Alphazero to alpha hero : A pre-study on additional tree sampling within self-play reinforcement learning. Bachelor’s thesis, KTH, School of Electrical Engineering and Computer Science (EECS)
Coulom R (2006) Efficient selectivity and backup operators in monte-carlo tree search. In: International conference on computers and games, pp. 72–83. Springer
Hasselt HV (2010) Double q-learning. In: Advances in neural information processing systems, pp. 2613–2621
Kocsis L, Szepesvári C (2006) Bandit based monte-carlo planning. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Mach Lear: ECML 2006. Springer, Berlin Heidelberg, pp 282–293
Lan L, Li W, Wei T, Wu I (2019) Multiple policy value monte carlo tree search. CoRR arXiv:abs/1905.13521
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971
Moerland TM, Broekens J, Plaat A, Jonker CM (2018) A0c: Alpha zero in continuous action space. arXiv preprint arXiv:1805.09613
Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped dqn. In: Advances in neural information processing systems, pp. 4026–4034
Osband I, Doron Y, Hessel M, Aslanides J, Sezener E, Saraiva A, McKinney K, Lattimore T, Szepezvari C, Singh S, et al. (2019) Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568
Schaeffer J, Lake R, Lu P, Bryant M (1996) Chinook the world man-machine checkers champion. AI Magazine 17(1):21
Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, et al. (2019) Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265
Sharma S, Kobti Z, Goodwin S (2008) Knowledge generation for improving simulations in uct for general game playing. In: Wobcke W, Zhang M (eds) AI 2008: Advances in Artificial Intelligence. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 49–55
Silver D (2015) Ucl course on rl lecture 2: Markov decision processes. https://www.davidsilver.uk/wp-content/uploads/2020/03/MDP.pdf
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of Go with deep neural networks and tree search. Nat 529(7587):484–489. https://doi.org/10.1038/nature16961
Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T, et al. (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap T, Hui F, Sifre L, van den Driessche G, Graepel T, Hassabis D (2017) Mastering the game of go without human knowledge. Nat 550:354–359. https://doi.org/10.1038/nature24270
Sutton RS, Barto AG (2018) Introduction to Reinforcement Learning, 2nd edn. MIT Press, Cambridge, MA, USA
Veness J, Silver D, Blair A, Uther W (2009) Bootstrapping from game tree search. In: Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, A. Culotta (eds.) Advances in Neural Information Processing Systems 22, pp. 1937–1945. Curran Associates, Inc. http://papers.nips.cc/paper/3722-bootstrapping-from-game-tree-search.pdf
Willemsen D, Baier H, Kaisers M (2020) Value targets in off-policy AlphaZero: a new greedy backup. In: Adaptive and Learning Agents (ALA) Workshop
Acknowledgements
The authors would like to thank the anonymous reviewers for their constructive feedback, which helped improve the manuscript. This work is part of the project Flexible Assets Bid Across Markets (FABAM, project number TEUE117015), funded within the Dutch Topsector Energie / TKI Urban Energy by Rijksdienst voor Ondernemend Nederland (RvO).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
An abridged version of this work has previously been presented at the Adaptive and Learning Agents workshop at AAMAS 2020 [24]. This extended version introduces additional results and performance analyses on the Tic-Tac-Toe domain.
Rights and permissions
About this article
Cite this article
Willemsen, D., Baier, H. & Kaisers, M. Value targets in off-policy AlphaZero: a new greedy backup. Neural Comput & Applic 34, 1801–1814 (2022). https://doi.org/10.1007/s00521-021-05928-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-05928-5