Value targets in off-policy AlphaZero: a new greedy backup

Willemsen, Daniel; Baier, Hendrik; Kaisers, Michael

doi:10.1007/s00521-021-05928-5

Value targets in off-policy AlphaZero: a new greedy backup

S.I. : Adaptive and Learning Agents 2020
Published: 16 June 2021

Volume 34, pages 1801–1814, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

455 Accesses
3 Citations
1 Altmetric
Explore all metrics

A Correction to this article was published on 13 June 2022

This article has been updated

Abstract

This article presents and evaluates a family of AlphaZero value targets, subsuming previous variants and introducing AlphaZero with greedy backups (A0GB). Current state-of-the-art algorithms for playing board games use sample-based planning, such as Monte Carlo Tree Search (MCTS), combined with deep neural networks (NN) to approximate the value function. These algorithms, of which AlphaZero is a prominent example, are computationally extremely expensive to train, due to their reliance on many neural network evaluations. This limits their practical performance. We improve the training process of AlphaZero by using more effective training targets for the neural network. We introduce a three-dimensional space to describe a family of training targets, covering the original AlphaZero training target as well as the soft-Z and A0C variants from the literature. We demonstrate that A0GB, using a specific new value target from this family, is able to find the optimal policy in a small tabular domain, whereas the original AlphaZero target fails to do so. In addition, we show that soft-Z, A0C and A0GB achieve better performance and faster training than the original AlphaZero target on two benchmark board games (Connect-Four and Breakthrough). Finally, we juxtapose tabular learning with neural network-based value function approximation in Tic-Tac-Toe, and compare the effects of learning targets therein.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Beyond games: a systematic review of neural Monte Carlo tree search applications

Article Open access 28 December 2023

Heuristic Search for Tetris: A Case Study

Deep or Wide? Learning Policy and Value Neural Networks for Combinatorial Games

Change history

13 June 2022
A Correction to this paper has been published: https://doi.org/10.1007/s00521-022-07447-3

Notes

The bias that we talk about here is a different bias than the one associated with the bias-variance trade-off between bootstrapping and Monte Carlo methods. That bias is associated with the (transient) effect of value initialization. The bias we are considering here is permanent and associated with the difference between the actual and ideal target policy.
The replay buffer that is used for training the neural network also contains samples that were sampled from older versions of \(\pi _\mathrm{AlphaZero}\). One could argue that this makes AlphaZero off-policy, but we do not go into the details about the effects of replay buffers in this work.

References

Anthony T, Tian Z, Barber D (2017) Thinking fast and slow with deep learning and tree search. In: Advances in Neural Information Processing Systems, pp. 5360–5370
Auger D, Couetoux A, Teytaud O (2013) Continuous upper confidence trees with polynomial exploration–consistency. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 194–209. Springer
Baier H, Winands MH (2014) Monte-carlo tree search and minimax hybrids with heuristic evaluation functions. In: Workshop on Computer Games, pp. 45–63. Springer
Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of monte carlo tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43. https://doi.org/10.1109/TCIAIG.2012.2186810
Article Google Scholar
Campbell M, Hoane A, hsiung Hsu F (2002) Deep blue. Artif Intell 134(1):57–83
Article Google Scholar
Carlsson F, Öhman J (2019) Alphazero to alpha hero : A pre-study on additional tree sampling within self-play reinforcement learning. Bachelor’s thesis, KTH, School of Electrical Engineering and Computer Science (EECS)
Coulom R (2006) Efficient selectivity and backup operators in monte-carlo tree search. In: International conference on computers and games, pp. 72–83. Springer
Hasselt HV (2010) Double q-learning. In: Advances in neural information processing systems, pp. 2613–2621
Kocsis L, Szepesvári C (2006) Bandit based monte-carlo planning. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Mach Lear: ECML 2006. Springer, Berlin Heidelberg, pp 282–293
Chapter Google Scholar
Lan L, Li W, Wei T, Wu I (2019) Multiple policy value monte carlo tree search. CoRR arXiv:abs/1905.13521
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971
Moerland TM, Broekens J, Plaat A, Jonker CM (2018) A0c: Alpha zero in continuous action space. arXiv preprint arXiv:1805.09613
Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped dqn. In: Advances in neural information processing systems, pp. 4026–4034
Osband I, Doron Y, Hessel M, Aslanides J, Sezener E, Saraiva A, McKinney K, Lattimore T, Szepezvari C, Singh S, et al. (2019) Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568
Schaeffer J, Lake R, Lu P, Bryant M (1996) Chinook the world man-machine checkers champion. AI Magazine 17(1):21
Google Scholar
Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, et al. (2019) Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265
Sharma S, Kobti Z, Goodwin S (2008) Knowledge generation for improving simulations in uct for general game playing. In: Wobcke W, Zhang M (eds) AI 2008: Advances in Artificial Intelligence. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 49–55
Chapter Google Scholar
Silver D (2015) Ucl course on rl lecture 2: Markov decision processes. https://www.davidsilver.uk/wp-content/uploads/2020/03/MDP.pdf
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of Go with deep neural networks and tree search. Nat 529(7587):484–489. https://doi.org/10.1038/nature16961
Article Google Scholar
Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T, et al. (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap T, Hui F, Sifre L, van den Driessche G, Graepel T, Hassabis D (2017) Mastering the game of go without human knowledge. Nat 550:354–359. https://doi.org/10.1038/nature24270
Article Google Scholar
Sutton RS, Barto AG (2018) Introduction to Reinforcement Learning, 2nd edn. MIT Press, Cambridge, MA, USA
MATH Google Scholar
Veness J, Silver D, Blair A, Uther W (2009) Bootstrapping from game tree search. In: Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, A. Culotta (eds.) Advances in Neural Information Processing Systems 22, pp. 1937–1945. Curran Associates, Inc. http://papers.nips.cc/paper/3722-bootstrapping-from-game-tree-search.pdf
Willemsen D, Baier H, Kaisers M (2020) Value targets in off-policy AlphaZero: a new greedy backup. In: Adaptive and Learning Agents (ALA) Workshop

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive feedback, which helped improve the manuscript. This work is part of the project Flexible Assets Bid Across Markets (FABAM, project number TEUE117015), funded within the Dutch Topsector Energie / TKI Urban Energy by Rijksdienst voor Ondernemend Nederland (RvO).

Author information

Authors and Affiliations

Centrum Wiskunde en Informatica, Amsterdam, Netherlands
Daniel Willemsen, Hendrik Baier & Michael Kaisers

Authors

Daniel Willemsen
View author publications
You can also search for this author in PubMed Google Scholar
Hendrik Baier
View author publications
You can also search for this author in PubMed Google Scholar
Michael Kaisers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Willemsen.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

An abridged version of this work has previously been presented at the Adaptive and Learning Agents workshop at AAMAS 2020 [24]. This extended version introduces additional results and performance analyses on the Tic-Tac-Toe domain.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Willemsen, D., Baier, H. & Kaisers, M. Value targets in off-policy AlphaZero: a new greedy backup. Neural Comput & Applic 34, 1801–1814 (2022). https://doi.org/10.1007/s00521-021-05928-5

Download citation

Received: 19 November 2020
Accepted: 12 March 2021
Published: 16 June 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s00521-021-05928-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Value targets in off-policy AlphaZero: a new greedy backup

Abstract

Access this article

Similar content being viewed by others

Beyond games: a systematic review of neural Monte Carlo tree search applications

Heuristic Search for Tetris: A Case Study

Deep or Wide? Learning Policy and Value Neural Networks for Combinatorial Games

Change history

13 June 2022

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Value targets in off-policy AlphaZero: a new greedy backup

Abstract

Access this article

Similar content being viewed by others

Beyond games: a systematic review of neural Monte Carlo tree search applications

Heuristic Search for Tetris: A Case Study

Deep or Wide? Learning Policy and Value Neural Networks for Combinatorial Games

Change history

13 June 2022

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation