Evolving Game-Specific UCB Alternatives for General Video Game Playing
Abstract
At the core of the most popular version of the Monte Carlo Tree Search (MCTS) algorithm is the UCB1 (Upper Confidence Bound) equation. This equation decides which node to explore next, and therefore shapes the behavior of the search process. If the UCB1 equation is replaced with another equation, the behavior of the MCTS algorithm changes, which might increase its performance on certain problems (and decrease it on others). In this paper, we use genetic programming to evolve replacements to the UCB1 equation targeted at playing individual games in the General Video Game AI (GVGAI) Framework. Each equation is evolved to maximize playing strength in a single game, but is then also tested on all other games in our test set. For every game included in the experiments, we found a UCB replacement that performs significantly better than standard UCB1. Additionally, evolved UCB replacements also tend to improve performance in some GVGAI games for which they are not evolved, showing that improvements generalize across games to clusters of games with similar game mechanics or algorithmic performance. Such an evolved portfolio of UCB variations could be useful for a hyper-heuristic game-playing agent, allowing it to select the most appropriate heuristics for particular games or problems in general.
Keywords
General AI Genetic programming Monte-Carlo Tree Search1 Introduction
Monte Carlo Tree Search (MCTS) is a relatively new and very popular stochastic tree search algorithm, which has been used with great success to solve a large number of single-agent and adversarial planning problems [5]. Unlike most tree search algorithms, MCTS builds unbalanced trees; it spends more time exploring those branches which seem most promising. To do this, the algorithm must balance exploitation and exploration when deciding which node to expand next.
In the canonical formulation of MCTS, the UCB1 equation is used to select which node to expand [1]. It does this by trying to maximize expected reward while also making sure that nodes are not underexplored, so that promising paths are not missed.
While MCTS is a general-purpose algorithm, in practice there are modifications to the algorithm that make it perform better on various problems. In the decade since MCTS was invented (in the context of Computer Go [22]), numerous modifications have been proposed to allow it to play better in games as different as Chess [3] and Super Mario Bros [10], and for tasks as different as real-value optimization and real-world planning. While some of these modifications concern relatively peripheral aspects of the algorithm, others change or replace the UCB1 equation at the heart of it.
The large number of different MCTS modifications that have been shown to improve performance on different problems poses the question whether we can automate the search for modifications suitable for particular problems. If we could do that, we could drastically simplify the effort of adapting MCTS to work in a new domain. It also poses the question whether we can find modifications that improve performance compared to the existing UCB1 not just on a single problem, but on a larger class of problems. If we can identify the class of problems on which a particular MCTS version works better, we can then use algorithm selection [12, 21] or hyper-heuristics [6] to select the best MCTS version for a particular problem. And regardless of practical improvements, searching the space of node selection equations helps us understand the MCTS algorithm by characterizing the space of viable modifications.
In this paper, we describe a number of experiments in generating replacements for the UCB1 equation using genetic programming. We use the General Video Game AI (GVGAI) framework as a testbed. We first evolve UCB replacements with the target being performance on individual games, and then we investigate the performance of the evolved equations on all games within the framework. We evolve equations under three different conditions: (1) only given access to the same information as UCB1 (\(UCB_{+}\)); (2) given access to additional game-independent information (\(UCB_{++}\)); and (3) given access to game-specific information (\(UCB_{\#}\)).
2 Background
2.1 Monte Carlo Tree Search
Monte Carlo Tree Search (MCTS) is a relatively recently proposed algorithm for planning and game playing. It is a tree search algorithm which selects which nodes to explore in a best-first manner, which means that unlike Minimax (for two-player games) and breadth-first search (for single-player games) Monte Carlo Tree Search focuses on promising parts of the search tree first, while still conducting targeted exploration of under-explored parts. This balance between exploitation and exploration is usually handled through the application of the Upper Confidence Bound for Trees (UCT) algorithm which applies UCB1 to the search tree.
Ideally, we need some way of searching through the different possible variations of tree selection policies to find one that is well suited for the particular game in question. We propose addressing this problem by evolving tree selection policies to find specific formulations that are well suited for specific games. If successful, this would allow us to automatically generate adapted versions of UCB for games we have never met, potentially leading to better general game playing performance.
2.2 Combinations of Evolution and MCTS
Evolutionary computation is the use of algorithms inspired by Darwinian evolution for search, optimization, and/or design. Such algorithms have a very wide range of applications due to their domain-generality; with an appropriate fitness function and representation, evolutionary algorithms can be successfully applied to optimization tasks in a variety of fields.
There are several different ways in which evolutionary computation could be combined with MCTS for game playing. Perhaps the most obvious combination is to evolve game state evaluators. In many cases, it is not possible for the rollouts of MCTS to reach a terminal game state; in those cases, the search needs to “bottom out” in some kind of state evaluation heuristic. This state evaluator needs to correctly estimate the quality of a game state, which is a non-trivial task. Therefore the state evaluator can be evolved; the fitness function is how well the MCTS agent plays the game using the state evaluator [19].
Of particular interest for the current investigation is Cazenave’s work on evolving UCB1 alternatives for Go [7]. It was found that it was possible to evolve heuristics that significantly outperformed the standard UCB1 formulation; given the appropriate primitives, it could also outperform more sophisticated UCB variants specifically aimed at Go. While successful, Cazenave’s work only concerned a single game, and one which is very different from a video game.
MCTS can be used for many of the same tasks as evolutionary algorithms, such as content generation [4, 5] and continuous optimization [14]. Evolutionary algorithms have also been used for real-time planning in single-player [16] and two-player games [11].
2.3 General Video Game Playing
The problem of General Video Game Playing (GVGP) [13] is to play unseen games. Agents are evaluated on their performance on a number of games which the designer of the agent did not know about before submitting the agent. GVGP focuses on real time games compared to board games (turn based) in General Game Playing. In this paper, we use the General Video Game AI framework (GVGAI), which is the software framework associated with the GVGP competition [17, 18]. In the competition, competitors submit agents which are scored on playing ten unseen games which resemble (and in some cases are modeled on) classic arcade games from the seventies and eighties.
Boulderdash: is a VGDL (Video Game Description Language) port of Boulderdash. The player’s goal is to collect at least ten diamonds then reach the goal while not getting killed by enemies or boulders.
Butterflies: is an arcade game developed specifically for the framework. The player’s goal is to collect all the butterflies before they destroy all the flowers.
Missile Command: is a VGDL port of Missile Command. The player’s goal is to protect at least one city building from being destroyed by the incoming missiles.
Solar Fox: is a VGDL port of Solar Fox. The player’s goal is to collect all the diamonds and avoid hitting the side walls or the enemy bullets. The player has to move continuously which makes the game harder.
Zelda: is a VGDL port of The legend of Zelda dungeon system. The goal is to reach the exit without getting killed by enemies. The player can kill enemies using his sword.
Descriptive statistics for all the tested games. Mean, Median, Min, Max, and SD all relate to the score attained using UCB1, \(UCB_+\), \(UCB_{++}\), and \(UCB_{\#}\), respectively. A bold value is significantly better than UCB1 (\(p<0.05\)).
Game | Tree policy | Mean | Median | Min | Max | SD | Win ratio |
---|---|---|---|---|---|---|---|
Boulderdash | UCB1 | 5.30 | 4.00 | 0 | 186 | 5.22 | 0 |
Boulderdash | \(UCB_+\) | 5.05 | 4.00 | 0 | 18 | 2.85 | 0 |
Boulderdash | \(UCB_{++}\) | 28.48 | 3.0 | −1.0 | 36.0 | 23.92 | 0.018 |
Boulderdash | \(UCB_{\#}\) | 27.03 | 3.0 | −1.0 | 36.0 | 21.85 | 0.014 |
Butterflies | UCB1 | 37.39 | 32.00 | 8 | 86 | 18.92 | 0.902 |
Butterflies | \(UCB_+\) | 36.34 | 30.00 | 8 | 88 | 18.68 | 0.89 |
Butterflies | \(UCB_{++}\) | 35.84 | 30.00 | 8 | 80 | 18.43 | 0.914 |
Butterflies | \(UCB_{\#}\) | 22.302 | 18.0 | 12.0 | 48.0 | 8.13 | 0.993 |
Missile Command | UCB1 | 2.88 | 2.00 | 2 | 8 | 1.37 | 0.641 |
Missile Command | \(UCB_+\) | 3.03 | 2.00 | 2 | 8 | 1.44 | 0.653 |
Missile Command | \(UCB_{++}\) | 4.95 | 5.00 | 2 | 8 | 2.13 | 0.785 |
Missile Command | \(UCB_{\#}\) | 8.0 | 8.0 | 8.0 | 8.0 | 0.0 | 1.0 |
Solarfox | UCB1 | 6.31 | 5.00 | 0 | 32 | 6.06 | 0.00565 |
Solarfox | \(UCB_+\) | 6.49 | 5.00 | 0 | 32 | 5.81 | 0.0075 |
Solarfox | \(UCB_{++}\) | 7.765 | 6.0 | −7.0 | 32.0 | 9.152 | 0.067 |
Solarfox | \(UCB_{\#}\) | 18.57 | 18.0 | −5.0 | 32.0 | 12.318 | 0.412 |
Zelda | UCB1 | 3.58 | 4.00 | 0 | 8 | 1.85 | 0.088 |
Zelda | \(UCB_{+}\) | 6.32 | 6.00 | 0 | 8 | 1.26 | 0.155 |
Zelda | \(UCB_{++}\) | 6.906 | 8.0 | −1.0 | 8.0 | 1.623 | 0.633 |
Zelda | \(UCB_{\#}\) | 6.661 | 7.0 | −1.0 | 8.0 | 1.731 | 0.613 |
2.4 Genetic Programming
Genetic Programming (GP) [20] is a branch of evolutionary algorithms [2, 8] which evolves computer programs as a solution to the current problem. GP is essentially the application of genetic algorithms (GA) [23] to computer programs. Like GAs, GP evolves solutions based on Darwinian theory of evolution. A GP run starts with a population of possible solutions called chromosomes. Each chromosome is evaluated for its fitness (how well it solves the problem). In GP, chromosomes are most commonly represented as syntax trees where inner nodes are functions (e.g. addition, subtraction, conditions) while leaf nodes are terminals (e.g. constants, variables). Fitness is calculated by running the current program and seeing how well it solves the problem. GP uses Crossover and Mutation to evolve the new chromosomes. Crossover in GP combines two different programs at a selected node by swapping the subtrees at these nodes, and mutation in GP alters a node value.
3 Methods
Tree Variables: represent the state of the tree built by MCTS, i.e. the variables that are used by the UCB1 formula;
Agent Variables: related to the agent behavior;
Game Variables: describe the state of the game.
In the population of the first generation contains 1 UCB1 chromosome and 99 random ones. UCB1 is injected in the initial population in order to push the GP to possibly converge faster to a better equation. We run the GP for 30 generations. Between one generation and the next a 10% elitism is applied to guarantee that the best chromosomes are carried out to the next generation.
The performance of the best equations, between all the chromosomes evolved by the genetic algorithm, is validated through the simulation of 2000 playthroughs.
\(UCB_{+}\), using only Tree Variables;
\(UCB_{++}\), using both Tree and Agent Variables;
\(UCB_{\#}\), using Tree, Agent and Game Variables.
4 Results
Table 1 shows the results of all the evolved equations compared to UCB1. We can see that \(UCB_{\#}\) is almost always better than all the others equations, followed by \(UCB_{++}\), then \(UCB_{+}\), finally UCB1. This was expected as the later equations have more information about the current game than the previous. Interestingly, almost all evolved equations can be said to implement the core idea behind UCB1 Eq. 1, in that they consist of two parts: exploitation and exploration.
4.1 Evolved Equations
In this section we will discuss and try to interpret every equation evolved for each game. We describe the best formula found in each experimental condition (\(UCB_{+}\), \(UCB_{++}\), and \(UCB_{\#}\)) for each game. All variables and their meanings can be found in Table 2.
4.2 Testing Evolved Equations on All Other Games
We ran all evolved equations on all 62 public games in GVGAI framework to see if any of these equations can be generalized. All the results are shown in Fig. 2. The average win ratio of the evolved equations is generally rather similar to UCB1 on average. In particular the two \(UCB_{\#}\) equations evolved for Butterflies and Missile Command are better than UCB1 over all games, in the latter case with a gain of 0.02 (2%) for the win ratio. The only \(UCB_{\#}\) that performs poorly compared to the others is the one evolved for Zelda. The probable reason is that the evolved formula for Zelda, a complex game, overfit to this game (as evidenced by the complexity and length of the formula).
For over 20 games, excluded the ones used in the evolutionary process, one of the evolved equation could score an improvement in the win ratio of 0.05 (5%). In particular in some cases the gain is more than 0.6 (60%), resulting in the algorithm mastering the game. For example, \(UCB_{\#}\) evolved for Boulderdash in the games Frogs, Camel Race and Zelda is remarkable, as it takes MCTS to win rates of 96%, 100% and 70% respectively. This happens because the evolved formula embeds important information about the game: the benefit of decreasing the distance from portals.
Variables used in the formula and their meanings, as well as what type of variable it is (tree variable, agent variable or game variable).
Variables | ||
---|---|---|
\(d_{j}\) | Tree | Child depth |
n | Tree | Parent visits |
\(n_{j}\) | Tree | Child visits |
\(X_{j}\) | Tree | Child value |
\(U_{j}\) | Agent | Useless moves for this child |
\(E_{xy}\) | Agent | Number of visits of the current tile |
\(R_{j}\) | Agent | Repeated actions as the current |
\(RV_{j}\) | Agent | Opposite action count to the current |
\(D_{mov}\) | Game | Distance from movable object |
\(D_{immov}\) | Game | Distance from immovable object |
\(D_{npc}\) | Game | Distance from NPC |
\(D_{port}\) | Game | Distance from portal |
\(N_{port}\) | Game | Number of portals |
5 Discussion and Conclusion
In this paper we have presented a system to evolve heuristics, used in the node selection phase of MCTS, for specific games. The evolutionary process implements a Genetic Programming technique that promotes chromosomes with the highest win rate. The goal is to examine the possibility of systematically finding UCB alternatives.
Our data supports the hypothesis that it is possible to find significantly enhance heuristics. Moreover, we can argue that embedding knowledge about the game (\(UCB_{\#}\)) and about the agent’s behavior (both \(UCB_{++}\) and \(UCB_{\#}\)) allows to get exceptional improvements. With either \(UCB_{++}\) or \(UCB_{\#}\) we were able to beat UCB1 in all the five games used in our experiments; \(UCB_{+}\) was able to beat UCB1 for three games, while using the same information. This supports the idea of developing a portfolio of equations that can conveniently be selected by an hyper-heuristic agent or through algorithm selection to achieve higher performance.
Many of the UCB alternatives evolved still resemble the exploitation/exploration structure of the original UCB. While the exploitation term is still the same, although sporadically swapped or enhanced by the mixmax modification, the equations instead push the concept of exploration toward different meanings, such as spatial exploration and game-element hunt, embodying in the equation some general knowledge of the domain of games. One might even use the evolved formula to better understand the games.
Subsequently we tested the evolved equations across all games currently available in the GVG-AI framework. We could notice an overall slight improvement for \(UCB_{\#}\) evolved for Missile Command and Butterflies. We also noted how a single clean equation can behave very well on games that share some design aspect. This encourages us to evolve heuristics not just for a single game but for clusters of games.
References
- 1.Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)CrossRefMATHGoogle Scholar
- 2.Bäck, T., Schwefel, H.P.: An overview of evolutionary algorithms for parameter optimization. Evol. Comput. 1(1), 1–23 (1993)CrossRefGoogle Scholar
- 3.Baier, H., Winands, M.H.: Monte-Carlo Tree Search and minimax hybrids. In: 2013 IEEE Conference on Computational Intelligence in Games (CIG), pp. 1–8. IEEE (2013)Google Scholar
- 4.Browne, C.: Towards MCTS for creative domains. In: Proceedings of the International Conference on Computational Creativity, Mexico City, Mexico, pp. 96–101 (2011)Google Scholar
- 5.Browne, C.B., Powley, E., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo Tree Search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–43 (2012)CrossRefGoogle Scholar
- 6.Burke, E.K., Gendreau, M., Hyde, M., Kendall, G., Ochoa, G., Özcan, E., Qu, R.: Hyper-heuristics: a survey of the state of the art. J. Oper. Res. Soc. 64(12), 1695–1724 (2013)CrossRefGoogle Scholar
- 7.Cazenave, T.: Evolving Monte Carlo Tree Search Algorithms. Dept. Inf., Univ. Paris 8 (2007)Google Scholar
- 8.Eiben, A.E., Smith, J.E.: Introduction to Evolutionary Computing, vol. 53. Springer, Heidelberg (2003)CrossRefMATHGoogle Scholar
- 9.Frydenberg, F., Andersen, K.R., Risi, S., Togelius, J.: Investigating MCTS modifications in general video game playing. In: 2015 IEEE Conference on Computational Intelligence and Games (CIG), pp. 107–113. IEEE (2015)Google Scholar
- 10.Jacobsen, E.J., Greve, R., Togelius, J.: Monte Mario: platforming with MCTS. In: Proceedings of the 2014 Conference on Genetic and Evolutionary Computation, pp. 293–300. ACM (2014)Google Scholar
- 11.Justesen, N., Mahlmann, T., Togelius, J.: Online evolution for multi-action adversarial games. In: Squillero, G., Burelli, P. (eds.) EvoApplications 2016. LNCS, vol. 9597, pp. 590–603. Springer, Cham (2016). doi:10.1007/978-3-319-31204-0_38 CrossRefGoogle Scholar
- 12.Kotthoff, L.: Algorithm selection for combinatorial search problems: a survey. AI Mag. 35(3), 48–60 (2014)Google Scholar
- 13.Levine, J., Congdon, C.B., Ebner, M., Kendall, G., Lucas, S.M., Miikkulainen, R., Schaul, T., Thompson, T.: General video game playing. Dagstuhl Follow-Ups 6 (2013)Google Scholar
- 14.McGuinness, C.: Monte Carlo Tree Search: Analysis and Applications. Ph.D. thesis (2016)Google Scholar
- 15.Park, H., Kim, K.J.: MCTS with influence map for general video game playing. In: 2015 IEEE Conference on Computational Intelligence and Games (CIG), pp. 534–535. IEEE (2015)Google Scholar
- 16.Perez, D., Samothrakis, S., Lucas, S., Rohlfshagen, P.: Rolling horizon evolution versus tree search for navigation in single-player real-time games. In: Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, pp. 351–358. ACM (2013)Google Scholar
- 17.Perez, D., Samothrakis, S., Togelius, J., Schaul, T., Lucas, S., Couëtoux, A., Lee, J., Lim, C.U., Thompson, T.: The 2014 General Video Game Playing Competition (2015)Google Scholar
- 18.Perez-Liebana, D., Samothrakis, S., Togelius, J., Schaul, T., Lucas, S.M.: General Video Game AI: Competition, Challenges and Opportunities (2016)Google Scholar
- 19.Pettit, J., Helmbold, D.: Evolutionary learning of policies for MCTS simulations. In: Proceedings of the International Conference on the Foundations of Digital Games, pp. 212–219. ACM (2012)Google Scholar
- 20.Poli, R., Langdon, W.B., McPhee, N.F., Koza, J.R.: A Field Guide to Genetic Programming. Lulu.com, Raleigh (2008)Google Scholar
- 21.Rice, J.R.: The algorithm selection problem. Adv. Comput. 15, 65–118 (1976)CrossRefGoogle Scholar
- 22.Rimmel, A., Teytaud, O., Lee, C.S., Yen, S.J., Wang, M.H., Tsai, S.R.: Current frontiers in computer go. IEEE Trans. Comput. Intell. AI Games 2(4), 229–238 (2010)CrossRefGoogle Scholar
- 23.Whitley, D.: A genetic algorithm tutorial. Stat. Comput. 4(2), 65–85 (1994)CrossRefGoogle Scholar