Transfer of Learned Opponent Models in Zero Sum Games

Guennouni, Ismail; Speekenbrink, Maarten

doi:10.1007/s42113-022-00133-6

Transfer of Learned Opponent Models in Zero Sum Games

Original Paper
Open access
Published: 31 May 2022

Volume 5, pages 326–342, (2022)
Cite this article

Download PDF

You have full access to this open access article

Computational Brain & Behavior Aims and scope Submit manuscript

Transfer of Learned Opponent Models in Zero Sum Games

Download PDF

1985 Accesses
3 Citations
Explore all metrics

Abstract

Human learning transfer abilities take advantage of important cognitive building blocks such as an abstract representation of concepts underlying tasks and causal models of the environment. One way to build abstract representations of the environment when the task involves interactions with others is to build a model of the opponent that may inform what actions they are likely to take next. In this study, we explore opponent modelling and its transfer in games where human agents play against computer agents with human-like limited degrees of iterated reasoning. In two experiments, we find that participants deviate from Nash equilibrium play and learn to adapt to their opponent’s strategy to exploit it. Moreover, we show that participants transfer their learning to new games. Computational modelling shows that players start each game with a model-based learning strategy that facilitates between-game transfer of their opponent’s strategy, but then switch to behaviour that is consistent with a model-free learning strategy in the latter stages of the interaction.

Modeling opponent learning in multiagent repeated games

Article Open access 23 December 2022

Game-Specific and Player-Specific Knowledge Combine to Drive Transfer of Learning Between Games of Strategic Interaction

Modeling the Opponent’s Action Using Control-Based Reinforcement Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Being able to transfer previously acquired knowledge to a new domain is one of the hallmarks of human intelligence. This ability relies on important cognitive building blocks, such as an abstract representation of concepts underlying tasks (Lake et al., 2017). One way to form these representations when the task involves interactions with others is to build a model of the person we are interacting with that offers predictions of the actions they are likely to take next. There is evidence that people learn such models of their opponents when playing repeated economic games (Stahl & Wilson, 1995). A model of the opponent can help increase performance in a particular game, but learning more general characteristics of an opponent may also help increase performance in other games. In this paper, we are specifically interested in the latter: How do people build and use models of their opponent to facilitate learning transfer?

Repeated games, in which players interact repeatedly with the same opponent and have the ability to learn about their opponent’s strategies and preferences (Mertens, 1990), are particularly useful to address this question. The early literature on learning transfer in repeated games has mostly focused on the proportion of people who play normatively optimal (e.g. Nash equilibrium play) or use salient (e.g. risk dominance) actions in later games, having had experience with a similar game environment previously (Ho et al., 1998; Knez & Camerer, 2000). As is well-known, a Nash equilibrium means that all players in a game act such that no-one can unilaterally improve their performance by deviating from their strategy. When playing against an opponent with a Nash-optimal strategy, you can do no better than play according to the Nash equilibrium strategy as well. However, when faced with a player who deviates from the Nash-optimal strategy, you may be able to exploit this and also deviate from this strategy, increasing your performance beyond what is expected at Nash equilibrium. Of course, this comes with a risk, as your own deviation from Nash-optimal play may leave you open to similar exploitation.

Studies that focused on whether people can learn to exploit deviations from Nash equilibrium play have mostly looked at the ability of players to detect and exploit contingencies in their opponent’s actions (Dyson et al., 2016; Spiliopoulos, 2013; Shachat & Swarthout, 2004). These studies used computer opponents that do not adapt their strategy to their human opponent, mostly consisting of playing each action with a fixed probability (a mixed strategy) or using a pre-determined sequence. Findings show that humans are capable of adapting to non-Nash equilibrium play and detect patterns in an opponent’s history of play. However, the use of mixed strategies may have limited people’s ability to form accurate opponent models.

The game of Rock-Paper-Scissors (RPS) has emerged as a central paradigm to test sequential adversarial reasoning in repeated interactions. Beyond action frequencies, Dyson (2019) identifies cycle-based and outcome-based dependencies as important strategies in repeated RPS games. A positive cycle is when a player chooses in trial t + 1 an action that would have beaten the action in trial t (e.g. playing Paper after Rock), whilst a negative cycle goes in the opposite direction by choosing an action in trial t + 1 that would have been beaten by the action on trial t (e.g. playing Scissors after Rock). These cycle-based strategies can be based on the opponent’s or a player’s own past choices. For example, Eyler et al. (2009) show that players have a tendency to repeat their previous round actions. Brockbank and Vul (2021) defined a hierarchy of strategies based on previous actions (base rate of choices) as well as cycle transitions (base rate of positive and negative cycles). Using data on dyads playing 300 rounds of RPS against the same opponent, they show that exploitation of non-random play leverages simple regularities between a player’s next action and their own previous action, and between a player’s next action and their opponent’s previous action, whilst more complex regularities remain unexploited. Outcome-based dependencies are another common basis for RPS strategies. The idea is to change the likelihood of future actions based on the outcome (win, loss, or tie) of the previous round. Heuristics such as win-stay/lose-shift are an example of such outcome dependencies. Players can also combine outcome- and cycle-based strategies, choosing to cycle in a positive or negative direction depending on whether the previous round was won or lost (Wang et al., 2014; Xu et al., 2013).

Another way to frame these strategies is in terms of iterative reasoning. People may think of their opponents as applying a limited number of recursive steps to determine their next action. The type of reasoning we refer to takes the form of “I believe that you believe that I believe …”. For example, if I believe that you believe I will play Rock (perhaps because I played Rock in the previous round), then I would expect you to play Paper, as this beats my Rock. I would therefore choose to play Scissors, as this beats your Paper. This results in a self-directed negative cycle strategy (playing the action that beats ones previous action), but it is the result of applying one level of recursive reasoning (player A believes player B believes A will repeat their last action, and A best responds to player B’s best response to this belief).

Strategies based on iterative reasoning can explain non-equilibrium play in a range of games (Camerer, 2003), such as the p-beauty contest game (Nagel, 1995), dominance solvable games (Camerer et al., 2004), and various standard normal form games (Costa-Gomes et al., 2001). They also underlie successful models in behavioural economics, such as Level-k and Cognitive Hierarchy models (Camerer et al., 2004), which posit that people assume their opponent applies a limited level of iterative reasoning, taking actions that are best responses to those of their (modelled) opponent. In Level-k theory, a level-0 player uses a fixed strategy without explicitly considering the strategy of their opponent. A level-1 player assumes their opponent is a level-0 player, and chooses actions to best respond to the strategy of their opponent, without considering what their opponent might believe that they will play. A level-2 player, on the other hand, takes their opponent’s belief about their actions into account, assuming they face a level-1 player, and choosing actions to best respond to the actions of that player. Cognitive Hierarchy theory is based on similar principles, but rather than assuming an opponent always adopts a particular level-k strategy, they are assumed to adopt each of the level-k strategies with a particular probability (i.e. the opponent’s strategy is a mixture over pure level-k strategies).

Batzilis et al. (2019) analysed data from a million online RPS games and found that players strategically used information on their opponents’ previous play. Whilst the majority of play (74%) is consistent with a level-0 strategy, a significant proportion is consistent with either a level-1 (19%) or level-2 strategy (7%). As noted above, iterative reasoning results in specific action contingencies which correspond to particular action-, outcome-, and cycle-based strategies. Prior research analysing behaviour in terms of these latter strategies has not explicitly related these to levels of iterative reasoning. That particular patterns are observed more frequently, and are exploited more readily (Brockbank and Vul, 2021), may be due to people adopting specific forms of iterative reasoning. For example, using a sequential move game, Hedden and Zhang (2002) found that people initially adopt a level-1 strategy, thus assuming their opponent to adopt a level-0 strategy, but can learn over time to adopt a level-2 strategy if their opponent consistently uses a level-1 strategy. In a different version of this sequential move game, Goodie et al. (2012) found that participants initially adopt a level-2 strategy, but over time can learn to adopt a level-1 strategy if the opponent consistently uses a level-0 strategy. To what extent these results generalise to simultaneous move games such as RPS is unclear. Zhang et al. (2021) found that people can learn to beat computer opponents adopting a win-stay lose-change or win-change lose-stay strategy, which are both level-0 strategies.

There are three main ways in which you can learn to exploit a level-k opponent. One way is to explicitly learn the depth of their iterative reasoning, such as that the opponent’s actions are based on a level-2 strategy. The second way is to learn the contingencies between previous round play and an opponent’s next action (e.g. that your opponent is likely to play Scissors if your previous action was Paper). Rather than learning to predict an opponent’s next actions, and then deciding upon the best counter-move, a third strategy is to directly learn rewarding actions. Unlike learning contingencies between actions, or which current action is most rewarding, learning the depth of iterative reasoning allows for generalisation to other games with different actions. This can then provide an early advantage, allowing one to exploit the opponent’s strategy before you have played long enough to reliably establish contingencies within a game. In what we call the Bayesian Cognitive Hierarchy model, we propose that people use Bayesian inference to determine the depth of iterative reasoning of their opponent, and use this to predict their opponent’s actions and determine the best response to this. We contrast this to a Reinforcement Learning model, which learns which actions are most rewarding given previous round play, by learning state-action values (the expected reward from taking an action in a given state). The Experience-Weighted Attraction (EWA) model (Ho et al., 2007), popular in behavioural game theory, includes an additional mechanism to learn about the consequences of actions that were not actually taken.

As both RL and EWA models learn values for actions in states (here, we define the state as the previous play), they do not allow for generalisation to new games with different actions, as there is no immediate way in which to map a set of actions in one game to a set of actions in another. For example, consider the game of Fire-Water-Grass (FWG), where water beats fire, grass beats water, and fire beats grass. This game is structurally the same as RPS. However, if you have learned in RPS to play Rock after a previous play of Paper, there is no direct way to transfer this to knowing that Fire is a rewarding move after previously having played Water. Learning about the new actions in FWG requires experience with the FWG game. This is in contrast to learning the level of iterative reasoning of the opponent. If you know they expect you to repeat your previous action and choose their action to beat this, then you can infer that they will play Grass after your previous choice of Water, even if you have never played FWG before. Whilst providing a way to transfer knowledge about an opponent to new games, iterative reasoning is likely to be more cognitively demanding than learning action contingencies. Once sufficient experience has been gained within a game, iterative reasoning may no longer provide an advantage over simpler reinforcement learning. Therefore, it might be more efficient to use an RL strategy in the later stages of a game. In our modelling, we will allow for such switching between strategies during games.

In the present study, we let humans repeatedly face computer agents endowed with a limited ability for iterative reasoning based on previous round play. As in previous studies discussed above, we use computer opponents to enable precise experimental control over the strategies used and the transfer of depth of iterative reasoning between games. We aim to assess whether (1) human players adapt their strategy to exploit this limited reasoning of their opponent, and (2) whether they are able to generalise a learned opponent model to other games. In two experiments, participants repeatedly face the same opponent (Experiment 1) or two different opponents (Experiment 2) in three consecutive games: the well-known Rock-Paper-Scissors game, the structurally similar Fire-Water-Grass game, and a less similar Numbers (Experiment 1) or Shootout (Experiment 2) game. To foreshadow our results, we find evidence that participants transfer the learned strategy of their opponent to new games, providing them with an early advantage before gaining enough experience within a game to enable learning contingencies between previous and winning actions. Computational modelling shows that participants, when first encountering an opponent in a new game, employ Bayesian inference about the level of iterative reasoning of their opponent to predict their actions and determine the best response. However, in later rounds of the games, they switch to a cognitively less demanding reinforcement learning strategy.

Experiment 1

In the first experiment, we aim to test learning transfer by making participants face the same computer opponent with a limited level of iterative reasoning in three sequential games that vary in similarity. If participants are able to learn the limitations of their opponent’s iterative reasoning and generalise this to new games, their performance in (the early stages of) later games should be higher than expected if they were to completely learn a new strategy in each game.

Methods

Participants and Design

A total of 52 (28 female, 24 male) participants were recruited on the Prolific Academic platform. The mean age of participants was 31.2 years. Participants were paid a fixed fee of £2.5 plus a bonus dependent on their performance (£1.06 on average). The experiment had a 2 (computer opponent: level-1 or level-2) by 3 (games: rock-paper-scissors, fire-water-grass, numbers) design, with repeated measures on the second factor. Participants were randomly assigned to one of the two levels of the first factor.

Tasks

Participants played three games against their computer opponent: Rock-Paper-Scissors (RPS), Fire-Water-Grass (FWG), and the Numbers game. RPS is a 3 × 3 zero-sum game, with a cyclical hierarchy between the two player’s actions: Rock blunts Scissors, Paper wraps Rock, and Scissors cut Paper. If one player chooses an action which dominates their opponent’s action, the player wins (receives a reward of 1) and the other player loses (receives a reward of − 1). Otherwise, it is a draw and both players receive a reward of 0. RPS has a unique mixed-strategy Nash equilibrium, which consists of each player in each round randomly selecting from the three options with uniform probability.

The FWG game is identical to RPS in all but action labels: Fire burns Grass, Water extinguishes Fire, and Grass absorbs Water. We use this game as we are interested in whether learning is transferred to a fundamentally similar game, where the only difference is in the label of the possible actions. This should make it relatively easy to generalise knowledge of the opponent’s strategy, provided this knowledge is on a sufficiently abstract level, such as knowing the opponent is a level-1 or level-2 player. Crucially, learning simple contingencies such as “If I played Rock on the previous round, playing Scissors next will likely result in a win”, is not generalisable to this similar game, as these contingencies are tied to the labels of the actions.

The Numbers game is a generalisation of RPS. In the variant we use, 2 participants concurrently pick a number between 1 and 5. To win in this game, a participant needs to pick a number exactly 1 higher than the number chosen by their opponent. For example, if a participant thinks their opponent will pick 3, they ought to choose 4 to win the round. To make the strategies cyclical as in RPS, the game stipulates that the lowest number (1) beats the highest number (5), so if the participant thinks the opponent will play 5, then the winning choice is to pick 1. This game has a structure similar to RPS in which every action is dominated by exactly one other action. All other possible combinations of choices are considered ties. Similar to RPS and FWG, the mixed-strategy Nash equilibrium is to randomly choose each action with equal probability.

The computer opponent was programmed to use either a level-1 or level-2 strategy in all the games. A level-1 player is defined as a player who best responds to a level-0 player. A level-0 player does not consider their opponent’s beliefs. Here, we assume a level-0 player simply repeats their previous action. There are other ways to define a level-0 player. For instance, as repeating their action if it resulted in a win and choosing randomly from the remaining actions otherwise, or choosing randomly from all actions. As a best response to a uniformly random action is itself a random action, defining a level-0 player in such a way would make a level-1 opponent’s strategy much harder to discern. Because we are mainly interested in generalisation of knowledge of an opponent’s strategy to other games, which requires good knowledge of this strategy, we opted for this more deterministic formulation of a level-0 player. Repeating previous actions is also in line with findings of Eyler et al. (2009). So the level-1 computer agent expects their (human) opponent to repeat their previous action, and chooses the action that would beat this. The level-2 computer opponent assumes in turn that the participant is a level-1 opponent, playing according to the strategy just described. To make the computer opponent’s strategy not too obvious, we introduced some randomness in their actions, making them play randomly in 10% of all trials. Note that at all levels, the strategies are contingent on the actions taken in the previous round. The choice of this type of strategy is consistent with evidence that humans strategically use information from last round play of their opponents in zero-sum games (Batzilis et al., 2019; Wang et al., 2014). It is also rational to do so given that the opponent bases their play on the previous round, as was the case for our computer players (Jones and Zhang, 2004). Table 1 shows an example of the computer opponent’s actions in response to the previous round play.

Table 1 Example of how a level-1 and level-2 computer agent plays in response to actions taken in the previous round

Full size table

Procedure

Participants were informed they would play three different games against the same computer opponent. Participants were told that the opponent cannot cheat and will choose its actions simultaneously with them, without prior knowledge of the participant’s choice. After providing informed consent and reading the instructions, participants answered a number of comprehension questions. They then played the three games against their opponent in the order RPS, FGW, and Numbers. An example of the interface for the RPS game is provided in Fig. 1. On each round, the human player chooses an action, and after a random delay (between 0.5 and 3 s) is shown the action chosen by the computer opponent and the outcome of that round. A total of 50 rounds of each game was played with the player’s score displayed at the end of each game. The score was calculated as the number of wins minus the number of losses. Ties did not affect the score. In order to incentivise the participants to maximise the number of wins against their opponent, players were paid a bonus at the end of the experiment proportional to their final score (each point was worth £0.02). After playing all the games, participants were asked questions about their beliefs about the computer opponent, related to whether they thought they learned their opponent’s strategy, and how difficult they found playing against their opponent. They were then debriefed and thanked for their participation.

Behavioural Results

Participants’ scores in each half of each game are depicted in Fig. 2. Overall, scores in each game were significantly higher than 0, the expected score of uniformly random play (RPS: M = 0.29, 95% CI [0.21, 0.37], t(51) = 7.26, p < .001; FWG: M = 0.45, 95% CI [0.36, 0.54], t(51) = 10.05, p < .001; Numbers: M = 0.31, 95% CI [0.22, 0.40], t(51) = 7.18, p < .001). As uniformly random play is the Nash equilibrium, this indicates successful deviation from a Nash-optimal strategy. Additional analysis (see Supplementary Information) indicates better performance against the level-1 compared to level-2 opponent. This indicates that participants may have found it more difficult to predict the actions of the more sophisticated level-2 opponent, even though both types of opponent are equally consistent and hence equally exploitable in principle.

For an initial assessment of learning transfer, we focus on participants’ scores in the initial 5 rounds after the first round (rounds 2–6) of each game (see Fig. 3). We exclude the first round as the computer opponent played randomly here and there is no opportunity yet for the human player to exploit their opponent’s strategy. Players with no knowledge of their opponent’s strategy would be expected to perform at chance level in these early rounds, whilst positive scores in rounds 2–6 are consistent with generalisation of prior experience. The early-round score in both FWG and Numbers is significantly higher than 0 (FWG: M = 0.24, 95% CI [0.12, 0.36], t(51) = 4.13, p < .001; Numbers: M = 0.15, 95% CI [0.07, 0.24], t(51) = 3.48, p = .001). We did not expect positive early scores for the RPS game, as it was the first game played and there was no opportunity for learning about the opponent’s strategy. Scores in this game were indeed not significantly different from 0 (M = 0.06, 95% CI [− 0.05, 0.18], t(51) = 1.08, p = .285). Additional analysis (see Supplementary Information) provided no evidence that learning transfer differed between the two types of opponent.

Experiment 2

The results of Experiment 1 indicate that participants were able to learn successful strategies which exploited the deviation from Nash-optimal play of their opponents. Moreover, they were able to transfer knowledge about their opponent to later games. In Experiment 2, we aimed to obtain a stronger test of learning transfer. Instead of facing a single level-1 or level-2 opponent throughout all games, participants now faced both types of opponent. To perform well against both opponents, participants would need to learn distinct strategies against these opponents. To reduce effects of increased memory load due to facing distinct opponents, we provided participants access to the history of play against an opponent within each game (see Fig. 1). Finally, we changed the third game to a penalty Shootout game, with participants aiming to score a goal and opponents playing the role of goal keepers. Whilst this game has the same number of actions as the first two (aim left, centre, or right), it is strategically dissimilar. Unlike the Numbers game in Experiment 1, the Shootout game does not have a cyclical hierarchy between actions, making it harder to win through a heuristic based on this cyclicity.