Modeling opponent learning in multiagent repeated games

Multiagent reinforcement learning (MARL) has been used extensively in the game environment. One of the main challenges in MARL is that the environment of the agent system is dynamic, and the other agents are also updating their strategies. Therefore, modeling the opponents’ learning process and adopting specific strategies to shape learning is an effective way to obtain better training results. Previous studies such as DRON, LOLA and SOS approximated the opponent’s learning process and gave effective applications. However, these studies modeled only transient changes in opponent strategies and lacked stability in the improvement of equilibrium efficiency. In this article, we design the MOL (modeling opponent learning) method based on the Stackelberg game. We use best response theory to approximate the opponents’ preferences for different actions and explore stable equilibrium with higher rewards. We find that MOL achieves better results in several games with classical structures (the Prisoner’s Dilemma, Stackelberg Leader game and Stag Hunt with 3 players), and in randomly generated bimatrix games. MOL performs well in competitive games played against different opponents and converges to stable points that score above the Nash equilibrium in repeated game environments. The results may provide a reference for the definition of equilibrium in multiagent reinforcement learning systems, and contribute to the design of learning objectives in MARL to avoid local disadvantageous equilibrium and improve general efficiency.


Introduction
The interaction and learning process of multiple agents in game environments has been an important area of research. However, current learning algorithms for agents in noncooperative game environments usually lack generalization capabilities. Before the proposal of machine learning and reinforcement learning, this topic was usually discussed as part of game theory. The Theory of Learning in Games [1] summarized the early relevant results. It mainly consisted of the method of updating strategies using fictitious games [2] and the application of stochastic dynamical systems theory to explain the learning process [3]. SCE (self-confirming equilibrium) [4], as an extension of the Nash equilibrium, is usually considered the convergence result of this Congying Han hancy@ucas.ac.cn 1 School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China learning process [5]. Equilibrium selection theory [6] noted the core difference between the theory of learning and classical games: the existence of nonstationary equilibrium sets can serve as an asymptotic description for the long-term behavior of a system. Studies of this topic were usually based on equilibrium as well as convergence, but numerical experiments were difficult to perform due to technical limitations.
In recent studies, there have been good applications of deep reinforcement learning in multiagent systems [7], particularly in cooperation scenarios [8]. Agent communication has yielded many achievements in multiagent cooperative systems [9,10]. The goal is to achieve the efficient operation of the system through higher-order communication between agents [11]. Another important area of multiagent cooperation is value function decomposition [12]. QMIX [13] combined with deep Q-learning has been successful in more complex tasks. It also inspired many subsequent works on value function decomposition [14,15].
However, CTDE (centralized training decentralized execution) class algorithms [16,17] can only be applied to cooperative tasks. For studies of competitive or mixed environments, progress is more concentrated around zerosum games. A famous study concerned the AlphaZero algorithm [18], which beat the top human players in Go. The algorithm was able to learn the Nash equilibrium strategy in a complete information game. Texas Hold'em, as a typical example of an incomplete information game, was also solved with the introduction of the CFR (Counterfactual Regret Minimization) [19,20] algorithm. Recent research includes the application of DMC (deep Monte Carlo) algorithms to solve the traditional Chinese game Fight the Landlord, which involves both cooperation and competition [21,22].
Multiagent systems typically train strategies with repeated games. Unlike the single-agent environment, the reward received by each agent for taking an action in the game is influenced by the actions of other agents. Therefore, the difficulty is in making the learning process adapt to the non-Markovian environment [23] and avoid poorer local solutions [24]. Modeling other agents' strategies is necessary when there are no shared targets or communication. DRON [25] is a method for estimating opponent strategies using historical samples, but its shortcoming is that it ignores changes in the opponent's strategy. LOLA [26] was the first method to attempt to model and shape the learning process of opponents, and it had success in classical games such as the Prisoner's Dilemma. This inspired subsequent work on opponent shaping (CGD [27] , COLA [28]). However, these algorithms focus only on transient changes in the opponent's strategy and usually require all agents to adopt the same learning behavior. The advantages are that they are easy to implement and have good convergence guarantees for the Nash equilibrium. The problem is that short-term modeling may lead to local results such as an inferior Nash equilibrium. A balance between instantaneous rewards and opponent shaping is difficult to achieve.
In this paper, we design an algorithm based on MOL (modeling opponent learning), which uses best response theory to estimate the stable outcome of the opponent learning process and adopts specific strategies. MOL is executed in two phases. The goal of the first phase is to explore the opponent's preferences for different actions, and the goal of the second phase is to learn stable equilibrium with higher rewards. We find that MOL has good convergence properties, and achieves good results in some classical games (the Prisoner's Dilemma game, Stackelberg Leader game [29], Pennies matching game and Tandem game) as well as in randomly generated bimatrix games. Most of the algorithms that can be applied to two-agent games will fail in scenarios with more agents. However, in the multiagent Stag Hunt experiment we found that MOL can also be effective in scenarios with more than two agents. MOL also achieved higher average rewards in a competitive game environment and simultaneously improved the social welfare of the system. These experiments show that MOL improves over the original algorithms when dealing with diverse game environments or opponents.
The main contributions of this article are summarized as follows.
1) This paper is the first to propose a MARL algorithm (MOL) from the perspective of modeling the longterm learning process of the opponent. MOL not only improves efficiency but also avoids using private information from other agents in the training process, which shows that the algorithm is efficient and generalizable. 2) We provide theoretical support for the convergence results of the two stages of MOL. The reward of the stable outcome obtained by MOL will not be lower than a Nash equilibrium. 3) For the first time, we introduce a competitive experiment in evaluating the MARL algorithm, where many algorithms based on opponent information cannot be applied. Our algorithm achieves good results in this experiment as well.
The remainder of this article is organized as follows. Section 2 introduces related works. Section 3 provides basic concepts about repeated games, learning systems, and opponent modeling to help explain the design of MOL. In Section 4, we describe in detail how the two stages of the MOL algorithm work. The experimental evaluation and results are discussed in Section 5. Finally, Section 6 concludes this work.

Related work
The application of reinforcement learning in decentralized multiagent systems has become a popular research area [30][31][32]. Algorithms applied to adaptive strategy learning in a noncooperative game environment have achieved remarkable success. The convergence of these algorithms in a general game environment were the main early results. WoLF (Win or Learn Fast) [33] successfully achieved the convergence of the Nash equilibrium in 2-dimensional matrix games. The AWESOME (Adapt When Everybody is Stationary, Otherwise Move to Equilibrium) [34] algorithm extended this result to the high-dimensional case. In both of these algorithms, convergence is achieved by starting from a stochastic initialization and improving the strategy in the direction of equilibrium. The Nash equilibrium has good stability and usually corresponds to dominant strategies. Therefore, these algorithms can learn the best response of a static opponent, and converge to the Nash equilibrium in self-play.
To cope with the diversity of opponent models in practical applications, theoretical research based on opponent modeling has made new progress. Unlike WoLF, these algorithms need to build a predictive model of the opponent's strategy and select actions based on that model. DRON [25] based on DQN (deep Q-network) [35] was an early attempt to estimate the opponent's strategy using neural networks [36,37]. However, DRON has limitations because it cannot predict changes in the opponent's strategy. The efficiency improvement over DQN is minimal, and the algorithm can only be applied in certain specific environments. Similar works have predicted the reward functions of other agents (bottom-up MARL) [38].
A number of subsequent studies attempted to model and shape the learning process of other agents. LOLA [26] is one of the works that have made landmark breakthroughs. By constructing a gradient algorithm based on a firstorder differential approximation of the opponent's strategy updates, LOLA achieves cooperation in the Prisoner's Dilemma game. The SOS [39] (stable opponent shaping) algorithm combines LOLA with LookAhead [40] from the perspective of differential games to obtain better convergence results. The subsequent work COLA [28] improved this algorithm, and the core of its effectiveness is to guarantee the consistency of the agent's estimation of the opponent's learning process. Combining it with the trust domain algorithm MATRL [41], and the meta-game algorithm Meta-MAPG [42] has also yielded good results. This class of methods also includes SOM [43] and PGP [44]. All of these methods use the propagation of gradients with approximations to changes in the opponent's strategy.
Although these algorithms have achieved good results in many game environments, we point out two aspects that could be improved. The first is that most of these approaches are based on a model of instantaneous changes in the opponent's strategy, and the convergence result depends heavily on the consistency of higher-order lookahead. This makes the efficiency of the algorithm sensitive to the parameter settings, and the opponent's strategy update process. In addition, gradient-based algorithms require the private information of other agents (the revenue matrix, strategies and even Hessain matrix). Using this information makes multiagent systems centralized or cooperative. This can lead to some restrictions on the environment in which the algorithm can be applied. Comparisons between prior algorithms and MOL are listed in Table 1.

Repeated game environments and agent learning systems
In this section, we present definitions of repeated games, agent learning systems, and optimization objectives. We also illustrate the design of MOL by comparing it with other algorithms.

Repeated games with learning systems
Different definitions of concepts such as a state and action space in MDP (Markov Decision Process) may lead to different results for the algorithm. Therefore, we will introduce the definition of a repeated game based on the Markov game.
A Markov game is specified by a tuple G =< S, U, P , R, O, n, γ >, which consists of n agents. In the current state s i ∈ S, each agent chooses its own action a i ∈ A. The obtained joint action u ∈ U leads to a state transition P (s |s, u) : S × U × S → [0, 1] and assigns different rewards r i to each agent as well as the observation o i ∈ O. γ ∈ [0, 1) is the discount factor.
A repeated game can be viewed as the game above repeated as time t → ∞. For round t of the game, the observation o i t obtained by agent i is the joint action u t as

MOL
INRL is a class of algorithms that is summarized in [26]. Tandem game is an example commonly used in opponent shaping to test the stability of algorithms. In this paper, we also consider whether the algorithm uses the private information of other agents (reward functions, true strategies) and whether it allows other agents to adopt arbitrary learning methods well as its own reward r i t (the reward is private information). The state is To avoid an infinite increase in the dimension of the state space, we assume that s i t = o i t−1 . For each agent, the state transition function is determined by the joint action composed of all agents and its own reward: The general reinforcement learning algorithm is not applicable to the above non-Markovian game system because the state transition function is uncertain. The iteration process of the RL algorithm is a combination of the instantaneous reward and the reward expectation at the next moment: where s t+1 ∼ P (s t , a t ). However, in a multiagent system since the reward in each round is influenced by the actions of other agents ( A − t+1 may be unsolvable (because other agents are updating their strategies). If we consider a reward-independent repeated game (the reward depends only on joint actions), the above equation becomes: which is influenced by subsequent rewards (the strategies of other agents at the next moment A − t+1 = A − t+1 (a t , A − t )).

Strategies and equilibrium
There are two types of strategies: mixed and pure strategies. A mixed strategy is actually a distribution built on the set of actions A = {a 1 , · · · , a n }, denoted as: where p a i = 1. A pure strategy can be seen as a special form of mixed strategy, where only one element in the action set (a i ) has a nonzero positive probability: In this paper, we will train a pure strategy for the agent instead of a mixed strategy. This is because our algorithm is based on the exploration of the best response, which is a pure strategy in most cases. Modeling pure strategies can significantly reduce the dimension of the prediction space. Additionally, training mixed strategies requires gradient propagation in most cases, which uses the private information of opponents (the reward function or Hessian matrix) [45]. This article uses the same assumptions as in the AWESOME [34] algorithm: only the actions of all agents and their own rewards are observed. We also perform an analysis of the algorithm performance based on experiments in which only a mixed strategy equilibrium exists.
The equilibrium is used to characterize the stabilization point or convergence result in a game. The most important definition is the Nash equilibrium (NE), which is also the training objective of most algorithms. The NE is defined as a strategy combination such that each agent's strategy is the best response to the other agents' strategies. It is denoted as: where f * −i is the joint strategy of the agents other than agent i. Similarly, when all agents' strategies are pure strategies and are the best responses to each other, we have: (6) and the joint action a * is called a PSNE (pure strategy Nash equilibrium). In 1996, Athey proved that when a game satisfies the SCC (Single Crossing Condition), it has a PSNE [46]. Other approaches to strategy selection will lead to different equilibrium definitions, such as the correlated equilibrium and Pareto equilibrium. The application of the correlated equilibrium in meta-games [47] provided a good solver for multiplayer general-sum games.
In classical game theory, an important example of agent learning is the Stackelberg Leader game, which provides the possible convergence results when the agent knows that other players update their strategies by learning. It contributes to the promotion of cooperative behavior in repeated games [48].
According to the theory of the Stackelberg Leader game (the game structure is given in Table 2), when the row player knows that its opponent is engaged in a learning process, it will find that choosing action U will lead the opponent to an equilibrium (U,R), while choosing action D will lead to (D,L). Based on this consideration, the row player will choose U and thus ensure a reward of 3 for itself at convergence. This means that (U,R) is more reasonable as an equilibrium of an agent system with learning capability, although (D,L) is a unique Nash equilibrium point.

Optimization objectives
In this section, we introduce the optimization objectives of agents for different algorithms and give our motivation for designing algorithms based on the best responses. For agent i in a repeated game, the predicted accumulated reward for action a t is a combination of the instantaneous reward and subsequent reward expectations: Since the opponents' strategy A − t , A − t+1 is updated, the agent needs to approximate the above optimization objective. A direct approximation is obtained by assuming thatÂ − t+m (m > 0) is independent of a t , then we have: The agent is optimized for instantaneous rewards, and it only needs to estimate the opponent's current strategy. Both the INRL and LOLA algorithms are based on this assumption.
INRL [26] : This is also called a naive learner. The agent optimizes its own reward by choosing the best response, assuming that the opponent's actions remain unchanged: We take a 2-player game as an example. Assuming the joint action of agents 1 and 2 at the current moment t is (a t , b t ), if agent 1 is a naive learner, then its action at t + 1 is If we take moment t as a starting point and assume that the opponent is also undergoing a learning process, then INRL optimizes the reward at moment t + 0 (when the opponent has not changed).
LOLA, SOS, COLA : This class of algorithms is based on the premise that the opponents are also updating their strategies, so that when the moment shifts from t to t + 1, the agent needs to estimate the changes in the opponents' strategies and predict the next joint actions. We assume that at moment t the strategy f (θ 1 t ) of agent 1 is controlled by a parameter θ 1 t and agent 2's strategy f (θ 2 t ) is controlled by θ 2 t . Then, at the next moment t + 1, the strategy parameter of agent 2 will become Therefore, if agent 1 wants to optimize its reward at the next moment, its strategy parameter should become LOLA implements this training process by assuming that the opponent is a naive learner and using the first-order approximation of θ 2 : (where η is a constant), which achieves good results in the Prisoner's Dilemma game. As we can see from the previous discussion, the LOLA class algorithm considers the reward at moment t + 1 as the optimization objective (when the opponent has changed its strategy according to its learning process). The above algorithms consider only instantaneous rewards and ignore the effects of actions on other agent strategies. Since estimating the opponent's strategy at each subsequent moment is a complex process, we can derive a stable point of the strategy change process with best response theory.
Best Response : The best response is an important reference for agents when making strategic choices. It has good properties: the best response allows us to obtain the maximum reward when the agent has a known opponent strategy. If in a joint action, (a, b) ∈ A × B, both actions are the best response to the other action, that is: then the PSNE has been reached. The best response is also connected to the fixed point of learning. If there exists T 1 and A − t ≡ A − 0 for t > T 1 , then the agent's strategy will converge to its best response.
When the discount factor γ is not sufficiently small, we can give an approximation of the accumulated reward: where A − 0 is the joint best response (BR) according to a 0 . Then we have Without restricting the learning rate δ, we can define the best response as the opponent strategy at moment ∞. Therefore, modeling the best response of the opponent can be seen as optimizing the reward at moment t + ∞.

Modeling opponent learning with two phases in multiagent repeated games
To implement the modeling and shaping of the opponent learning process, we design an algorithm named MOL (Modeling Opponent Learning). It can be divided into two phases. In the first phase, the agents optimize the instantaneous reward to explore the game structure as well as the best response of the opponent. In the second phase, the agents guide the learning process of the opponent to a stable point with a higher fixed reward.

Phase I of MOL
Phase I of MOL (MOL-1) starts from a random initialization, where the agent does not have any information about the game structure or the opponent. Thus, the objective of MOL-1 is to explore the game structure and the opponents' preferences for different actions. We denote the opponents' joint action at moment t as a − t . At the end of each round of the game, the agents are able to observe the joint action (a t , a − t ) and their own rewards r i (a t , a − t ). In the first phase we create a Q-table to record the rewards for taking a certain action when the opponents' action is fixed (approximating the structure of the reward matrix for each agent). We denote the agent's expected reward asq(a|a − ) for taking action a when the opponents are determined to take action a − , and we denote the actual value by r(a|a − ).

Lemma 1q(a|a
Proof Denote N(a, a − ) as the number of times the trajectory passes s(a, a − ), then which impliesq(a|a − ) → r(a|a − ), thus Lemma 1 holds.
We implement this process using the following approach: Before round t, the agent predicts the opponents' actionŝ a − t based on the previously tracked information (taking a greatest-likelihood approach). Then, the agent chooses the best response based onq andâ − t . We assume that the set of joint actions corresponding to the PSNE is U 0 ⊂ A 1 × · · · × A n , then the prediction convergence set isÛ 0 ⊂ A 1 × · · · ×Â n .

Lemma 2
Assuming that random exploration rate γ → 0 as the number of iterative steps increases (γ ≈ 0 when T > N). The joint action U t converges to PSNE set U 0 , when at moment t (> N) agents' joint prediction of the opponent's action is inÛ 0 .
Proof We assume that (a 0 , a − 0 ) is a PSNE. According to Lemma 1 we know that the Q-table eventually converges to the true reward matrixq t (a|a − ) → r(a|a − ). Then we have Assuming that the prediction of joint action is the same for all agents asû 0 = (â 1 , · · · ,â n ) ∈Û 0 after the tth step. At this time ∀i = 1, · · · , n, Similarly we have Thus we prove that the convergence of the prediction is a sufficient condition for the convergence of the actual joint action. Considering that therefore it is also a necessary condition. At moment t we assume that exploration rate γ ≈ 0, and every agent's prediction is corresponding to a PSNE u 0 = (a 1 , · · · , a n ). Then s t+1 = u t = u 0 and the prediction is unchanged. So the joint action converges to u 0 ∈ U 0 .
In the MOL-1 period we also want to explore the learned best response of the opponent when the agent takes different actions. Therefore, we need to increase the frequency of corresponding nonequilibrium actions. Our approach is to use the UCB (upper confidence bound) [49] function and to focus on exploration at the beginning of the training process. The UCB was chosen as the sampling function because of its ability to improve the exploration rate compared to the EI (expected improvement), and PI (probability of improvement), and thus allow for more accurate estimations of the best responses of other agents. The action of an agent at moment t is selected by: where N denotes the total number of rounds and n a denotes the number of rounds in which the agent takes action a.
We give a proof of the algorithm's convergence only for games with 2 players, due to a nice property of such games: when there are both pure and mixed strategy equilibrium in the game, there must be a linear combination relationship between them. Most existing algorithms, such as DRON, LOLA, and SOS, are based on the background of two-player games. Additionally, solving the Nash equilibrium for multiplayer general-sum games is an unsolved problem.
In the proof, we use the definition of the inferior action: if for two actions a and b, regardless of the opponent action, the reward for action a is never higher than that of b, then a is said to be an inferior action. equilibrium (a, b), then for at least one of the agents it can obtain the action corresponding to the equilibrium (a) by repeatedly eliminating the inferior action.

Lemma 3 When there exists a unique pure strategy Nash
Proof First, in a 2-dimensional bimatrix game, we may assume that the unique pure strategy equilibrium is (L,U), and that the other optional actions are (R,D). Since (L,U) is a Nash equilibrium, the game has an equivalent matrix form as in Table 3, where M ≥ 0 and N ≥ 0.
Since (L,U) is the only pure strategy equilibrium, c ≥ a and d ≥ b cannot be both true (otherwise (R,D) is also a Nash equilibrium). It may be assumed that c < a, then, D is an inferior strategy for the row player, and the action under equilibrium can be obtained by eliminating the inferior strategy. For the case in which the agent has more than two possible actions, we can obtain the elimination method similarly using the dominance relation of the action combination.

Theorem 1 MOL-1 converges to Nash equilibrium in games with 2 players.
Proof 1) When there is a unique pure strategy Nash equilibrium of the game, we may assume that (a, a − ) is the PSNE, and let b = a − . From Lemma 3 we can see that there must be an agent at this time, remembered as agent 1. When the set of its available actions is A * (A * is derived by eliminating the inferior strategy of A), for each b ∈ B * , a is the best response of b. From Lemma 1 we haveq t (a, b) → r(a, b), which indicates At this point, regardless of the value ofb, we havê q t (a,b) = max a ∈A * q t (a ,b). Thus, when t → ∞, the action of agent 1 will converge to a. Then for agent 2, the opponent's action predictionâ will converge to a. And (a, b) is a PSNE, sô Then the action of agent 2 will converge to b, which implies MOL-1 converge. 2) When game has more than one PSNE, we want to prove that it will converge to one PSNE with probability 1 (as t → ∞). We assume that γ ≈ 0 when t > T . The action of agent 1 is selected by max a∈A q t (a,b) = max a∈A r(a,b) from Lemma 1. We consider the set of pure strategy Nash equilibrium , which means when (â t ,b t ) ∈ U 0 algorithm reaches a fixed point. We assume that the unstable point set is the complementary set U 0 , and there exists T 0 , Q = R when t > T 0 for all players. And we assume that there exists T > T 0 , Since p T is solved by using the previous track sample for the maximum likelihood, a T =â T +1 ⇒ N T a i = N T a j (N T a i means total number of occurrences of a i before moment T ).
Due to the presence of random exploration in the algorithm controlled by the UCB function, we have Thus the joint action will not cycles between multiple pure strategy equilibrium. Considering that (a i , b j ) or (a j , b i ) are not available as a result of the convergence of the algorithm, we only need to exclude the case where the state cycles between nonstable points. For the caseû t = (a i , b j ), there exists a minimum n satisfyingû t+n = (a i , b j ). Similarly to equation (29) we have So we can assume thatû t+n = (a i , b m ). Considering that random exploration rate γ → 0 and b i is the best response for a i , we havê Thus we prove that the joint prediction converges to the set corresponding to PSNE. So when t → ∞ the joint action will reach PSNE set U 0 with probability 1, which implies MOL-1 converge.
3) In the previous discussion of the strategy section we mentioned that for each agent in a mixed equilibrium, its optional actions correspond to an identical reward. Thus for a game in which there is no pure strategy equilibrium (the existence of a mixed equilibrium is proved by the existence of NE), the corresponding set of joint actions is A × B . For agent 1, when t → ∞, the reward of actions in A must be higher than those not in A . So the joint action will be centred on A × B . Since only mixed strategy Nash equilibrium exist, when b ∈ B , So agent 1 will only pick among certain actions with almost identical Q-values.
Purification theorem of Nash equilibrium (Harsanyi [50]): for almost all games G, the following statement is true. Let s = (s 1 , · · · , s n ) be the mixed equilibrium, and G * ( ) be a family of games whose revenue matrix differs from G by only a small perturbation with G * (0) = G. Then there exits a family {s( )} of n-tuples of mixed strategies for any , s( ) is an pure strategy point in G * ( ), with limit This shows that the mixed equilibrium can be viewed as the limit of a sequence of pure strategy equilibrium. Since the algorithm can converge to a pure strategy equilibrium, it can also converge asymptotically to a mixed equilibrium, which implies MOL-1 converge.

Phase II of MOL
In the MOL-2 phase, our strategy choice is based on the model of the opponents' learning process. To predict the expected stable reward for each action, we evaluate the reward after predicting the best response of the opponent and then build a long-term value model. Our initialization of the value V a is obtained from the weighted average of the Qvalue q(a 0 |a − ) when the opponent reacts differently. This process relies on the approximation of the best response for each action in the first phase. The value function for action a 0 is defined as follows: The weight λ a − is derived from the approximated probability of different actions a − being the best response learned by the opponent when the agent takes action a 0 . To set a constraint on the process of value iteration to ensure its convergence, we set the value of the joint action corresponding to the Nash equilibrium converged to in the first phase as fixed. Assuming that u 0 = (a 1 , · · · , a n ) = (a i , a −i ) is the Nash equilibrium converged to in the first phase (for a mixed equilibrium, this is a component of the equilibrium corresponding to the linear combination). We define the value of the actions in this equilibrium V a i as This value is fixed, and when the value of any other action is lower than it, that action will no longer be considered among the agent's optional strategies. The choice of the agent strategy in each iteration step is based on a linear combination of the long-term value V and instantaneous Q-value: We use the value of β to control the long and short term value weights thus ensuring the balance of convergence and reward. The algorithm switches from the first phase to the second after k rounds are performed. The initial value of β is close to 1. When the value of β approaches 0, the agent's strategy will change back to greedy. Since we set the Nash equilibrium to correspond to a constant action-value, the algorithm will return to the result obtained in the first phase. Therefore we can set a termination step N and reduce the weight of the long-term value β in strategy selection when the agent finds that it is not converging to a better result even after exceeding N steps in the second stage. Eventually, convergence is ensured by returning to the Nash equilibrium. If we consider the joint reward as the target, we expect the score of the MOL-2 phase to improve compared to that in the first phase. Additionally, if there is a significant global optimum of the game, we expect the joint action to avoid convergence to a local solution and reach the global optimum.

Theorem 2
The average reward for the convergence result of MOL-2 will not be lower than the Nash equilibrium converged to in the first phase. Furthermore, if there exists a joint action u p that is the unique Pareto equilibrium of the game, then MOL-2 will converge to u p .
Proof If there exists a PSNE in the game, we can assume that the pure strategy Nash equilibrium converged to in the first phase is u 0 = (a 1 , · · · , a n ), and its corresponding reward is (r 1 , · · · , r n ). At this point for agent 1, the value of its action a 1 is fixed at r 1 . Thus its preference will transfer to other actions if the value is higher than r 1 . Since β 1 is monotonically decreasing, when the algorithm converges and β ∞ 1 > 0, the stable point converged to (a ∞ 1 , a ∞ −1 ) should satisfy that r ∞ 1 ≥ r 1 . Or with t → ∞, β 1 → 0, at this point, the convergence result of the algorithm is the same as the first phase, so the convergence result will not decrease.
For a game with only mixed equilibrium, each optional action in the equilibrium state corresponds to the same reward. Therefore, no matter which action value is fixed at the end of the first phase of MOL, its joint score is the same as the equilibrium. And the convergence result of the MOL-2 phase will not be lower than the fixed value.
When the Pareto equilibrium u p exists uniquely, we denote the joint action corresponding to this equilibrium as u p = (a i , a −i ) and the corresponding reward as (r i ), i ∈ {1, · · · , n}. At this point, for any of the joint action (a i , a −i ), the reward has r i ≤ r i . This shows that (a i , a −i ) is also a Nash equilibrium.
When the algorithm converges at (a i , a −i ) in the first phase, it has become a fixed point due to its reward maximum. When the first phase converges to another equilibrium (a i , a −i ), (a i , a −i ) has a reward of r i ≤ r i as the expected optimal response of each other. Thus the consensus of agents is (a i , a −i ) for a high-value state, such that in the second phase they will jointly explore this state and find that the reward meets expectations. This indicates that the convergence result is u p . Therefore MOL-2 converges to Pareto optimality.
From Theorem 2 we can see that the convergence result of the MOL algorithm is always better than some Nash equilibrium. The global optimal solution can be obtained in some special game structures. Therefore, the convergence and efficiency of the MOL algorithm are guaranteed. The pseudocode of MOL is given above.
The first phase of MOL can be seen as an exploration step, where agents explore the preferences of other agents on the basis of maintaining a greedy policy. In the second Algorithm 1 Modeling opponent learning algorithm. phase the agent uses the long-term value estimation V a i obtained from the exploration combined with the Q-table to form a new objective function. The prediction ofâ − t is obtained by the maximum likelihood method. The initial  value of β is close to 1 and will be multiplied by a decreasing multiplier c = 0.99 in each round when MOL-2 does not converge. Therefore, it gradually decreases to 0 if MOL-2 does not converge.

Experimental results and discussion
Our experimental setup is based on partially observed repeated games. The agents do not have access to information other than their own rewards and joint actions. We conduct experiments from two perspectives: the convergence efficiency when the agents adopt the same algorithm and adaptability to other algorithms.

Classical games
(a) Stackelberg Leader game: We consider this game in Table 2. In Stackelberg's theory it is assumed that the agents of the game are not homogeneous, but are classified as leaders (who act first) and followers (who act later). The convergence result (U,R) obtained is also based on this assumption. In our experiment, the goal is to converge to this equilibrium without using the above assumption. This process can be implemented because for the column player, the preference of the row player is explored to be U → R and D → L in the MOL-1 phase. Due to the higher reward of r 1 (U,R), the column player will tend to choose action U in the second phase, which leads to the (U,R) result. Figure 1 shows the training results in the Stackelberg Leader game, where the joint reward represents the average reward of the agent system in each episode (rollout). From Fig. 1, we can see that DRON and LOLA do not converge to a fixed point. MOL-1 and the other opponent shaping algorithms (SOS, COLA) converge to NE (D,L). MOL-2 algorithm obtains the expected result of the Stackelberg game: the joint action converges to (U,R) and obtains an average reward of 2.5.
(b) Prisoner's Dilemma game (IPD): This is a classic game related to social dilemmas (Table 4). There is a unique Nash equilibrium (U,L) from the static game perspective, but TFT (Tit for Tat) is a better equilibrium concept in the repeated game environment. This equilibrium corresponds to a strategy of choosing to cooperate in the first round and repeating the opponent's previous round action in each subsequent round. General algorithms converge to the inferior local solution (U,L), while LOLA was the first to achieve the TFT equilibrium. Similarly we want to avoid reaching the local Nash equilibrium and instead achieve cooperation. We need both agents to find that cooperation is a strategy with a higher fixed reward than that of acting greedily. Figure 2 shows the training results in the Prisoner's Dilemma game. From Fig. 2, we can see that our algorithm obtains the same result as LOLA in achieving cooperation in the IPD. In contrast, the algorithms aiming to reach the Nash equilibrium remain in the local inferior equilibrium (U,L).   Table 5). The only mixed equilibrium in this game occurs when the agent chooses two actions at random with equal probability. We express the probability of an agent's action in terms of the frequency before each moment and test whether it converges to 0.5. Figure 3 shows the training results in the Pennies Matching game (since the game is symmetric, we only need to consider whether the strategy of agent 1 converges). The result in Fig. 3 is consistent with our conjecture of asymptotic convergence to the mixed equilibrium. MOL-2 has a loss of stability compared to the short-sighted MOL-1 as well as the gradient-based algorithms (LOLA, SOS, COLA), but does not have different convergence results. In the subsequent randomly generated matrix games, we find that our algorithm converges to a stable point with a higher score than the equilibrium in games where only mixed equilibrium exist. This supports the rationality of training pure strategies.
(d) Tandem game: The Tandem game is mentioned in the article on the SOS algorithm and serves as a counterexample to the nonconvergence of the LOLA algorithm. This game is characterized by the fact that when both agents "arrogantly" shape the opponent's action, it leads to the worst equilibrium. The two participants have the following loss function: When both x and y are positive integers, increasing the value of their sum leads to a decrease in the reward. Thus, x = y = 0 is the optimal solution to the game. However, acting greedily or "arrogantly" shaping the action of the opponent can lead to worse results.
We use this game environment to test the robustness of the algorithm in the opponent shaping process. Figure 4 shows the training results in the Tandem game. The MOL algorithm does not lead both agents to act "arrogantly", but eventually converges to an optimal result. It achieves the same result as SOS and COLA without using the opponent's real strategy information.
(e) Stag Hunt game (with multiple agent) [24]: For multiagent systems, when the number of agents exceeds two, it leads to a significant increase in the state space and more complex equilibrium. Therefore, previous articles rarely include experiments with three or more agents. We refer to the classic Stag Hunt game (given in Table 6) and extend it to three agents: where i ∈ {1, 2, 3}. The previous algorithms were only tested in the two-player Stag Hunt, and we applied MOL to a multiagent scenario for comparison. The results are shown in Fig. 5. It can be seen that MOL is also effective in multiagent scenarios.

Randomly generated games
We examine the learning ability of an agent system with the MOL algorithm in randomly generated games. We generate 2,000 bimatrix games, with the reward corresponding to each of their joint actions being chosen from a determined set of integers. We allow agent systems with different Fig. 3 The training results of the agent systems in the Pennies Matching game. The strategy probability parameters of actions L and R are approximated by the frequency. The game has a unique mixed Nash equilibrium P (A = L) = P (A = R) = 0.5. We can assume that the system converges to this equilibrium when the strategy parameters converge to 0.5 Fig. 4 The training results of the agent systems in the Tandem game. The line with a joint reward of 0 corresponds to the optimal equilibrium x = y = 0 algorithms to be trained in these environments and record their convergence results. Agents in this environment do not have access to information about other agents or their strategies. Therefore, avoiding a local inferior Nash equilibrium is the main objective. We use the Nash equilibrium with the highest average reward as the theoretical optimal result. Figure 6 shows the joint score for different algorithms in randomly generated games. From the figure, we can see that the joint score of MOL is higher and is close to the optimal Nash equilibrium. This indicates that as a learning agent system MOL is more efficient and robust.
We also record the performance of the algorithm in response to different Nash equilibrium environments in Table 7.r is the joint score for the algorithm, r(NE) is the joint reward for the optimal Nash equilibrium and k is a parameter. Therefore, k = 0.9 in PE means that the joint score of the algorithm is higher than 0.9 times the score of the optimal pure strategy equilibrium. We find that MOL achieves good results in exploring equilibrium that approach or exceed the optimal Nash equilibrium reward. Training pure strategies in game environments where only mixed equilibrium exist does not cause a significant decrease in the joint scores. Our algorithm achieves training results close to the optimal Nash equilibrium in the general case.

Competitive environments
Since we cannot require agents to use the same learning method in a repeated game environment, the ability to cope with different opponents is also important. We therefore also conduct an experiment for evaluating the algorithm performance against different opponents. The environment used is the same generated bimatrix game, but random agents are used for training in each game. In this environment the agents do not have access to the (learning) strategy that the opponent will adopt and can only observe joint actions after each round of the game. Algorithms based on the private information of the opponent or on consistency (SOS, COLA) cannot be applied in this scenario. There are 20,000 randomly generated games, each with two selected agents (which may or may not use the same algorithm), that are added to the training over 100 episodes. We evaluate the performance of the algorithm from two perspectives. The first is the average score (the results are shown in Fig. 7) of the evolution process for agents with different algorithms over 100 episodes (obtained by training in 20,000 matrix games). This reflects the ability of the agents to adapt when faced with a diverse set of opponents. From Fig. 7, we can see that our algorithm achieves higher scores against diversified opponents. We also record the training results of each particular combination of agents in randomly generated games. Since the structure of the game is symmetric and randomized, agents who perform better have stronger applicability in noncooperative game environments. The experimental results are presented in Fig. 8. From Fig. 8, we find that MOL-2 does not receive a higher reward in a competitive game environment when facing LOLA or MOL-1, two Results of agent systems with different algorithms trained in 2,000 random game environments. The line labeled best ne represents the average reward for the optimal Nash equilibrium in these games. The joint score is obtained by averaging the rewards for the last episode Table 7 The frequency of the convergence result that the algorithm scores above the optimal pure strategy equilibrium (PE) and mixed equilibrium (ME) in randomly generated games P(r ≥ k · r(NE)) P E M E The probability P is represented by the frequency and k is a multiplier. The bold numbers are the maximum value of each column Fig. 7 The training results for agents with different opponents. The curves represent the scores of the strategies learned by different algorithms within 100 episodes when facing a random opponent (averaged over 20,000 matrix games) algorithms that aim for short-term rewards. However, MOL-2 generally receives higher rewards against a variety of opponents (and simultaneously makes the opponents' rewards higher). We can see that MOL performs well in striking a balance between increasing its own rewards and maximizing social welfare (the average reward for the agent system).
From the experiments above, we can see that MOL achieves good results in terms of improving the agent reward as well as social welfare. In the cooperative form of the game, MOL can converge to an equilibrium with higher joint rewards and effectively face different kinds of opponents in competitive environments. Since MOL is based on the best response (a pure strategy in most cases), the stability of MOL is reduced in game environments where only mixed equilibrium exists. However, we can see from the experimental results that this does not affect the final convergence results. Therefore MOL is more suitable for application in the repeated game environment of multiagent learning system, and as a baseline for opponent modeling due to its prosociality as well as its generalization capability.

Conclusion
This article proposes an MOL method for multiagent repeated game environments. By modeling the stable points of the opponent learning process and taking actions to guide the opponent, MOL can achieve a solution with Fig. 8 Training results against different opponents. The row labels indicate the algorithm used by the agent, and the column labels indicate the algorithm used by the opponent. The values in the corresponding cells indicates the score of that agent against this class of opponents. A cell with a lighter color indicates a higher score a high reward equilibrium. Since there is no restriction on the opponent's model during training and no private information of the other agents is used, MOL is more feasible for noncooperative game environments than other algorithms. We provide a proof of the convergence of MOL by dividing the algorithm into two looping phases and using the Nash equilibrium as a boundary constraint. MOL achieves good results in the classical game structure as well as in the randomly generated games, and obtains a higher joint score when dealing with different opponents. Additionally, we discuss the definition of the equilibrium concept in repeated games with learning processes. We argue that there are better convergence results that can be used as optimization objectives for multiagent systems and establish the relationship between our algorithm and Pareto optimality. We hope to provide a reference for the design of optimization objectives in the learning processes of agent systems, and to help build a general model for noncooperative game environments. Tiande Guo is a professor and doctorate tutor of University of Chinese Academy of Sciences. His current research interests include image processing, reinforcement learning and optimization method & application. He has published some articles such as the journal of IEEE on PAMI, TIP and PR, ETC.