# Stochastic Games and Learning

Latest version View entry history

**DOI:**https://doi.org/10.1007/978-1-4471-5102-9_33-2

## Abstract

A stochastic game was introduced by Lloyd Shapley in the early 1950s. It is a dynamic game with *probabilistic transitions* played by one or more players. The game is played in a sequence of stages. At the beginning of each stage, the game is in a certain *state*. The players select actions, and each player receives a *payoff* that depends on the current state and the chosen actions. The game then moves to a new random state whose distribution depends on the previous state and the actions chosen by the players. The procedure is repeated at the new state, and the play continues for a finite or infinite number of stages. The total payoff to a player is often taken to be the discounted sum of the stage payoffs or the limit inferior of the averages of the stage payoffs.

A learning problem arises when the agent does not know the reward function or the state transition probabilities. If an agent directly learns about its optimal policy without knowing either the reward function or the state transition function, such an approach is called *model-free reinforcement learning*. *Q*-learning is an example of such a model.

*Q*-learning has been extended to a noncooperative multi-agent context, using the framework of general-sum stochastic games. A learning agent maintains *Q*-functions over joint actions and performs updates based on assuming Nash equilibrium behavior over the current *Q*-values. The challenge is convergence of the learning protocol.

## Keywords

Markov decision process Repeated game Equilibrium Dynamic programming Reinforcement learning Asynchronous dynamic programming*Q*-learning

## Synonyms

## Introduction

### A Stochastic Game

### Definition 1 (Stochastic games).

A stochastic game is a dynamic game with probabilistic transitions played by one or more players. The game is played in a sequence of stages. At the beginning of each stage, the game is in a certain state. The players select *actions*, and each player receives *a payoff* that depends on the current state and the chosen actions. The game then moves to a new random state whose distribution depends on the previous state and the actions chosen by the players. The process is repeated at the new state, and the play continues for a finite or infinite number of stages.

*The total payoff* to a player can be defined in various ways. It depends on the payoffs at each stage and strategies chosen by players. The aim of the players is to control their total payoffs in the game by appropriate actions.

The notion of a stochastic game was introduced by Lloyd Shapley (1953) in the early 1950s. Stochastic games generalize both Markov decision processes (see also MDP) and repeated games. A repeated game is equivalent to a stochastic game with a single state. The stochastic game is played in discrete time with past history as common knowledge for all the players. An *individual strategy* for a player is a map which associates with each given history a probability distribution on the set of actions available to the players. The players’ actions at stage *n* determines the players’ payoffs at this stage and the state \(s \in\mathfrak{S}\) at stage *n* + 1.

### Learning

Learning is acquiring new, or modifying and reinforcing existing, knowledge, behaviors, skills, values, or preferences, and may involve synthesizing different types of information. The ability to learn is possessed by humans, animals, and some machines which will be later called *agents*. In the context of this entry, learning refers to a particular class of stochastic game theoretical models.

### Definition 2 (Learning in stochastic games).

A learning problem arises when an agent does not know the reward function or the state transition probabilities. If the agent directly learns about its optimal policy without knowing either the reward function or the state transition function, such an approach is called *model-free reinforcement learning*. *Q*-learning is an example of such a model.

Learning models constitute a branch of larger literature. Players follow a form of behavioral rule, such as imitation, regret minimization, or reinforcement. Learning models are most appropriate in settings where players have a good understanding of their strategic environment and where the stakes are high enough to make forecasting and optimization worthwhile. The known approaches are formulated as *minimax-Q* (Littman 1994), Nash-*Q* (Hu and Wellman 1998), tinkering with learning rates (“Win or Learn Fast”-WoLF Bowling and Veloso 2001) and multiple timescale *Q*-learning (Leslie and Collins 2005).

## Model of Stochastic Game

*N*-

*person stochastic game*is described by the objects \(\left (\mathfrak{N},\mathfrak{S},X_{k},A_{k},r_{k},q\right )\) with the interpretation that:

- 1.
\(\mathfrak{N}\) is a set of players, with\(\left \vert \mathfrak{N}\right \vert= N \in\mathbb{N}\).

- 2.
\(\mathfrak{S}\) is the

*set of states*of the game, and it is finite. - 3.
\(\overrightarrow{X} = X_{1} \times X_{2} \times \ldots \times X_{N}\)is the

*state of actions*, where*X*_{ k }is a nonempty, finite space of actions for player*k*. - 4.
*A*_{ k }’s are correspondences from \(\mathfrak{S}\) into nonempty subsets of*X*_{ k }. For each \(s \in\mathfrak{S}\),*A*_{ k }(*s*) represents the*set of actions*available to player*k*in state*s*. For \(s \in\mathfrak{S}\), denote \(\overrightarrow{A}(s) = A_{1}(s) \times A_{2}(s) \times \ldots \times A_{N}(s)\). - 5.
\(r_{k} : \mathfrak{S} \times \overrightarrow{ X} \rightarrow \mathfrak{R}\) is a payoff function for player

*k*. - 6.
*q*is a transition probability from \(\mathfrak{S} \times \overrightarrow{ X}\) to \(\mathfrak{S}\), called the*law of motion*among states. If*s*is a state at a certain stage of the game and the players select \(\overrightarrow{x} \in \overrightarrow{ A}(s)\), then \(q\left (\cdot \left \vert s,\right.\overrightarrow{x}\right )\) is the probability distribution of the next state of the game.

- 1.
\(\{\sigma _{n}\}_{n=1}^{T}\) with values in \(\mathfrak{S}\)

- 2.
\(\{\alpha _{n}\}_{n=1}^{T}\) with values in \(\overrightarrow{X}\)

### Strategies

Let \(\mathfrak{H} = \mathfrak{S}_{1} \times \overrightarrow{ X}_{1} \times \mathfrak{S}_{2} \times \cdots \) be the space of all infinite histories of the game and \(\mathfrak{H}_{n} = \mathfrak{S}_{1} \times \overrightarrow{ X}_{1} \times \mathfrak{S}_{2} \times \overrightarrow{ X}_{2} \times \cdots \mathfrak{S}_{n}\) the histories up to stage *n*.

### Definition 3.

A player’s *strategy* \(\pi =\{\alpha _{n}\}_{n=1}^{T}\)consists of random maps \(\alpha _{n} : \Omega\times \mathfrak{H}_{n} \rightarrow X\). In other words, the strategy associates with each given history a probability distribution dependent on the set of actions available to the player. If *α* _{ n } is dependent on the history only, it is called deterministic.

- 1.
For player \(i \in\mathbb{N}\), a deterministic strategy specifies a choice of actions for the player at every stage of every possible history.

- 2.
A mixed strategy is a probability distribution over deterministic strategies.

- 3.Restricted classes of strategies:
- 1.
A behavioral strategy – a mixed strategy in which the mixing takes place at each history independently.

- 2.
A Markov strategy – a behavioral strategy such that for each time

*t*, the distribution over actions depends only on the current state, but the distribution may be different at time*t*than at time*t*^{ ′ }≠*t*. - 3.
A stationary strategy – a Markov strategy in which the distribution over actions depends only on the current state (not on the time

*t*).

- 1.

### The Total Payoffs Type

*σ*

_{ n }and

*α*

_{ n }describe the state and the actions chosen by the players, respectively, on the

*n*th stage of the game. Let us define \(E_{s}^{\pi }\) the expectation operator with respect to the probability measure \(P_{s}^{\pi }\). For each profile of strategies

*π*= (

*π*

_{1},

*…*,

*π*

_{ N }) and every initial state \(s \in\mathfrak{S}\), the following are considered:

- 1.The
*expectedT*-*stage payoff*to player*k*, for any finite horizon*T*, defined as$$\Phi _{k}^{T}(\pi )(s) = E_{ s}^{\pi }\left (\displaystyle\sum \limits _{ n=1}^{T}r_{ k}(\sigma _{n},\alpha _{n})\right )$$ - 2.The
*β*-discounted expected payoff to player*k*, where*β*∈ (0, 1) is called the*discount factor*, defined as$$\Phi _{k}^{\beta }(\pi )(s) = E_{ s}^{\pi }\left (\displaystyle\sum \limits _{ n=1}^{\infty }\beta ^{n-1}r_{ k}(\sigma _{n},\alpha _{n})\right )$$ - 3.The
*average payoff per unit time*for player*k*defined as$$\Phi _{k}(\pi )(s) =\mathop{ \mathop{lim}\;\sup }\limits _{T} \frac{1} {T}\Phi _{k}^{T}(\pi )(s)$$

### Equilibria

Let \(\pi ^{{\ast}} = \left (\pi _{1}^{{\ast}},\ldots,\pi _{N}^{{\ast}}\right ) \in\Pi \) be a fixed profile of the players’ strategies. For any strategy \(\pi _{k} \in\Pi _{k}\) of player *k*, we write \(\left (\pi _{-k}^{{\ast}},\pi _{k}\right )\) to denote the strategy profile obtained from *π* ^{ ∗} by replacing \(\pi _{k}^{{\ast}}\) with \(\pi _{k}\).

### Definition 4 (A Nash equilibrium).

*Nash equilibrium*(in \(\Pi \)) for the average payoff stochastic game if no unilateral deviations from it are profitable, that is, for each

*s*∈

*S*,

*k*and any strategy

*π*

_{ k }.

### Definition 5 (An \(\boldsymbol{\epsilon }\)-Nash equilibrium).

*(Nash) equilibrium*of the average payoff stochastic game if for every \(k \in\mathfrak{N}\), we have

*π*

_{ k }.

Nash equilibria and \(\epsilon \)-Nash equilibria are analogously defined for the *T*-stage stochastic games, *β*-discounted stochastic games, and the average payoff per unit time stochastic games.

### Construction of an Equilibrium

For stochastic games with a finite state space and finite action spaces, the existence of a stationary equilibrium has been shown (cf. Herings and Peeters 2004). The stationary strategies at time *t* do not depend on the entire history of the game up to that time. This allows reduction of the problem of finding discounted stationary equilibria in a general *n*-person stochastic game to that of finding a global minimum in a nonlinear program with linear constraints. Solving this nonlinear program is equivalent to solving a certain nonlinear system for which it is known that the objective value in the global minimum is zero (cf. Filar et al. 1991). However, as is noted by Breton (1991), the convergence of an optimization algorithm to the global optimum is not guaranteed.

- 1.
\(\left (\pi _{1}^{{\ast}},\pi _{2}^{{\ast}}\right )\) is an equilibrium point in the discounted stochastic game with equilibrium payoffs \(\left (\Phi _{1}^{\beta }\left (\overrightarrow{\pi }^{{\ast}}\right ),\Phi _{2}^{\beta }\left (\overrightarrow{\pi }^{{\ast}}\right )\right )\).

- 2.For each \(s \in\mathfrak{S}\), the pair \(\left (\pi _{1}^{{\ast}}(s),\pi _{2}^{{\ast}}(s)\right )\) constitutes an equilibrium point in the static bimatrix game (
*B*_{1}(*s*),\(B_{2}(s))\) with equilibrium payoffs \(\left (\Phi _{1}^{\beta }\left (s,\overrightarrow{\pi }^{{\ast}}\right ),\Phi _{2}^{\beta }\left (s,\overrightarrow{\pi }^{{\ast}}\right )\right )\), where for players*k*= 1, 2, and pure actions (\(a_{1},a_{2}) \in A_{1}(s) \times A_{2}(s)\), an admissible action space at state*s*, the elements of \(B_{k}(s)\) related to (\(a_{1},a_{2})\)An algorithm for recursive computation of stationary equilibria in stochastic games can be derived from (1). It starts with bimatrix games with$$\begin{array}{c} b_{k}(s,a_{1},a_{2}) := (1-\beta )r_{k}(s,a_{1},a_{2}) \\ +\beta E_{s}^{(a_{1},a_{2})}\Phi _{k}^{\beta }\left (\overrightarrow{\pi }^{{\ast}}\right )\end{array}$$(1)*β*= 0, and then a careful equilibrium selection process guarantees its convergence under mild assumptions on the model (see, e.g., Herings and Peeters 2004).

### A Brief History of the Research on Stochastic Games

The notion of a stochastic game was introduced by Shapley (1953) in the early 1950s. It is a dynamic game with *probabilistic transitions* played by one or more players. The game is played in a sequence of stages. At the beginning of each stage, the game is in a certain *state*. The players select actions, and each player receives a *payoff* that depends on the current state and the chosen actions. The game then moves to a new random state whose distribution depends on the previous state and the actions chosen by the players. The process is repeated at the new state, and the play continues for a finite or an infinite number of stages. The total payoff to a player is often taken to be the discounted sum of the stage payoffs or the limit inferior of the averages of the stage payoffs.

The theory of nonzero-sum stochastic games with the average payoffs per unit time for the players started with the papers by Rogers (1969) and Sobel (1971). They considered finite state spaces only and assumed that the transition probability matrices induced by any stationary strategies of the players are irreducible. Until now, only special classes of nonzero-sum average payoff stochastic games have been shown to possess Nash equilibria (or \(\epsilon \)-equilibria). A review of various cases and results for generalization to infinite state spaces can be found in the survey paper by Nowak and Szajowski (1998).

## Learning in Stochastic Game

- 1.
Goals of single-agent reinforcement learning are to determine the optimal value and a control policy which maximizes the payoff. The model of such a system can be built based on the framework of Markov decision processes with discounted payoff. Suppose the policy is stationary and defined by a function \(h : \mathfrak{S} \rightarrow X\). Such a policy defines what action should be taken in each state: \(\alpha _{n}(\cdot ) := h(\cdot )\). There are various ways to learn the optimal policy. The most straightforward way is based on the

*Q*-values: \(\mbox{ Q}^{h}(s,a) =\sum \limits _{ j=0}^{\infty }\beta _{j+1}^{jr}\). The greedy action is \(a =\mathop{ \arg \max }\limits _{a'\in A(s)}Q^{h}(s,a')\) (see the article on*Q*-learning in Reinforcement learning). - 2.
Multi-agent reinforcement learning can be employed to solve a single task, or an agent may be required to perform a task in an environment with other agents, either human, robot, or software ones. In either case, from an agent’s perspective, the world is not stationary. In particular, the behavior of the other agents may change as they also learn to better perform their tasks. This type of a multi-agent nonstationary world creates a difficult problem for learning to act in these environments. Such a nonstationary scenario can be viewed as a game with multiple players. In game theory, in the study of such problems, there is generally an underlying assumption that the players have similar adaptation and learning abilities. Therefore, the actions of each agent affect the task achievement of the other agents. It allows to build the value of the game and an equilibrium strategy profile in following steps.

Stochastic games can be seen as extension of single-agent Markov decision process framework to include multiple agents whose actions all impact the resulting rewards and the next state. They can also be viewed as an extension of the framework of matrix games. Such a view emphasizes the difficulty of finding the optimal behavior in stochastic games since the optimal behavior of any one agent depends on the behavior of other agents. A comprehensive study of the multi-agent learning techniques for stochastic games does not yet exist. For the interested reader, there are monographs by Fudenberg and Levine (1998) and Shoham and Leyton-Brown (2009) and the special issue of AI journal (Vohra and Wellman 2007), which could be consulted.

Despite its interesting properties, *Q*-learning is a very slow method that requires a long period of training for learning an acceptable policy. In practice, to reduce the problem, there are parallel computing implementation models of *Q*-learning.

## Summary and Future Directions

Details concerning solution concepts for stochastic games can be found in Filar and Vrieze (1997). The refinements of the Nash equilibrium concept have been known in the economic dynamic games (see Myerson 1978). The Nash equilibrium concept may be extended gradually when the rules of the game are interpreted in a broader sense, so as to allow preplay or even intraplay communication. A well-known extension of the Nash equilibrium is Aumann’s correlated equilibrium (see Aumann 1987), which depends only on the normal form of the game. Two other solution concepts for multistage games have been proposed by Forges (1986): the extensive form correlated equilibrium, where the players can observe private exogenous signals at every stage, and the communication equilibrium, where the players are furthermore allowed to transmit inputs to an appropriate device at every stage. An application of the notion of correlated equilibria for stochastic games can be found in Nowak and Szajowski (1998).

In economics, in the context of economic growth problems, Ramsey (1928) has introduced an *overtaking optimality* and independently (Rubinstein 1979) for repeated games. The criterion has been investigated for some stochastic games by Carlson and Haurie (1995) and Nowak (2008), and others. The existence of overtaking optimal strategies is a subtle issue, and there are counterexamples showing that one has to be careful with making statements on overtaking optimality.

Regarding a stochastic game and learning, let us mention that the first idea can be found in the papers by Brown (1951) and Robinson (1951). Some convergence results for a fictitious play have been given by Shoham and Leyton-Brown (2009) in Theorem 7.2.5. An important example showing non-convergence was given by Shapley (1964). In multi-person stochastic games and learning, convergence to equilibria is a basic stability requirement (see, e.g., Greenwald and Hall 2003; Hu and Wellman 2003). This means that the agents’ strategies should eventually converge to a coordinated equilibrium. Nash equilibrium is most frequently used, but their usefulness is suspected. For instance, in Shoham and Leyton-Brown (2009), there is an argument that the link between stage-wise convergence to Nash equilibria and the performance in stochastic games is unclear.

## Cross-References

## Bibliography

- Aumann RJ (1987) Correlated equilibrium as an expression of Bayesian rationality. Econometrica 55:1–18. doi:10.2307/1911154CrossRefzbMATHMathSciNetGoogle Scholar
- Bowling M, Veloso M (2001) Rational and convergent learning in stochastic games. In: Proceedings of the 17th international joint conference on artificial intelligence (IJCAI), Seattle, pp 1021–1026Google Scholar
- Breton M (1991) Algorithms for stochastic games. In: Raghavan TES, Ferguson TS, Parthasarathy T, Vrieze OJ (eds) Stochastic games and related topics: in honor of Professor L. S. Shapley, vol 7. Springer Netherlands, Dordrecht, pp 45–57. doi:10.1007/978-94-011-3760-7_5CrossRefGoogle Scholar
- Brown GW (1951) Iterative solution of games by fictitious play. In: Koopmans TC (ed) Activity analysis of production and allocation. Wiley, New York, Chap. XXIV, pp 374–376Google Scholar
- Buşoniu L, Babuška R, Schutter BD (2010) Multi-agent reinforcement learning: an overview. In: Srinivasan D, Jain LC (eds) Innovations in multi-agent systems and application–1. Springer, Berlin, pp 183–221Google Scholar
- Carlson D, Haurie A (1995) A turnpike theory for infinite horizon open-loop differential games with decoupled controls. In: Olsder GJ (ed) New trends in dynamic games and applications. Annals of the international society of dynamic games, vol 3. Birkhäuser, Boston, pp 353–376CrossRefGoogle Scholar
- Filar J, Vrieze K (1997) Competitive Markov decision processes. Springer, New YorkzbMATHGoogle Scholar
- Filar JA, Schultz TA, Thuijsman F, Vrieze OJ (1991) Nonlinear programming and stationary equilibria in stochastic games. Math Program 50(2, Ser A):227–237. doi:10.1007/BF01594936Google Scholar
- Forges F (1986) An approach to communication equilibria. Econometrica 54:1375–1385. doi:10.2307/1914304CrossRefzbMATHMathSciNetGoogle Scholar
- Fudenberg D, Levine DK (1998) The theory of learning in games, vol 2. MIT, CambridgezbMATHGoogle Scholar
- Greenwald A, Hall K (2003) Correlated-Q learning. In: Proceedings 20th international conference on machine learning (ISML-03), Washington, DC, 21–24 Aug 2003, pp 242–249Google Scholar
- Herings PJ-J, Peeters RJAP (2004) Stationary equilibria in stochastic games: structure, selection, and computation. J Econ Theory 118(1):32–60. doi:10.1016/j.jet.2003.10.001zbMATHMathSciNetGoogle Scholar
- Hu J, Wellman MP (1998) Multiagent reinforcement learning: theoretical framework and an algorithm. In: Proceedings of the 15th international conference on machine learning, New Brunswick, pp 242–250Google Scholar
- Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J Mach Learn Res 4:1039–1069MathSciNetGoogle Scholar
- Leslie DS, Collins EJ (2005) Individual
*Q*-learning in normal form games. SIAM J Control Optim 44(2):495–514. doi:10.1137/S0363012903437976CrossRefzbMATHMathSciNetGoogle Scholar - Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Proceedings of the 13th international conference on machine learning, New Brunswick, pp 157–163Google Scholar
- Myerson RB (1978) Refinements of the Nash equilibrium concept. Int J Game Theory 7(2):73–80. doi:10.1007/BF01753236CrossRefzbMATHMathSciNetGoogle Scholar
- Nowak AS (2008) Equilibrium in a dynamic game of capital accumulation with the overtaking criterion. Econ Lett 99(2):233–237. doi:10.1016/j.econlet.2007.05.033CrossRefzbMATHGoogle Scholar
- Nowak AS, Szajowski K (1998) Nonzerosum stochastic games. In: Bardi M, Raghavan TES, Parthasarathy T (eds) Stochastic and differential games: theory and numerical methods. Annals of the international society of dynamic games, vol 4. Birkhäser, Boston, pp 297–342. doi:10.1007/978-1-4612-1592-9_7Google Scholar
- Ramsey F (1928) A mathematical theory of savings. Econ J 38:543–559CrossRefGoogle Scholar
- Robinson J (1951) An iterative method of solving a game. Ann Math 2(54):296–301. doi:10.2307/1969530CrossRefGoogle Scholar
- Rogers PD (1969) Nonzero-sum stochastic games, PhD thesis, University of California, Berkeley. ProQuest LLC, Ann ArborGoogle Scholar
- Rubinstein A (1979) Equilibrium in supergames with the overtaking criterion. J Econ Theory 21:1–9. doi:10.1016/0022-0531(79)90002-4CrossRefzbMATHGoogle Scholar
- Shapley L (1953) Stochastic games. Proc Natl Acad Sci USA 39:1095–1100. doi:10.1073/pnas.39.10.1095CrossRefzbMATHMathSciNetGoogle Scholar
- Shapley L (1964) Some topics in two-person games. Ann Math Stud 52:1–28zbMATHGoogle Scholar
- Shoham Y, Leyton-Brown K (2009) Multiagent systems: algorithmic, game-theoretic, and logical foundations. Cambridge University Press, Cambridge. doi:10.1017/CBO9780511811654Google Scholar
- Sobel MJ (1971) Noncooperative stochastic games. Ann Math Stat 42:1930–1935. doi:10.1214/aoms/1177693059CrossRefzbMATHMathSciNetGoogle Scholar
- Tijms H (2012) Stochastic games and dynamic programming. Asia Pac Math Newsl 2(3):6–10MathSciNetGoogle Scholar
- Vohra R, Wellman M (eds) (2007) Foundations of multi-agent learning. Artif Intell 171:363–452Google Scholar
- Weiß G, Sen S (eds) (1996) Adaption and learning in multi-agent Systems. In: Proceedings of the IJCAI’95 workshop, Montréal, 21 Aug 1995, vol 1042. Springer, Berlin. doi:10.1007/3-540-60923-7Google Scholar