An epistemic approach to stochastic games

In this paper we focus on stochastic games with finitely many states and actions. For this setting we study the epistemic concept of common belief in future rationality, which is based on the condition that players always believe that their opponents will choose rationally in the future. We distinguish two different versions of the concept—one for the discounted case with a fixed discount factor δ,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta ,$$\end{document} and one for the case of uniform optimality, where optimality is required for all discount factors close enough to 1” . We show that both versions of common belief in future rationality are always possible in every stochastic game, and always allow for stationary optimal strategies. That is, for both versions we can always find belief hierarchies that express common belief in future rationality, and that have stationary optimal strategies. We also provide an epistemic characterization of subgame perfect equilibrium for two-player stochastic games, showing that it is equivalent to mutual belief in future rationality together with some “correct beliefs assumption”.


Introduction
The literature on stochastic games is massive, and has concentrated mostly on the question whether Nash equilibria, subgame perfect equilibria, or other types of equilibria exist in such games. To the best of our knowledge, this paper is the first to analyze stochastic games from an epistemic point of view.
A distinctive feature of an equilibrium approach to games is the assumption that every player believes that the opponents are correct about his beliefs (see Dekel 1987, 1989;Tan and Werlang 1988;Aumann and Brandenburger 1995;Asheim 2006;Perea 2007). The main idea of this paper is to analyze stochastic games without imposing the correct beliefs assumption, while at the same time preserving the spirit of subgame perfection. This leads to a concept called common belief in future rationality-an extension of the corresponding concept by Perea (2014) which has been defined for dynamic games of finite duration. Very similar concepts have been introduced in Baltag et al. (2009) and Penta (2015).
Common belief in future rationality states that, after every history, the players continue to believe that their opponents will choose rationally in the future, that they believe that their opponents believe that their opponents will choose rationally in the future, and so on, ad infinitum. The crucial feature that common belief in future rationality has in common with subgame perfect equilibria is that the players uphold the belief that the opponents will be rational in the future, even if this belief has been violated in the past. What distinguishes common belief in future rationality from subgame perfect equilibrium is that the former allows the players to have erroneous beliefs about their opponents, while the latter incorporates the condition of correct beliefs in the sense that we make precise.
We introduce our solution concept using the language of epistemic models with types, following Harsanyi (1967Harsanyi ( , 1968a). An epistemic model specifies, for each player, the set of possible types, and for each type and each history of the game, a probability distribution over the opponents' strategy-type combinations. An epistemic model succinctly describes the entire belief hierarchy after each history of the game. This model is essentially the same as the epistemic models used by Ben-Porath (1997), Siniscalchi (1999, 2002) and Perea (2012Perea ( , 2014 to encode conditional belief hierarchies in finite dynamic games.
For a given discount factor δ, we say that a player believes in the opponents' future δ-rationality if he always believes that his opponents maximize their expected utility, given the discount factor δ, now and in the future. More precisely, a type in the epistemic model believes in the opponents' future δ-rationality if, at every history, it assigns probability 1 to the set of opponents' strategy-type combinations where the strategy maximizes the type's expected utility, given the discount factor δ, at the present and every future history.
A player is said to believe in the opponents' future uniform rationality if he always believes that his opponents maximize their expected utility, for all discount factors large enough, now and in the future. Formally, we say that the type believes in the opponents' future uniform rationality if it assigns probability 1 to the set of opponents' strategy-type combinations where the strategy maximizes the type's expected utilityfor all discount factors larger than some threshold-at the present and every future history. Common belief in future δ -rationality requires that the type not only believes in the opponents' future δ-rationality, but also believes, throughout the game, that his opponents always believe in their opponents' future δ -rationality, and so on, ad infinitum. Similarly, we can define common belief in future uniform rationality.
In this paper we show that common belief in future rationality is always possible in a stochastic game with finitely many states, and always allows for stationary optimal strategies. More precisely, we prove in Theorem 5.1 that for every discount factor δ < 1, we can always construct an epistemic model in which all types express common belief in future δ-rationality, and have stationary optimal strategies. A similar result holds for the uniform optimality case-see Theorem 5.2.
The fact that stationary optimal strategies exist for common belief in future rationality is important both from a conceptual and an applied point of view. Conceptually, stationary strategies are very attractive since they are memory-less. Indeed, in a stationary strategy a player need not keep track of the choices made by his opponents or himself in the past, but need only look at the current state, and base his decision solely on the state he is at. Also from an applied perspective stationarity is an important virtue, as it makes the strategies much easier to describe and compute in concrete applications.
A second objective of this paper is to relate common belief in future rationality in stochastic games to the well-known concept of subgame perfect equilibrium (Selten 1965). In Theorems 6.1 and 6.2 we provide an epistemic characterization of subgame perfect equilibrium for two-player stochastic games. We show that a behavioral strategy profile (σ 1 , σ 2 ) is a subgame perfect equilibrium, if and only if, it is induced by a pair of types (t 1 , t 2 ) where type t 1 (a) always believes that the opponent's type is t 2 , (b) believes in the opponent's future rationality, and similarly for type t 2 . We refer to condition (a) as the correct beliefs condition, and to condition (b) as mutual belief in future rationality. Indeed, condition (a) for types t 1 and t 2 implies that type t 1 always believes that player 2 always believes that 1's type is t 1 and no other, and hence that player 2 is correct about 1's beliefs. Similarly for player 2.
It is exactly this correct beliefs condition that separates subgame perfect equilibrium from common belief in future rationality, at least for the case of two players. The reason is that the correct beliefs condition, together with mutual belief in future rationality, implies common belief in future rationality. Hence, our characterization theorem shows, in particular, that subgame perfect equilibrium is a refinement of common belief in future rationality. Our characterization result is analogous to the epistemic characterizations of Nash equilibrium as presented in Dekel (1987, 1989), Tan and Werlang (1988), Aumann and Brandenburger (1995), Asheim (2006) and Perea (2007).
The equilibrium counterpart of common belief in future uniform rationality is the concept we term uniform subgame perfect equilibrium. A uniform subgame perfect equilibrium is a strategy profile that is a subgame perfect equilibrium under a discounted evaluation for all sufficiently high values of the discount factor. It is wellknown that uniform subgame perfect equilibria may fail to exist in some stochastic games. Indeed, every uniform subgame perfect equilibrium is also a subgame perfect equilibrium under the limiting average reward. It is well-known that subgame perfect equilibria, and in fact even Nash equilibria, may fail to exist in stochastic games under the limiting average reward criterion. This is for instance the case in the famous Big Match game (Gillette 1957), a game we discuss in detail in this paper. Our existence results in Theorems 5.1 and 5.2, which guarantee that common belief in future rationality is always possible in a stochastic game-even for the uniform optimality case-do not rely on any form of equilibrium existence. Instead, we explicitly construct an epistemic model where each type exhibits common belief in future (δor uniform) rationality.
The paper is structured as follows. In Sect. 2 we provide a preliminary discussion of the concept of common belief in future rationality, and its relation to subgame perfect equilibrium, by means of the famous Big Match game (Gillette 1957). In Sect. 3 we give a formal definition of stochastic games. In Sect. 4 we introduce epistemic models and define the concept of common belief in future rationality. In Sect. 5 we prove that common belief in future δ-(and uniform) rationality is always possible in a stochastic game, and always allows for stationary optimal strategies. In Sect. 6 we present our epistemic characterizations of subgame perfect equilibrium. All proofs are collected in Sect. 7.

The Big Match
Before presenting our formal model and definitions, we will illustrate the concept of common belief in future rationality, and its relation to subgame perfect equilibrium, by means of the well-known Big Match game by Gillette (1957). This game has originally been considered under the limiting average reward criterion, and has no Nash equilibrium, and hence no subgame perfect equilibrium, under this criterion.
In dynamic games of finite duration, subgame perfect equilibrium can be viewed as the equilibrium analogue to common belief in future rationality. Similarly, within stochastic games, uniform subgame perfect equilibrium is the equilibrium counterpart to common belief in future uniform rationality. Uniform subgame perfect equilibrium is defined as a strategy profile that is a subgame perfect equilibrium for all sufficiently high values of the discount factor. As uniform optimality implies optimality under the limiting average reward criterion, each uniform subgame perfect equilibrium is also a subgame perfect equilibrium under the limiting average reward criterion. Hence, the Big Match does not admit a uniform subgame perfect equilibrium either. Nevertheless, we will show that in this game we can construct belief hierarchies that express common belief in future rationality with respect to the uniform optimality criterion.
The Big Match, introduced by Gillette (1957), has become a real classic in the literature on stochastic games. It is a two-player zero-sum game with three states, two of which are absorbing. Here, by "absorbing" we mean that if the game reaches this state, it will never leave this state thereafter. In state 1 each player has only one action, and the instantaneous utilities are (1, −1). From state 1 the transition to state 1 occurs with probability 1, so state 1 is absorbing. In state 2 each player has only one action, and the instantaneous utilities are (0, 0). From state 2 the transition to state 2 occurs with probability 1, so also state 2 is absorbing. In state 0 player 1 can play C (continue) or S (stop), while player 2 can play L (left) or R (right), the instantaneous utilities being given by the table in Fig. 1. After actions (C, L) or (C, R), the transition to state (1, −1) S (1, −1) * (0, 0) * 0 occurs, after (S, L) transition to state 1 occurs, while after (S, R) transition to state 2 occurs. So, the * in the table above represents a situation where the game enters an absorbing state.
It is well-known that for the limiting average reward case-and hence also for the uniform optimality case-there is no subgame perfect equilibrium, nor a Nash equilibrium, in this game. An important reason for this is the fact that the best-response correspondence is not upper-hemicontinuous in the opponent's mixed strategy. For instance, R is the unique optimal choice for player 2, under the uniform optimality criterion, whenever he believes that player 1 chooses a mixed stationary strategy that assigns positive probability to both C and S. This even holds when player 1 chooses S with a very low probability. Indeed, under the uniform optimality criterion player 2 exclusively focuses on the long run, and therefore must make sure that he makes the "right choice" whenever the game enters an absorbing state. However, if he believes that player 1 will always choose C with probability 1, then only L is optimal for player 2. Blackwell and Ferguson (1968) have shown, however, how to construct an ε-(subgame perfect) equilibrium for the limiting average reward case for every ε > 0.
Consider now the belief hierarchy for player 1 in which (a) player 1 always believes that player 2 will always choose L at state 0 in the future, (b) player 1 always believes that player 2 always believes that player 1 will always choose C at state 0 in the future, (c) player 1 always believes that player 2 always believes that player 1 always believes that player 2 will always choose R at state 0 in the future, (d) player 1 always believes that player 2 always believes that player 1 always believes that player 2 always believes that player 1 will choose S at state 0 in the future, (e) player 1 always believes that player 2 always believes that player 1 always believes that player 2 always believes that player 1 always believes that player 2 will always choose L at state 0 in the future, and so on.
Then, it can be verified that player 1 always believes that player 2 will choose rationally in the future, that player 1 always believes that player 2 always believes that player 1 will always choose rationally in the future, and so on. Here, rationality is taken with respect to the uniform optimality criterion. That is, the belief hierarchy above expresses common belief in future rationality with respect to the uniform optimality criterion. In a similar way, we can construct a belief hierarchy for player 2 that expresses common belief in future rationality with respect to the uniform optimality criterion.
Note, however, that in player 1's belief hierarchy above, player 1 believes that player 2 is wrong about his actual beliefs: on the one hand, player 1 believes that player 2 will always choose L in the future, but at the same time player 1 believes that player 2 believes that player 1 believes that player 2 will always choose R in the future. This is something that can never happen in a subgame perfect equilibrium: there, players are always assumed to believe that the opponent is correct about the actual beliefs they hold. We will see in Sect. 6 of this paper that this correct beliefs assumption is exactly what separates the concept of common belief in future rationality from subgame perfect equilibrium.
The belief hierarchy for player 1 constructed above is special, as it allows for a stationary optimal strategy for player 1, in which he always chooses S at state 0, no matter what happened in the past. The reason for this is that the belief hierarchy constructed above is also essentially "stationary" , since player 1 always believes at state 0 that player 2 will be implementing the same stationary strategy, no matter what happened in the past. Moreover, this "stationary" belief hierarchy expressing common belief in future rationality has been constructed on the basis of a cycle of stationary strategies, connected by "best-response properties" . Such a cycle of stationary strategies can always be built as long as there are finitely many states in the game, since then the number of stationary strategies is finite. This fact is heavily exploited in the proofs of our existence theorems for common belief in future rationality, where we show that such best-response cycles of stationary strategies are always possible, and always lead to "stationary" belief hierarchies that express common belief in future rationality and that allow for stationary optimal strategies.
Note also that for constructing the belief hierarchies above it does not matter whether the best-response correspondence is upper-hemicontinuous or not. Indeed, in the construction we only make use of "pure" belief hierarchies that always assign probability 1 to one opponent's pure stationary strategy. This suffices for creating belief hierarchies that express common belief in future rationality with respect to the uniform optimality criterion. Theorem 5.2 and its proof show that this is true not only for the Big Match, but for every stochastic game with finitely many states and actions. This, in part, explains why common belief in future rationality with respect to the uniform optimality criterion is always possible in every stochastic game with finitely many states and actions, although the best-response correspondence is not always upper-hemicontinuous in such games.

Stochastic games
A finite stochastic game consists of the following ingredients: (1) a finite set of players I , (2) a finite, non-empty set of states X , (3) for every state x and player i ∈ I , there is a finite, non-empty set of actions A i (x), (4) for every state x and every profile of actions a in × i∈I A i (x), there is an instantaneous utility u i (x, a) for every player i, and (5) a transition probability p(y|x, a) ∈ [0, 1] for every two states x, y ∈ X and every action profile a in × i∈I A i (x). Here, the transition probabilities should be such that y∈X p(y|x, a) = 1 for every x ∈ X and every action profile a in × i∈I A i (x).
At every state x, we write A( (3) for every period m ∈ {2, . . . , k} the state x m can be reached with positive probability given that at period m − 1 state x m−1 and action profile a m−1 ∈ A(x m−1 ) have been realized. By x(h) := x k we denote the last state that occurs in history h. Let H k denote the set of all possible histories of length k. Let H := ∪ k∈N H k be the set of all (finite) histories.
A strategy for player i is a function s i that assigns to every history h ∈ H some action s i (h) ∈ A i (x(h)). By S i we denote the set of all strategies for player i. Note that the set S i of strategies is typically uncountably infinite. We say that the strategy So, the prescribed action only depends on the state, and not on the specific history. A stationary strategy can thus be summarized as During the game, players always observe what their opponents have done in the past, but face uncertainty about what the opponents will do now and in the future, and also about what these opponents would have done at histories that are no longer possible. That is, after every history h all players know that their opponents have chosen a combination of strategies that could have resulted in this particular history h. To model this precisely, consider a history Here, a m i is the action of player i in the action profile a m ∈ A(x m ). Hence, S i (h) contains precisely those strategies for player i that are compatible with the history h.
So, after every history h, every player i knows that each of his opponents j is implementing a strategy from S j (h), without knowing precisely which one. This uncertainty can be modelled by conditional belief vectors. Formally, a conditional belief vector b i for player i specifies for every history h ∈ H some probability distri- To define the space (S −i (h)) formally we must first specify a σ -algebra is typically an uncountably infinite set. Let h ∈ H k be a history of length k. For a given player j, strategy s j ∈ S j (h), and m ≥ k, let [s j ] m be the set of strategies that coincide with s j at all histories of length at most m. As m ≥ k, every strategy in [s j ] m must in particular coincide with s j at all histories that precede h, and hence every strategy in , and this is precisely the σ -algebra we will use. So, when we say (S −i (h)) we mean the set of probability distributions on S −i (h) with respect to this specific σ -algebra −i (h).
Suppose that the game has reached history h ∈ H k . Consider for every player i some strategy s i ∈ S i (h) which is compatible with the history h. Let s = (s i ) i∈I . Then, for every m ≥ k, and every history h ∈ H m , we denote by p(h |h, s) the probability that history h ∈ H m will be realized, conditional on the event that the game has reached history h ∈ H k and the players choose according to s. The corresponding expected utility for player i at period m ≥ k would be given by is the combination of actions chosen by the players at state x(h ) after history h , if they choose according to the strategy profile s. The expected discounted utility for player i would be The strategy s i is δ-optimal under the conditional belief vector b i if for every history h ∈ H and every strategy s i ∈ S i (h). The strategy s i is said to be uniformly optimal under b i if there is someδ ∈ (0, 1) such that s i is δ-optimal under b i for every δ ∈ [δ, 1). Note that every strategy s i which is uniformly optimal under the conditional belief vector b i , will also be optimal under b i with respect to the limiting average reward criterion-an optimality criterion which is widely used in the literature on stochastic games. This result follows from Theorem 2.8.3 in Filar and Vrieze (1997).
A finite Markov decision problem can be identified with a finite stochastic game with only one player, say player i. In that case, the conditional belief vectors for player i become redundant, but δ-optimal strategies and uniformly optimal strategies for player i can be defined in the same way as above.
The following classical results state that for every finite Markov decision problem, we can always find a stationary strategy that is optimal-both for the δ-discounted and the uniform optimality case. (a) For every δ ∈ (0, 1), there is a δ-optimal strategy which is stationary.
(b) There is a uniformly optimal strategy which is stationary.
Part (a) follows from Shapley (1953) and has later been shown in Howard (1960), but Blackwell (1962) provides a simpler proof. The proof for part (b) can be found in Blackwell (1962).

Common belief in future rationality
In this section we define the central notion in this paper-common belief in future rationality. In words, the concept states that a player always believes, after every history, that his opponents will choose rationally in the future, that his opponents always believe that their opponents will choose rationally in the future, and so on. Before we define this concept formally, we first introduce epistemic models with types à la (Harsanyi 1967(Harsanyi , 1968a as a possible way to encode belief hierarchies.

Epistemic model
We do not only wish to model the beliefs of players about the opponents' strategy choices, but also the beliefs about the opponents' beliefs about the other players' strategy choices, and so on. One way to do so is by means of an epistemic model with types à la (Harsanyi 1967(Harsanyi , 1968a. Moreover, these conditional beliefs (β i (t i , h)) h∈H are assumed to satisfy Bayesian updating, that is, for every history h, and every history h following h with Here, the σ -algebra on S −i (h) × T −i that we use is the product σ -algebra generated by the σ -algebra −i (h) on S −i (h), and the discrete σ -algebra on the finite set T −i , containing all subsets. Moreover, −i (h ) is the σ -algebra on S −i (h ). The probability distribution β i (t i , h) encodes the belief that type t i holds, after history h, about the opponents' strategies and the opponents' conditional beliefs. In particular, by taking the marginal of β i (t i , h) on S −i (h), we obtain the first-order belief b i (t i , h) ∈ (S −i (h)) of type t i about the opponents' strategies. As β i (t i , h) also specifies a belief about the opponents' types, and every opponent's type holds conditional beliefs about his opponents' strategies, we can also derive, for every type t i and history h, the second-order belief that type t i holds, after history h, about the opponents' conditional first-order beliefs.
By continuing in this fashion, we can derive for every type t i in the epistemic model his first-order beliefs, second-order beliefs, third-order beliefs, and so on. That is, we can derive for every type t i a complete belief hierarchy. The epistemic model just represents a very easy and compact way to encode such belief hierarchies. The epistemic model above is very similar to models used in Ben-Porath (1997), Siniscalchi (1999, 2002) and Perea (2012Perea ( , 2014 for finite dynamic games. Note that we automatically assume Bayesian updating whenever we talk about types in an epistemic model.
The reader may wonder why we restrict to finitely many types in the epistemic model. The reason is purely pragmatic: it is easier to work with finitely many types, since we do not need additional topological or measure-theoretic machinery. At the same time, our analysis and results in this paper would not change if we would allow for infinitely many types. For instance, in order to prove the existence of common belief in future rationality in both the discounted and the uniform case, it is sufficient to build one epistemic model in which all types express common belief in future rationality, and we show that we can always build an epistemic model with finitely many types that has this property.

Belief in future rationality
Consider a type t i , and let b i (t i ) be the induced first-order belief vector. That is, b i (t i ) specifies for every history h the first-order belief b i (t i , h) ∈ (S −i (h)) that t i holds about the opponents' strategies. Note that b i (t i ) is a conditional belief vector as defined in the previous section. We say that strategy s i is δ-optimal for type t i at history h if s i is δ-optimal at h for the conditional belief b i (t i , h). More precisely, s i is δ-optimal for type t i at history h if for every s i ∈ S i (h). 2 We say that s i is δ-optimal for type t i if s i is δ-optimal for type t i at every history h with s i ∈ S i (h). We say that type t i believes in his opponents' future δ -rationality if at every stage of the game, type t i assigns probability 1 to the set of those opponents' strategy-type pairs where the opponent's strategy is δ-optimal for the opponent's type at all future stages. To formally define this, let Here, we say that h weakly follows h if h follows h, or h = h. Moreover, let (S −i × T −i ) h,δ-opt := × j =i (S j × T j ) h,δ-opt be the set of opponents' strategy-type combinations where the strategies are δ-optimal for the types at all stages weakly following h.
Similar definitions can be given for the case of uniform optimality. We define s i is δ-optimal for t i at every h that weakly follows h , Definition 4.2 (Belief in future rationality) Consider a finite epistemic model M = (T i , β i ) i∈I , and a type t i ∈ T i .

(a) Type t i believes in the opponents' future δ-rationality if for every history
With this definition at hand, we can now define "common belief in future δrationality" , which means that players do not only believe in their opponents' future δ -rationality, but also always believe that the other players believe in their opponents' future δ-rationality, and so on. We do so by recursively defining, for every player i, smaller and smaller sets of types A type t i expresses common belief in future δ-rationality if t i ∈ T m i for all m.
That is, T 2 i contains those types that believe in the opponents' future δ-rationality, and which only deem possible opponents' types that believe in their opponents' future δ-rationality. Similarly for T 3 i , T 4 i , and so on. This definition is based on the notion of "common belief in future rationality" as presented in Perea (2014), which has been designed for dynamic games of finite duration. Baltag et al. (2009) and Penta (2015) present concepts that are very similar to "common belief in future rationality" . In the same way, we can define "common belief in future uniform rationality" for stochastic games.

Existence result
In this section we will show that "common belief in future δ-rationality" and "common belief in future uniform rationality" are possible in every finite stochastic game, and that they always allow for stationary optimal strategies. The proof will be constructive, as we will explicitly construct an epistemic model in which all types express common belief in future δ-(or uniform) rationality, allowing for stationary optimal strategies.

Common belief in future rationality is always possible
We first show the following important result, for which we need some new notation. For a given strategy s i and history h, let S i [s i , h] be the set of strategies in S i (h) that coincide with s i on histories that weakly follow h. Similarly, for a given combination of strategies s −i ∈ S −i and history h, we denote by S −i [s −i , h] := × j =i S j [s j , h] the set of opponents' strategy combinations in S −i (h) that coincide with s −i on histories that weakly follow h.

Lemma 5.1 (Stationary strategies are optimal under stationary beliefs) Consider a finite stochastic game . Let s −i be a profile of stationary strategies for i's opponents. Let b i be a conditional belief vector that assigns, at every history h, probability 1 to S
Then,

(a) for every δ ∈ (0, 1) there is a stationary strategy for player i that is δ-optimal under b i , and (b) there is a stationary strategy for player i that is uniformly optimal under b i .
That is, if we always assign full probability to the same stationary continuation strategy for each of our opponents, then there will be a stationary strategy for us that is optimal after every history.
We are now in a position to prove that common belief in future δ-rationality is always possible in every finite stochastic game, and that it always allows for stationary δ-optimal strategies for every player.
Theorem 5.1 (Common belief in future δ-rationality is always possible) Consider a finite stochastic game , and some δ ∈ (0, 1). Then, there is a finite epistemic model M = (T i , β i ) i∈I for such that (a) every type in M expresses common belief in future δ -rationality, and (b) every type in M has a stationary δ-optimal strategy.
The proof for this theorem is constructive. We show how, on the basis of Lemma 5.1, part (a), we can construct special belief hierarchies that express common belief in future δ-rationality, and assign at every history probability 1 to the same stationary continuation strategies of the opponents. By Lemma 5.1, part (a), such belief hierarchies allow for stationary δ-optimal strategies. For this construction we heavily rely on the fact that the number of (pure) stationary strategies is finite for every player.
Similarly, we can prove that common belief in future uniform rationality is always possible as well, and allows for stationary uniformly optimal strategies. Theorem 5.2 (Common belief in future uniform rationality is always possible) Consider a finite stochastic game . Then, there is a finite epistemic model M = (T i , β i ) i∈I for such that (a) every type in M expresses common belief in future uniform rationality, and (b) every type in M has a stationary uniformly optimal strategy.
The proof for this theorem is almost identical to the proof of Theorem 5.1. The only difference is that we must use part (b), instead of part (a), in Lemma 5.1. For that reason, this proof is omitted.
In particular, it follows from the two theorems above that stationary optimal strategies are always possible under common belief in future rationality, both in the discounted and the uniform case. As explained before, this is relevant from a conceptual and applied point of view, since stationary strategies are cognitively attractive, easy to describe and rather simple to compute in concrete applications.
Suppose that, instead of restricting to finitely many types, we would start from a terminal epistemic model (Friedenberg 2010) in which all possible belief hierarchies are present. Then, Theorems 5.1 and 5.2 would imply that within this terminal epistemic model we can always find belief-closed submodels with finitely many types in which every type expresses common belief in future rationality. Hence, the message of these two theorems would not change if we would consider such terminal epistemic models with infinitely many types.

Big Match revisited
We will now illustrate the existence result by means of the Big Match game we discussed in Sect. 2. For this game, it has been shown that subgame perfect equilibria fail to exist if we use the uniform optimality criterion. Nevertheless, our Theorem 5.2 guarantees that common belief in future uniform rationality is possible for this game. In fact, we will explicitly construct epistemic models where all types express common belief in future uniform rationality.
Recall the Big Match from Fig. 1. With a slight abuse of notation we write C to denote player 1's stationary strategy in which he always plays action C in state 0, and similarly for S, L, and R. Now consider the chain of stationary strategy pairs: In this chain, each stationary strategy is δ-optimal, for every δ ∈ (0, 1), under the belief that the opponent will play the preceding strategy in the chain at the present and future histories in the game. For instance, "(S, R) → (C, R) " indicates that for player 1 it is optimal to play C if he believes that player 2 will play R now and in the future, and for player 2 it is optimal to play R if he believes that player 1 will play S now. Similarly for the other arrows in the chain. In particular, each of these strategies is uniformly optimal as well for these beliefs. This chain leads to the following epistemic model with types Here, b 1 (t S 1 , h) = (L, t L 2 ) means that type t S 1 , after every possible history h, assigns probability 1 to player 2 choosing the stationary strategy L in the remainder of the game, and to player 2 having type t L 2 . Similarly for the other types. Note that type t R 2 always believes that player 1 will choose S in the current stage, even though it is evident that player 1 has always chosen C in the past. This degree of stubbornness is typical for backward induction concepts such as common belief in future rationality or subgame perfect equilibrium. Think, for instance, of Rosenthal's (1981) centipede game, where in a subgame perfect equilibrium a player always believes that his opponent will opt out in the next round, whereas it is evident that the opponent has not opted out at any point in the past.
It may be verified that every type in the epistemic model above believes in the opponent's future δ-(and uniform) rationality. As a consequence, every type expresses common belief in future δ-(and uniform) rationality. Moreover, every type admits a stationary δ-(and uniformly) optimal strategy.
Note that the type t S 1 for player 1 induces exactly the belief hierarchy we have described verbally in Sect. 2.

Relation to subgame perfect equilibrium
In the literature on stochastic games, the concepts which are most commonly used are Nash equilibrium (Nash 1950(Nash , 1951 and subgame perfect equilibrium (Selten 1965). In this section we will explore the precise relation between (common) belief in future rationality on the one hand, and subgame perfect equilibrium on the other hand. We will show that in two-person stochastic games, subgame perfect equilibrium can be characterized by mutual belief in future rationality, together with some "correct beliefs condition". Since these two conditions together imply common belief in future rationality, it follows that subgame perfect equilibrium can be viewed as a refinement of common belief in future rationality.
In Sect. 5 we have seen that common belief in future rationality is always possible in every finite stochastic game, even if we use the uniform optimality criterion. Hence, the reason that subgame perfect equilibrium fails to exist in some of these games is that mutual belief in future rationality is logically inconsistent with the "correct beliefs condition" in those games. In this section we first explain what we mean by the correct beliefs condition and mutual belief in future rationality. Subsequently, we show how types that meet the correct beliefs condition naturally induce behavioral strategies. We use all this to finally state our epistemic characterization of subgame perfect equilibrium in two-player stochastic games.

Correct beliefs condition
Intuitively, the correct beliefs condition states that player 1 always believes that player 2 is always correct about his beliefs, and that player 2 always believes that player 1 is always correct about his beliefs. Since the players' conditional belief hierarchies can be encoded by means of types in an epistemic model, it can formally be defined as follows.
Definition 6.1 (Correct beliefs condition) Consider a finite epistemic model M = (T i , β i ) i∈I for a two-player stochastic game. A pair of types (t 1 , t 2 ) ∈ T 1 × T 2 satisfies the correct beliefs condition if That is, type t 1 always believes that player 2 always assigns probability 1 to his true type t 1 , and hence believes that player 2 is always correct about each of his conditional beliefs. Similarly for player 2.
Mutual belief in future rationality simply means that both types t 1 and t 2 believe in the opponent's future rationality.
Definition 6.2 (Mutual belief in future rationality) Consider a finite epistemic model M = (T i , β i ) i∈I for a two-player stochastic game. A pair of types (t 1 , t 2 ) expresses mutual belief in future δ-rationality if both t 1 and t 2 believe in the opponent's future δ-rationality.
Mutual belief in future uniform rationality can be defined in a similar fashion. Note that, if (t 1 , t 2 ) satisfies the correct beliefs condition, then mutual belief in future rationality implies common belief in future rationality. We will see, later in this section, that subgame perfect equilibrium can be characterized by the correct beliefs condition in combination with mutual belief in future rationality.

From types to behavioral strategies
The concepts of mutual belief in future rationality and subgame perfect equilibrium are defined within two different languages: The first concept is defined within an epistemic model with types, whereas the latter is defined by the use of behavioral strategies. How can we then formally relate these two concepts? We will see that, under the correct beliefs condition, a type within an epistemic model will naturally induce a behavioral strategy for the opponent.
Formally, a behavioral strategy for player i is a function σ i that assigns to every history h some probability distribution σ i (h) ∈ (A i (x(h))) on the set of actions available at state x(h). Now, consider an epistemic model M = (T i , β i ) i∈I , and a pair of types (t 1 , t 2 ) ∈ T 1 × T 2 . Fix a player i and his opponent j = i. For every history h and every action a j ∈ A j (x(h)) for opponent j at h, let S j (h, a j ) denote the set of strategies s j ∈ S j (h) with s j (h) = a j . We define the behavioral strategy σ t i j induced by type t i for opponent j by for every history h and every action a j ∈ A j (x(h)). Hence, σ t i j (h)(a j ) is the probability that type t i assigns, after history h, to the event that player j will choose action a j after h. In this way, type t i naturally induces a behavioral strategy σ t i j for his opponent j, where σ t i j represents t i 's conditional beliefs about j's future behavior. Hence, every pair of types (t 1 , t 2 ) induces a pair of behavioral strategies (σ 1 , σ 2 ) where σ 1 = σ t 2 1 and σ 2 = σ t 1 2 . With this definition at hand it is now clear what it means for a pair of types (t 1 , t 2 ) to induce a subgame perfect equilibrium, since a subgame perfect equilibrium is just a behavioral strategy pair satisfying some special conditions. In order to define a subgame perfect equilibrium formally, we need some additional notation first. Take some behavioral strategy pair (σ i , σ j ), and some history h. We denote by U δ i (h, σ i , σ j ) the δ-discounted expected utility for player i, if the game would start after history h, and if the players choose according to (σ i , σ j ) in the subgame that starts after history h.

Definition 6.3 (Subgame perfect equilibrium)
(a) A behavioral strategy pair (σ 1 , σ 2 ) is a δ-subgame perfect equilibrium if after every history h, and for both players i, we have that (b) A behavioral strategy pair (σ 1 , σ 2 ) is a uniform subgame perfect equilibrium if there is someδ ∈ (0, 1) such that for every δ ∈ [δ, 1), for every history h, and for both players i, we have that Hence, a δ-subgame perfect equilibrium constitutes a δ-Nash equilibrium in each of the subgames. A behavioral strategy pair is thus a uniform subgame perfect equilibrium if it is a subgame perfect equilibrium under a discounted evaluation for all sufficiently high values of the discount factor. The concept of uniform -equilibrium (e.g. Jaśkiewicz and Nowak 2017) features prominently in the literature on stochastic games. While uniform subgame perfect equilibrium is not logically related to the uniform -equilibrium, it is somewhat similar in spirit. Both concepts entail a requirement of robustness of the solution within a small range of the parameters of the game.

Epistemic characterization of subgame perfect equilibrium
We are now ready to state our epistemic characterization of δ -subgame perfect equilibrium in two-player stochastic games. Theorem 6.1 (Characterization of δ-subgame perfect equilibrium) Consider a finite two-player stochastic game , and a behavioral strategy pair (σ 1 , σ 2 ) in . Then, (σ 1 , σ 2 ) is a δ-subgame perfect equilibrium, if and only if, there is a finite epistemic model M = (T i , β i ) i∈I and a pair of types (t 1 , t 2 ) ∈ T 1 × T 2 that (1) satisfies the correct beliefs condition, (2) expresses mutual belief in future δ-rationality, and (3) induces (σ 1 , σ 2 ).
In a similar way we can prove the following characterization of uniform subgame perfect equilibrium.
The proof is almost identical to the proof of Theorem 6.1, and is therefore omitted. Note that the two theorems above would not change if we would allow for epistemic models with infinitely many types. For instance, if we would start from a terminal epistemic model in which all belief hierarchies are present, then the two theorems above state that (σ 1 , σ 2 ) is a subgame perfect equilibrium exactly when we can find a pair of types within that model which satisfies conditions (1)-(3).
The epistemic conditions above are rather similar to those used in Aumann and Brandenburger (1995) to characterize Nash equilibrium in two-player games. Indeed, in their Theorem A they show that in such games, Nash equilibrium can be characterized by mutual knowledge of the players' first-order beliefs and mutual knowledge of the players' rationality. In our setting, mutual knowledge of rationality corresponds to mutual belief in future rationality, whereas mutual knowledge of the players' first-order beliefs is implied by the correct beliefs condition.

Proofs
Proof of Lemma 5.1 We construct the following Markov decision problem M D P for player i. The set of states X in M D P is simply the set of states in the stochastic game , and for every state x the set of actions A(x) in M D P is simply the set of actions A i (x) for player i in . For every state x and action a ∈ A(x), let the utility u(x, a) in M D P be the utility that player i would obtain in if the game reaches x, player i chooses a at x, and the opponents choose according to s −i at x. Note that s −i is a profile of stationary strategies, and hence the behavior induced by s −i at x is independent of the history. So, u(x, a) is well-defined. Finally, we define the transition probabilities q(y|x, a) in M D P. For every two states x, y and every action a ∈ A(x), let q(y|x, a) be the probability that state y will be reached in next period if the game is at x, player i chooses a at x, and i's opponents choose according to s −i at x. Again, q(y|x, a) is well-defined since, by stationarity of s −i , the behavior of s −i at x is independent of the history. This completes the construction of M D P.
We will now prove part (a) of the theorem. Take some δ ∈ (0, 1). By part (a) in Theorem 3.1, we know that player i has a δ-optimal strategyŝ i in M D P which is stationary. So, we can writeŝ i = (ŝ i (x)) x∈X . Now, let s i be the stationary strategy for player i in the game which prescribes, after every history h, the actionŝ i (x(h)). Then, it may easily be verified that the stationary strategy s i is δ-optimal for player i in , given the conditional belief vector b i . Part (b) of the theorem can be shown in a similar way, by relying on part (b) in Theorem 3.1.

Proof of Theorem 5.1
We start by recursively defining profiles of stationary strategies, as follows. Let s 1 = (s 1 i ) i∈I be an arbitrary profile of stationary strategies for the players. Let b i [s 1 −i ] be a conditional belief vector for player i that assigns, after every history h, probability 1 to some strategy combination s * We know from Lemma 5.1 that for every player i there is a stationary strategy s 2 i which is δ-optimal, given the conditional belief vector b i [s 1 −i ]. Let s 2 := (s 2 i ) i∈I be the new profile of stationary strategies thus obtained. By recursively applying this step, we obtain an infinite sequence s 1 , s 2 , s 3 , .. of profiles of stationary strategies.
As there are only finitely many states in , and finitely many actions at every state, there are also only finitely many stationary strategies for the players in the game. Hence, there are also only finitely many profiles of stationary strategies. Therefore, the infinite sequence s 1 , s 2 , s 3 , . . . must go through a cycle s m → s m+1 → s m+2 → · · · → s m+R → s m+R+1 where s m+R+1 = s m . We will now transform this cycle into an epistemic model where all types express common belief in future δ-rationality.
For every player i, we define the set of types , h] is δ-optimal for type t m+r −1 j at all histories weakly following h. That is, for every strategy s j ∈ S j (h) and all m ≥ k. So, type t i believes, after every history h, that player j is of type t j , and that player j will choose according to σ j in the game that lies ahead. This completes the construction of the epistemic model M = (T i , β i ) i∈I .
(2) Choose a player i, with opponent j. We show that type t i believes in j's future δ-rationality. Consider an arbitrary history h. We must show that is a subgame perfect equilibrium, we have at every history h weakly following h that for every behavioral strategy σ j . This implies that for all s j ∈ S j (h ). By (1), this is equivalent to stating that for every history h weakly following h, and every s j ∈ S j (h ). Let Since the conditional belief of type t j at h about i's strategy is given by b σ i j (h ), it follows that S h,opt j contains exactly those strategies s j ∈ S j (h) that are δ-optimal for type t j at all histories weakly following h. Moreover, the conditional belief that type t i has at h about j's strategy is given by b σ j i (h), for which we have seen that b σ j i (h)(S h,opt j ) = 1. By combining these two insights, we obtain that As this holds for every history h, we conclude that t i believes in j 's future δrationality. Since player i was chosen arbitrarily, the pair (t 1 , t 2 ) expresses mutual belief in future δ -rationality.
(3) Consider a player i with opponent j. We show that σ t i j = σ j . Take some history h = ((x 1 , a 1 ), . . . , (x k−1 , a k−1 ), x k ) of length k, and some action a j ∈ A j (x k ). Let [S j (h, a j )] k := {[s j ] k | s j ∈ S j (h, a j )} be the finite collection of equivalence classes that partitions S j (h, a j ). Then, which implies that σ t i j = σ j . Here, the first equality follows from the definition of σ t i j . The second equality follows from (2). The third equality follows from the observation that [S j (h, a j )] k constitutes a finite partition of the set S j (h, a), and that each member of [S j (h, a j )] k is in the σ -algebra j (h). The fourth equality follows from ( 1). The fifth equality follows from two observations: First, that s j ∈ S j (h, a j ), if and only if, s j (h m ) = a m j for all m ≤ k − 1 and s j (h) = a j , where h m = ((x 1 , a 1 ), . . . , (x m−1 , a m−1 ), x m ) for all m ≤ k − 1. The second observation is that σ h j (h m )(a m j ) = 1 for all m ≤ k − 1. The sixth equality follows from the fact that σ h j coincides with σ j on histories that weakly follow h. In particular, this implies that σ h j (h) = σ j (h).
Since σ t i j = σ j for both players i and j, we conclude that (t 1 , t 2 ) induces the behavioral strategy pair (σ 1 , σ 2 ).
Take a player i and a history h. We must show that for every behavioral strategy σ i . By (1) this is equivalent to showing that for all s i ∈ S i (h). Let Then, (5) is equivalent to showing that As σ t i j = σ j and t i satisfies Bayesian updating, it follows that the conditional belief of type t i at h about j's continuation strategy is given by b σ j i (h). But then, S opt i (h) = {s i ∈ S i (h) | s i is δ-optimal for t i at history h}.
As (t 1 , t 2 ) expresses mutual belief in future δ-rationality, it must be that t j believes in i's future δ-rationality. In particular, As t j assigns probability 1 to t i , and every strategy s i which is δ-optimal for t i at all histories weakly following h must be in S opt i (h), it follows that Since σ t j i = σ i and t j satisfies Bayesian updating, it follows that the conditional belief of type t j at h about i's continuation strategy is given by b σ i j (h). So, ( 7) implies that b σ i j (h) S opt i (h) = 1, which establishes (6). This, as we have seen, implies (4), stating that for every behavioral strategy σ i . Since this holds for both players i and every history h, it follows that (σ i , σ j ) is a δ-subgame perfect equilibrium. This therefore completes the proof of this theorem.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.