Positive Zero-Sum Stochastic Games with Countable State and Action Spaces

A positive zero-sum stochastic game with countable state and action spaces is shown to have a value if, at every state, at least one player has a finite action space. The proof uses transfinite algorithms to calculate the upper and lower values of the game. We also investigate the existence of (ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document}-)optimal strategies in the classes of stationary and Markov strategies.


Introduction
A positive zero-sum stochastic game is a two-person dynamic game played in stages t = 0, 1, . . . At each stage t of the game the players simultaneously select actions from their action sets at the current state. These actions together with the current state determine a nonnegative reward and the distribution of the next state. The payoff from player 2 to player 1 is the infinite sum of the rewards taken over all the stages. We assume that the state space and the action sets of the two players at each state are countable.
As is well-known, such a game does not always admit a value. A typical example is the following: There is a state with action set N = {0, 1, . . .} for each player in which the reward is 1 if player 1's action is greater than player 2's action and is 0 otherwise. From this state, regardless of the chosen actions, the transition is to an absorbing state where the reward is zero. Thus player 1 wins in this game if he chooses the larger action and loses if not.
To find conditions for the existence of the value, we construct transfinite algorithms that output the upper value and the lower value of the game (cf. Theorems 6 and 9). If at every state at least one player has a finite action set, then the algorithms for the upper and lower values give the same answer. So, in this case, the game has a value. Also, in this case, player 2 has arbitrarily good Markov strategies (cf. Theorem 11).
If the action set for player 2 is finite at every state, then the algorithm for the value simplifies and becomes the limit of the sequence of values of the n-stage games. In this simpler case, player 2 has an optimal stationary strategy (cf. Theorem 12).
Related Literature For an overview on the literature of positive stochastic games, we refer to the recent survey by Jaśkiewicz and Nowak [7]. In particular, we refer to Maitra and Parthasarathy [9], Parthasarathy [15,16], Frid [6], and Nowak [13]. Transfinite algorithms were used previously for example by Blackwell [1] for G δ games and by Maitra and Sudderth [10,11] for stochastic games with limsup payoff.
Organization of the Paper The next section presents the basic definitions needed for our model. Section 3 has some preliminary results for one-shot games with unbounded payoffs and infinite action sets. Section 4 introduces the one-shot games associated with a stochastic game, defines operators corresponding to the upper and lower values of the one-shot games, and shows that, by iterating these operators up to some countable ordinal, the upper and lower values of the stochastic game are attained. The results of Sect. 4 are applied in Sect. 5 to obtain our main theorems on the value and ( -)optimal strategies for the stochastic game.
sure p(s, a, b) = p(s |s, a, b) s ∈S on S, and (4) a non-negative reward function r : Z → [0, ∞), where Z = {(s, a, b)|s ∈ S, a ∈ A(s), b ∈ B(s)}. A positive zero-sum stochastic game is denoted by . Whenever we need to emphasize the initial state s 0 of the game, we write (s 0 ).
The game is played at stages in N = {0, 1, . . .} and begins in the initial state s 0 ∈ S. At every stage t ∈ N, the play is in a state s t ∈ S. In this state, player 1 chooses an action a t ∈ A(s t ) and simultaneously player 2 chooses an action b t ∈ B(s t ), yielding a triple z t = (s t , a t , b t ) ∈ Z . Then, player 1 receives reward r (z t ) from player 2, and state s t+1 is drawn in accordance with the probability measure p(z t ). Thus, play of the game induces an infinite sequence (z 0 , z 1 , . . .) in Z . The payoff is which is an element of [0, ∞]. The payoff is paid by player 2 to player 1. Player 1's objective is to maximize the expected payoff given by u, and player 2's objective is to minimize it.
Strategies The set of histories at stage t is denoted by H t . Thus, H 0 = S and H t = Z t × S for every stage t ≥ 1. Let H = ∪ t∈N H t denote the set of all histories. For each history h, let s h denote the final state in h.
A mixed action for player 1 in state s ∈ S is a probability measure x(s) on A(s). Similarly, a mixed action for player 2 in state s ∈ S is a probability measure y(s) on B(s). The respective sets of mixed actions in state s are denoted by (A(s)) and (B(s)).
A strategy for player 1 is a map π that to each history h ∈ H assigns a mixed action π(h) ∈ (A(s h )). Similarly, a strategy for player 2 is a map σ that to each history h ∈ H assigns a mixed action σ (h) ∈ (B(s h )). The set of strategies is denoted by for player 1 and by for player 2. A strategy is called pure if it places probability 1 on one action after each history.
A strategy is called stationary if the assigned mixed actions only depend on the history through its final state. Thus, a stationary strategy for player 1 can be seen as an element x of × s∈S (A(s)). Similarly, a stationary strategy for player 2 can be seen as an element y of × s∈S (B(s)). A strategy is called Markov if the assigned mixed actions only depend on the history through its final state and the current stage.
An initial state s ∈ S and a pair of strategies (π, σ ) ∈ × determine a distribution P s,π,σ for the stochastic process z 0 , z 1 , . . .. The corresponding expectation operator is written E s,π,σ and the expected payoff is denoted by u(s, π, σ ) = E s,π,σ [ ∞ t=0 r (z t )]. Value and Optimality The lower value of the game (s) with the initial state s ∈ S, denoted by α(s), is defined as Similarly, the upper value of the game (s) with the initial state s ∈ S, denoted by β(s), is defined as The inequality α(s) ≤ β(s) always holds. If α(s) = β(s), then this quantity is called the value of the game for initial state s and it is denoted by v(s). Then, for ≥ 0, a strategy π ∈ for player 1 is called -optimal for initial state s if u(s, π, σ ) ≥ v(s)− for every strategy σ ∈ for player 2. Similarly, a strategy σ ∈ for player 2 is calledoptimal for initial state s if u(s, π, σ ) ≤ v(s)+ for every strategy π ∈ for player 1.
If the value exists for every initial state, then for every > 0, each player has a strategy that is -optimal for every initial state. We call these strategies -optimal. A 0-optimal strategy is simply called optimal.
As we observed in the introduction, the value in general does not exist and a major objective here is to find interesting conditions for its existence.
Example 1 It is known that in a positive zero-sum stochastic game, even if the state and action spaces are finite, player 1 may have no optimal strategy, see Kumar and Shiau [8], Maitra and Sudderth [12], and Jaśkiewicz and Nowak [7]. Consider the following game.
In this game, there is only one non-trivial state, shown in the figure. In this state, player 1's actions are T and B, and player 2's actions are L and R. The rewards for the corresponding action combinations are given in the matrix. The transitions are as follows: if action combination (T , L) is chosen then the state remains the same, but after any other action combination transition occurs to an absorbing state where the reward is equal to 0. Clearly, player 1 can guarantee an expected payoff of 1 − , for any ∈ (0, 1), by playing the stationary strategy (1 − , ). Thus the value is equal to 1 for the non-trivial state. Yet, player 1 has no optimal strategy.

One-Shot Games
In this section we present some results for one-shot games that are crucial for our study of positive zero-sum stochastic games.
We consider positive zero-sum one-shot games G = (A, B, f ) where A is a nonempty and countable action set for player 1, B is a nonempty and countable action set for player 2, and f : A × B → [0, ∞] is the payoff function.
The notation x and y will be used for mixed actions for players 1 and 2 respectively. Also we set u(x, y) = a b f (a, b)x(a)y(b) for the expected payoff under a pair (x, y) of mixed actions. When calculating expected payoffs, we make use of the usual convention in measure theory that ∞ · 0 = 0. So, if an action combination with payoff ∞ is chosen with probability 0, then it makes no contribution to the expected payoff.
Here is a slight generalization of the usual von Neumann theorem (cf. Tijs [20], or exercise 18.3 on p. 201 in Maitra and Sudderth [12]).

Theorem 1 Consider a positive zero-sum one-shot game G = (A, B, f ). If f is bounded and either A or B is finite, then G has a value v.
Moreover, if A, respectively B, is finite, then player 1, respectively player 2, has an optimal strategy.
The following example shows that player 1 may have no optimal strategy, even if both action sets are finite.

Example 2
This game has value 3. Indeed, for every ∈ (0, 1), player 1 can guarantee an expected payoff of 3 − by playing action T with probability and action B with probability 1 − , and player 2 can also guarantee that the expected payff is not more than 3 by playing action L. Clearly, player 1 has no optimal strategy. Andrés Perea pointed out that this game could also be seen as where δ is an infinitesimal as in nonstandard analysis.
If both action sets are infinite, the value does not always exists. Probably the most well known example is the following game, which is essentially the game mentioned in the introduction.

Example 3
Let the action sets be A = B = N and the payoff for actions a ∈ A and b ∈ B be equal to 1 if a > b and 0 otherwise. That is, in this game player 1 wins if his action is greater than the action chosen by player 2. This game has no value. The lower value of this game is 0 and the upper value is 1.
Perhaps the next two theorems are also known, but we do not have references for them.

Theorem 2 Consider a positive zero-sum one-shot game G = (A, B, f ). If B is finite, then G has a value v ∈ [0, ∞] and player 2 has an optimal strategy.
Proof Let v n be the value of the game Set v = lim v n , and write u n (x, y) for the expected payoff from x, y in G n . Let α and β be the lower value and respectively the upper value of the game G.
We prove the following two claims: Claim 1: α ≥ v. Claim 2: player 2 has a strategy y * ∈ (B) such that for every x ∈ (A) we have u(x, y * ) ≤ v.
Once these two claims are proven, the statement of the theorem follows. Indeed, claim 2 implies that β ≤ v, and hence by claim 1 we have v = α = β. So, the value of the game is v, and y * is an optimal strategy for player 2.
First we prove claim 1. Notice that By taking the limit when n tends to infinity, we obtain α ≥ v. Now we prove claim 2. There is no harm in assuming that v < ∞. Then, for every n, v n ≤ v < ∞.
By Theorem 1, player 2 has an optimal strategy y n ∈ (B) in the game G n for every n. Since (B) is a compact subset of a finite dimensional Euclidean space, the sequence y n has a subsequence that converges to some y * ∈ (B). With a slight abuse of notation we write y n for this convergent subsequence.
Fix a strategy x ∈ (A) for player 1. For every b ∈ B and n ∈ N, let Notice that, by the monotone convergence theorem, for every b we have g(b) = lim n g n (b).
Let B denote the set of those actions b ∈ B for which g(b) = ∞. Since for every n it holds that For the third equality, we used the assumption that B is finite. This completes the proof of claim 2.
Theorem 2 is related to Theorem 5.1 of Nowak [13]. Nowak's result is for realvalued payoff functions in a Borel measurable setting.
Proof We can assume without loss of generality that B = N.
Let G n denote the game where player 1's action set is A, player 2's action set is B n = {0, . . . , n}, and the payoff function is the restriction of f to A× B n . By Theorem 2 the game G n has a value, say v n . Of course, We argue that v is the value of G.
Let α and β be the lower value and respectively the upper value of the game G. It is obvious that β ≤ v n for every n ∈ N, since player 2 can always ignore all but finitely many of his actions. Hence β ≤ v. We show that v ≤ α.
Take an R < v. As R < v n , player 1 has a strategy x n that guarantees him the payoff of R in the game G n , that is R ≤ u(x n , y) for each y ∈ (B n ). Since x n lies in the compact set (A), it has a subsequence converging to a point in (A). By passing to the subsequence, we can assume that x n converges to some x * ∈ (A). Fix ρ ∈ (0, 1) and define player 1's strategy x as follows: with probability ρ, randomize uniformly on A; with probability (1 − ρ), randomize according to the distribution x * . More precisely, we let We argue that (1 − ρ) · R ≤ u(x, y) for each y ∈ (B).
If u(x, y) = ∞, there is nothing to prove, so will assume that u(x, y) < ∞. Since x places positive probability on each action in A, this implies that u(a, y) < ∞ for each a ∈ A.
Suppose first that the distribution y is finitely supported, i.e. that y is an element of (B n ) for some natural n. Since u(a, y) < ∞ for each a ∈ A, the sum u(x m , y) = a∈A x m (a)u(a, y) converges to u(x * , y). For each m ≥ n, since y is a feasible strategy for player 2 in the game G m , we have R ≤ u(x m , y). Taking the limit we Now take an arbitrary y ∈ (B). Let N ∈ B be any action such that y(N ) > 0. For each n ≥ N define y n ∈ (B n ) by letting Since y n is finitely supported, it follows by the previous paragraph that On the other hand the payoff converges to u(x, y) as n → ∞. This is because the fraction in the above expression converges to 1, while the sum converges to u(x, y) by the monotone convergence theorem. We conclude that (1 − ρ) · R ≤ u(x, y), as desired.

One-Shot Operators and Their Fixed Points
We now return to our study of positive zero-sum stochastic games. Let be a game as in Sect. 2. Recall that α(s) and β(s) are the lower and the upper values of the game (s) with the initial state s ∈ S. Let L + be the space of all functions from S to [0, ∞]. We write α and β to denote the functions s → α(s) and s → β(s), respectively. Both of these are elements of L + . The functions α and β, referred to simply as the lower and the upper values of , will be at the center of our attention in this section.
For a function f ∈ L + and a state s ∈ S, we define the one-shot game M f (s) in which the action sets are A(s) for player 1 and B(s) for player 2, and the payoff for actions a ∈ A(s) and b ∈ B(s) is (t|s, a, b).
The game M f (s) does not have a value in general, but if either A(s) or B(s) is finite, then Theorems 2 and 3 guarantee that M f (s) admits a value.
We define the one-shot operator A : L + → L + by letting A f (s) be the lower value of the one-shot game M f (s), for a function f ∈ L + and a state s ∈ S. That is, Similarly, we define the one-shot operator B : L + → L + by letting B f (s) be the upper value of the one-shot game M f (s), for each function f ∈ L + and state s ∈ S. That is, (s |s, a, b) x(a)y(b).
It is clear that the operators A and B are monotone. That is, if for some f , g ∈ L + we have f ≤ g then we also have A f ≤ Ag and B f ≤ Bg.
We call a function f ∈ L + a fixed point of the operator A if A f = f . We define fixed points similarly for B. Note that if f is a fixed point of either operator, then f + c is also a fixed point for every c ∈ R as long as f + c is nonnegative (so that it belongs to L + ).

Lemma 4 The lower value α of is a fixed point of the operator A.
Proof The proof that Aα = α is presented in two steps. Fix s ∈ S.
Let R < Aα(s) and > 0. Since Aα(s) is the lower value of the one-shot game M α (s), there exists x * ∈ (A(s)) such that for all y ∈ (B(s)) we have It could happen that some of the terms α(s ) are infinite. By the monotone convergence theorem, there exists n > 0 so that Use the definition of α to choose a strategy π for player 1 such that, for all states s and all strategies σ for player 2, u(s , π, σ ) ≥ min{α(s ), n} − . Let π * be a strategy for player 1 that begins at state s with action x * and continues from the next state with the strategy π . It is easy to check that u(s, π * , σ ) ≥ R − 2 for every strategy σ of player 2. This implies that α(s) ≥ R − 2 . Since this holds for all R < Aα(s) and > 0, we have shown that α(s) ≥ Aα(s).
Let R < α(s). Since α(s) is the lower value of the stochastic game (s), there exists a strategy π * for player 1 such that u(s, π * , σ ) ≥ R for all strategies σ for player 2. Let x * = π * (s), that is, the initial mixed action prescribed by π * when starting in state s.
Let y ∈ (B(s)) and > 0 be arbitrary. By the definition of α, for each state s there is a strategy σ s such that u(s , π * , σ s ) ≤ α(s ) + . Let σ be the strategy that plays y in stage 0, and as of stage 1 coincides with the strategy σ s 1 . Then (s |s, a, b) x * (a)y(b) + .
The lemma now follows from Steps 1 and 2.
Next we define an algorithm that iterates the operator A. First let α 0 (s) = 0 for all s. Then, for each ordinal ξ > 0, define if ξ is a successor ordinal, and Note that α 1 = A0 ≥ 0 = α 0 . As A is a monotone operator, it follows by induction that the sequence (α ξ ) ξ is nondecreasing in ξ .
The next result is strongly related to Tarski's [19] fixed point theorem. Note the following: (1) The set L + of functions from S to [0, ∞] is a complete lattice with the usual ≤ relation between functions. (2) The operator A is monotone, as we mentioned already. (3) The operator A has a fixed point: the function f such that f (s) = ∞ for every state s ∈ S. In view of these three properties, Tarski's fixed point theorem implies that A has a least fixed point too. The next lemma identifies the least fixed point of A.

Lemma 5
There exists a countable ordinal ξ * such that sup ξ α ξ = α ξ * . The function α ξ * is the least fixed point of A. Moreover, α ξ * is the least A-excessive function, that is, the least function f ∈ L + such that A f ≤ f .

Proof
The existence of ξ * follows from a cardinality argument. Indeed, for every state s ∈ S, the sequence (α ξ (s)) ξ is a nondecreasing sequence of real numbers. Hence, the sequence (α ξ (s)) ξ can only have countably many different elements. 1 Thus, for state s, there exists a countable ordinal ξ s such that α ξ (s) = α ξ s for every ξ ≥ ξ s . Now take ξ * = sup s∈S ξ s . Because S is countable, ξ * is a countable ordinal.
It also follows that α ξ * = α ξ * +1 = Aα ξ * , so α ξ * is a fixed point of A. Now we show that α ξ * is the least fixed point of A. Let f ∈ L + be any fixed point of A. As f ∈ L + , we have 0 ≤ f . As A is monotone, it follows by induction that α ξ ≤ f for every ξ . Hence, α ξ * ≤ f . As α ξ * is a fixed point of A, clearly α ξ * is A-excessive. Suppose that f ∈ L + is A-excessive too. Then, α 0 = 0 ≤ f implies α 1 = Aα 0 ≤ A f ≤ f . Continuing by induction, we find that α ξ * ≤ f . Thus, α ξ * is the least A-excessive function.

Theorem 6 The lower value α of is equal to α ξ * .
Proof It follows from Lemmas 4 and 5 that α ≥ α ξ * .
To complete the proof of the theorem, take any strategy π for player 1 and let > 0. We show that there exists a strategy σ for player 2 such that u(s, π, σ ) ≤ α ξ * (s) + 2 for every state s ∈ S. This will imply that α ≤ α ξ * .
If at any stage t, the current state is s and the history is h ∈ H t , then let σ (h) be a mixed action in (B(s)) such that A mixed action σ (h) with the required property does exist, because the lower value of the matrix game M α ξ * (s) is Aα ξ * (s) by the definition of the operator A, and because Aα ξ * (s) = α ξ * (s) by Lemma 5.
Consider the process where the inequality follows by the definition of σ (h n ). Hence we have E s,π,σ [Q n+1 ] ≤ E s,π,σ [Q n ] + · 2 −n . Combining the inequalities we obtain Thus, for all n ∈ N, Using the monotone convergence theorem yields as desired.

Lemma 7 The upper value β of is a fixed point of the operator B.
Proof The proof that Bβ = β is in two steps. Fix s ∈ S.
Step 1 We prove Bβ(s) ≤ β(s). We can assume that β(s) < ∞, otherwise the statement is obvious. Let R > β(s). It can be proven similarly to (4.1) in the proof of Step 2 of Lemma 4 that there is a mixed action y * ∈ (B(s)) such that for all x ∈ (A(s)) we have Hence, Bβ(s) ≤ R. Since this holds for all R > β(s), we have shown that Bβ(s) ≤ β(s).

Step 2 We prove Bβ(s) ≥ β(s).
We can assume that Bβ(s) < ∞, otherwise the statement is obvious. Let R > Bβ(s) and > 0. It can be proven similarly to the proof of Step 1 of Lemma 4 that player 2 has a strategy σ * such that u(s, π, σ * ) ≤ R + 2 for every π . This implies that β(s) ≤ R + 2 . Since this holds for all R > Bβ(s) and > 0, we have shown that β(s) ≤ Bβ(s).
We also define an algorithm that iterates the operator B. First let β 0 (s) = 0 for all s. Then, for each ordinal ξ > 0, define if ξ is a successor ordinal, and Note that β 1 = B0 ≥ 0 = β 0 . As B is a monotone operator, it follows by induction that the sequence (β ξ ) ξ is nondecreasing in ξ .
Similarly to Lemma 5, we have the following.

Lemma 8
There exists a countable ordinal ξ * such that sup ξ β ξ = β ξ * . The function β ξ * is the least fixed point of B. Moreover, β ξ * is the least B-excessive function, that is, the least function f ∈ L + such that B f ≤ f .
Similarly to Theorem 6, we have the following.

Theorem 9
The upper value β of is equal to β ξ * .

The Value and Optimal Strategies
The next theorem is a consequence of Theorems 6 and 9.
Theorem 10 Consider a positive stochastic game. The game has a value if and only if α ξ * = β ξ * . In that case, the value is equal to α ξ * = β ξ * .
The following example illustrates that a game can admit a value even when the one-shot games fail to have values at some stages of the recursive algorithms. This illustrates the advantage of our approach that uses separate algorithms to find the upper value and the lower value, rather than having only one algorithm that iterates the value of the one-shot games.

Example 4
Consider the following game with state space S = {1, 2, 3, 4}, with state 1 being the initial state. In state 1, the action spaces are N for both players, the reward is 0, and if actions a and b are chosen, then the next state is state 3 if a > b and it is state 2 if a ≤ b. (The transitions depend on which player chooses the higher number.) In state 2, the reward is 0, and the next state is state 3. In state 3, the reward is 1, and the next state is state 4. State 4 is absorbing with reward 0.
If in each state s ∈ S, either A(s) or B(s) is finite, then Theorems 2 and 3 guarantee that the one-shot game M f (s), for every f ∈ L + and every state s ∈ S, admits a value. Hence, in that case, the operators A and B are equal, α ξ = β ξ for all ordinals ξ , and ξ * = ξ * . To emphasize this, in that case, we will use the notations T := A = B and v ξ := α ξ = β ξ for all ordinals ξ and v * = v ξ * = v ξ * .

Theorem 11 Assume that in every state s, either A(s) or B(s) is finite. Then the following hold:
1. The value of the stochastic game exists and is equal to v * .

The value function v * is the least T -excessive function, that is, the least function
f ∈ L + such that T f ≤ f . 3. For every > 0, player 2 has an -optimal Markov strategy. 4. If expected payoffs are uniformly bounded in the sense that sup s,π,σ u(s, π, σ ) < ∞, then, for every positive integer n and > 0, there is a stationary strategy π for player 1 such that, for all s and all strategies σ for player 2, u(s, π, σ ) ≥ v n (s)− .

Proof
The first assertion follows from Theorem 10; the second is by Lemma 5 or by Lemma 8. To prove the third assertion, let > 0. Recall that T v * (s) = v * (s) is the value of the matrix game M v * (s). Now for each state s ∈ S and every n ∈ N, choose a mixed action y n (s) in (B(s)) to be · 2 −n -optimal in the one-shot game M v * (s); that is, for all x ∈ (A(s)), Let σ be the Markov strategy for player 2 that uses the action y n (s) when in state s at stage n, for all s and n. One can show that for any strategy π for player 1 and any state s we have u(s, π, σ ) ≤ v * (s) + 2 , i.e. that σ is a 2 -optimal strategy. The argument that σ is a 2 -optimal strategy is similar to the proof of Theorem 6, and is omitted.
We also omit the proof of the final assertion, because it is only a slight variation on the proof of Theorem 3.2 in Secchi [17].
The characterization of the value function as the least T -excessive function mirrors a similar characterization of the optimal reward function in the (one-person) gambling theory of Dubins and Savage [3]. It also appears as the réduite of a function in the potential theoretic treatment of gambling theory by Dellacherie and Meyer [2]. See also Lemma 7.10, p. 187 in Maitra and Sudderth [12] for a similar characterization of the value of a leavable game.
The following simple example shows that if player 2's action space B(s) is infinite in some state s, then player 2 may not have a 0-optimal strategy or an -optimal stationary strategy for small > 0.
Example 5 Consider a game in which there is only one state, player 1 has only one action and player 2's action space is N. If player 2 chooses action n then the reward is Clearly, the value of the game is 0, and player 2 has no 0-optimal strategy. Moreover, any stationary strategy of player 2 induces an expected payoff of infinity, thus player 2 has no -optimal stationary either, for any > 0.
However, for any > 0, player 2 has an -optimal Markov strategy, as stated in Theorem 11. Indeed, let m ∈ N be such that m ≥ 1 and 1 2 m−1 ≤ . Consider the Markov strategy that chooses action m + n at stage n. This induces a payoff of By the choice of m, this Markov strategy is -optimal.
The next theorem states that, when the action space of player 2 is finite in each state, it suffices to execute the algorithm on the natural numbers to find the value of the stochastic game. As usual, ω denotes the first limit ordinal.

Theorem 12 Assume that B(s) is finite for all states s ∈ S. Then:
1. The value of the positive stochastic game is equal to v ω = lim n→∞ v n . 2. Player 2 has an optimal stationary strategy. 3. If expected payoffs are uniformly bounded in the sense that sup s,π,σ u(s, π, σ ) < ∞, then, for every s ∈ S and > 0, player 1 has a stationary strategy that is -optimal at state s.
Proof The Proof of Part 1 It suffices to show that v * = v ω . It will then follow from Theorem 11 that the value of the positive stochastic game is equal to v ω .
We already know that v ω ≤ v * . So, in view of Lemma 5 and Theorem 6 (or by Lemma 8 and Theorem 9), it suffices to show that v ω is T -excessive: Hence, the value of the one-shot game M v n (s), namely T v n (s), is at least R for n sufficiently large. Thus, As this holds for all R < T v ω (s), we have shown that v ω (s) ≥ T v ω (s).

The Proof of Part 2
The proof is based on using Theorem 2, which allows us to choose, for each state s ∈ S, a mixed action y(s) ∈ (B(s)) that is optimal for player 2 in the one-shot game M v * (s). Let σ be the stationary strategy for player 2 that uses y(s) at each state s ∈ S. The proof that σ is optimal is similar to the proof of part 3 of Theorem 11.
The Proof of Part 3 This follows from part 1 together with part 4 of Theorem 11.
Theorem 12 is almost a special case of Theorem 5.4 in Nowak [13]. Nowak's result is for a much more general setting with Borel state and action spaces, but uses a condition on n-stage games called FA that is not needed for our simpler result.
It need not be true that v ω is the value of the stochastic game if player 2 has an infinite action set. Here is an example in which player 1 is a dummy. Motion is deterministic from state n to n − 1 for n ≥ 2 and state 1 is absorbing. The reward r depends only on the state and is equal to 0 except at state 2 where it equals 1.
By choosing an initial action b > n, player 2 can guarantee a reward of 0 for the first n stages. Hence, v n (0) = T n 0(0) = 0. However, the value of the stochastic game with initial state 0 is 1. In this example, the value function is equal to v ω+1 = T v ω .
The next example shows that player 1 may have no -optimal stationary strategy, even if the state and action spaces are finite, and the reward function has only finite values.

Example 7
Consider the following game. The value for the non-trivial state is infinity, and player 1 has an optimal strategy. To show this we construct a strategy for player 1 that guarantees that with positive probability either entry (B, L) is played infinitely often or entry (T , R) is played infinitely often. Let {q n } n∈N be a sequence in (0, 1) such that q * := ∞ n=0 q n > 0. Let H denote the set of histories in which the final state is still the non-trivial one (so in which no absorption has taken place). For each h ∈ H let R h denote the number of times that player 2 has played action R in h. Consider the following strategy for player 1: at any h ∈ H , play the mixed action (q R h , 1 − q R h ). It is not difficult to verify that if player 1 uses this strategy then regardless the strategy of player 2, with probability at least q * , either entry (B, L) is played infinitely often or entry (T , R) is played infinitely often. To see this, notice that (a) the probability of absorption in entry (B, R) is at most 1 − q * , and (b) the probability that there is an N ∈ N such that at all stages n > N entry (T , L) is played is zero.
Any stationary strategy of player 1 only guarantees a finite expected payoff. Indeed, it is clear if the stationary strategy is pure. So assume that the stationary strategy places probability z ∈ (0, 1) on action T . Then, if player 2 plays action R at every stage, the expected payoff is which is finite, as claimed. Hence, player 1 has no -optimal stationary strategy, for any > 0. Yet, there is a Markov strategy for player 1 that guarantees that the expected payoff is infinite. Let the Markov strategy π place probability z n = n+1 n+2 on action T at stage n. So, the sequence of probabilities on action T is 1 2 , 2 3 , 3 4 , . . .. Consider a pure Markov strategy σ for player 2. Since π is Markov, it is sufficient to consider such responses from player 2. We distinguish two cases.
Case 2: Suppose that σ chooses action L at all stages from stage N onwards. Note that no absorption occurs before stage N with probability at least the second Borel-Cantelli lemma implies that, if no absorption occurs before stage N , then with probability 1, σ will choose action B infinitely often from stage N onwards. Therefore, u(s, π, σ ) ≥ 1 N + 1 · ∞ = ∞.