We begin this section by introducing the notions of (almost sure) fair strategy and stopping games under fairness. From now on, we assume that Player 2 represents the environment, which tries to minimize the amount of rewards obtained by the system, thus fairness restrictions will be applied to this player.
Definition 1
Given a stochastic game \(\mathcal {G}= (V, (V_1, V_2, V_\mathsf {P}), \delta )\). The set of fair plays for Player 2 (denoted \( FP ^2\)) is defined as follows:
$$ FP ^2 = \{ \omega \in Paths _{\mathcal {G}} \mid \forall v' \in V_2: v' \in \inf (\omega ) \Rightarrow post (v') \subseteq \inf (\omega ) \} $$
Alternatively, if we consider each vertex as a proposition, \( FP ^2\) can be written using LTL notation as: \(\bigwedge _{v \in V_2} \bigwedge _{v' \in post (v)}(\Box \Diamond v \Rightarrow \Box \Diamond v')\). This property is \(\omega \)-regular, thus it is measurable in the \(\sigma \)-algebra generated by the cones of \( Paths _{\mathcal {G}}\) (see e.g., [5, p.804]). This is a state-based notion of fairness, but it can be straightforwardly extended to settings where transitions are considered. For the sake of simplicity we do not do so in this paper.
Next, we introduce the notion of (almost-sure) fair strategies for Player 2.
Definition 2
Given a stochastic game \(\mathcal {G}= (V, (V_1, V_2, V_\mathsf {P}), \delta )\), a strategy \(\pi _{2} \in \varPi _{2}\) is said to be almost-sure fair (or simply fair) iff it holds that: \( Prob ^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}( FP ^2) = 1\), for every \(\pi _{1} \in \varPi _{1}\) and \(v \in V\).
The set of all the fair strategies for Player 2 is denoted by \(\varPi ^{\mathcal {F}}_{2}\). We combine this notation with the notation introduced in Sect. 3, e.g., \(\varPi ^{\textit{M}\mathcal {F}}_{2}\) refers to the set of all memoryless and fair strategies for Player 2. The previous definition is based on the notion of fair scheduler as introduced for Markov decision processes [5, 7].
Note that for stopping games, every strategy is fair, because the probability of visiting a vertex infinitely often is 0. Also notice that there are games which are not stopping, but they become stopping if Player 2 uses only fair strategies. This is the main idea behind the notion of stopping under fairness as introduced in the following definition.
Definition 3
A stochastic game \(\mathcal {G}= (V, (V_1, V_2, V_\mathsf {P}), \delta )\) is said to be stopping under fairness iff for all strategies \(\pi _{1} \in \varPi _{1}, \pi _{2} \in \varPi ^{\mathcal {F}}_{2}\) and vertex \(v \in V\), it holds that \( Prob ^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}(\Diamond T)=1\), where T is the set of terminal vertices of \(\mathcal {G}\).
Checking stopping criteria. This section is devoted to the effective characterization of games that are stopping under fairness. The following lemma states that, for every game that is not stopping under fairness, there is a memoryless deterministic strategy for Player 1 and a fair strategy for Player 2 that witnesses it.
Lemma 1
Let \(\mathcal {G}= (V, (V_1, V_2, V_\mathsf {P}), \delta )\) be a stochastic game, \(v \in V\), and T the set of terminal states of \(\mathcal {G}\). If \( Prob ^{\pi _{1}, \pi _{2}}_{\mathcal {G}, v}(\Diamond T) < 1\) for some \(\pi _{1} \in \varPi _{1}\) and \(\pi _{2} \in \varPi ^{\mathcal {F}}_{2}\), then, for some memoryless and deterministic strategy \(\pi _{1}' \in \varPi ^{\textit{MD}}_{1}\) and fair strategy \(\pi _{2}' \in \varPi ^{\mathcal {F}}_{2}\), \( Prob ^{\pi _{1}', \pi _{2}'}_{\mathcal {G}, v}(\Diamond T) < 1\).
The proof of this lemma follows by noticing that, if \( Prob ^{\pi _{1}, \pi _{2}}_{\mathcal {G}, v}(\Diamond T) < 1\), there must be a finite path that leads with some probability to an end component not containing a terminal state and which is a trap for the fair strategy \(\pi _{2}\). This part of the game enables the construction of a memoryless deterministic strategy for Player 1 by ensuring that it follows the same finite path (but skipping loops) and that it traps Player 2 in the same end component.
The next theorem states that checking stopping under fairness in a stochastic game \(\mathcal {G}\) can be reduced to check the stopping criteria in a MDP, which is obtained from \(\mathcal {G}\) by fixing a strategy in Player 2 that selects among the output transitions according to a uniform distribution. Thus, this theorem enables a graph solution to determine stopping under fairness.
Theorem 1
Let \(\mathcal {G}= (V, (V_1, V_2, V_\mathsf {P}), \delta )\) be a stochastic game and T its set of terminal states. Consider the Player 2 (memoryless) strategy \(\pi ^{\mathrm {u}}_{2} : V_2 \rightarrow \mathcal {D}(V)\) defined by \(\pi ^{\mathrm {u}}_{2}(v)(v') = \frac{1}{\# post (v)}\), for all \(v \in V_2\) and \(v' \in post(v)\). Then, \(\mathcal {G}\) is stopping under fairness iff \( Prob ^{\pi _{1}, \pi ^{\mathrm {u}}_{2}}_{\mathcal {G},v}(\Diamond T)=1\) for every \(v\in V\) and \(\pi _{1} \in \varPi _{1}\).
While the “only if” part of the theorem is direct, the “if” part is proved by contraposition using Lemma 1.
Theorem 1 introduces an algorithm to check if the stochastic game \(\mathcal {G}\) is stopping under fairness: transform \(\mathcal {G}\) into the MDP \(\mathcal {G}^{\pi ^{\mathrm {u}}_{2}}\) by fixing \(\pi ^{\mathrm {u}}_{2}\) in \(\mathcal {G}\) and check whether \( Prob ^{\pi _{1}}_{\mathcal {G}^{\pi ^{\mathrm {u}}_{2}},v}(\Diamond T)=1\) for all \(v\in V\). As a consequence, we have the following theorem.
Theorem 2
Checking whether the stochastic game \(\mathcal {G}\) is stopping under fairness or not is in \(O( poly ( size (\mathcal {G})))\).
Alternatively, we can use Theorem 1 to provide a direct algorithm on \(\mathcal {G}\) and avoiding the construction of the intermediate MDP. The main idea is to use a modification of the standard \( pre \) operator, as shown in the following definition:
$$\begin{aligned} \exists Pre _f(C) = {}&\{ v \in V \mid \delta (v,C)> 0\} \\ \forall Pre _f(C) = {}&\{ v \in V_2{\cup } V_\mathsf {P}\mid \delta (v,C)>0\} \cup \{ v \in V_1 \mid \forall v' {\in } V : \delta (v,v') > 0 \Rightarrow v' {\in } C \} \end{aligned}$$
As usual we consider the transitive closures of these operators denoted \(\exists Pre _f^*\) and \(\forall Pre _f^*\), respectively.
Theorem 3
Let \(\mathcal {G}= (V, (V_1, V_2, V_\mathsf {P}), \delta )\), be a stochastic game and let T be the set of its terminal states. Then, (1) \( Prob ^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}(\Diamond T) = 1\) for every \(\pi _{1} \in \varPi _{1}\) and \(\pi _{2} \in \varPi ^{\mathcal {F}}_{2}\) iff \(v \in V\setminus \exists Pre _f^*(V \setminus \forall Pre _f^*(T))\), and (2) \(\mathcal {G}\) is stopping under fairness iff \(\exists Pre _f^*(V \setminus \forall Pre _f^*(T)) = \emptyset \).
Determinacy of Stopping Games under Fairness. The determinacy of stochastic games with Borel and bounded payoff functions follows from Martin’s results [24]. The function \(\textit{rew}\) is unbounded, so Martin’s theorems do not apply to it. In [18], the determinacy of a general class of stopping stochastic games (called transient) with total rewards is proven. However, note that we restrict Player 2 to only play with fair strategies and hence, the last result does not apply either. In [26] the authors classify Player 2’s strategies into proper (those ensuring termination) and improper (those prolonging the game indefinitely). For proving determinacy, the authors assume that the value of the game for Player 2’s improper strategies is \(\infty \). It is worth noting that, for proving the results below, we do not make any assumption about unfair strategies. In the following we prove that the restriction to fair plays does not affect the determinacy of the games.
Figure 3 shows the dependencies of the lemmas that eventually lead to our main results, namely, Theorem 4, which states that the general problem can be limited to only memoryless and deterministic strategies, and Theorem 5, which establishes determinacy and the correctness of the algorithmic solution through the Bellman equations. To prove Theorem 4 we use the intermidiate notion of semi-Markov strategies [18] and a first step to this reduction is presented in Lemma 2. Lemmas 3 and 4 ensure the transient carachteristics of stopping under fairness problems. They are essential to prove that every possible total reward play yields a solution (Lemma 5). Already approaching Theorem 4, Lemma 6 states that there is always a minimizing fair strategy that is memoryless and deterministic, and Lemma 7 helps to reduce the problem from the domain of semi-Markov strategies to the domain of memoryless deterministic strategies. Using Theorem 4 and Proposition 1, which states that the Bellman equations are well behaved in the lattice of solutions, Theorem 5 is finally proved.
Intuitively, a semi-Markov strategy only takes into account the length of a play, the initial state, and the current state to select the next step in the play.
Definition 4
Let \(\mathcal {G}= (V, (V_1, V_2, V_\mathsf {P}), \delta )\) be a stochastic game. A strategy \(\pi _{i} \in \varPi _{i}\) is called semi-Markov if: \(\pi _{i}(v \hat{\omega }v') = \pi _{i}(v \hat{\omega }'v')\), for every \(v \in V\) and \(\hat{\omega }, \hat{\omega }' \in V^*\) such that \(|\hat{\omega }|=|\hat{\omega }'|\).
Notice that, by fixing an initial state v, a semi-Markov strategy \(\pi _{i}\) can be thought of as a sequence of memoryless strategies \(\pi _{i}^{0,v}\pi _{i}^{1,v}\pi _{i}^{2,v}\ldots \) where \(\pi _{i}(v)=\pi _{i}^{0,v}(v)\) and \(\pi _{i}(v\hat{\omega }v')=\pi _{i}^{|\hat{\omega }|+1,v}(v')\). The set of all semi-Markov (resp. semi-Markov fair) strategies for player i is denoted \(\varPi ^{\textit{S}}_{i}\) (resp. \(\varPi ^{\textit{S}\mathcal {F}}_{i}\)).
The importance of semi-Markov strategies lies in the fact that, when Player 2 plays a semi-Markov strategy, any Player 1’s strategy can be mimicked by a semi-Markov strategy as stated in the following lemma.
Lemma 2
Let \(\mathcal {G}\) be a stopping under fairness stochastic game, and let \(\pi _{2} \in \varPi ^{\textit{S}\mathcal {F}}_{2}\) be a fair and semi-Markov strategy. Then, for any \(\pi _{1} \in \varPi _{1}\), there is a semi-Markov strategy \(\pi ^*_{1} \in \varPi ^{\textit{S}}_{1}\) such that \(\mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G}, v}[\textit{rew}] = \mathbb {E}^{\pi ^*_{1}, \pi _{2}}_{\mathcal {G}, v}[\textit{rew}]\).
Proof (Sketch)
The proof follows the arguments of Theorem 4.2.7 in [18] adapted to our setting.
Consider the event \(\Diamond ^{k} v' = \{ \omega \in Paths _\mathcal {G}\mid \omega _k = v'\}\), for \(k\ge 0\). That is, the set of runs in which \(v'\) is reached after exactly k steps. We define \(\pi ^*_{1}\) as follows. For \(v'\) with \( Prob ^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}(\Diamond ^k v') > 0\) and \(|\hat{\omega }v'| = k\),
$$ \pi ^*_{1}(\hat{\omega }v')(v'') = Prob ^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}(\Diamond ^{k+1} v'' \mid \Diamond ^k v'). $$
For \(v'\) with \( Prob ^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}(\Diamond ^k v') = 0\) and \(|\hat{\omega }v'| = k\) we define \(\pi ^*_{1}(\hat{\omega }v')\) to be the uniform distribution on \( post (v')\). Notice that \(\pi ^*_{1}\) is a semi-Markov strategy. We prove that \(\pi ^*_{1}\) is the strategy that satisfies the conclusion of the lemma. For this, we first show that \( Prob ^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}(\Diamond ^{k} v') = Prob ^{\pi ^*_{1}, \pi _{2}}_{\mathcal {G},v}(\Diamond ^{k} v')\) by induction on k, and use it to conclude the following.
In a stopping game, all non-terminal states are transient (a state is transient if the expected time that both players spend in it is finite). In fact, [18] defines a stopping game with terminal states in T as a transient game, i.e., a game in which \(\sum ^\infty _{N=1} \sum _{\hat{\omega } \in (V \setminus T)^{N}} Prob ^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}(\hat{\omega }) < \infty \) for all strategies \(\pi _{1}\in \varPi _{1}\) and \(\pi _{2}\in \varPi _{2}\). Obviously, this generality does not hold in our case since unfair strategies make the game dwell infinitely on a set of non-terminal states. Therefore, we prove a weaker property in our setting. Roughly speaking, the next lemma states that, in games that stop under fairness, non-terminal states are transient, provided that the two players play memoryless strategies, and in particular, that Player 2 plays only fair.
Lemma 3
Let \(\mathcal {G}= (V, (V_1, V_2, V_\mathsf {P}), \delta )\) be a stochastic game that is stopping under fairness with T being the set of terminal states. Let \(\pi _{1} \in \varPi ^{\textit{M}}_{1}\) be a memoryless strategy for Player 1 and \(\pi _{2} \in \varPi ^{\textit{M}\mathcal {F}}_{2}\) a memoryless fair strategy for Player 2. Then \(\sum ^\infty _{N=1} \sum _{\hat{\omega } \in (V \setminus T)^{N}} Prob ^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}(\hat{\omega }) < \infty \).
This result can be extended to all the strategies of Player 1. The main idea behind the proof is to fix a stationary fair strategy for Player 2 (e.g., a uniform distributed strategy). This yields an MDP that stops for every strategy of Player 1, and furthermore, it can be seen as a one-player transient game (as defined in [18]). Hence, the result follows from Lemma 3 and Theorem 4.2.12 in [18].
Lemma 4
Let \(\mathcal {G}\) be a stochastic game that is stopping under fairness and let T be the set of terminal states. In addition, let \(\pi _{1} \in \varPi _{1}\) be a strategy for Player 1 and \(\pi _{2} \in \varPi ^{\textit{M}\mathcal {F}}_{2}\) be a fair and memoryless strategy for Player 2. Then \(\sum ^\infty _{N=0} \sum _{\hat{\omega } \in v(V \setminus T)^N} Prob ^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}(\hat{\omega }) < \infty \).
Using the previous lemma, some fairly simple calculations lead to the fact that the value of the total accumulated reward payoff game is well-defined for any strategy of the players. As a consequence, the value of the game is bounded from above for any Player 1’s strategy. This is stated in the next lemma.
Lemma 5
Let \(\mathcal {G}= (V, (V_1, V_2, V_\mathsf {P}), \delta , r )\) be a stochastic game that is stopping under fairness, \(\pi _{1} \in \varPi _{1}\) a strategy for Player 1. Then, for all memoryless fair strategy \(\pi _{2} \in \varPi ^{\textit{M}\mathcal {F}}_{2}\) for Player 2 and all \(v \in V\), \(\mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] < \infty \). Moreover, for every vertex \(v \in V\), \(\inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] < \infty \).
The following lemma is crucial and plays an important role in the rest of the paper. Intuitively, it states that, when Player 1 plays with a memoryless strategy, Player 2 has an optimal deterministic memoryless fair strategy. This lemma is the guarantee of the eventual existence of a minimizing memoryless deterministic fair strategy for Player 2 in general.
Lemma 6
Let \(\mathcal {G}= (V, (V_1, V_2, V_\mathsf {P}), \delta , r )\) be a stochastic game that is stopping under fairness and let \(\pi _{1} \in \varPi ^{\textit{M}}_{1}\) be a memoryless strategy for Player 1. There exists a deterministic memoryless fair strategy \(\pi ^*_{2}\in \varPi ^{\textit{MD}\mathcal {F}}_{2}\) such that \(\inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] = \mathbb {E}^{\pi _{1}, \pi ^*_{2}}_{\mathcal {G},v}[\textit{rew}]\), for every \(v \in V\).
Proof (Sketch)
Though it differs in the details, the proof strategy is inspired by the proof of Lemma 10.102 in [5]. We first construct a reduced MDP \(\mathcal {G}^{\pi _{1}}_{\min }\) which preserves exactly the optimizing part of the MDP \(\mathcal {G}^{\pi _{1}}\). Thus \(\delta ^{\pi _{1}}_{\min }(v,v')=\delta ^{\pi _{1}}(v,v')\) if \(v\in V_1\cup V_\mathsf {P}\), or \(v\in V_2\) and \(x_v= r (v)+x_{v'}\), where, for every \(v\in V\), \(x_v = \inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]\) (which exists due to Lemma 5). Otherwise, \(\delta ^{\pi _{1}}_{\min }(v,v')=0\). \(\mathcal {G}^{\pi _{1}}_{\min }\) can be proved to be stopping under fairness.
Then, the strategy \(\pi ^*_{2}\) for \(\mathcal {G}^{\pi _{1}}_{\min }\) is constructed as follows. For every \(v\in V\), let \({\Vert {v}\Vert }\) be the length of the shortest path fragment to some terminal vertex in T in the MDP \(\mathcal {G}^{\pi _{1}}_{\min }\). Define \(\pi ^*_{2}(v)(v')=1\) for some \(v'\) such that \(\delta ^{\pi _{1}}_{\min }(v,v')=1\) and \({\Vert {v}\Vert }={\Vert {v'}\Vert }+1\). By definition, \(\pi ^*_{2}\) is memoryless. We prove first that \(\pi ^*_{2}\) yields the optimal solution of \(\mathcal {G}^{\pi _{1}}\) by showing that the vector \((x_v)_{v\in V}\) (i.e., the optimal values of \(\mathcal {G}^{\pi _{1}}\)) is a solution to the set of equations for expected rewards of the Markov chain \(\mathcal {G}^{\pi _{1},\pi ^*_{2}}\). Being the solution unique, we have that \(x_v=\mathbb {E}^{}_{\mathcal {G}^{\pi _{1},\pi ^*_{2}},v}[\textit{rew}]\) for all \(v\in V\) and hence the optimality of \(\pi ^*_{2}\). To conclude the proof we show by contradiction that \(\pi ^*_{2}\) is fair. \(\square \)
As already noted, semi-Markov strategies can be thought of as sequences of memoryless strategies. The next lemma uses this fact to show that, when Player 2 plays a memoryless and fair strategy, semi-Markov strategies do not improve the value that Player 1 can obtain via memoryless deterministic strategies. The proof of the following lemma adapts the ideas of Theorem 4.2.9 in [18] to our games.
Lemma 7
For any stochastic game \(\mathcal {G}\) that is stopping under fairness, and vertex v, it holds that:
$$\sup _{\pi _{1} \in \varPi ^{\textit{S}}_{1}} \inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] = \sup _{\pi _{1} \in \varPi ^{\textit{MD}}_{1}} \inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] $$
Using the previous lemma, we can conclude that the problem of finding \(\sup _{\pi _{1} \in \varPi _{1}} \inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}[\textit{rew}]\), for any vertex v, can be solve by only focusing on deterministic memoryless strategies as stated and proved in the following theorem.
Theorem 4
For any stochastic game \(\mathcal {G}\) that is stopping under fairness we have:
$$ \sup _{\pi _{1} \in \varPi _{1}} \inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] = \sup _{\pi _{1} \in \varPi ^{\textit{MD}}_{1}} \inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] $$
Proof
First, we prove that the left-hand term is less than or equal to the right-hand one:
$$\begin{aligned} \sup _{\pi _{1} \in \varPi _{1}} \inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]&{} \le \sup _{\pi _{1} \in \varPi _{1}} \inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]\\&{} \le \sup _{\pi _{1} \in \varPi ^{\textit{S}}_{1}} \inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]\\&{}\le \sup _{\pi _{1} \in \varPi ^{\textit{MD}}_{1}} \inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]. \end{aligned}$$
The first inequality follows from \(\varPi ^{\textit{MD}\mathcal {F}}_{2} \subseteq \varPi ^{\mathcal {F}}_{2}\), the second inequality is due to Lemma 2 and the fact that memoryless strategies are semi-Markov, and the last inequality is obtained by applying Lemma 7.
To prove the other inequality, we calculate:
$$\begin{aligned} \sup _{\pi _{1} \in \varPi ^{\textit{MD}}_{1}} \inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]&{} = \sup _{\pi _{1} \in \varPi ^{\textit{MD}}_{1}} \inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] \\&{} \le \sup _{\pi _{1} \in \varPi _{1}} \inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]. \end{aligned}$$
The first equality is a consequence of Lemma 6 and the second inequality is due to properties of suprema. \(\square \)
The standard technique to prove the determinacy of stopping games is by showing that the Bellman operator
$$ \varGamma (f)(v) = {\left\{ \begin{array}{ll} r (v) + \mathop {\sum }\nolimits _{v' \in post (v)} \delta (v,v') f(v') &{} \text { if } v \in V_\mathsf {P}\setminus T \\ \max \{ r (v) + f(v') \mid v' \in post (v) \}&{} \text { if } v \in V_1 \setminus T, \\ \min \{ r (v) + f(v') \mid v' \in post (v) \} &{} \text { if } v \in V_2 \setminus T, \\ 0 &{} \text { if } v \in T. \end{array}\right. } $$
has a unique fixpoint. However, in the case of games stopping under fairness, \(\varGamma \) has several fixpoints as shown by the next example.
Example 1. Consider the (one-player) game in Fig. 4, where Player 1’s vertices are drawn as boxes, Player 2’s vertices are drawn as diamonds, and probabilistic vertices are depicted as circles. Note that, in that game, the greatest fixpoint is (1, 1, 1, 0). Yet, (0.5, 0.5, 1, 0) is also a fixpoint as \(\varGamma (0.5,0.5,1,0) = (0.5,0.5,1,0)\). In fact, the Bellman operator for this game has infinite fixpoints: any f of the form (x, x, 1, 0) with \(x\in [0,1]\).
Thus, the standard approach cannot be used here. Instead, we use the greatest fixpoint for proving determinacy, but this cannot be done directly on \(\varGamma \). A main difficulty is that the Knaster-Tarski theorem does not apply for \(\varGamma \) since \((\mathbb {R}^V, \le )\) is not a complete lattice. Using instead the extended reals (\((\mathbb {R} \cup \{\infty \})^V\)) is not a solution, as in some cases the greatest fixpoint will assign \(\infty \) to some vertices (e.g., \((\infty ,\infty ,0)\) would be the greatest fixpoint in the Markov chain of Fig. 5). One possible approach is to approximate the greatest fixpoint from an estimated upper bound via value iteration. Unfortunately, there may not be an order relation between f and \(\varGamma (f)\) and it may turn out that for some vertex v, \(\varGamma (f)(v)>f(v)\) before converging to the fixpoint. This is shown in the next example.
Example 2. Consider the game depicted in Fig. 5. The (unique) fixpoint in this case is (100, 90, 0). Observe that, we have that \(\varGamma (120,100,0) = (110, 108, 0)\), thus the value at \(v_1\) increases after one iteration. Several iterations are needed then to reach the greatest fixpoint. Thus, in general, starting value iteration from an estimated upper bound does not guarantee a monotone convergence to the greatest fixpoint.
We overcome the aforementioned issues by using a modified version of \(\varGamma \). Roughly speaking, we modify the Bellman operator in such a way that it operates over a complete lattice.
Notice that, by Lemma 5, the value \(\mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]\) is finite for every stopping game under fairness \(\mathcal {G}\) and strategies \(\pi _{1} \in \varPi ^{\textit{MD}}_{1}\), \(\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}\). Furthermore, because the number of deterministic memoryless strategies is finite, we also have that the number \(\max \{ \inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \sup _{\pi _{1} \in \varPi ^{\textit{MD}}_{1}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] \mid v \in V \}\) is well defined. From now on, fix a number \(\mathbf {U}\ge \max \{ \inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \sup _{\pi _{1} \in \varPi ^{\textit{MD}}_{1}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] \mid v \in V \}\). We define a modified Bellman operator \(\varGamma ^*: [0,\mathbf {U}]^V \rightarrow [0,\mathbf {U}]^V\) as follows.
$$ \varGamma ^*(f)(v) = {\left\{ \begin{array}{ll} \min \big ( r (v) + \mathop {\sum }\nolimits _{v' \in post (v)} \delta (v,v') f(v'), \ \mathbf {U}\ \big ) &{} \text { if } v \in V_\mathsf {P}\setminus T \\ \min \big ( \max \{ r (v) + f(v') \mid v' \in post (v) \}, \ \mathbf {U}\ \big )&{} \text { if } v \in V_1 \setminus T, \\ \min \big ( \min \{ r (v) + f(v') \mid v' \in post (v) \}, \ \mathbf {U}\ \big ) &{} \text { if } v \in V_2 \setminus T, \\ 0 &{} \text { if } v \in T. \end{array}\right. } $$
Note that \(\varGamma ^*\) is monotone, which can be proven by observing that maxima, minima and convex combinations are all monotone operators. Furthermore, \(\varGamma ^*\) is also Scott continuous (it preserves suprema of directed sets), this can be proven similarly as in [10]. The following proposition formalizes these properties.
Proposition 1
\(\varGamma ^*\) is monotone and Scott-continuous.
Note that \(([0,\mathbf {U}]^V, \le )\) is a complete lattice. Thus by Proposition 1 and the Knaster-Tarski theorem [15], the (non-empty) set of fixed points of \(\varGamma ^*\) forms a complete lattice, and the greatest fixpoint of the operator can be approximated by successive applications of \(\varGamma ^*\) to the top element (i.e., \(\mathbf {U}\)) [15]. In the following we denote by \(\nu \varGamma ^*\) the greatest fixed point of \(\varGamma ^*\).
The following theorem states that games restricted to fair strategies on Player 2 are determinate. Furthermore, the value of the game is given by the greatest fixpoint of \(\varGamma ^*\).
Theorem 5
Let \(\mathcal {G}\) be a stochastic game that is stopping under fairness. It holds that:
$$\inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \sup _{\pi _{1} \in \varPi _{1}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] = \sup _{\pi _{1} \in \varPi _{1}} \inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] = \nu \varGamma ^*(v) $$
Proof
First, note that \(\inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \sup _{\pi _{1} \in \varPi ^{\textit{MD}}_{1}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]\) is a fixed point of \(\varGamma ^*\). Thus we have:
$$\begin{aligned} \sup _{\pi _{1} \in \varPi _{1}} \inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]&\le \inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \sup _{\pi _{1} \in \varPi _{1}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] \\&\le \inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \sup _{\pi _{1} \in \varPi ^{\textit{MD}}_{1}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] \le \nu \varGamma ^*(v) \end{aligned}$$
for any v. The first inequality is a standard property of suprema and infima [21], the second inequality holds because \(\varPi ^{\textit{MD}\mathcal {F}}_{2} \subseteq \varPi ^{\mathcal {F}}_{2}\) and standard properties of MDPs: by fixing a deterministic memoryless fair strategy for Player 2 we obtain a transient MDP, the optimal strategy for Player 1 in this MDP is obtained via a deterministic memoryless strategy [20]. The last inequality holds because \(\inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \sup _{\pi _{1} \in \varPi ^{\textit{MD}}_{1}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]\) is a fixpoint of \(\varGamma ^*\).
Rest to prove that \(\sup _{\pi _{1} \in \varPi _{1}} \inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] \ge \nu \varGamma ^*(v)\). Note that, if there is \(\pi _{1} \in \varPi _{1}\) such that \(\inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}] \ge \nu \varGamma ^*(v)\) the property above follows by properties of supremum. Consider the strategy \(\pi ^*_{1}\) defined as follows: \(\pi ^*_{1}(v) \in \mathop {\mathrm {argmax}}\limits \{\nu \varGamma ^*(v') + r (v) \mid v' \in post (v) \}\). Note that \(\pi ^*_{1}\) is a memoryless and deterministic strategy. For any memoryless, deterministic and fair strategy \(\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}\) we have \(\nu \varGamma ^*(v) \le \mathbb {E}^{\pi ^*_{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]\) (by definition of \(\varGamma ^*\)). Thus, \(\nu \varGamma ^*(v) \le \inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{2}} \mathbb {E}^{\pi ^*_{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]\) and then: \(\nu \varGamma ^*(v) \le \sup _{\pi _{1} \in \varPi ^{\textit{MD}}_{1}} \inf _{\pi _{2} \in \varPi ^{\textit{MD}\mathcal {F}}_{1}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]\). Finally, by Theorem 4 we get: \(\nu \varGamma ^*(v) \le \sup _{\pi _{1} \in \varPi _{1}} \inf _{\pi _{2} \in \varPi ^{\mathcal {F}}_{2}} \mathbb {E}^{\pi _{1}, \pi _{2}}_{\mathcal {G},v}[\textit{rew}]\). \(\square \)
Considerations for an algorithmic solution. Value iteration [9] has been used to compute maximum/minimum expected accumulated reward in MDPs, e.g., in the PRISM model checker. Usually, the value is computed by approximating the least fixpoint from below using the Bellman equations [9]. In [6], the authors propose to approach these values from both a lower and an upper bound (known as interval iteration [19]). To do so, [6] shows a technique for computing upper bounds for the expected total rewards for MDPs. This approach is based on the fact that, given a stopping MDP \(\mathcal {G}\), \(\mathbb {E}^{\pi _{1}}_{\mathcal {G},v}[\textit{rew}] = \sum _{v' \in R(v)} \zeta _v^{\pi _{1}}(v')* r (v')\), where R(v) denotes the set of reachable states from v, and \(\zeta _v^{\pi _{1}}(v')\) denotes the expected number of times to visit \(v'\) in the Markov chain induced by \(\pi _{1}\) when starting at v. [6] describes how to compute a value \(\zeta _{v}^*(v')\), such that \(\zeta _v^*(v') \ge \sup _{\pi _{1} \in \varPi _{1}} \zeta _v^{\pi _{1}}(v')\). Thus, \(\sum _{v' \in R(v)} \zeta _v^{*}(v')* r (v')\) gives an upper bound for \(\sup _{\pi _{1}} \mathbb {E}^{\pi _{1}}_{\mathcal {G},v}[\textit{rew}]\). Our algorithm uses these ideas to provide an upper bound for two-player games. Roughly speaking, the above defined functional \(\varGamma ^*\) presents a form of Bellman equations that enables a value iteration algorithm to solve these games. We need to start with some value vector larger than such a fixpoint. Given a stopping under fairness game, we fix a (memoryless) fair strategy for the environment, thus obtaining an MDP. We then use the techniques described above to find an
upper bound for this MDP, which in turn is an upper bound in the original game. The obvious fair strategy to use is the one based on the uniform distribution (as in Theorem 1). This idea is described in Algorithm 1. It is worth noting that, instead of using a unique upper bound for every vertex (as in the definition of \(\varGamma ^*\)), the algorithm may use a different upper bound for each component of the value vector, this improves the number of iterations performed by the algorithm. We have implemented Algorithm 1 as a prototype embedded in the PRISM-games toolset [22], as described in the next section.