Simple Stochastic Games with Almost-Sure Energy-Parity Objectives are in NP and coNP

We study stochastic games with energy-parity objectives, which combine quantitative rewards with a qualitative \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega $$\end{document}ω-regular condition: The maximizer aims to avoid running out of energy while simultaneously satisfying a parity condition. We show that the corresponding almost-sure problem, i.e., checking whether there exists a maximizer strategy that achieves the energy-parity objective with probability 1 when starting at a given energy level k, is decidable and in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {NP}\cap \mathsf {coNP}$$\end{document}NP∩coNP. The same holds for checking if such a k exists and if a given k is minimal.


Introduction
Simple stochastic games (SSGs), also called competitive Markov decision processes [30], or 2 1  2 -player games [23,22] are turn-based games of perfect information played on finite graphs.Each state is either random or belongs to one of the players (maximizer or minimizer).A game is played successively moving a pebble along the game graph, where the next state is chosen by the player who owns the current one or, in the case of random states, according to a predefined distribution.This way, an infinite run is produced.The maximizer tries to achieve an objective (in our case almost surely), while the minimizer tries to prevent this.The maximizer can be seen as a controller trying to ensure an objective in the face of both known random failure modes (encoded by the random states) and an unknown or hostile environment (encoded by the minimizer player).
Stochastic games were first introduced in Shapley's seminal work [48] in 1953 and have since then played a central role in the solution of many problems in computer science, including synthesis of reactive systems [46,42]; checking interface compatibility [27]; well-formedness of specifications [28]; verification of open systems [4]; and many others.
A huge variety of objectives for such games was already studied in the literature.We will mainly focus on three of them in this paper: parity; meanpayoff; and energy objectives.In order to define them we assume that numeric rewards are assigned to transitions, and priorities (encoded by bounded nonnegative numbers) are assigned to states.
The parity objective simply asks that the minimal priority that appears infinitely often in a run is even.Such a condition is a canonical way to define desired behaviors of systems, such as safety, liveness, fairness, etc.; it subsumes all ω-regular objectives.The algorithmic problem of deciding the winner in nonstochastic parity games is polynomial-time equivalent to the model checking of the modal µ-calculus [51] and is at the center of the algorithmic solutions to the Church's synthesis problem [45].But the impact of parity games goes well beyond automata theory and logic: They facilitated the solution of two long-standing open problems in stochastic planning [29] and in linear programming [32], which was done by careful adaptation of the parity game examples on which the strategy improvement algorithm [31] requires exponentially many iterations.
The parity objective can be seen as a special case of the mean-payoff objective that asks for the limit average reward per transition along the run to be non-negative.Mean-payoff objectives are among the first objectives studied for stochastic games and go back to a 1957 paper by Gillette [33].They allow for reasoning about the efficiency of a system, e.g., how fast it operates once optimally controlled.
The energy objective [14] can be seen as a refinement of the mean-payoff objective.It asks for the accumulated reward at any point of a run not to be lower than some finite threshold.As the name suggests, it is useful when reasoning about systems with a finite initial energy level that should never become depleted.Note that the accumulated reward is not bounded a-priori, which essentially turns a finite-state game into an infinitely-state one.
In this paper we consider SSGs with energy-parity objectives, which requires runs to satisfy both an energy and a parity objective.It is natural to consider such an objective for systems that should not only be correct, but also energy efficient.For instance, consider a robot maintaining a nuclear power plant.We not only require the robot to correctly react to all possible chains of events (parity objective for functional correctness), but also never to run out of energy as charging it manually would be risky (energy objective).
While the complexity of games with single objectives is often in NP ∩ coNP, asking for multiple objectives often makes solving games harder.Parity games are commonly viewed as the simplest of these objectives, and some traditional solutions for non-stochastic games go through simple reductions to mean-payoff or energy conditions (which are quite similar in non-stochastic games) to discounted payoff games that establishes the membership of those problems in UP and coUP [36].However, asking for two parity objectives to be satisfied at the same time leads to coNP completeness [21].
We study the almost sure satisfaction of the energy-parity objective, i.e., with probability 1.Such qualitative analysis is important as there are many applications where we need to know whether the correct behavior arises almostsurely, e.g., in the analysis of randomized distributed algorithms (see, e.g, [43,49]) and safety-critical examples like the one from above.Moreover, the algorithms for quantitative analysis, i.e., computing the optimal probability of satisfaction, typically start by performing the qualitative analysis first and then solving a game with a simpler objective (see, e.g., [23,15]).Finally, there are stochastic models for which qualitative analysis is decidable but quantitative one is not (e.g., probabilistic finite automata [6]).This may also be the case for our model.
Our contributions.We consider stochastic games with energy-parity winning conditions and show that deciding whether maximizer can win almost-surely for a given initial energy level k is in NP ∩ coNP.We show the same for checking if such k exists at all and checking if a given k is the smallest possible for which this holds.The proofs are considerably harder than the corresponding result for MDPs [41] (on which they are partly based), because the attainable mean-payoff value is no longer a valid criterion in the analysis (via combinations of sub-objectives).E.g., even though the stored energy might be inexorably drifting towards +∞ (resp.−∞), the mean-payoff value might still be zero because the minimizer (resp.maximizer) can delay payoffs for longer and longer (though not indefinitely, due to the parity condition).Moreover, the minimizer might be able to choose between different ways of losing and never commit to any particular way after any finite prefix of the play (see Example 1).
Our proof characterizes almost-sure energy-parity via a recursive combination of complex sub-objectives called Gain and Bailout, which can each eventually be solved in NP ∩ coNP.
Our proof of the coNP membership is based on a result on the strategy complexity of a natural class of objectives, which is of independent interest.We show (cf.Theorem 6; based on previous work in [35]) that, if an objective O is such that its complement is both shift-invariant and submixing, and that every MDP admits optimal finite-memory deterministic maximizer strategies for O, then the same is true in turn-based stochastic games.
Example 1. Figure 1 shows an energy-parity game that the maximizer can win almost surely when starting with an energy level of ≥ 2 from the middle left node.Whenever the game is at that node with an energy level ≥ 3, then the maximizer can turn left and has at least 1  2 chance that the energy level will never drop to 2 while wining the game with priority 2. This is because we can view this process as a random walk on a half line.If x n is the probability of Fig. 1: A SSG with two maximizer states ( ), one minimizer state ( ) and one probabilistic state ( ).Each state is annotated with its priority and each edge with a reward by which it increases the energy level (respectively, decreases if the reward is negative).The maximizer wins if the lowest priority visited infinitely often is even and the energy level never drops below 0.
reaching energy level 2 when starting at n then these probabilities are the least point-wise positive solution of the following system of linear equations: x n−1 for all n ≥ 3. We then get that x n = 1 2 n−2 so the probability of not reaching energy 2 is ≥ 1  2 for all n ≥ 3. Always turning left guarantees that, almost surely, the parity condition holds and the limes inferior of the energy level is not −∞.We call this condition Gain.Strategies for Gain can be used when the energy level is sufficiently high (at least 3 in our example) to win with a positive probability.
However, if maximizer plays for Gain and always moves left, then for every initial energy level the chance of eventually dropping the energy down to level 2 is positive, due to the negative cycle.When that happens, the only other option for the maximizer is to move right.There minimizer can 'choose how to lose', via a disjunction of two conditions that we later formalize as Bailout.Either minimizer goes back to the start state without changing the energy level (thus maximizer wins as the energy stays at level 2 and only the good priority 2 is seen), or minimizer turns right.In the latter case, the play visits a dominating odd priority (which is bad for maximizer) but also increases the energy by 1, which allows maximizer to switch back to playing left for the Gain condition until energy level 2 is reached again.
Our maximizer strategies are a complex interplay between Bailout and Gain.In the example, it is easy to see that the probability of seeing priority 1 infinitely often is zero if maximizer follows the just described strategy (the probability of requiring to go right more than n times is at most ( 12 ) n ), so maximizer wins this energy-parity game almost surely.Note that maximizer does not win almost surely when the initial energy level is 0 or 1.
Previous work on combined objectives.Non-stochastic energy-parity games have been studied in [16].They can be solved in NP ∩ coNP and maximizer strategies require only finite (but exponential) memory, a property that also allowed to show P-time inter-reducibility with mean-payoff parity games.More recently they were also shown to be solvable in pseudo-quasi-polynomial time [26].Related results on non-stochastic games (e.g., mean-payoff parity) are summarized in [18].
Most existing work on combined objectives for stochastic systems [17,18,9,41] is restricted to Markov decision processes (MDPs; aka 1 1  2 -player games).Almostsure energy-parity objectives for MDPs were first considered in [17,18], where a direct reduction to ordinary energy games was proposed.This reduction relies on the assumption that maximizer can win using finite memory if at all.Unfortunately, this assumption does not necessarily hold: it was shown in [41] that an almost sure winning strategy for energy-parity in finite MDPs may require infinite memory.Nevertheless, it was possible to recover the original result, that deciding the existence of a.s.winning strategies is in NP ∩ coNP (and pseudo-polynomial time), by showing that the existence of an a.s.winning strategy can be witnessed by the existence of two compatible, and finite-memory, winning strategies for two simpler objectives.We generalize this approach from MDPs to full stochastic games.
Stochastic mean-payoff parity games were studied in [20], where it was shown that they can be solved in NP ∩ coNP.However, this does not imply a solution for stochastic energy-parity games, since, unlike in the non-stochastic case [16], there is no known reduction from energy-parity to mean-payoff parity in stochastic games.(The reduction in [16] relies on the fact that maximizer has a winning finitememory strategy for energy-parity, which does not generally hold for stochastic games or MDPs; see above.) A related model are the 1-counter MDPs (and stochastic games) studied in [12,11,8], since the value of the counter can be interpreted as the stored energy.These papers consider the objective of reaching counter value zero (which is dual to the energy objective of staying above zero), thus the roles of minimizer and maximizer are swapped.However, unlike in this paper, these works do not combine termination objectives with extra parity conditions.
Structure of the paper.The rest of the paper is organized as follows.We start by introducing the notation and formal definitions of games and objectives in the next section.In Section 3 we show how checking almost-sure energy-parity objectives can be characterized in terms of two newly defined auxiliary objectives: Gain and Bailout.In Sections 4 and 5, we show that almost-sure Bailout and Gain objectives, respectively, can be checked in NP and coNP.Section 6 contains our main result: NP and coNP algorithms for checking almost-sure energy-parity games with a known and unknown initial energy, as well as checking if a given initial energy is the minimal one.We conclude and point out some open problems in Section 7. Due to page restrictions, most proofs in the main body of the paper were replaced by sketches.The detailed proofs can be found in the appendix.

Preliminaries
A probability distribution over a set X is a function f : X → [0, 1] such that x∈X f (x) = 1.We write D(X) for the set of distributions over X.

Games, Strategies, Measures
, where all states have an outgoing edge and the set of states is partitioned into states owned by maximizer (V ), minimizer (V ) and probabilistic states (V ).The set of edges is E ⊆ V × V and λ : V → D(E) assigns each probabilistic state a probability distribution over its outgoing edges.W.l.o.g., we assume that each probabilistic state has at most two successors, because one can introduce a new probabilistic state for each excess successor.We let λ(ws) def = λ(s) for all ws ∈ (V E) * V .A path is a finite or infinite sequence ρ def = s 0 e 0 s 1 e 1 . . .such that e i = (s i , s i+1 ) ∈ E holds for all indices i.A run is an infinite path and we write Runs def = (V E) ω for the set of all runs.
A strategy for maximizer is a function σ : (V E) * V → D(E) that assigns to each path ws ∈ (V E) * V a probability distribution over the outgoing edges of its target node s.That is, σ(ws)(e) > 0 implies e = (s, t) ∈ E for some t ∈ V .A strategy is called memoryless if σ(xs) = σ(ys) for all x, y ∈ (V E) * and s ∈ V , deterministic if σ(w) is Dirac for all w ∈ (V E) * V , and finite-state if there exists an equivalence relation ∼ on (V E) * V with a finite index, such that σ(ρ 1 ) = σ(ρ 2 ) if ρ 1 ∼ ρ 2 .Of particular interest to us will be the class of memoryless deterministic strategies (MD) and the class of finite-memory deterministic strategies (FD).Strategies for minimizer are defined analogously and will usually be denoted by τ : (V E) * V → D(E).
A maximizing (minimizing) Markov Decision Process (MDP) is a game in which minimizer (maximizer) has no choices, i.e., all her states have exactly one successor.We will write G[τ ] for the MDP resulting from fixing the strategy τ .A Markov chain is a game where neither player has a choice.In particular, G[σ, τ ] is a Markov chain obtained by setting, in the game G, the strategies for maximizer and minimizer to σ and τ , respectively.
Given an initial state s ∈ V and strategies σ and τ for maximizer and minimizer, respectively, the set of runs starting in s naturally extends to a probability space as follows.We write Runs G w for the w-cylinder, i.e., the set of all runs with prefix w ∈ (V E) * V .We let F G be the σ-algebra generated by all these cylinders.We inductively define a probability function P G,σ,τ s on all cylinders, which then uniquely extends to F G by Carathéodory's extension theorem [5], by setting . .s i )(e i ) for w = s 0 e 0 s 1 e 1 . . .e n−1 s n , where s 0 = s, e i = (s i , s i+1 ) and dist i is σ(•), τ (•) or λ(•), for s i ∈ V ,V or V , respectively.
Objective Functions.A (Borel) objective is a set Obj ∈ F G of runs.We write Obj def = Runs \ Obj for its complement.Borel objectives Obj are weakly determined [40,39], which means that This quantity is called the value of Obj in state s, and written as Val G s (Obj).We say that Obj holds almost-surely (abbreviated as a.s.) at state s iff there exists σ such that ∀τ, P G,σ,τ s (Obj) = 1.Let AS G (Obj) denote the set of states at which Obj holds almost surely.We will drop the superscript G and simply write Runs, P σ,τ s and AS (Obj), if the game is clear from the context.We use the syntax and semantics of operators F (eventually) and G (always) from the temporal logic LTL [25] to specify some conditions on runs.
A reachability condition is defined by a set of target states T ⊆ V .A run ρ = s 0 e 0 s 1 . . .satisfies the reachability condition iff there exists an i ∈ N s.t.s i ∈ T .We write FT ⊆ Runs for the set of runs that satisfy this reachability condition.Given a set of states W ⊆ V , we lift this to a safety condition on runs and write GW ⊆ Runs for the set of runs ρ = s 0 e 0 s 1 . . .where ∀i.s i ∈ W .
A parity condition is given by a bounded function parity : V → N that assigns a priority (a non-negative integer) to each state.A run ρ ∈ Runs satisfies the parity condition iff the minimal priority that appears infinitely often on the run is even.The parity objective is the subset PAR ⊆ Runs of runs that satisfy the parity condition.
Energy conditions are given by a function r : E → Z, that assigns a reward value to each edge.For a given initial energy value k ∈ N, a run s 0 e 0 s 1 e 1 . . .satisfies the k-energy condition if, for every finite prefix of length n, the energy level k + n i=0 r (e i ) is greater or equal to 0. Let EN(k) ⊆ Runs denote the k-energy objective, consisting of those runs that satisfy the k-energy condition.
The l-storage condition holds for a run s 0 e 0 s 1 e 1 . . .if l + n−1 i=m r (s i , s i+1 ) ≥ 0 holds for every infix s m e m s m+1 . . .s n .Let ST(k, l) ⊆ Runs denote the k-energy l-storage objective, consisting of those runs that satisfy both the k-energy and the l-storage condition.We write ST(k) for l ST(k, l).Clearly, ST(k) ⊆ EN(k).
Mean-payoff and limit-payoff conditions are defined w.r.t. the same reward function as the energy conditions.The mean-payoff value of a run ρ = s 0 e 0 s 1 e 1 . . .The combined energy-parity objective EN(k) ∩ PAR is Borel and therefore weakly determined, meaning that it has a well-defined (inf sup = sup inf) value for every game [40,39].Moreover, the almost-sure energy-parity objective (asking to win with probability 1) is even strongly determined [38]: either maximizer has a strategy to enforce the condition with probability 1 or minimizer has a strategy to prevent this.

Characterizing Energy-Parity via Gain and Bailout
The main theorem of this section (Theorem 5) characterizes almost sure energyparity objectives in terms of two intermediate objectives called Gain and k-Bailout for parameters k ≥ 0. This will form the basis of all computability results: we will show (as Theorems 14, 17 and 18) how to compute almost-sure sets for these intermediate objectives.
Definition 2. Consider a finite SSG G = (V, E, λ), as well as reward and parity functions defining the objectives PAR, LimInf(> ∞), LimSup(= ∞) as well as ST(k, l) and EN(k) for every k, l ∈ N. We define combined objectives Gain and k-Bailout The main idea behind these two objectives is a special witness property for energy-parity.We argue that, if maximizer has an almost-sure winning strategy for energy-parity then he also has one that combines two almost-sure winning strategies, one for Gain and one for k-Bailout.
Notice that playing an almost-sure winning strategy for Gain implies a uniformly lower-bounded strictly positive chance that the energy level never drops below zero (assuming it is sufficiently high to begin with).This fact uses the finiteness of the set of control-states and does not hold for infinite-state MDPs.In the unlikely event that the energy level does get close to zero, maximizer switches to playing an almost sure winning strategy for k-Bailout.This is a disjunction of two scenarios, and the balance might be influenced by minimizer's choices.In the first scenario (ST(k, l) ∩ PAR) the energy never drops much and stays above zero (thus satisfying energy-parity).In the second scenario, (EN(k) ∩ LimSup(= ∞)), the parity objective is temporarily suspended in favor of boosting (while always staying above zero) the energy to a sufficiently high level to switch back to the strategy for Gain and thus try again from the beginning.The probability of infinitely often switching between these modes is zero due to the lower-bounded chance of success in the Gain phase.Therefore, maximizer eventually wins by playing for Gain.Note that maximizer needs to remember the current energy level in order to know when to switch and consequently, this strategy uses infinite memory.
Example 3. Consider again the game in Fig. 1.The middle left state satisfies both Gain and k-Bailout objectives for all k ≥ 2 almost-surely.The respective winning strategies are to always go left for Gain or always go right for k-Bailout when at that state.Note that it neither satisfies 0-Bailout nor 1-Bailout objectives.
We define the subset W ⊆ V of states from which maximizer can almost surely win both Gain and k-Bailout (assuming sufficiently high initial energy), while at the same time ensuring that the play remains within this set of states.These are the states from which maximizer can win by freely combining individual strategies for the Gain and Bailout objectives.Definition 4. Given a finite SSG G = (V, E, λ), let W ⊆ V be the largest subset of states satisfying the following condition This condition describes a fixed-point, and as it is easy to see that if two sets W 1 and W 2 are such fixed-points, then so is W 1 ∪ W 2 .Thus, the maximal fixed-point W is well-defined.
Our main characterization of almost-sure energy-parity objectives is the following Theorem 5.It states that maximizer can almost surely win an EN(k) ∩ PAR objective if, and only if, he can win the easier k-Bailout objective while always staying in the safe set W .
Our proof of this characterization theorem relies on the following claim, which allows to lift the existence of finite-memory deterministic optimal strategies from MDPs to SSGs.It applies to a fairly general class of objectives and, we believe, is of independent interest. .., it holds that s 1 e 1 s 2 e 2 . . .∈ Obj ⇐⇒ s 2 e 2 . . .∈ Obj.Shift-invariance slightly generalizes the better-known tail condition (see [35] for a discussion).Theorem 6.Let O be an objective such that O is both shift-invariant and submixing.If maximizer has optimal FD strategies (from any state s) for O for every finite MDP then maximizer has optimal FD strategies (from any state s) for O for every finite SSG.

Recall that Obj
This applies in particular to the Gain objective, but not to k-Bailout objectives, as these are not shift-invariant.A proof of Theorem 6 can be found in Appendix A. It uses a recursive argument based on the notion of reset strategies from [35].
The remainder of this section is dedicated to proving Theorem 5. We will first collect the remaining technical claims about Gain, Bailout, and reachability objectives.Most notably, as Lemma 8, we show that if maximizer can almost surely win Gain in a SSG, then he can do so using a FD strategy which moreover satisfies an energy-parity objective with strictly positive (and lower-bounded) probability.This is shown in part based on Theorem 6 applied to the Gain objective.We will also need the following fact about reachability objectives in finite MDPs.
Lemma 7 ([8, Lemma 3.9]).Let M be a finite MDP and Reach T be the reachability objective with target One can compute a rational constant c < 1 and an integer h ≥ 0 such that for all states s and i ≥ h we have ∀τ.
where Gain holds a.s.for every state s ∈ V .Then, for every δ ∈ [0, 1) and s ∈ V , there exists a k ∈ N and an FD strategy σ s.t.
Proof.Fix a δ ∈ [0, 1) and a state s ∈ V .Both LimInf(= −∞), as well as PAR objectives are shift-invariant and submixing, and therefore also the union has both these properties.It follows that Gain = LimInf(> −∞) ∩ PAR = LimInf(= −∞) ∪ PAR is both shift-invariant and submixing, since the complement of a parity objective is also a parity objective.By Lemma 16 and Theorem 6, there exists an almost-sure winning FD strategy σ for maximizer for the objective Gain from s, i.e., ∀τ.P σ,τ s (Gain) = 1, thus yielding Item 1.Let M be the MDP obtained from G by fixing the strategy σ for maximizer from s. Since G is finite and σ is FD, also M is finite.In M we have ∀τ.P τ s (Gain) = 1.In particular, in M, the set is not reachable, i.e., ∀τ.P τ s (Reach T ) = 0.By Lemma 7, in M there exists a horizon h ∈ N and a constant c < 1 such that for all i ≥ h we have ∀τ.
Since T cannot be reached in M, the condition Reach T evaluates to true and we have ∀τ.
Proof.Let ρ be a run in EN(k) ∩ PAR.There are two cases.In the first case we have , it follows that ρ does not satisfy the l-storage condition for any l ∈ N. So, for every l ∈ N, there exists an infix ρ of ρ s.t.l + r (ρ ) < 0. Let ρ be the prefix of ρ before ρ .Since ρ ∈ EN(k) we have k +r (ρ ρ ) ≥ 0 and thus r (ρ ) We now define W as the set of states that are almost-sure winning for energy-parity with some sufficiently high initial energy level.(W is also called the winning set for the unknown initial credit problem.) and σ a strategy that witnesses this property.Except for a null-set, all runs ρ = se 0 s 1 e 1 . . .e n−1 s n . . .from s induced by σ satisfy EN(k) ∩ PAR.
Let ρ = se 0 s 1 e 1 . . .s m be a finite prefix of ρ.For every n ≥ 0 we have k + n−1 i=0 r (e i ) ≥ 0, since ρ ∈ EN(k).In particular this holds for all n ≥ m.So, for every n ≥ m, we have k Proof.It suffices to show that W satisfies the monotone condition imposed on W (cf. Definition 4), since W is defined as the largest set satisfying this condition.
Proof of Theorem 5. Towards the ⊆ inclusion, we have by Lemma 11(2) and Lemma 12.
Towards the ⊇ inclusion, let s ∈ AS (k-Bailout ∩ GW ) and σ 1 be a strategy that witnesses this.We show that s ∈ AS (EN(k) ∩ PAR).We now consider the modified SSG G = (W, E, λ) with the state set restricted to W .In particular, s ∈ W and σ 1 witnesses s ∈ AS (k-Bailout) in G .We now construct a strategy σ that witnesses s ∈ AS (EN(k) ∩ PAR) in G , and thus also in G.The strategy σ will use infinite memory to keep track of the current energy level of the run.
Apart from σ 1 , we require several more strategies as building blocks for the construction of σ.
First, in G we had ∀s ∈ W. s ∈ AS (Gain ∩ GW ), and thus in G we have ∀s ∈ W. s ∈ AS (Gain).For every s ∈ W we instantiate Lemma 8 for G with δ = 1/2 and obtain a number ks and a strategy σs with 1. ∀τ.P σs ,τ s (Gain) = 1, and 2. ∀τ.P σs ,τ s The strategies σs are called gain strategies.Second, by the finiteness of V , there is a minimal number k 2 such that Thus in G for every s ∈ W there exists a strategy σs with ∀τ.P σs ,τ s (k 2 -Bailout) = 1.The strategies σs are called bailout strategies.Let We now define the strategy σ.
Start: First σ plays like σ 1 from s. Since σ 1 witnesses s ∈ AS (k-Bailout) against every minimizer strategy τ , almost all induced runs ρ = se 0 s 1 e 1 . . .satisfy either Almost all runs ρ of the latter type (B) (and potentially also some runs of type (A)) satisfy EN(k) and l i=0 r (e i ) ≥ k eventually for some l.If we observe l i=0 r (e i ) ≥ k for some prefix se 0 s 1 e 1 . . .e l s of the run ρ then our strategy σ plays from s as described in the Gain part below.Otherwise, if we never observe this condition, then our run ρ is of type (A) and σ continues playing like σ 1 .Since property (A) implies (EN(k) ∩ PAR), this is sufficient.
Gain: In this case we are in the situation where we have reached some state s after some finite prefix ρ of the run, where r (ρ ) ≥ k .Our strategy σ now plays like the gain strategy σs , as long as r (ρ ) ≥ k − k 1 holds for the current prefix ρ of the run.By Item 2, this will satisfy ∀τ.P σs ,τ s (EN( ks )∩PAR) ≥ 1/2 and thus ∀τ.P σs ,τ s (EN(k 1 ) ∩ PAR) ≥ 1/2.It follows that with probability ≥ 1/2 we will keep playing σs forever and satisfy PAR and always r (ρ In this case (which happens with probability < 1/2) we continue playing as described in the Bailout part below.
Bailout: In this case we are in the situation where we have reached some state s ∈ W after some finite prefix ρ of the run, where k + r (ρ ) = k 2 .Since s ∈ W , we can now let our strategy σ play like the bailout strategy σs and obtain ∀τ.
Thus almost all induced runs ρ = s e 0 s 1 e 1 . . .from s satisfy either As long as r (ρ ) < k holds for the current prefix ρ of the run, we keep playing σs .Otherwise, if eventually r (ρ ) ≥ k holds, then we switch back to playing the Gain strategy above.All the runs that never switch back to playing the Gain strategy must be of type (A) and thus satisfy PAR.Since we have k 2 -Bailout ⊆ EN(k 2 ), it follows that, for every prefix ρ of the run from s , according to σs we have k 2 + r (ρ ) ≥ 0. Thus, for every prefix ρ of ρ, we have As shown above, almost all runs induced by σ that eventually stop switching between the three modes satisfy EN(k) ∩ PAR.Switching from Gain/Bailout to Start is impossible, but switching from Gain to Bailout and back is possible.However, the set of runs that infinitely often switch between Gain and Bailout is a null-set, because the probability of switching from Gain to Bailout is ≤ 1/2.Thus, σ witnesses s ∈ AS (EN(k) ∩ PAR).

Bailout
In this section we will argue that it is possible decide, in NP and coNP, whether the bailout objective can be satisfied almost surely.More precisely, we show the existence of procedures to decide if, for a given k ∈ N and state s, there exists an l ∈ N such that s almost-surely satisfies the Bailout(k, l) objective Recall that the idea behind the Bailout objective is that, during a game for energy-parity, maximizer is temporarily abandoning the parity (but not the energy) condition in order to increase the energy to a sufficient level (which will then allow him to try an a.s.strategy for Gain once more).However, in a stochastic game -as opposed to an MDP [41] -an opponent could possibly prevent this increase in energy level at the expense of satisfying the original energy-parity objective in the first place (cf.Example 1).The Bailout objective is designed to capture the disjunction of both outcomes, as both are favorable for the maximizer.The parameter k is the acceptable total energy drop (i.e., the initial value), and the parameter l is the acceptable energy drop on any infix of a play, which translates to the upper bound on the energy level in the second outcome.
The question can be phrased equivalently as membership of a control state s in the almost-sure set for the k-Bailout objective for a given game G and energy level k ∈ N.Moreover, there are K, L ∈ N, polynomial in |V | and the largest absolute transition reward, so that k≥0 AS G (k-Bailout) = AS G (Bailout(K, L)).And so, checking whether state s belongs to k≥0 AS G (k-Bailout) is in NP and coNP.
Proof (sketch).This is shown by a sequence of transformations of the game and ultimately reduced to a finding the winner of a non-stochastic game with an energy-parity objective, which is known to be solvable in NP, coNP and pseudo-polynomial time [19].One important observation is that it is possible to replace, without changing the outcome, the energy EN(k) condition in the Bailout(k, l) objective by the more restrictive energy-storage ST(k, l) condition.See Appendix B for further details.

Gain
In this section we will argue that it is possible to decide, in NP and coNP, whether the Gain objective (i.e., LimInf(> −∞) ∩ PAR) can be satisfied almost surely.
We start by investigating the strategy complexity of winning strategies for the Gain objective.
Lemma 15.In every finite SSG, minimizer has optimal MD strategies for objective Gain.
Proof.We show that maximizer has MD optimal strategies for LimInf(= −∞) ∪ PAR.This is equivalent to the claim of the lemma because LimInf(> −∞) ∩ PAR = LimInf(= −∞) ∪ PAR and the complement of a parity condition is itself a parity condition (with all priorities incremented by one).
We note that both LimInf(= −∞), as well as parity objectives PAR are shiftinvariant and submixing and therefore also that the union LimInf(= −∞) ∪ PAR has both these properties.The claim now follows from the fact that SSGs with objectives that are both submixing and shift-invariant admit MD optimal strategies for maximizer [35,Theorem 5.2].
Based on the results in [41] one can show a similar claim for maximizer strategies in MDPs.
Lemma 16.For finite MDPs, almost-sure winning maximizer strategies for Gain can be chosen FD.
Using the existence of MD optimal minimizer strategies (Lemma 15) and a coNP upper bound for checking almost sure Gain in MDPs established in [41], we can derive a coNP procedure.See Appendix C.2 for full details.
Theorem 17. Checking whether a state s ∈ V of a SSG satisfies Gain almostsurely is in coNP.
The rest of this section will deal with the NP upper bound, which is the most challenging part of this paper.The crux of our proof is the observation that if maximizer has a strategy that wins almost surely against all MD minimizer strategies, then he wins almost surely.This is because one of these MD strategies is optimal due to Lemma 15.We show that, in order to witness such an almost-sure winning strategy for maximizer in SSG G, it suffices to provide a polynomially larger SSG G 3 , together with an almost-sure winning strategy for the storageparity objective (see Theorem 21 in Section 6) in G 3 .This will give us an NP algorithm, because G 3 , along with its winning strategy, can be guessed and verified in polynomial time.Formally we claim that: Theorem 18. Checking whether a state s ∈ V of G satisfies Gain almost-surely is in NP.
Proof.(sketch) For technical convenience, we will assume w.l.o.g. that every SSG henceforth is in a normal form, where every random state has only one predecessor, which is owned by the maximizer.To show the existence of G 3 , we are going to introduce two intermediate games: G 1 and G 2 .These games are never constructed by our NP algorithm, but are just defined to break down the complex construction of G 3 into more manageable steps.
Intuitively, G 1 is just G where all rewards on edges are multiplied by a large enough factor, f , to turn strategies with a mean-payoff > 0 into ones with mean-payoff > 2. G 2 is an extension of G 1 where the maximizer is given a choice before every visit to a probabilistic node.He can either let the game proceed as before, or sacrifice part of his one-step reward in exchange for a more evenly balanced reward outcome, so the energy can no longer drop arbitrarily low when a probabilistic cycle is reached.As a result, in G 2 it suffices to consider a storage-parity objective (see Theorem 21 in Section 6) instead of Gain.The number of choices maximizer is given is the number of MD minimizer strategies, which clearly can be exponential.That would not suffice for an NP algorithm.Therefore, we show that most of these choices are redundant and can be removed without impairing the almost sure wining region.As the result of that pruning, we obtain G 3 of polynomial size.
For the the technical details of the G → G 1 → G 2 → G 3 constructions please see Appendix C.3. Figure 2 shows how these transformations may look like.

The Main Results
In this section, we prove the main results of the paper, namely that almost-sure energy parity stochastic games can be decided in NP and coNP.The proofs are straightforward and follow from the much more involved characterization of almost sure energy parity objective in terms of the Bailout and Gain objectives established in Section 3 and their computational complexity analysis in Sections 4 and 5, respectively.
Proof.Recall that we can compute the set W from Definition 4 by iterating starting with W 0 def = V , until we reach the greatest fixed point W .Note that at step i we need to solve almost sure Gain and almost sure k AS (k-Bailout), where the states of the game are restricted to W i−1 .There can be at most |V | steps, because at least one state is removed in each iteration.It then suffices to check AS (k-Bailout ∩ GW) (i.e., AS (k-Bailout) for the subgame that consists only of the states of the fixed point W for k = k * .Note that this step can be skipped if k * ≥ K, the bound from Theorem 14.
Before we discuss how to use NP and coNP procedures to construct these sets and to conduct the final test on the fixed point W , we note that the '∩GW i−1 ' does not add anything substantial, as these are simply the same tests and procedures conducted on the subgame that only consist of the states of W i−1 .
To obtain an NP procedure for constructing AS (Gain)-or, as remarked above, AS (Gain ∩ GW i−1 )-we can guess and validate its membership for each state s in this set, using the NP result from Theorem 18, and we can guess and validate its non-membership for each state s not in this set in NP, using the coNP result from Theorem 17.Similarly, we can guess and validate both the membership and the non-membership in k AS (k-Bailout ∩ GW i−1 )-and of k AS (k-Bailout ∩ GW i−1 ) by analysing the subgame with only the states in W i−1 -by using the NP and coNP result, respectively, from Theorem 14.
Once we can construct these sets, we can also intersect them and check if a fixed point has been reached.(One can, of course, stop when s / ∈ W i .)We can now conduct the final check in NP using Theorem 18.A coNP algorithm that constructs W can be designed analogously: once W i−1 is known, membership and non-membership of a state s in AS (Gain ∩ GW i−1 ) can be guessed and validated in coNP by Theorem 17 and by Theorem 18, respectively; and membership or non-membership of a state in k AS (k-Bailout ∩ GW i−1 ) can be guessed and validated in coNP using the coNP and NP part, respectively, of Theorem 14.
Once W is constructed, we can conduct the final check in coNP using Theorem 17.
This result, together with the upper bound on the energy needed to win energy-parity objective, allows us to solve the "unknown initial energy problem" [7], which is to compute the minimal initial energy level required.
Corollary 20.For any state s, checking if there is k such that AS (EN(k) ∩ PAR) holds is in NP ∩ coNP.Also, for a given k * , checking if k * is the minimal energy level required to win almost surely is in NP ∩ coNP as well.
Proof.Due to Theorem 14, if there is an energy level k for which AS (EN(k) ∩ PAR) holds, then it also holds for the bound K whose size is polynomial in the size of the game.We can then simply calculate K and then use NP and coNP algorithms from Theorem 19 for AS (EN(K) ∩ PAR).
As for the second claim, note that checking whether maximizer cannot win almost surely EN(k) ∩ PAR is also in NP and coNP as a complement of a coNP and an NP set, respectively.Therefore, for an NP/coNP upper bound it suffices to simultaneously guess certificates for almost surely EN(k * ) ∩ PAR and not almost surely EN(k * − 1) ∩ PAR and verify them in polynomial time.
Finally, let us mention that the slightly more restrictive storage-parity objectives can also be solved in NP ∩ coNP.These are almost identical to energy-parity except that, in addition, there must exist some bound l ∈ N such that the energy level never drops by more than l during a run.This extra condition ensures that, if the storage-parity objective holds almost-surely, then there must exist a finite-memory winning strategy for maximizer.
Theorem 21.One can check in NP, coNP and pseudo-polynomial time if, for a given SSG H def = (V, E, λ), k ∈ N and control state s ∈ V , maximizer can almost-surely satisfy ST(k) ∩ PAR from s.
Moreover, there is a bound L ∈ N, polynomial in the number of states and the largest absolute transition reward, so that ST(k) ∩ PAR = ST(k, L) ∩ PAR.
Proof.(sketch) This result follows by a simple adaptation of the proofs showing the same computational complexity of the Bailout objective (Section 4).See the end of Appendix B for further details.
Example 22.In the game in Fig. 1, maximizer cannot ensure the storage-parity condition ST(k)∩PAR for any initial energy level k.This is because it would imply the existence of a finite-memory almost-surely winning strategy, which as we have already argued, cannot be true.More intuitively, to prevent an intermediate energy drop by l units, a winning maximizer strategy for storage-parity would need to stop moving left after observing the negative cycle in the leftmost state l successive times.However, when maximizer moves right, this gives minimizer the chance to visit the rightmost bad state (with dominating odd priority 1).The chance of that happening is (1/3) l > 0. In particular, this probability is > 0 for any value of the intermediate energy drop l.Therefore, for any fixed l, maximizer would need to move right infinitely often to satisfy storage and lose (against an optimal minimizer strategy that moves to the rightmost state).

Conclusion and Outlook
We showed that several almost-sure problems for combined energy-parity objectives in simple stochastic games are in NP ∩ coNP.No pseudo-polynomial algorithm is known (just like for stochastic mean-payoff parity games [20]).All these problems subsume (stochastic) parity games, by setting all rewards to 0. Thus the existence of a pseudo-polynomial algorithm would imply that (stochastic and non-stochastic) parity games are in P, which is a long-standing open problem.
It is known that maximizer already needs infinite memory to win almostsurely a combined energy-parity objective in MDPs [41].Our results do not imply anything about the memory requirement for optimal minimizer strategies in SSGs for this objective.We conjecture that memoryless minimizer strategies suffice.If this conjecture holds (and is proven), this would greatly simplify the coNP upper bound that we established for this problem.
A natural question is whether results on mean-payoff/energy/parity games can be generalized to a setting with multi-dimensional payoffs.Non-stochastic multi-mean-payoff and multi-energy games have been studied in [50,37,1].To the best of our knowledge, the techniques used there, e.g.upper bounds on the necessary energy levels as in [37], do not generalize to stochastic games (or MDPs).
Multiple mean-payoff objectives in MDPs have been studied in [10,24], but the corresponding multi-energy (resp.multi-energy-parity) objective has extra difficulties due to the 0-boundary condition on the energy.I.e., even on Markov chains, and without any parity condition, it subsumes problems about multidimensional random walks.Some partial results on Markov chains and MDPs have been obtained in [13,2,3], but the decidability of the almost-sure problem for stochastic multi-energy-parity games (and MDPs) remains open.
For the induction step, we will use Definition 24 and Lemma 25, instantiated with O instead of O. Since O is both shift-invariant and submixing, this satisfies the conditions of Definition 24, but (relative to O) the roles of the players minimizer/maximizer are swapped.
Pick some initial state s and a minimizer's state π for O (i.e., a maximizer's state for O) and let G l , G r be defined as in Definition 24.By induction hypothesis, in both these games G l and G r , maximizer has an FD optimal strategy for objective O from s.Call these strategies σ l and σ r , respectively.In particular, since σ l and σ r are optimal and O and O are shift-invariant, the strategies σ l and σ r are subgame-perfect, and thus -subgame-perfect for = 0. Thus we can instantiate Definition 24 with objective O and reversed roles of players minimizer/maximizer.I.e., we take σ l for τ l and σ r for τ r , which are subgameperfect for player minimizer for objective O.We obtain the trigger-strategy σ lr for maximizer for O (i.e., the τ lr for minimizer for O from Definition 24).Since σ l and σ r are FD, so is σ lr .
We now argue that this trigger strategy σ lr must be optimal.The shiftinvariance and submixing conditions on O imply ( [35], Theorem 5.2) that minimizer has MD optimal strategies in every SSG with winning condition O. Let τ * be some MD optimal strategy for minimizer in G from s. W.l.o.g.assume that τ * (π) = l (otherwise rename left/right).
We show that σ lr and τ * are best responses to each other, and thus both are optimal.That is, in order to finish the induction step, we prove that the following two claims hold for the game G. Together these imply the claim that σ lr is optimal, and hence the induction step, because = inf where the second equation uses the optimality of τ * .It remains to prove the two claims above.Item 1).Since τ * (π) = l we have P σ lr ,τ * G,s , where the equalities hold by τ * (π) = l and the inequality holds by the assumed optimality of σ l in G l .
Item 2).From Lemma 25, instantiated with O, we obtain that Since in our case = 0 we obtain and thus ∀τ.
In particular, for τ = τ * we obtain P σ lr ,τ * s (O) ≥ min{Val G l s (O), Val Gr s (O)}.However, since τ * is an MD optimal strategy for minimizer, we also have

B Bailout
We will proceed in several reduction steps, ultimately reducing to checking the winner of a non-stochastic game for energy-parity objectives.
Assume from now on a fixed SSG G with associated reward and parity functions.
the number of control states in the game times the largest absolute reward R times the largest priority c used in the parity condition.
We claim that every a.s.winning strategy can be turned into one that avoids sub-runs of the form s π1 −→ s π2 −→ s where 1) both π 1 and π 2 have strictly negative total effect on the energy level, 2) neither π 1 nor π 2 visit state s internally, 3) the dominant priority on π 1 and π 2 is the same.If a strategy allows such a path, then one can safely "cut out" π 2 and the resulting strategy will still be a.s.winning.Taken to the limit, such transformations will result in a strategy that is a.s.winning for AS Bailout (k, L) .The idea of the next step is to allow maximizer to witness the LimInf(= ∞) condition by occasionally trading in energy for a good priority, thereby satisfying a parity condition instead.This results in a stochastic game for a ST(k, l) ∩ PAR objective.
Let G be the SSG derived from G, where maximizer can always trade energyincrease for visiting the best possible priority 0. That is, G results from G by replacing every edge s +a − − → t, with a > 0, by a gadget below, where s ∈ V , parity(s ) = parity(s) and parity(t ) = 0. Proof.Assume that R ∈ N is the largest absolute transition reward in G (and hence also G ). Every a.s.winning strategy σ for Bailout (k, l) = ST(k, l) ∩ (PAR ∪ LimInf(= ∞)) in G can be turned into an a.s.winning strategy σ for ST(k, l) ∩ PAR in G as follows.
The new strategy σ behaves just as σ but additionally, keeps track of the energy levels up to the bound l • R. If in G, σ chooses to increase the energy level above this bound, σ will opt to visit a good priority instead, and continue from the current energy level.Since σ ensures the l-storage condition on (almost) all runs, so does σ .Moreover, plays in G that do not satisfy PAR must instead satisfy LimInf(= ∞).The corresponding runs in according to σ will therefore infinitely often visit the best priority and hence satisfy the parity condition.
For the other direction, notice that one can just as well transform an a.s.winning strategy σ for storage-parity in G to a winning strategy σ for Bailout (k, l) in G.The strategy σ just increments the energy level and whenever σ would visit a newly introduced priority-0 state.Suppose ρ is a play in G that corresponds to a play ρ in G .If ρ visits new states only finitely often, then after some finite prefix, the sequence of states visited by ρ and ρ are the same.Since ρ satisfies the parity condition so must ρ.Otherwise, if ρ visits new states infinitely often, then ρ the difference of energy levels on ρ and ρ must grow unboundedly.Since ρ satisfies the l-storage condition this means that ρ satisfies the LimInf(= ∞) condition, and hence Bailout (k, l).
Finally, we use a construction similar to that in [23] for parity objectives, to replace random states by small "negotiation gadgets", resulting in a nonstochastic energy-parity game.Let G be the non-stochastic game derived from G , where random states are replaced by gadgets as in [23].
Lemma 29.For every state s of G and every k, l ∈ N it holds that s ∈ AS G (ST(k, l) ∩ PAR) if, and only if, s ∈ AS G (ST(k, l) ∩ PAR).
Proof.The construction in [23] does not affect the transition rewards.Thus the ST(k, l) condition is trivially preserved.The a.s.PAR condition is preserved by exactly the same argument as in [23].
Theorem 14.One can check in NP, coNP and pseudo-polynomial time if, for a given SSG G def = (V, E, λ), k ∈ N and control state s ∈ V , maximizer can almost-surely satisfy k-Bailout from s.
Moreover, there are K, L ∈ N, polynomial in |V | and the largest absolute transition reward, so that k≥0 AS G (k-Bailout) = AS G (Bailout(K, L)).And so, checking whether state s belongs to k≥0 AS G (k-Bailout) is in NP and coNP.
Proof.By Lemmas 26 to 29, for every k ∈ N it holds that Since G is a two-player non-stochastic game, the claim now follows from [19], (Theorem 2 and Lemma 5).For the existence of polynomially bounded number K, L just notice that G has the same largest absolute transition reward, and only a polynomially larger set of states compared to G.For non-stochastic energy-parity games such as G it holds that k≥0 AS (ST(k, L) , if K denotes the product of the number of states, the largest priority and absolute transition rewards in G .Now, to check if a state s belongs to k≥0 AS G (k-Bailout), we can calculate K and then simply follow the NP or coNP procedure to check if s belongs to AS G (K-Bailout) instead.This shows that this problem in NP and coNP as well.
As a side result, note that neither Lemma 29, nor the complexity argument in Theorem 14, make use of the structure of G : they hold for all SSGs with storage parity condition.
Theorem 21.One can check in NP, coNP and pseudo-polynomial time if, for a given SSG H def = (V, E, λ), k ∈ N and control state s ∈ V , maximizer can almost-surely satisfy ST(k) ∩ PAR from s.
Moreover, there is a bound L ∈ N, polynomial in the number of states and the largest absolute transition reward, so that ST(k) ∩ PAR = ST(k, L) ∩ PAR.

C.1 Strategy Complexity for Gain
We prove Lemma 16, i.e., if maximizer can almost-surely win Gain in an MDP, then he can do so using a finite-memory deterministic strategy.
To do this, we will utilize some results from [41], where we showed how to compute winning regions for energy-parity objectives in MDPs based on a similar combination of "gain" and "bailout" objectives as in this paper.
Consider a state s of a finite MDP with energy-parity objective and define the limit value of state s as LVal s def = sup k Val s (EN(k) ∩ PAR).This is well defined, because energy conditions are monotone increasing in the initial energy level k.
Lemma 30.For any state s of a finite MDP, we have Val s (Gain) = LVal s .
Proof.It follows directly from the definitions that for every k ∈ N Towards the reverse inclusion, consider a run ρ ∈ LimInf(≥ −j) ∩ PAR for some j ∈ N.Then, except in a finite prefix ρ , the energy along ρ stays above −j.Let k be the minimal energy reached in ρ , which is finite because ρ is finite, and let and thus From Eq. ( 2) and Eq. ( 3) we obtain Therefore, Lemma 16.For finite MDPs, almost-sure winning maximizer strategies for Gain can be chosen FD.
Proof.By Lemma 30 we have Val s (Gain) = LVal s .Moreover, the objective Gain is shift-invariant and therefore there exist optimal strategies [34].Thus it follows from [41,Theorem 18] that AS (Gain) = AS (Reach(A ∪ B)), for the following sets of states . This means that if an a.s.winning strategy for Gain exists, then there also exists one that operates in two phases: 1) a.s.reach A∪B.This can be done with memoryless deterministic strategies.2a) once in A proceed along an a.s.winning strategy for ST(k) ∩ PAR, which can be done deterministically with memory O(k • |G|).Or, 2b) once in B, proceed along an a.s.winning strategy for LimInf(= ∞) ∩ PAR.For MDPs a strategy is almost-sure winning for LimInf(= ∞) ∩ PAR iff it is almost-sure winning for MP(> 0) ∩ PAR, the combination of a parity condition together with a strictly positive Mean-Payoff condition.Such strategies can be chosen FD [17].

C.2 Gain is in coNP
Theorem 17. Checking whether a state s ∈ V of a SSG satisfies Gain almostsurely is in coNP.
Proof.By Lemma 15, it suffices to show coNP membership only for the MDP case, as a witnessing MD strategy for minimizer can be guessed as part of the certificate.To check if maximizer can almost surely win from state s in an MDP with Gain objective, we can equivalently check if Val s (Gain) = 1.This is because the objective is shift-invariant and therefore there exist optimal strategies [34].By Lemma 30, we can alternatively check if LVal s = 1, which can be done in coNP by [41,Lemma 26].

C.3 Gain is in NP
Before we can proceed with the technical details of the G → G 1 → G 2 → G 3 constructions, we first need to introduce the following standard definitions.Definition 31.Let M = G[τ ] be an MDP induced by game G def = (V = (V , V , V ), E, λ) and an MD strategy τ for minimizer.An end-component is a strongly connected set of states C ⊆ V such that, for every state A leaf-component is storage-parity-safe if the dominating priority is even and it satisfies the storage condition k≥0 ST(k), and mean-positive if its mean-payoff is positive.
An end-component C of G[τ ] is gain-safe if (1) the dominating priority of C (the smallest priority of any state in C) is even and C contains a mean-positive leaf-component or (2) there is an MD strategy σ for the maximizer, such that C is a storage-parity-safe leaf-component in G[σ, τ ].
Note that any end-component that satisfies a.s.Gain is gain-safe, which justifies its name.This is because either (1) holds, or else maximizer can reach again and again a state with a dominating even priority without the need to pump up the energy level first for which an MD strategy suffices, so (2) would hold then.
The G → G 1 construction just multiplies all the rewards by a "large enough" factor.Formally, we need G 1 to have the following property.
Lemma 32.Let τ be an MD strategy for the minimizer and s ∈ V .If there exists a strategy σ for the maximizer such that P G,σ,τ s (MP(> 0) ∩ PAR) = 1 then there exists an FD strategy σ for the maximizer such that P G1,σ ,τ s (MP(> 2)∩PAR) = 1.
We construct a game G 1 , based on G, in which all edge rewards are multiplied by a large factor so that if the maximizer can originally ensure the parity condition and a positive expected mean-payoff in G, then he can ensure parity condition and expected mean-payoff higher than 2 in G 1 .It is intuitively clear that such a factor exists, because multiplying all transition rewards by a positive factor has no effect on the outcome of the Gain objective.What is less clear that such a factor can be of polynomial size so that G 1 is only polynomially larger than G. Before we can proceed with the proof of Lemma 32, we need to show an auxiliary result below.
Recalling that, for the Gain objective, the minimizer has MD optimal strategies, we consider the effect of multiplying the rewards of all edges by factor f against every such strategy τ : We show that if maximizer can a.p is exponential in the size of G.Moreover, a factor f > 2 p , with a representation polynomial in |G|, can be computed independent of τ , E, σ, or L.
Proof.For any fixed MD strategies σ and τ , we can write a linear program for the so-called gain-bias relations3 in L, which is a standard way to solve MDPs with a mean-payoff objective (see, e.g., [44,Theorem 8.2.6(a), p. 343]).In any solution, the gain of a state equals its mean-payoff value while, broadly speaking, the bias compensates for the fluctuation of the payoff, where the gain is only the expected longterm average.Notice that for a fixed L, we only need a single gain variable g, because all nodes in a leaf component have the same mean-payoff.For each node u ∈ L, we introduce a bias variable, b u .
The constraints of the gain-bias linear program for L are: and its objective is Maximize g.It follows from the proof of Corollory 10.2a in [47] that the size of an optimal finite solution to such this linear program is at most 4m 2 (m + 1)(S + 1), where m is the number of variables and S is the maximum size of any coefficient used.In our case we can easily estimate that m ≤ |V | + 1 and S ≤ |G|, so the optimal solution, p, is of size polynomial in |G|.And, since p > 0, the same holds for 2/p.
Note that the loose upper bound given above on the size of 2/p does not really depend on τ , σ, E nor L, so if we take the maximum of the size of 2/p over all possible τ , σ, E and L, we would still get the same upper bound.
Such an f will serve as our sufficiently large (yet sufficiently small) blow-up factor: G 1 is obtained from G by changing the reward function to r 1 (e) = f • r (e) for all e ∈ E, i.e., by multiplying all rewards by f .We are now finally ready to prove Lemma 32.
Proof of Lemma 32.The existence of an FD strategy σ that achieves P G,σ ,τ s (MP(> 0) ∩ PAR) = 1 follows from [17].Moreover, σ achieves the same mean-payoff, denoted by p , as the original almost-sure winning strategy σ.By Lemma 33, the mean payoff of σ in   Example 34 (running example).Consider the game G in Fig. 3 (left).Maximizer can almost-surely guarantee the Gain condition.The strategy that always loops in the right-most state ensures a mean-payoff of 3. As this is the only MD strategy for maximizer that ensures a positive mean-payoff, picking any factor f > 2  3 is sufficient.In particular we can pick f = 1 which results in G 1 = G.
We are now going to modify the game G 1 into the game G 2 , where maximizer can sacrifice part of the reward he would normally get while visiting a probabilistic node in exchange for rebalancing the values of these rewards.
During the construction of G 2 we are going to fix an optimal MD strategy, τ * , for minimizer in G 1 .Game G 2 will be the same no matter which optimal strategy is picked as τ * .
We start the construction of G 2 with identifying the union, U , of all gain-safe end-components of G 1 [τ * ], for which there is no maximizer strategy that ensures MP(> 0) ∩ PAR.Condition (2) of gain-safeness has to hold instead, i.e., there are MD maximizer strategies that a.s.satisfy storage and parity, and note that then the mean-payoff has to be 0. We can compose all these strategies into a single winning maximizer MD strategy σ for all states in U .We now collapse all states in U into a single gain-safe state u a with an even priority, and a self-loop with payoff 3, resulting in the SSG G U .Now, if the maximizer can a.s.reach All the remaining gain-safe end-components in G U satisfy MP(> 0) ∩ PAR and so MP(> 2) ∩ PAR due to Lemma 32.
We therefore fix a winning maximizer MD strategy σ for MP(> 2) and for each MD strategy τ write a linear program, consisting of the gain-bias inequations for gain of at least 2 in G U [σ, τ ], and forcing all biases to be non-negative and of polynomial size.This is a straight-forward adaptation of the gain-bias relations for solving mean-payoff MDPs (see, e.g., [44, Theorem 8.2.6(a), p. 343]) In particular, we have and we pick as the objective It follows from the proof of Corollory 10.2a in [47] that the size of an optimal finite solution to such a linear program is at most 4m 2 (m + 1)(S + 1), where m is the number of variables and S is the maximum size of any coefficient used.In The reduction from G 1 to G 2 , in which maximizer can choose to rebalance the rewards of edges out of probabilistic states at the cost of a reduced expected payoff where , where G 2 ⊇ G U , and the associated reward function r 2 that we derive from G U by allowing maximizer to redistribute the rewards of random edges.More precisely, let s be a random state with two outgoing edges (s, t 1 ), (s, t 2 ) ∈ E and a unique predecessor p ∈ V .Then, for every MD minimizer strategy τ , G 2 contain an extra random state s τ and edges (p, s τ ), (s τ , t 1 ), (s τ , t 2 )-with the same probabilities p 1 and p 2 for taking (s τ , t 1 ) and (s τ , t 2 ) as for taking (s, t 1 ) and (s, t 2 ), respectively-and rewards r 2 (p, s τ ) def = r 1 (p, s), r 2 (s τ , t 1 ) def = 1 + b τ,s − b τ,t1 and r 2 (s τ , t 2 ) def = 1 + b τ,s − b τ,t2 .See Fig. 4 for an example.Notice that, due to the inequalities defining the biases b τ,u , we have p 1 r 2 (s τ , t 1 )+p 2 r 2 (s τ , t 1 )+1 < p 1 r 1 (s, t 1 )+p 2 r 1 (s, t 2 ), so maximizer sacrifices expected reward of at least 1 at s.
This extended arena has the following property for every state u ∈ V 2 of G 2 .
Proof. ( ⇐= ).Pick k such that u ∈ AS G2[τ ] (ST(k) ∩ PAR) holds, and let σ be an a.s.winning FD strategy for the maximizer.Now in G 1 , we simply follow σ, but whenever σ picks a trade-in edge to s τ , we pick the original edge to s instead.Notice that such a strategy ensures parity and the energy level at any point can only increase.If such a strategy reaches a node in U then it switches to an optimal strategy for ST(k ) ∩ PAR, where k is the minimum energy for which ST(k ) ∩ PAR holds for all states in U .It is easy to see that while using such a strategy the energy can never drop more than k + k , so it has to satisfy Gain a.s.( =⇒ ).First of all, note that due to the definition of biases b τ,u we have that r Now pick any a.s.winning σ for Gain in G 1 [τ ].Let σ be σ that always picks trade-ins s τ when possible.Such a strategy still satisfies parity a.s.Consider any play ρ = s 0 e 0 s 1 e 1 s 2 e 2 . . . of σ .If ρ reaches a state in U then we switch at that point to an optimal strategy for ST(k ) ∩ PAR as defined above.Otherwise, we have that for any infix s l e l . . .e h−1 s h of ρ, the change in the energy level is This shows that ST(B + k ) ∩ PAR is satisfied a.s.by such a strategy.
Using the existence of MD optimal minimizer strategies for their respective objectives in both games, we get the following.
Proof.First of all, by the way G 1 is defined, we have u ∈ AS G (Gain) ⇐⇒ u ∈ AS G1 (Gain).
(⇒) For all MD strategies τ the following has to hold u ∈ AS G1[τ ] (Gain).Due to Lemma 35 we get u ∈ k≥0 AS G2[τ ] (ST(k) ∩ PAR), so there exists k such that u ∈ AS G2[τ ] (ST(k) ∩ PAR).As there are only finitely many MD strategies, we let k * be the maximum value of k corresponding to one of them.Note that u ∈ AS G2 (ST(k * ) ∩ PAR) has to hold, because u ∈ AS G2[τ ] (ST(k * ) ∩ PAR) for all MD strategies τ (as ST(k) ∩ PAR objective is upward-closed) and one of them has to be an optimal strategy for minimizer.

(left).
In its derived game G 2 there are as many trade-in options for the random state as there are MD minimizer's strategies (just two in this example).The blue one (top left) corresponds to minimizer going left and the red one (top right) to going up.Example biases that satisfy the inequalities presented in Section C.3.2 are drawn next to the nodes inside colored boxes.They results in the rewards 4 and −10 for the blue trade-in and 4 and −9 for the red one.

C.3.3 Concise Witnesses Construction
The final step is to show that we can clean up G 2 by removing all but a small number of the new trade-in for maximizer when entering a random state, preserving the fact that maximizer wins the ST(•) ∩ PAR objective.Formally this whole subsection is dedicated to a proof of the following crucial lemma.Lemma 38.There exists a game G 3 ⊇ G 1 that results from G 2 by keeping, for any random state, at most twice the number of states in G 1 trade-in options, and such that for any state s ∈ V maximizer wins the almost-sure k-storage-parity game in G 3 iff he does in G 2 .
Most of the properties in this subsection hold for an arbitrary energy-parity game, so we will use H instead of G 2 in order to avoid the use of double subscripts.
The main idea of the proof of Lemma 38 is to use the monotonicity of the ST(k) ∩ PAR objective with respect to the initial energy level k.If maximizer a.s.wins ST(•) ∩ PAR from state p then there is a least k p ∈ N such that (for some l), ST(k p , l) ∩ PAR holds a.s.Fix l large enough to work for all minimal k p for every state p-and for all purposes of the proofs below.
Consider a configuration (p, k p ) ∈ AS (ST(k p , l) ∩ PAR) where p has newly introduced outgoing edges that allow for trade-ins (it has a random successor node).Let σ be a winning maximiser strategy for this game that depends only on the state and the energy level in the energy store 4 , and let σ min denote the maximiser strategy that maps each maximiser state p to the successor that σ assigns to (p, k p ).Note that this strategy is positional, and therefore uses only one possible trade-in option.
We first observe that maximiser can ensure by using this strategy that he can only gain energy distance relative to the minimal energy level of the state (except where the energy is limited by the capacity of his energy store): For every run (s 0 , k 0 ), (s 1 , k 1 ), (s 2 , k s ), (s 3 , k 3 ), . . . of H consistent with σ min and all i ∈ ω it holds that k i+1 − k si+1 ≥ k i − k si .The following lemma is a direct consequence.
Lemma 39.The strategy σ min almost-surely guarantees that 1) the cumulative rewards tend to infinity or 2) the parity condition holds.That is, for every minimizer strategy τ and initial state s of H it holds that P H,σmin,τ s (LimInf(= ∞) ∪ PAR) = 1.
Proof.Assume for contradiction that minimizer has a strategy that ensures that runs with a positive probability weight contain (1) only finitely many transitions that lead to a true gain in energy (relative to the minimal energy level) and (2) do not satisfy the parity condition.( 1) is a co-Büchi objective, (2) a parity objective, so (1) and ( 2) together are a parity objective.Thus, minimizer has a memoryless strategy τ to obtain this.Thus H[σ min , τ ] has a leaf-component where this holds.Thus, H[σ, τ ] is not winning on the states of this leaf-component on the minimal energy level.(contradiction) We call the property (LimInf(= ∞) ∪ PAR) established by this lemma the lift or win property and will use it for a separation of concerns.For this, we first show that, when the dominating priority is odd, then the maximizer can win on a smaller set that he can ensure is never left while winning the energy storage condition almost surely.
For a set S of states, we write atr H i (S) for the set of states from which player i ∈ { , } (maximizer / minimizer) can force the game to a state in S. In particular, atr H (S) = AS (FS) is the set of states for which maximizer can ensure to almost-surely reach S. We call a set S of states a (minimizer) trap if all minimizer states and all random states in S have only successors in S. Naturally, the union of two traps is also a trap, so there exists a unique ⊆-maximal trap.
Lemma 40.Let H be a game with minimal odd priority o, where the maximiser wins storage parity from all positions, and let S o be the states with priority o.Then there is a trap S t in H \ atr H (S o ), such that the maximiser wins storage parity from all positions in the subgame H ∩ S t , that is, without exiting S t .
Proof.Assume for contradiction that no such trap exists.Then the minimiser has an almost-sure winning strategy-and thus a positional winning strategy τ -for all positions in H \ atr H (S o ).Then the minimiser can win almost surely in H by a positional winning strategy that fixes an arbitrary strategy for her positions in S o , uses her attractor strategy in all her other positions in atr H (S o ), and τ elsewhere.(contradiction) The minimal energy level for winning from a state in S t can, of course, differ from the minimal sufficient energy level for the same state in the full game H.We now partition the winning regions using divide and conquer.Lemma 41.Let H be a game where the maximiser wins storage parity from all positions.Let o be the minimal odd priority that occurs in H.If o is the minimal priority in H then let S t be defined as the trap S t guaranteed by Lemma 40, otherwise let S t be the set of states with smaller priority than o.The following holds.
1. Maximizer wins storage parity from all positions in the subgame H = H \ atr H (S t ). 2. Fix maximizer strategies σ 1 , σ 2 and σ 3 that are almost-sure winning for 1) storage-parity in H , storage-parity in S t , and 3) reachability (FS t ), respectively, and let I ⊆ H be the game in which all new trade-in states that are can be chosen MD.In any further decomposition, any given state will wither belong to a smaller game (H or S t ), in which case the number of necessary trade-in options is unchanged, or is in atr H (S t ) \ S t , in which case the combined strategy may need to chose between σ min and an attractor strategy.But notice that the choice of trade-in state is meaningless for the attractor strategy, because all such states have the same (distributions over) successors.We can prune G 2 into a game where all but one new alternative state is removed.
In this game G 3 , depicted on the right, maximizer can almost-surely guarantee the Gain condition while simultaneously ensuring that no negative cycle is closed.This means that ST(k) ∩ PAR holds almost-surely in G 3 , and hence EN(k) ∩ PAR in G.

C.3.4 Proof of Theorem 18
We are now ready to prove the main theorem of Section 5.
Theorem 18. Checking whether a state s ∈ V of G satisfies Gain almost-surely is in NP.
Proof.Guess a game G 3 that uses only the given bound on the number of choices, i.e., without constructing the exponentially large game G 2 .Prune the unreachable random states and verify that maximizer can almost-surely ensure the storageparity objective in G 3 .The correctness of this procedure follows from Lemma 38 and Corollary 36.

def=
Runs \ Obj denotes the complement of objective Obj.For runs a, b, c ∈ Runs we say that a is a shuffle of b and c if there exist factorizations b = b 0 b 1 . . .and c = c 0 c 1 . . .such that a = b 0 c 0 b 1 c 1 . . . .An objective Obj is called submixing if, for every run a ∈ Obj that is a shuffle of runs b and c, either b ∈ Obj or c ∈ Obj.Obj is shift-invariant if, for every run s 1 e 1 s 2 e 2 .

Remark 13 .
It follows from the results above that W = W .The ⊆ inclusion holds by Lemma 12.For the reverse inclusion we have W ⊆ k AS (k-Bailout ∩ GW ) by Definition 4 = k AS (EN(k) ∩ PAR) by Theorem 5 = W by Definition 10.

Theorem 14 .
One can check in NP, coNP and pseudo-polynomial time if, for a given SSG G def = (V, E, λ), k ∈ N and control state s ∈ V , maximizer can almost-surely satisfy k-Bailout from s.

Fig. 2 :
Fig.2: An example game G (left) and the derived games.The strategy that always loops in the right-most state of G ensures a mean-payoff of 3. As this is the only MD strategy for maximizer that ensures a positive mean-payoff, a factor f = 1 is sufficient here and we have G 1 = G.In the derived game G 2 in Fig.2bthere are as many trade-in options for the random state as there are MD minimizer's strategies in G 1 (just two in this example).The blue one (top left) corresponds to minimizer going left and the red one (top right) to going up in G 1 .Maximizer almost-surely wins Gain in G iff he almost-surely wins a storage-parity condition (see Theorem 21) in G 3 .

Lemma 28 .
For every state s of G, and every k, l ∈ N it holds that s ∈ AS G Bailout (k, l) if, and only if, s ∈ AS G (ST(k, l) ∩ PAR).
s. obtain MP(> 0) ∩ PAR from a state s in the MDP G[τ ], then he can a.s.obtain MP(> 2) ∩ PAR from s in the MPD G 1 [τ ].Lemma 33.Let (1) τ be an MD strategy for minimizer, (2) E be an end component in the MDP G[τ ], with even minimal priority, (3) σ an MD strategy for Max, and (4) L ⊆ E a leaf component in G[σ, τ ] with expected payoff p > 0. Then 2 The original game G.
The derived game G1 happens to be equal to G.

Fig. 3 :
Fig.3: An example game G (left) and its example derived game G 1 .
and k is the number of MD minimizer strategies.ourcase m ≤ |V | + 1 and S ≤ |G|, so the size of any b τ,u in an optimal finite solution to such a linear program is of size polynomial in |G 1 |.Note that this loose upper bound, B, does not depend on τ , σ nor U .We now build the SSG G 2