Stochastic Games with Lexicographic Reachability-Safety Objectives

We study turn-based stochastic zero-sum games with lexicographic preferences over reachability and safety objectives. Stochastic games are standard models in control, verification, and synthesis of stochastic reactive systems that exhibit both randomness as well as angelic and demonic non-determinism. Lexicographic order allows to consider multiple objectives with a strict preference order over the satisfaction of the objectives. To the best of our knowledge, stochastic games with lexicographic objectives have not been studied before. We establish determinacy of such games and present strategy and computational complexity results. For strategy complexity, we show that lexicographically optimal strategies exist that are deterministic and memory is only required to remember the already satisfied and violated objectives. For a constant number of objectives, we show that the relevant decision problem is in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {NP}\cap \mathsf {coNP}$$\end{document}, matching the current known bound for single objectives; and in general the decision problem is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {PSPACE}$$\end{document}-hard and can be solved in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {NEXPTIME}\cap \mathsf {coNEXPTIME}$$\end{document}. We present an algorithm that computes the lexicographically optimal strategies via a reduction to computation of optimal strategies in a sequence of single-objectives games. We have implemented our algorithm and report experimental results on various case studies.


Introduction
Simple stochastic games (SGs) [26] are zero-sum turn-based stochastic games played over a finite state space by two adversarial players, the Maximizer and Minimizer, along with randomness in the transition function. These games allow the interaction of angelic and demonic non-determinism as well as stochastic uncertainty. They generalize classical models such as Markov decision processes (MDPs) [39] which have only one player and stochastic uncertainty. An objective specifies a desired set of trajectories of the game, and the goal of the Maximizer is to maximize the probability of satisfying the objective against all choices of the Minimizer. The basic decision problem is to determine whether the Maximizer can ensure satisfaction of the objective with a given probability threshold. This problem is among the rare and intriguing combinatorial problems that are NP ∩ coNP, and whether it belongs to P is a major and long-standing open problem. Besides the theoretical interest, SGs are a standard model in control and verification of stochastic reactive systems [4,18,31,39], as well as they provide robust versions of MDPs when precise transition probabilities are not known [22,45].
The multi-objective optimization problem is relevant in the analysis of systems with multiple, potentially conflicting goals, and a trade-off must be considered for the objectives. While the multi-objective optimization has been extensively studied for MDPs with various classes of objectives [1,28,39], the problem is notoriously hard for SGs. Even for multiple reachability objectives, such games are not determined [23] and their decidability is still open.
This work considers SGs with multiple reachability and safety objectives with lexicographic preference order over the objectives. That is, we consider SGs with several objectives where each objective is either reachability or safety, and there is a total preference order over the objectives. The motivation to study such lexicographic objectives is twofold. First, they provide an important special case of general multiple objectives. Second, lexicographic objectives are useful in many scenarios. For example, (i) an autonomus vehicle might have a primary objective to avoid clashes and a secondary objective to optimize performance; and (b) a robot saving lives during fire in a building might have a primary objective to save as many lives as possible, and a secondary objective to minimize energy consumption. Thus studying reactive systems with lexicographic objectives is a very relevant problem which has been considered in many different contexts [7,33]. In particular non-stochastic games with lexicographic objectives [6,25] and MDPs with lexicographic objectives [47] have been considered, but to the best of our knowledge SGs with lexicographic objectives have not been studied.
In this work we present several contributions for SGs with lexicographic reachability and safety objectives. The main contributions are as follows.
-Determinacy. In contrast to SGs with multiple objectives that are not determined, we establish determinacy of SGs with lexicographic combination of reachability and safety objectives. -Computational complexity. For the associated decision problem we establish the following: (a) if the number of objectives is constant, then the decision problem lies in NP ∩ coNP, matching the current known bound for SGs with a single objective; (b) in general the decision problem is PSPACE-hard and can be solved in NEXPTIME ∩ coNEXPTIME. -Strategy complexity. We show that lexicographically optimal strategies exist that are deterministic but require finite memory. We also show that memory is only needed in order to remember the already satisfied and violated objectives.
-Algorithm. We present an algorithm that computes the unique lexicographic value and the witness lexicographically optimal strategies via a reduction to computation of optimal strategies in a sequence of single-objectives games. -Experimental results. We have implemented the algorithm and present experimental results on several case studies.
Technical Contribution. The key idea is that, given the lexicographic order of the objectives, we can consider them sequentially. After every objective, we remove all actions that are not optimal, thereby forcing all following computation to consider only locally optimal actions. The main complication is that local optimality of actions does not imply global optimality when interleaving reachability and safety, as the latter objective can use locally optimal actions to stay in the safe region without reaching the more important target. We introduce quantified reachability objectives as a means to solve this problem.
Related Work. We present related works on: (a) MDPs with multiple objectives; (b) SGs with multiple objectives; (c) lexicographic objectives in related models; and (d) existing tool support.
(a) MDPs with multiple objectives have been widely studied over a long time [1,39]. In the context of verifying MDPs with multiple objectives, both qualitative objectives such as reachability and LTL [29], as well as quantitative objectives, such as mean payoff [8,13], discounted sum [17], or total reward [34] have been considered. Besides multiple objectives with expectation criterion, other criteria have also been considered, such as, combination with variance [9], or multiple percentile (threshold) queries [8,20,32,41]. Practical applications of MDPs with multiple objectives are described in [2,3,42]. (b) More recently, SGs with multiple objectives have been considered, but the results are more limited [43]. Multiple mean-payoff objectives were first examined in [5] and the qualitative problems are coNP-complete [16]. Some special classes of SGs (namely stopping SGs) have been solved for totalreward objectives [23] and applied to autonomous driving [24]. However, even for the most basic question of solving SGs with multiple reachability objectives, decidability remains open. (c) The study of lexicographic objectives has been considered in many different contexts [7,33]. Non-stochastic games with lexicographic mean-payoff objectives and parity conditions have been studied in [6] for the synthesis of reactive systems with performance guarantees. Non-stochastic games with multiple ω-regular objectives equipped with a monotonic preorder, which subsumes lexicographic order, have been studied in [12]. Moreover, the beyond worst-case analysis problems studied in [11] also considers primary and secondary objectives, which has a lexicographic flavor. MDPs with lexicographic discounted-sum objectives have been studied in [47], and have been extended with partial-observability in [46]. However, SGs with lexicographic reachability and safety objectives have not been considered so far.
(d) PRISM-Games [37] provides tool support for several multi-player multiobjective settings. MultiGain [10] is limited to generalized mean-payoff MDPs. Storm [27] can, among numerous single-objective problems, solve Markov automata with multiple timed reachability or expected cost objectives [40], multi-cost bounded reachability MDPs [35], and it can provide simple strategies for multiple expected reward objectives in MDPs [28].
Structure of this Paper. After recalling preliminaries and defining the problem in Sect. 2, we first consider games where all target sets are absorbing in Sect. 3. Then, in Sect. 4 we extend our insights to general games, yielding the full algorithm and the theoretical results. Finally, Sect. 5 describes the implementation and experimental evaluation. Section 6 concludes.

Preliminaries
such that x∈A f (x) = 1. We denote the set of all probability distributions on A by D(A). Vector-like objects x are denoted in a bold font and we use the notation x i for the i-th component of x. We use x <n as a shorthand for (x 1 , . . . , x n−1 ).

Basic Definitions
Probabilistic Models. In this paper, we consider (simple) stochastic games [26], which are defined as follows. Let L = {a, b, . . .} be a finite set of actions labels.

Definition 1 (SG).
A stochastic game (SG) is a tuple G = (S , S ♦ , Act, P ) with S := S S ♦ = ∅ a finite set of states, Act : S → 2 L \ {∅} defines finitely many actions available at every state, and P : S × L → D(S) is the transition probability function. P (s, a) is undefined if a / ∈ Act(s).
We abbreviate P (s, a)(s ) to P (s, a, s ). We refer to the two players of the game as Max and Min and the sets S and S ♦ are the Maxand Min-states, respectively. As the game is turn based, these sets partition the state space S such that in each state it is either Max's or Min's turn. The intuitive semantics of an SG is as follows: In every turn, the corresponding player picks one of the finitely many available actions a ∈ Act(s) in the current state s. The game then transitions to the next state according to the probability distribution P (s, a). The winning conditions are not part of the game itself and need to be further specified.

Strategies.
We define the formal semantics of games by means of paths and strategies. An infinite path π is an infinite sequence π = s 0 a 0 s 1 a 1 · · · ∈ (S × L) ω , such that for every i ∈ N, a i ∈ Act(s i ) and s i+1 ∈ {s | P (s i , a i , s ) > 0}.
Finite paths are defined analogously as elements of (S × L) * × S. Note that when considering MCs, every state just has a single action, so an infinite path can be identified with an element of S ω . A strategy of player Max is a function σ : (S × L) * × S → D(L) where σ(πs)(s ) > 0 only if s ∈ Act(s). It is memoryless if σ(πs) = σ(π s) for all π, π ∈ (S × L) * . More generally, σ has memory of class-size at most m if the set (S × L) * can be partitioned in m classes M 1 , . . . , M m ⊆ (S × L) * such that σ(πs) = σ(π s) for all 1 ≤ i ≤ m, π, π ∈ M i and s ∈ S . A memory of class-size m can be represented with log(m) bits.
A strategy is deterministic if σ(πs) is Dirac for all πs. Strategies that are both memoryless and deterministic are called MD and can be identified as functions σ : S → L. Notice that there are at most |L| S different MD strategies, that is, exponentially many in S ; in general, there can be uncountably many strategies.
Strategies τ of player Min are defined analogously, with S replaced by S ♦ . The set of all strategies of player Max is denoted with Σ Max , the set of all MD strategies with Σ MD Max , and similarly Σ Min and Σ MD Min for player Min. Fixing a strategy σ of one player in a game G yields the induced MDP G σ . Fixing a strategy τ of the second player too, yields the induced MC G σ,τ . Notice that the induced models are finite if and only if the respective strategies use finite memory.
Given an (induced) MC G σ,τ , we let P σ,τ s be its associated probability measure on the Borel-measurable sets of infinite paths obtained from the standard cylinder construction where s is the initial state [39].
Reachability and Safety. In our setting, a property is a Borel-measurable set Ω ⊆ S ω of infinite paths in an SG. The reachability property Reach (T ) where T ⊆ S is the set Reach (T ) = {s 0 s 1 . . . ∈ S ω | ∃i ≥ 0: s i ∈ T }. The set Safe (T ) = S ω \ Reach (T ) is called a safety property. Further, for sets T 1 , T 2 ⊆ S we define the until property T 1 U T 2 = {s 0 s 1 . . . ∈ S ω | ∃i ≥ 0: s i ∈ T 2 ∧ ∀j < i : s j ∈ T 1 }. These properties are measurable (e.g. [4]). A reachability or safety property where the set T satisfies T ⊆ Sinks(G) is called absorbing. For the safety probabilities in an (induced) MC, it holds that P s (Safe (T )) = 1−P s (Reach (T )). We highlight that an objective Safe (T ) is specified by the set of paths to avoid, i.e. paths satisfying the objective remain forever in S \ T .

Stochastic Lexicographic Reachability-Safety Games
SGs with lexicographic preferences are a straightforward adaptation of the ideas of e.g. [46] to the game setting. The lexicographic order on R n is defined as x ≤ lex y iff x i ≤ y i where i ≤ n is the greatest position such that for all j < i it holds that x j = y j . The position i thus acts like a tiebreaker. Notice that for arbitrary sets X ⊆ [0, 1] n , suprema and infima exist in the lexicographic order.
Thus the lex-value at state s is the lexicographically supremal vector of probabilities that Max can ensure against all possible behaviors of Min. We will prove in Sect. 4.3 that the supremum and infimum in (1) can be exchanged; this property is called determinacy. We omit the superscript Ω in Ω v lex if it is clear from the context. We also omit the sets Σ Max and Σ Min in the suprema in (1), e.g. we will just write sup σ . Fig. 1a with the lex-objective Ω = {Reach (S 1 ) , Safe (S 2 )}. Player Max must thus maximize the probability to reach S 1 and, moreover, among all possible strategies that do so, it must choose one that maximizes the probability to avoid S 2 forever.

Lex-Value of Actions and Lex-Optimal Actions.
We extend the notion of value to actions. Let s ∈ S be a state. The lex-value of an action a ∈ Act(s) is . Lex-optimal actions are defined analogously for states s ∈ S ♦ by considering the minimum instead of the maximum. Notice that there is always at least one optimal action because Act(s) is finite by definition.
Example 2 (Lex-value of actions). We now intuitively explain the lex-values of all states in Fig. 1a. The lex-value of sink states s, t, u and w is determined by their membership in the sets S 1 and S 2 . E.g., v lex (s) = (1, 1), as it is part of the set S 1 that should be reached and not part of the set S 2 that should be avoided. Similarly we get the lex-values of t, u and w as (1, 0), (0, 0) and (0, 1) respectively. State v has a single action that yields (0, 0) or (0, 1) each with State p has one action going to s, which would yield (1, 1). However, as p is a Min-state, its best strategy is to avoid giving such a high value. Thus, it uses the action going downwards and v lex (p) = v lex (q). State q only has a single action going to r, so v lex (q) = v lex (r).
State r has three choices: (i) Going back to q, which results in an infinite loop between q and r, and thus never reaches S 1 . So a strategy that commits to this action will not achieve the optimal value. (ii) Going to t or u each with probability 1 /2. In this case, the safety objective is definitely violated, but the reachability objective achieved with 1 /2. (iii) Going to t or v each with probability 1 /2. Similarly to (ii), the probability to reach S 1 is 1 /2, but additionally, there is a 1 /2 · 1 /2 chance to avoid S 2 . Thus, since r is a Max-state, its lex-optimal choice is the action leading to t or v and we get v lex (r) = ( 1 /2, 1 /4).
Notice that with the kind of objectives considered, we can easily swap the roles of Max and Min by exchanging safety objectives with reachability and vice versa. It is thus no loss of generality to consider subsequently introduced notions such as optimal strategies only from the perspective of Max.

Definition 3 (Lex-Optimal Strategies). A strategy
We stress that counter-strategies of Min depend on the strategy chosen by Max.
Locally Lex-Optimal Strategies. An MD strategy σ of Max (Min, resp.) is called locally lex-optimal if for all s ∈ S (s ∈ S ♦ , resp.) and a ∈ Act(s), we have σ(s)(a) > 0 implies that action a is lex-optimal. Thus, locally lex-optimal strategies only assign positive probability to lex-optimal actions.
Convention. For the rest of the paper, unless stated otherwise, we use G = (S , S ♦ , Act, P ) to denote an SG and Ω = (Ω 1 , . . . , Ω n ) is a suitable (not necessarily absorbing) lex-objective, that is

Lexicographic SGs with Absorbing Targets
In this section, we show how to compute the lexicographic value for SGs where all target sets are absorbing. We first show various theoretical results in Sect. 3.1 upon which the algorithm for computing the values and optimal strategies presented in Sect. 3.2 is then built. The main technical difficulty arises from interleaving reachability and safety objectives. In Sect. 4, we will reduce solving general (not necessarily absorbing) SGs to the case with absorbing targets.

Characterizing Optimal Strategies
This first subsection derives a characterization of lex-optimal strategies in terms of local optimality and an additional reachability condition (Lemma 2 further below). It is one of the key ingredients for the correctness of the algorithm presented later and also gives rise to a (non-constructive) proof of existence of MD lex-optimal strategies in the absorbing case. We begin with the following lemma that summarizes some straightforward facts we will frequently use. Recall that a strategy is locally lex-optimal if it only selects actions with optimal lex-value.

Lemma 1. The following statements hold for any absorbing lex-objective Ω:
Max is lex-optimal and τ ∈ Σ MD Min is a lex-optimal counter strategy against σ, then σ and τ are both locally lex-optimal. (We do not yet claim that such strategies σ, τ always exist.) (b) Let G be obtained from G by removing all actions (of both players) that are not locally lex-optimal. Let v lex be the lex-values in G.
Proof (Sketch). Both claims follow from the definitions of lex-value and lexoptimal strategy. For (b) in particular, we show that a strategy using actions which are not lex-optimal can be transformed into a strategy that achieves a greater (lower, resp.) value. Thus removing the non lex-optimal actions does not affect the lex-value. See [19, Appendix A.1] for more technical details.

Example 3 (Modified game G).
Consider again the SG from Fig. 1a. Recall the lex-values from Example 1. Now we remove the actions that are not locally lexoptimal. This means we drop the action that leads from p to s and the action that leads from r to t or u (Fig. 1b). Since these actions were not used by the lex-optimal strategies, the value in the modified SG is the same as that of the original game.
Example 4 (Locally lex-optimal does not imply globally lex-optimal). Note that we do not drop the action that leads from r to q, because v lex (r) = v lex (q), so this action is locally lex-optimal. In fact, a lex-optimal strategy can use it arbitrarily many times without reducing the lex-value, as long as eventually it picks the action leading to t or v. However, if we only played the action leading to q, the lex-value would be reduced to (0, 1) as we would not reach S 1 , but would also avoid S 2 . We stress the following consequence of this: Playing a locally lex-optimal strategy is not necessarily globally lex-optimal. It is not sufficient to just restrict the game to locally lex-optimal actions of the previous objectives and then solve the current one. Note that in fact the optimal strategy for the second objective Safe (S 2 ) would be to remain in {p, q}; however, we must not pick this safety strategy, before we have not "tried everything" for all previous reachability objectives, in this case reaching S 1 .
This idea of "trying everything" for an objective Reach (S i ) is equivalent to the following: either reach the target set S i , or reach a set of states from which S i cannot be reached anymore. Formally, let Zero i = {s ∈ S | v lex i (s) = 0} be the set of states that cannot reach the target set S i anymore. Note that it depends on the lex-value, not the single-objective value. This is important, as the singleobjective value could be greater than 0, but a more important objective has to be sacrificed to achieve it.
We define the set of states where we have "tried everything" for all reachability objectives as follows:

Definition 4 (Final Set). For absorbing Ω, let
The final set contains all target states as well as the states that have lex-value 0 for all reachability objectives; we need the intersection of the sets Zero k , because as long as a state still has a positive probability to reach any target set, its optimal behaviour is to try that. Fig. 1, we have Zero 1 = {u, v, w} and thus F = Zero 1 ∪ S 1 = {s, t, u, v, w}. An MD lex-optimal strategy of Max must almost-surely reach this set against any strategy of Min; only then it has "tried everything".

Example 5 (Final set). For the game in
The following lemma characterizes MD lex-optimal strategies in terms of local lex-optimality and the final set.

Lemma 2. Let Ω be an absorbing lex-objective and σ ∈ Σ MD
Max . Then σ is lexoptimal for Ω if and only if σ is locally lex-optimal and for all s ∈ S we have ∀τ ∈ Σ MD Min : P σ,τ s (Reach (F )) = 1.

( )
Proof (Sketch). The "if "-direction is shown by induction on the number n of targets. We make a case distinction according to the type of Ω n : If it is safety, then we prove that local lex-optimality is already sufficient for global lex-optimality. Else if Ω n is reachability, then intuitively, the additional condition ( ) ensures that the strategy σ indeed "tries everything" and either reaches the target S n or eventually a state in Zero n where the opponent Min can make sure that Max cannot escape. The technical details of these assertions rely on a fixpoint characterization of the reachability probabilities combined with the classic Knaster-Tarski Fixpoint Theorem [44] and are given in [19,Appendix A.2]. For the "only if "-direction recall that lex-optimal strategies are necessarily locally lex-optimal by Lemma 1 (a). Further let i be such that Ω i = Reach (S i ) and assume for contradiction that σ remains forever within S \ (S i ∪ Zero i ) with positive probability against some strategy of Min. But then σ visits states with positive lex-value for Ω i infinitely often without ever reaching S i . Thus σ is not lex-optimal, contradiction.
Finally, this characterization allows us to prove that MD lex-optimal strategies exist for absorbing objectives.

Theorem 1.
For an absorbing lex-objective Ω, there exist MD lex-optimal strategies for both players.
Proof (Sketch). We consider the subgame G obtained by removing lex-suboptimal actions for both players and then show that the (single-objective) value of Reach (F ) in G equals 1. An optimal MD strategy for Reach (F ) exists [26]; further, it is locally lex-optimal, because we are in G, and it reaches F almost surely. Thus, it is lex-optimal for Ω by the "if "-direction of Lemma 2. See [19,Appendix A.3] for more details on the proof.

Algorithm for SGs with Absorbing Targets
Theorem 1 is not constructive because it relies on the values v lex without showing how to compute them. Computing the values and constructing an optimal strategy for Max in the case of an absorbing lex-objective is the topic of this subsection.

Definition 5 (QRO).
A quantified reachability objective (QRO) is determined by a function q : S → [0, 1] where S ⊆ S. For all strategies σ and τ , we define: Intuitively, a QRO generalizes its standard Boolean counterpart by additionally assigning a weight to the states in the target set S . Thus the probability of a QRO is obtained by computing the sum of the q(t), t ∈ S , weighted by the probability to avoid S until reaching t. Note that this probability does not depend on what happens after reaching S ; so it is unaffected by making all states in S absorbing. In Sect. 4, we need the dual notion of a quantified safety property, defined as P σ,τ s (Safe (q)) = 1 − P σ,τ s (Reach (q)); intuitively, this amounts to minimizing the reachability probability.

Remark 1.
A usual reachability property Reach (S ) is a special case of a quantified one with q(s) = 1 for all s ∈ S . Vice versa, quantified properties can be easily reduced to usual ones defined only by the set S : Convert all states t ∈ S into sinks, then for each such t prepend a new state t with a single action a and P (t , a, t) = q(t) and P (t , a, ⊥) = 1 − q(t) where ⊥ is a sink state. Finally, redirect all transitions leading into t to t . Despite this equivalence, it turns out to be convenient and natural to use QROs.
Example 6 (QRO). Example 4 illustrated that solving a safety objective after a reachability objective can lead to problems, as the optimal strategy for Safe (S 2 ) did not use the action that actually reached S 1 . In Example 5 we indicated that the final set F = {s, t, u, v, w} has to be reached almost surely, and among those states the ones with the highest safety values should be preferred. This can be encoded in a QRO as follows: Compute the values for the Safe (S 2 ) objective for the states in F . Then construct the function q 2 : F → [0, 1] that maps all states in F to their safety value, i.e., q 2 : Thus using QROs, we can effectively reduce (interleaved) safety objectives to quantified reachability objectives:

Lemma 3 (Reduction Safe → Reach). Let Ω be an absorbing lex-objective
where F is the final set (Definition 4), and Ω = (Ω 1 , . . . , Ω n−1 , Reach (q n )). Then: Proof (Sketch). By definition, Ω v lex (s) = Ω v lex (s) for all s ∈ F , so we only need to consider the states in S \ F . Since any lex-optimal strategy for Ω or Ω must also be lex-optimal for Ω <n , we know by Lemma 2 that such a strategy reaches F <n almost-surely. Note that we have F <n = F , as the n-th objective, either the QRO or the safety objective, does not add any new states to F . The reachability objective Reach (q n ) weighs the states in F with their lexicographic safety values v lex n . Thus we additionally ensure that in order to reach F , we use those actions that give us the best safety probability afterwards. In this way we obtain the correct lex-values v lex n even for states in S \ F . See [19, Appendix A.4] for the full technical proof. → Reach). Recall Example 6. By the preceding Lemma 3, computing sup σ inf τ P σ,τ s (Reach (S 1 ) , Reach (q 2 )) yields the correct lex-value v lex (s) for all s ∈ S. Consider for instance state r in the running example: The action leading to q is clearly suboptimal for Reach (q 2 ) as it does not reach F . Both other actions surely reach F . However, since q 2 (t) = q 2 (u) = 0 while q 2 (v) = 1 /2, the action leading to u and v is preferred over that leading to t and u, as it ensures the higher safety probability after reaching F .

Example 7 (Reduction Safe
We now explain the basic structure of Algorithm 1. More technical details are explained in the proof sketch of Theorem 2 and the full proof is in [19,Appendix A.5]. The idea of Algorithm 1 is, as sketched in Sect. 3.1, to consider the objectives sequentially in the order of importance, i.e., starting with Ω 1 .

G ← G
Consider whole game in the beginning.

7:
F<i ← final set with respect to G and Ω <i see Def. 4

21: end procedure
The i-th objective is solved (Lines 5-10) and the game is restricted to only the locally optimal actions (Line 11). This way, in the i-th iteration of the main loop, only actions that are locally lex-optimal for objectives 1 through (i−1) are considered. Finally, we construct the optimal strategy and update the result variables (Lines 12-19).

Theorem 2.
Given an SG G and an absorbing lex-objective Ω = (Ω 1 , . . . , Ω n ), Algorithm 1 correctly computes the vector of lex-values v lex and an MD lexoptimal strategy σ for player Max. It needs n calls to a single objective solver.
-G-invariant: For i > 1, in the i-th iteration of the loop, G is the original SG restricted to only those actions that are locally lex-optimal for the targets 1 to (i−1); this is the case because Line 11 was executed for all previous targets. -Single-objective case: The single-objective that is solved in Line 5 can be either reachability or safety. We can use any (precise) single-objective solver as a black box, e.g. strategy iteration [36]. Recall that by Remark 1, it is no problem to call a single-objective solver with a QRO since there is a trivial reduction. -QRO for safety: If an objective is of type reachability, no further steps need to be taken; if on the other hand it is safety, we need to ensure that the problem explained in Example 4 does not occur. Thus we compute the final set F <i for the i-th target and then construct and solve the QRO as in Lemma 3.
-Resulting strategy: When storing the resulting strategy, we again need to avoid errors induced by the fact that locally lex-optimal actions need not be globally lex-optimal. This is why for a reachability objective, we only update the strategy in states that have a positive value for the current objective; if the value is 0, the current strategy does not have any preference, and we need to keep the old strategy. For safety objectives, we need to update the strategy in two ways: for all states in the final set F <i , we set it to the safety strategy σ (from Line 5) as within F <i we do not have to consider the previous reachability objectives and therefore must follow an optimal safety strategy. For all states in S \ F <i , we set it to the reachability strategy from the QRO σ Q (from Line9). This is correct, as σ Q ensures almost-sure reachability of F <i which is necessary to satisfy all preceding reachability objectives; moreover σ Q prefers those states in F <i that have a higher safety value (cf. Lemma 3). -Termination: The main loop of the algorithm invokes SolveSingleObj for each of the n objectives.

General Lexicographic SGs
We now consider Ω where S i ⊆ Sinks(G) does not necessarily hold. Section 4.1 describes how we can reduce these general lex-objectives to the absorbing case. The resulting algorithm is given in Sect. 4.2 and the theoretical implications in Sect. 4.3.

Reducing General Lexicographic SGs to SGs with Absorbing Targets
In general lexicographic SG, strategies need memory, because they need to remember which of the S i have already been visited and behave accordingly. We formalize the solution of such games by means of stages. Intuitively, one can think of a stage as a copy of the game with less objectives, or as the sub-game that is played after visiting some previously unseen set S i .  Fig. 2a. As there are two objectives, there are four possible stages: The one where we consider both objectives (the region denoted with Ω in Fig. 2b), the simple ones where we consider only one of the objectives (regions Ω({1}) and Ω({2})), and the one where both objectives have been visited. The last stage is trivial since there are no more objectives, hence we do not depict it and do not have to consider it. The actions of q and r are omitted in the Ω-stage, as upon visiting these states, a new stage begins. Consider the simple stages: in stage Ω({1}), q has value 0, as it is a Min-state and will use the self-loop to avoid reaching r ∈ S 2 . In stage Ω({2}), both p and r have value 1, as they can just go to the target state q ∈ S 1 . Combining this knowledge, we can get an optimal strategy for every state. In particular, note that an optimal strategy for state p needs memory: First go to r and thereby reach stage Ω({2}). Afterwards, go from r to p and now, on the second visit in a different stage, use the other action in p to reach q. In this example, we observe another interesting fact about lexicographic games: it can be optimal to first satisfy less important objectives.

Example 8 (Stages). Consider the SG in
In the example, we combined our knowledge of the sub-stages to find the lex-values for the whole lex-objective. In general, the values for the stages are numbers in [0, 1]. Thus we reuse the idea of quantified reachability and safety objectives, see Definition 5.
For all 1 ≤ i ≤ n, let q i : j≤n S j → [0, 1] by defined by: To keep the correct type of every objective, we let qΩ = (type 1 (q 1 ), . . . , type n (q n )) where for all 1 ≤ i ≤ n, type i = Reach if Ω i = Reach (S i ) and else type i = Safe if Ω i = Safe (S i ). So we have now reduced a general lexicographic objective Ω to a vector of quantitative objectives qΩ. Lemma 4 shows that this reduction preserves the values.

Lemma 4. For arbitrary lex-objectives Ω it holds that
Proof (Sketch). We write S = j≤n S j for the sake of readability in this sketch. By induction on the length n of the lex-objective Ω, it is easy to show that the equation holds in states s ∈ S, i.e., Ω v lex (s) = qΩ v lex (s). For a state s which is not contained in any of the S j , and for any strategies σ, τ we have the following equation where P aths fin (S) = {πt ∈ ((S \ S) × L) * × S | t ∈ S} denotes the set of all finite paths to a state in S in the Markov chain G σ,τ and P σ,τ s (πt) is the probability of such a path when G σ,τ starts in s. From this we deduce that in order to maximize the left hand size of the equation in the lexicographic order, we should play such that we prefer reaching states in S where q i has a higher value; that is, we should maximize the QRO Reach (q i ). The argument for safety is similar and detailed in [19,Appendix A.6].
The functions q i involved in qΩ all have the same domain j≤n S j . Hence we can, as mentioned below Definition 5, consider qΩ on the game where all states in j≤n S j are sinks without changing the lex-value. This is precisely the definition of an absorbing game, and hence we can compute qΩ v lex using Algorithm 1 from Sect. 3.2.

Algorithm for General SG
Algorithm 2 computes the lex-value Ω v lex for a given lexicographic objective Ω and an arbitrary SG G. We highlight the following technical details: -Reduction to absorbing case: We just have seen, that once we have the quantitative objective vector qΩ, we can use the algorithm for absorbing SG (Line 12). -Computing the quantitative objective vector: To compute qΩ, the algorithm calls itself recursively on all states in the union of all target sets (Line 5-7). We annotated this recursive call "With dynamic programming", as we can reuse the results of the computations. In the worst case, we have to solve all 2 n − 1 possible non-empty stages. Finally, given the values Ω (s) v lex for all s ∈ j≤n S j , we can construct the quantitative objective (Line 9 and 11) that is used for the call to SolveAbsorbing. -Termination: Since there are finitely many objectives in Ω and in every recursive call at least one objective is removed from consideration, eventually we have a simple objective that can be solved by SolveSingleObj (Line 3).
14: return ( qΩ v lex , σ) 15: end procedure -Resulting strategy: The resulting strategy is composed in Line 13: It adheres to the strategy for the quantitative query qΩ σ until some s ∈ j≤n S j is reached. Then, to achieve the values promised by q i (s) for all i with s / ∈ S i , it adheres to Ω (s) σ, the optimal strategy for stage Ω(s) obtained by the recursive call. Proof. Correctness of the algorithm and termination follows from the discussion of the algorithm, Lemma 4 and Theorem 2.

Theoretical Implications: Determinacy and Complexity
Theorem 3 below states that lexicographic games are determined for arbitrary lex-objectives Ω. Intuitively, this means that the lex-value is independent from the player who fixes their strategy first. Recall that this property does not hold for non-lexicographic multi-reachability/safety objectives [23]. Proof. This statement follows because single-objective games are determined [26] and Algorithm 2 obtains all values by either solving single-objective instances directly (Line 3) or calling Algorithm 1, which also reduces everything to the single-objective case (Line 5 of Algorithm 1). Thus the sup-inf values v lex returned by the algorithm are in fact equal to the inf-sup values.
By analyzing Algorithm 2, we also get the following complexity results: Theorem 4 (Complexity). For any SG G and lex-objective Ω = (Ω 1 , . . . , Ω n ): 1. Strategy complexity: Deterministic strategies with 2 n − 1 memory-classes (i.e., bit-size n) are sufficient and necessary for lex-optimal strategies. 2. Computational complexity: is PSPACE-hard and can be solved in NEXPTIME ∩ coNEXPTIME. If n is a constant or Ω is absorbing, then it is contained in NP ∩ coNP.
Proof. 1. For each stage, Algorithm 2 computes an MD strategy for the quantitative objective. These strategies are then concatenated whenever a new stage is entered. Equivalently, every stage has an MD strategy for every state, so as there are at most 2 n − 1 stages (since there are n objectives), the strategy needs at most 2 n − 1 states of memory; these can be represented with n bits. Intuitively, we save for every target set whether it has been visited. The memory lower bound already holds in non-stochastic reachability games where all n targets have to be visited with certainty [30]. 2. The work of [41] shows that in MDPs, it is PSPACE-hard to decide if n targets can be visited almost-surely. This problem trivially reduces to ours. For the NP upper bound, observe that there are at most 2 n − 1 stages, i.e., a constant amount if n is assumed to be constant (or even just one stage if Ω is absorbing). Thus we can guess an MD strategy for player Max in every stage. The guessed overall strategy can then be checked by analyzing the induced MDP in polynomial time [29]. The same procedure works for player Min and since the game is determined, we have membership in coNP. In the same way we obtain the NEXPTIME ∩ coNEXPTIME upper bound in the general case where n is arbitrary.
We leave the question whether PSPACE is also an upper bound open. The main obstacle towards proving PSPACE-membership is that it is unclear if the lex-value -being dependent on the value of exponentially many stages in the worst-case -may actually have exponential bit-complexity.

Experimental Evaluation
In this section, we report the results of a series of experiments made with a prototypical implementation of our algorithm.
Case Studies. We have considered the following case studies for our experiments: Dice. This example is shipped with PRISM-games [37] and models a simple dice game between two players. The number of throws in this game is a configurable parameter, which we instantiate with 10, 20 and 50. The game has three possible outcomes: Player Max wins, Player Min wins or draw. A natural lex-objective is thus to maximize the winning probability and then the probability of a draw. Charlton. This case study [24] is also included in PRISM-games. It models an autonomous car navigating through a road network. A natural lex-objective is to minimize the probability of an accident (possibly damaging human life) and then maximize the probability to reach the destination. Hallway (HW). This instance is based on the Hallway example standard in the AI literature [15,38]. A robot can move north, east, south or west in a known environment, but each move only succeeds with a certain probability and otherwise rotates or moves the robot in an undesired direction. We extend the example by a target wandering around based on a mixture of probabilistic and demonic non-deterministic behavior, thereby obtaining a stochastic game modeling for instance a panicking human in a building on fire. Moreover, we assume a 0.01 probability of damaging the robot when executing certain movements; the damaged robot's actions succeed with even smaller probability. The primary objective is to save the human and the secondary objective is to avoid damaging the robot. We use square grid-worlds of sizes 5 × 5, 8 × 8 and 10 × 10. Avoid the Observer (AV). This case study is inspired by a similar example in [14]. It models a game between an intruder and an observer in a grid-world. The grid can have different sizes as in HW, and we use 10 × 10, 15 × 15 and 20 × 20. The most important objective of the intruder is to avoid the observer, its secondary objective is to exit the grid. We assume that the observer can only detect the intruder within a certain distance and otherwise makes random moves. At every position, the intruder moreover has the option to stay and search to find a precious item. In our example, this occurs with probability 0.1 and is assumed to be the third objective.
Implementation and Experimental Results. We have implemented our algorithm within PRISM-games [37]. Since PRISM-games does not provide an exact algorithm to solve SGs, we used the available value iteration to implement our single-objective blackbox. Note that since this value iteration is not exact for single-objective SGs, we cannot compute the exact lex-values. Nevertheless, we can still measure the overhead introduced by our algorithm compared to a single-objective solver. In our implementation, value iteration stops if the values do not change by more than 10 −8 per iteration, which is PRISM's default configuration. The experiments were conducted on a 2.4 GHz Quad-Core Intel c Core TM i5 processor, with 4 GB of RAM available to the Java VM. The results are reported in Table 1. We only recorded the run time of the actual algorithms; the time needed to parse and build the model is excluded. All numbers are rounded to full seconds. All instances (even those with state spaces of order 10 6 ) could be solved within a few minutes. The case studies are grouped by the type of lex-objective, where R indicates reachability, S safety. For each combination of case study and scaling parameters, we report the state size in column |S|, three different model checking runtimes, the average number of actions in the original and all considered restricted games, and the fraction of stages considered, i.e. the stages solved by the algorithm compared to the theoretically maximal possible number of stages (2 n − 1).
We compare the time of our algorithm on the lexicographic objective (Lex.) to the time for checking the first single objective (First) and the sum of checking all single objectives (All). We see that the runtimes of our algorithm and checking all single objectives are always in the same order of magnitude. This shows that our algorithm works well in practice and that the overhead is often small. Even on SGs of non-trivial size (HW [10 × 10] and AV [20 × 20]), our algorithm returns the result within a few minutes.
Regarding the average number of actions, we see that the decrease in the number of actions in the sub-games G obtained by restricting the input game to optimal actions varies: For example, very few actions are removed in the Dice instances, in AV we have a moderate decrease and in HW a significant decrease, almost eliminating all non-determinism after the first objective. It is our intuition that the less actions are removed, the higher is the overhead compared to the individual single-objective solutions. Consider the AV and HW examples: While for AV [20 × 20], computing the lexicographic solution takes 1.7 times as long as all the single-objective solutions, it took only about 25% longer for HW [10 × 10]; this could be because in HW, after the first objective only little nondeterminism remains, while in AV also for the second and third objectives lots of choices have to be considered. Note that the first objective sometimes (HW), but not always (AV) needs the majority of the runtime.
We also see that the algorithm does not have to explore all possible stages. For example, for Dice we always just need a single stage, because the SG is absorbing. For charlton and HW all stages are relevant for the lex-objective, while for AV 4 of 7 need to be considered.

Conclusion and Future Work
In this work we considered simple stochastic games with lexicographic reachability and safety objectives. Simple stochastic games are a standard model in reactive synthesis of stochastic systems, and lexicographic objectives let one consider multiple objectives with an order of preference. We focused on the most basic objectives: safety and reachability. While simple stochastic games with lexicographic objectives have not been studied before, we have presented (a) determinacy; (b) strategy complexity; (c) computational complexity; and (d) algorithms; for these games. Moreover, we showed how these games can model many different case studies and we present experimental results for them.
There are several directions for future work. First, for the general case closing the complexity gap (NEXPTIME∩coNEXPTIME upper bound and PSPACE lower bound) is an open question. Second, the study of lexicographic simple stochastic games with more general objectives, e.g., quantitative or parity objectives poses interesting questions. In particular, in the case of parity objectives, there are some indications that the problem is significantly harder: Consider the case of a reachability-safety lex-objective. If the lex-value is (1, 1) then both objectives can be guaranteed almost surely. Since almost-sure safety is sure safety, our results imply that sure safety and almost-sure reachability can be achieved with constant memory. In contrast, for parity objectives the combination of sure and almost-sure requires infinite-memory (e.g, see [21,Appendix A.1]).