Abstract
We study turnbased stochastic zerosum games with lexicographic preferences over objectives. Stochastic games are standard models in control, verification, and synthesis of stochastic reactive systems that exhibit both randomness as well as controllable and adversarial nondeterminism. Lexicographic order allows one to consider multiple objectives with a strict preference order. To the best of our knowledge, stochastic games with lexicographic objectives have not been studied before. For a mixture of reachability and safety objectives, we show that deterministic lexicographically optimal strategies exist and memory is only required to remember the already satisfied and violated objectives. For a constant number of objectives, we show that the relevant decision problem is in \(\textsf{NP}\cap \textsf{coNP}\), matching the current known bound for single objectives; and in general the decision problem is \(\textsf{PSPACE}\)hard and can be solved in \(\textsf{NEXPTIME}\cap \textsf{coNEXPTIME}\). We present an algorithm that computes the lexicographically optimal strategies via a reduction to the computation of optimal strategies in a sequence of singleobjectives games. For omegaregular objectives, we restrict our analysis to oneplayer games, also known as Markov decision processes. We show that lexicographically optimal strategies exist and need either randomization or finite memory. We present an algorithm that solves the relevant decision problem in polynomial time. We have implemented our algorithms and report experimental results on various case studies.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Simple stochastic games (SG) [1] are zerosum turnbased stochastic games played over a finite state space by two adversarial players, Maximizer and Minimizer, along with randomness in the transition function. These games allow the interaction of angelic and demonic nondeterminism as well as stochastic uncertainty. They generalize classical models such as Markov decision processes (MDP) [2] which have only one player and stochastic uncertainty. An objective specifies the desired set of trajectories of the game, and the goal of the Maximizer is to maximize the probability of satisfying the objective against all choices of the Minimizer. The basic decision problem is to determine whether the Maximizer can ensure the satisfaction of the objective with a given probability threshold. This is among the rare and intriguing problems that are \(\textsf{NP}\cap \textsf{coNP}\), and whether it belongs to \(\mathsf P\) is a major and longstanding open problem. Besides the theoretical interest, SG are a standard model in control and verification of stochastic reactive systems [2,3,4,5], and they serve as robust abstractions of MDP when precise transition probabilities are unknown [6, 7].
Multiobjective optimization problems are relevant in the analysis of systems with multiple – potentially conflicting – goals where tradeoffs must be considered. While such tradeoff analyses have been extensively studied for MDP with various classes of objectives, see e.g. [2, 8,9,10,11], the problem is notoriously hard for SG. In fact, even for multiple reachability objectives, such games are not determined [12] and their decidability is still open.
This work considers SG with multiple objectives equipped with a total lexicographic preference order. That is, if there are, say, two objectives \(\Omega _1\) and \(\Omega _2\) where \(\Omega _1\) comes first in the preference order, then Maximizer should pick a strategy that is (i) optimal for \(\Omega _1\), and (ii) optimal for \(\Omega _2\) among all strategies that satisfy condition (i). This lexicographic optimization scheme easily generalizes to an arbitrary number of objectives. Lexicographic objectives are useful in many scenarios. For example, an autonomous vehicle might have a primary objective to avoid clashes and a secondary objective to optimize performance; or a robot saving lives during a fire in a building might have a primary objective to save as many lives as possible, and a secondary objective to minimize energy consumption. Thus studying reactive systems with lexicographic objectives is a very relevant problem that has been considered in many different contexts [13, 14]. In particular, nonstochastic games with lexicographic objectives [15, 16] and MDP with lexicographic objectives [17, 18] have been considered, but to the best of our knowledge SG with lexicographic objectives have not been studied.
1.1 Our contribution
The contribution of this paper is twofold.
(I) We present theoretical and practical results for SG with lexicographic reachability and safety objectives. These are the same as in the conference version of this paper [19], but we improved the presentation and included the full proofs. The main contributions are as follows.

Determinacy In contrast to SG with multiple objectives that are not determined, we establish determinacy of SG with a lexicographic combination of reachability and safety objectives.

Computational complexity For the associated decision problem we establish the following: (a) if the number of objectives is constant, then the decision problem lies in \(\textsf{NP}\cap \textsf{coNP}\), matching the current known bound for SG with a single objective; (b) in general, the decision problem is \(\textsf{PSPACE}\)hard and can be solved in \(\textsf{NEXPTIME}\cap \textsf{coNEXPTIME}\).

Strategy complexity We show that there exist lexicographically optimal strategies that are deterministic and use finite memory. We also show that memory is only needed in order to remember the already satisfied and violated objectives, i.e., the memory is arenaindependent [20].

Algorithm We present an algorithm that computes the unique lexicographic value and the witness lexicographically optimal strategies via a reduction to the computation of optimal strategies in a sequence of singleobjectives games.

Experimental results We have implemented the algorithm and present experimental results on several case studies.
Technical contribution The key idea is that, given the lexicographic order of the objectives, we can consider them sequentially. After every objective, we remove all actions that are not optimal, thereby forcing all following computations to consider only locally optimal actions. The main complication is that local optimality of actions does not imply global optimality when interleaving reachability and safety, as the latter objective can use locally optimal actions to stay in the safe region without reaching the more important target. We introduce quantified reachability objectives as a means to solve this problem.
(II) We present new results for MDP with lexicographic \(\omega \)regular objectives. These results extend the conference version [19] of this paper. We discuss the reason for only addressing MDP and not SG in Sect. 4.3, also sketching the possible directions for dealing with the more general model of games, as well as a solution for a subclass of SG. The main contributions are as follows.

Algorithm We present an algorithm that computes the unique lexicographic value and witness lexicographically optimal strategies for MDP with a tuple of lexicographic Streett objectives in polynomial time.

Computational complexity We establish that the associated decision problem is in \(\textsf{P}\). This result relies on the objectives being given as Streettconditions. If they are in a different form, e.g. LTL formulae, we get the corresponding doubly exponential blowup of transforming LTL to Streett.

Strategy complexity We show that there exist lexicographically optimal strategies. They need either randomization or finite memory, but only in a limited part of the state space, the socalled end components. In the rest of the MDP, memoryless deterministic strategies suffice.

Experimental results We have implemented the algorithm and present experimental results on several case studies, comparing our deterministic algorithm to the one from [18], which is based on reinforcement learning. Surprisingly, our algorithm often outperforms the learningbased approach.
Technical contribution The key problem is that satisfaction of general \(\omega \)regular objectives depends on the game’s infinite runs. This is different from reachability (and safety) where satisfaction (violation, resp.) can be witnessed after a finite number of steps. Hence, we have to analyze the part of the state space where a play can remain forever, i.e., the end components. In MDP, we can compute the lexicographic value vector for all states in end components and then reduce the problem to reachability. In SG, the analysis of the end components is more complicated and we leave it for future work. A more detailed discussion of the complications and the possible solutions is in Sect. 4.3.
1.2 Related work
We discuss related works on (a) MDP with multiple objectives, (b) SG with multiple objectives, (c) lexicographic objectives in related models, and (d) existing tool support.
(a) MDP with multiple objectives have been widely studied over a long time [2, 8]. In the context of verifying MDP with multiple objectives, both qualitative objectives such as reachability and LTL [21], as well as quantitative objectives, such as mean payoff [9, 22], discounted sum [23], total reward [24] and combinations thereof [11, 25] have been considered. Besides multiple objectives with expectation criteria, other criteria have also been considered, e.g. combinations with variance [26], or multiple percentile (threshold) queries [22, 27,28,29]. Practical applications of MDP with multiple objectives are described in [30,31,32].
(b) More recently, SG with multiple objectives have been considered, but the results are more limited [33]. Multiple meanpayoff objectives were first examined in [34] and the qualitative problems are coNPcomplete [35]. Some special classes of SG (namely stopping SG) have been solved for totalreward objectives [12, 36] and applied to autonomous driving [37]. However, in the general case, decidability of SG with multiple objective remains open, even in the simple case of reachability objectives. On the positive side, it is known that the Pareto curve of achievable thresholds can be approximated to arbitrary precision [38].
(c) The study of lexicographic objectives has been considered in many different contexts [13, 14]. Nonstochastic games with lexicographic meanpayoff objectives and parity conditions have been studied in [15] for the synthesis of reactive systems with performance guarantees. Nonstochastic games with multiple \(\omega \)regular objectives equipped with a monotonic preorder, which subsumes lexicographic order, have been studied in [39]. Moreover, the beyond worstcase analysis problems studied in [40] also considers primary and secondary objectives, which has a lexicographic flavor. MDP with lexicographic discountedsum objectives have been studied in [17], and have been extended with partialobservability in [41]. Very recently, a reinforcement learningbased algorithm for MDP with lexicographic \(\omega \)regular objectives was presented in [18]. This solution relies on the choice of suitable parameters for the learning and, in the absence of mechanisms to guess them, does not provide a formal guarantee about the correctness of the result. In contrast, our algorithm is deterministic and guarantees to find the lexicographic value and optimal strategies.
(d) PRISMGames [42] provides tool support for several multiplayer multi objective settings. MultiGain [43] is limited to generalized meanpayoff MDP. Storm [44] can, among numerous singleobjective problems, solve Markov automata with multiple timed reachability or expected cost objectives [45], multicost bounded reachability MDP [46], and it can provide simple strategies for multiple expected reward objectives in MDP [10]. The recent tool TEMPEST [47] includes a solver for SG with safety and meanpayoff objectives. It can also solve multiple meanpayoff objectives by mixing them into a single one.
1.3 Organization of this paper
After recalling preliminaries and defining the problem in Section 2, we first, in Sect. 3.1, consider SG with reachability and safety objectives with the following restriction: all target sets are absorbing. Then, in Sect. 3.2 we extend our insights to general games, yielding the full algorithm and the theoretical results. Section 4 considers the lexicographic \(\omega \)regular objectives for MDP. After providing a solution for end components in Sect. 4.1, the general algorithm is given in Sect. 4.2. In Sect. 4.3 we discuss the complications for SG. Finally, Section 5 describes the prototypical implementation and experimental evaluation of both algorithms, and Sect. 6 concludes.
2 Preliminaries
We fix some general notation first. A probability distribution on a finite set A is a function \(f: A \rightarrow [0,1]\) such that \(\sum _{x\in A}f(x) =1\). We denote the set of all probability distributions on A by \(\mathcal {D}(A)\). A distribution f is Dirac if \(f(a) = 1\) for some \(a \in A\). Vectors \(\textbf{x} \in B^n\) where B is an arbitrary set are denoted in a bold font. For \(1 \le i \le n\) we write \(\textbf{x}_i\) to refer to the ith component of \(\textbf{x}\). Moreover, we use \(\textbf{x}_{<i}\) to denote the (possibly empty) vector \((\textbf{x}_1,\ldots ,\textbf{x}_{i1})\).
2.1 Games, strategies, and basic objectives
In this section, we formally introduce the stochastic models used in this paper, most importantly stochastic games and Markov decision processes. We also define strategies and basic objectives.
2.1.1 Stochastic games and Markov decision processes
In this paper, we consider (simple) stochastic games [1], which are defined as follows (see Fig. 1 on page 1 for examples):
Definition 1
(SG: Stochastic game) An SG is a tuple \(\mathcal {G}= (S_{\square },S_{\lozenge }, L, \textsf{Act}, P)\) with \(S := S_{\square }\uplus S_{\lozenge }\ne \emptyset \) a finite set of states, \(L\) a finite set of action labels, \(\textsf{Act}: S \rightarrow 2^L\setminus \{\emptyset \}\) a set of actions available at every state, and \(P: S \times L\dashrightarrow \mathcal {D}(S)\) a (partial) probabilistic transition function. P(s, a) is defined for \(s \in S\) and \(a \in L\) iff \(a \in \textsf{Act}(s)\).
We write \(P(s,a,s')\) instead of \(P(s,a)(s')\) for all \(s,s' \in S\), \(a \in \textsf{Act}(s)\). A state \(s \in S\) is called absorbing (or sink) if \(P(s,a,s) = 1\) for all \(a \in \textsf{Act}(s)\). \(\textsf{Sinks}(\mathcal {G})\) denotes the set of all absorbing states of \(\mathcal {G}\). We refer to the two players of the game as \(\textsf{Max}\) and \(\textsf{Min}\), and the sets \(S_{\square }\) and \(S_{\lozenge }\) as \(\textsf{Max}\) and \(\textsf{Min}\)states, respectively. As the game is turnbased, these sets partition the state space S: in each state, it is either \(\textsf{Max}\)’s or \(\textsf{Min}\)’s turn.
Intuitively, the semantics of an SG is as follows: In every turn, the corresponding player picks one of the finitely many available actions \(a \in \textsf{Act}(s)\) in the current state s. The game then transitions to the next state according to the probability distribution P(s, a). The winning conditions and the initial state are not part of the game and need to be specified externally.
We also consider Markov decision processes in this paper:
Definition 2
(MDP: Markov decision process) An MDP is the special case of an SG where either \(S_{\lozenge }= \emptyset \) or \(S_{\square }= \emptyset \). In other words, an MDP is a 1player SG.
2.1.2 Strategies
We define the formal semantics of SG (and MDP) by means of paths and strategies. Let \(\mathcal {G}= (S_{\square },S_{\lozenge },L,\textsf{Act}, P)\) be an SG. An infinite path \(\pi \) is an infinite sequence \(\pi = s_0 a_0 s_1 a_1 \dots \in (S \times L)^\omega \), such that for every \(i \ge 0\), we have \(a_i\in \textsf{Act}(s_i)\) and \(s_{i+1} \in \{s' \mid P(s_i,a_i,s')>0\}\). Finite paths are defined analogously as elements of \((S \times L)^*\times S\). Omitting the action labels in a path yields a sequence of states. We call such state sequences trajectories.
A strategy of player \(\textsf{Max}\) is a function \(\sigma :(S \times L)^* \times S_{\square }\rightarrow \mathcal {D}(L)\) where \(\sigma (\pi s)(s')>0\) only if \(s \in \textsf{Act}(s)\). The strategy \(\sigma \) is memoryless if \(\sigma (\pi s) = \sigma (\pi ' s)\) for all \(\pi , \pi ' \in (S \times L)^*\) and \(s \in S_{\square }\). More generally, \(\sigma \) has memory of classsize at most m if the set \((S \times L)^*\) can be partitioned into m classes \(M_1,\ldots ,M_m \subseteq (S \times L)^*\) such that \(\sigma (\pi s) = \sigma (\pi ' s)\) for all \(1 \le i \le m\), \(\pi , \pi ' \in M_i\) and \(s \in S_{\square }\). A memory of classsize m can be represented with \(\lceil \log m \rceil \) bits.
A strategy \(\sigma \) of \(\textsf{Max}\) is deterministic if \(\sigma (\pi s)\) is Dirac for all \(\pi s\). Strategies that are both memoryless and deterministic are called MD and can be identified as functions \(\sigma :S_{\square }\rightarrow L\). Note that there are at most \(L^{S_{\square }}\) different MD strategies, i.e., exponentially many in \(S_{\square }\). MD strategies play an important role in this paper because they are sufficient in order to play optimally w.r.t. to many basic objectives, see below.
Strategies \(\tau \) of player \(\textsf{Min}\) are defined analogously, with \(S_{\square }\) replaced by \(S_{\lozenge }\) in the above definitions. The set of all strategies of player \(\textsf{Max}\) is denoted with \(\Sigma _{\textsf{Max}}\), and its set of all MD strategies with \(\Sigma _{\textsf{Max}}^{\textsf{MD}}\). Similarly, we use the notation \(\Sigma _{\textsf{Min}}\) and \(\Sigma _{\textsf{Min}}^{\textsf{MD}}\) for the corresponding strategy sets of player \(\textsf{Min}\).
2.1.3 Markov chains and probability measures
A Markov chain (MC) is the special case of an SG with \(\textsf{Act}(s) = 1\) for all game states s. Fixing strategies \({\sigma ,\tau }\) of both players in an SG \(\mathcal {G}\) yields the induced MC \(\mathcal {G}^{\sigma ,\tau }\). Formally, \(\mathcal {G}^{\sigma ,\tau }\) has infinitely many states \(S^{\sigma ,\tau }= (S \times L) ^* \times S\) and the transition probabilities are given as
for all \(\pi \in (S \times L) ^*\), states \(s, s' \in S\) and \(a\in \textsf{Act}(s)\). Note that the term \(\sigma (\pi s)(a)\) is the probability assigned by \(\textsf{Max}\) to action a when the game reaches state s with history \(\pi \). It is also possible to fix just one player’s strategy \(\sigma \) which results in the induced MDP \(\mathcal {G}^\sigma \).
Even though induced Markov chains (and MDP) are generally infinite, it is wellknown that if the strategies \({\sigma ,\tau }\) are finitememory, then \(\mathcal {G}^{\sigma ,\tau }\) is equivalent (more precisely: bisimilar) to a finite MC. In particular, if the strategies are MD, then the induced MC is simply obtained from keeping only the actions selected by the MD strategies at each state.
Unlike SG and MDP, Markov chains are purely probabilistic systems, and it is in fact possible to define a probability measure over certain events that may occur in an MC. Ultimately, it is this probability measure that assigns meaning to statements like “\(\textsf{Max}\) can win the game with probability 1/2”. The construction of such a probability measure is standard, see e.g. [3, Ch. 10]. Given a Markov chain with (countable) state space S, we define the cylinder set of a finite trajectory \(\pi \in S^*\) as \(\pi S^\omega = \{\pi \pi ' \mid \pi ' \in S^\omega \}\). We obtain a sigmaalgebra \((S^\omega , \mathcal {F})\) where \(\mathcal {F} \subseteq 2^{S^\omega }\) is the smallest set that contains all the cylinder sets and that is closed under complement and countable union. The sets in \(\mathcal {F}\) are called measurable events. Each measurable event \(E \in \mathcal {F}\) can be assigned a probability \(\mathbb {P}_{{s_0}}(E) \in [0,1]\) given some MC state \({s_0}\in S\). This probability measure is constructed as follows: First, the probability of a cylinder set is defined as
where \(\pi \) is the length of a finite trajectory. Second, the probabilities of countably many pairwise disjoint events \((E_i)_{i \ge 0}\) satisfy
It turns out that these axioms induce a unique probability measure \(\mathbb {P}_{{s_0}} :\mathcal {F} \rightarrow [0,1]\). The resulting structure \((S^\omega , \mathcal {F}, \mathbb {P}_{{s_0}})\) is a probability space. The probability measure of an induced Markov chain \(\mathcal {G}^{\sigma ,\tau }\) with initial state \({s_0}\) is denoted \(\mathbb {P}_{{s_0}}^{\sigma ,\tau }\).
2.1.4 Reachability and safety objectives
In our setting, an objective is a measurable event \(\Omega \subseteq S^\omega \) of infinite trajectories in an SG \(\mathcal {G}\). Note that on the level of an induced MC \(\mathcal {G}^{\sigma ,\tau }\), where \({\sigma ,\tau }\) are arbitrary strategies, the event \(\Omega \) must be identified with the (measurable) event
i.e., the trajectories in \(\Omega '\) are like the ones in \(\Omega \) but each state also carries a possible history. To simplify the notation, we do not distinguish between \(\Omega \) and \(\Omega '\).
The reachability objective \(\textsf{Reach}\left( T\right) \) with target set \(T \subseteq S\) is the objective
The set \(\textsf{Safe}\left( T\right) = S^\omega \setminus \textsf{Reach}\left( T\right) \) is called a safety objective; alternatively,
In other words, a safety objective consists of avoiding the unsafe set T forever (we remark that some other authors specify safety objectives in terms of a safe set that should never be left). Further, for sets \(T_1,T_2 \subseteq S\) we define the until objective
Reachability, safety, and until objectives are among the simplest examples of measurable events [3]. A reachability or safety objective where the set T satisfies \(T \subseteq \textsf{Sinks}(\mathcal {G})\) is called absorbing. For the safety probabilities in an (induced) MC, it holds that \(\mathbb {P}_{{s_0}}(\textsf{Safe}\left( T\right) )= 1  \mathbb {P}_{{s_0}}(\textsf{Reach}\left( T\right) )\).
2.1.5 \(\omega \)regular objectives and streett conditions
Reachability, safety, and until objectives as discussed above are all special cases of \(\omega \)regular objectives. In general, the class of \(\omega \)regular objectives can be characterized as follows: An objective \(\Omega \subseteq S^\omega \) is \(\omega \)regular iff there exists a deterministic Streett^{Footnote 1}automaton (DSA) with input alphabet S that accepts exactly the trajectories in \(\Omega \). In order to reason about optimal strategies for \(\Omega \), one can construct the automatatheoretic product of the game with the DSA defining \(\Omega \). This product is an SG with a Streett objective, and optimal strategies for \(\Omega \) can be found by optimizing the probability to satisfy the Streett condition in the product. More formally, DSA and products of DSA and SG are defined as follows:
Definition 3
(DSA and product contruction) A deterministic Streett automaton (DSA) is a finite automaton \(\mathcal {A} = (Q,\Sigma ,\delta ,q_0,(\textsf{F}^{}_j,\textsf{I}^{}_j)_{1\le j\le m_{}})\), where \(Q \ne \emptyset \) is a finite set of control states, \(\Sigma \ne \emptyset \) is a finite input alphabet, \(\delta :Q \times \Sigma \rightarrow Q\) is a deterministic transition function, \(q_0 \in Q\) is an initial state, and for all \(1 \le j \le m\), \((\textsf{F}_j, \textsf{I}_j) \subseteq Q\times Q\) is a tuple of states called Streett pair.
Consider an SG \(\mathcal {G}= (S_{\square },S_{\lozenge }, L, \textsf{Act}, P)\) and \(\mathcal {A} = (Q,S,\delta ,q_0,(\textsf{F}^{}_j,\textsf{I}^{}_j)_{1\le j\le m_{}})\).
We define the product of \(\mathcal {G}\) and \(\mathcal {A}\) as the SG \(\mathcal {G}\times \mathcal {A} = (S_{\square }\times Q, S_{\lozenge }\times Q, L, \textsf{Act}', P')\) where for all \(s \in S\) and \(q \in Q\), \(\textsf{Act}'(s,q) = \textsf{Act}(s)\), and \(P' :(S \times Q) \times L \dashrightarrow \mathcal {D}(S \times Q)\) such that for all \(a \in \textsf{Act}(s)\), \(P'((s,q), a)\) is a distribution \(d_{s,q,a}\) over \(S \times Q\) with \(d_{s,q,a}(s',q') = P(s,a,s')\) if \(q' = \delta (q,s)\), and \(d_{s,q,a}(s',q') = 0\) otherwise.
The tuples \((S \times \textsf{F}_j, S \times \textsf{I}_j)_{1 \le j \le m}\) are a Streett condition for \(\mathcal {G}\times \mathcal {A}\).
Such product constructions are standard in probabilistic model checking, see e.g. [3, Sec. 10.6.4] for further details. In the rest of the paper we assume that we are already given a (product) SG equipped with a Streett condition when considering \(\omega \)regular objectives, i.e., we do not consider the product explicitly.
Let \(\mathcal {G}\) be an SG with state space S. The semantics of a Streett condition \((\textsf{F}^{}_j,\textsf{I}^{}_j)_{1\le j\le m_{}}\), \( \forall 1 \le j \le m:\textsf{F}_j, \textsf{I}_j \subseteq S\), for \(\mathcal {G}\) is defined in terms of the objective \(\Omega = \textsf{Streett}((\textsf{F}^{}_j,\textsf{I}^{}_j)_{1\le j\le m_{}})\). An infinite path \(s_0 s_1\ldots \in S ^\omega \) is contained in \(\Omega \) iff
It is wellknown that \(\omega \)regular sets, and thus, in particular, Streett objectives, are measurable events, see e.g. [3, Sec. 10.3].
2.2 Lexicographic objectives
The lexicographic order on \(\mathbb {R}^n\) is defined as \(\textbf{x} \le _{\textsf{lex}}\textbf{y}\) iff \(\textbf{x}_i \le \textbf{y}_i\) where \(i \le n\) is the greatest position such that for all \(j < i\) it holds that \(\textbf{x}_j = \textbf{y}_j\). The position i is called tiebreaker. Notice that for arbitrary sets \(X \subseteq [0,1]^n\), suprema and infima exist in the lexicographic order.
Definition 4
(Lexobjectives and lexvalues) Let \(\mathcal {G}\) be an SG with state space S. A lex(icographic) objective for \(\mathcal {G}\) is a vector \(\varvec{\Omega }= (\Omega _1,\ldots ,\Omega _n)\) where \(\Omega _i \subseteq S^\omega \) is an objective in \(\mathcal {G}\) for all \(1 \le i \le n\). For all \(s\in S\), the lex(icographic) value function \(^{\varvec{\Omega }}\textbf{v}^{\textsf{lex}}:S \rightarrow [0,1]^n\) is defined as:
where \(\mathbb {P}^{\sigma ,\tau }_s(\varvec{\Omega })\) denotes the vector \((\mathbb {P}^{\sigma ,\tau }_s(\Omega _1),\ldots ,\mathbb {P}^{\sigma ,\tau }_s(\Omega _n)) \in [0,1]^n\) and the suprema and infima are taken with respect to the order \(\le _{\textsf{lex}}\) on \([0,1]^n\).
Thus the lexvalue at state s is the lexicographically supremal vector of probabilities that \(\textsf{Max}\) can ensure against all possible behaviors of \(\textsf{Min}\) when the game starts in s. We prove in Sect. 3.2.3 that the supremum and infimum in (1) can be exchanged in the case of (possibly) mixed reachability and safety objectives; this property is called determinacy. We omit the superscript \(\varvec{\Omega }\) in \(^{\varvec{\Omega }}\textbf{v}^{\textsf{lex}}\) if it is clear from the context.
Example 1
(SG and lexvalues) Consider the SG sketched in Fig. 1a with the lexobjective \(\varvec{\Omega }= \{\textsf{Reach}\left( S_1\right) , \textsf{Safe}\left( S_2\right) \}\). Player \(\textsf{Max}\) must thus maximize the probability to reach \(S_1\) and, moreover, among all optimal strategies for \(\textsf{Reach}\left( S_1\right) \), it must choose one that maximizes the probability to avoid \(S_2\) forever.
2.2.1 Lexvalue of actions and lexoptimal actions
We extend the notion of value to actions. Let \(s \in S\). The lexvalue of an action \(a \in \textsf{Act}(s)\) is defined as \(\textbf{v}^{\textsf{lex}}(s,a) = \sum _{s'}P(s,a,s')\cdot \textbf{v}^{\textsf{lex}}(s')\). If \(s \in S_{\square }\), then action a is called lexoptimal if \(\textbf{v}^{\textsf{lex}}(s,a) = \max _{b \in \textsf{Act}(s)}\textbf{v}^{\textsf{lex}}(s,b)\). Similarly, if \(s \in S_{\lozenge }\), then a is called lexoptimal if \(\textbf{v}^{\textsf{lex}}(s,a) = \min _{b \in \textsf{Act}(s)}\textbf{v}^{\textsf{lex}}(s,b)\). There exists at least one lexoptimal action because \(\textsf{Act}(s)\) is finite by definition.
Example 2
(Lexvalues of actions) We now intuitively explain the lexvalues of all states in Fig. 1a. The lexvalue of sink states s, t, u and w is determined by their membership in the sets \(S_1\) and \(S_2\). E.g., \(\textbf{v}^{\textsf{lex}}(s) = (1,1)\), as it is part of the set \(S_1\) that should be reached and not part of the set \(S_2\) that should be avoided. Similarly we get the lexvalues of t, u and w as (1, 0), (0, 0) and (0, 1) respectively. State v has a single action that yields (0, 0) or (0, 1) each with probability \(1/2\), thus \(\textbf{v}^{\textsf{lex}}(v) = (0,1/2)\).
State p has one action going to s, which would yield (1, 1). However, as p is a \(\textsf{Min}\)state, its best strategy is to avoid giving such a high value. Thus, it uses the action going downwards and \(\textbf{v}^{\textsf{lex}}(p)=\textbf{v}^{\textsf{lex}}(q)\). State q only has a single action going to r, so \(\textbf{v}^{\textsf{lex}}(q)=\textbf{v}^{\textsf{lex}}(r)\).
State r has three choices: (i) Going back to q, which results in an infinite loop between q and r, and thus never reaches \(S_1\). So a strategy that commits to this action will not achieve the optimal value. (ii) Going to t or u each with probability \(1/2\). In this case, the safety objective is definitely violated, but the reachability objective is achieved with \(1/2\). (iii) Going to t or v each with probability \(1/2\). Similarly to (ii), the probability to reach \(S_1\) is \(1/2\), but additionally, there is a \(1/2 \cdot 1/2\) chance to avoid \(S_2\). Thus, since r is a \(\textsf{Max}\)state, its lexoptimal choice is the action leading to t or v and we get \(\textbf{v}^{\textsf{lex}}(r) = (1/2, 1/4)\).
2.2.2 Lexoptimal strategies
Definition 5
(Lexoptimal strategies) A strategy \(\sigma \in \Sigma _{\textsf{Max}}\) is lexoptimal for \(\varvec{\Omega }\) if for all \(s \in S\), \(\textbf{v}^{\textsf{lex}}(s) = \inf _{\tau ' \in \Sigma _{\textsf{Min}}} \mathbb {P}_s^{\sigma ,\tau '}(\varvec{\Omega })\). A strategy \(\tau \in \Sigma _{\textsf{Min}}\) is a lexoptimal counterstrategy against \(\sigma \) if \(\mathbb {P}_s^{\sigma ,\tau }(\varvec{\Omega }) = \inf _{\tau ' \in \Sigma _{\textsf{Min}}} \mathbb {P}_s^{\sigma ,\tau '}(\varvec{\Omega })\).
Further, a strategy \(\sigma \) of \(\textsf{Max}\) (\(\textsf{Min}\), resp.) is called locally lexoptimal if for all \(\pi \in (S \times L)^*\), \(s \in S_{\square }\) (\(s \in S_{\lozenge }\), resp.) and \(a \in \textsf{Act}(s)\), we have \(\sigma (\pi s)(a) > 0\) implies that action a is lexoptimal. Thus, locally lexoptimal strategies only assign positive probability to lexoptimal actions. Locally lexoptimal strategies are not necessarily (globally) lexoptimal, see Example 4.
We stress that in general, counterstrategies of \(\textsf{Min}\) may depend on the strategy chosen by \(\textsf{Max}\); this is because of the quantification order “\(\sup \inf \)”.
3 Lexicographic 2player stochastic games
In this section, we derive properties and algorithms for SG with lexicographic reachability and safety objectives. Formally, a lexicographic reachabilitysafety objective (reachsafe lexobjective, for short) for a game \(\mathcal {G}\) with state space S is a vector \(\varvec{\Omega }= (\Omega _1,\ldots ,\Omega _n)\) such that \(\Omega _i \in \{\textsf{Reach}\left( S_i\right) , \textsf{Safe}\left( S_i\right) \}\) with \(S_i \subseteq S\) for all \(1\le i \le n\). Note that arbitrary alternations of reachability and safety objectives are allowed. We call \(\varvec{\Omega }\) absorbing if \(S_i \subseteq \textsf{Sinks}(\mathcal {G})\) for all \(1 \le i \le n\). Intuitively, games with absorbing objectives terminate once a single reachability or safety objective in \(\varvec{\Omega }\) has been satisfied or violated, respectively, which somewhat simplifies the analysis.
The rest of this section is structured as follows. In Sect. 3.1, we treat absorbing reachsafe lexobjectives, and we reduce the nonabsorbing case to the absorbing setting in Sect. 3.2.
3.1 Lexicographic reachsafe SG with absorbing targets
This section deals with computing the values and optimal strategies of SG with absorbing reachsafe lexobjectives. In Sect. 3.1.1, we prove a structural result (Theorem 1) about optimal strategies which implies in particular that MD optimal strategies exist (Theorem 2). The subsequent Sect. 3.1.2 presents our algorithm. The main technical difficulty arises from interleaving reachability and safety objectives.
3.1.1 Characterizing optimal strategies
This first subsection derives a characterization of lexoptimal strategies in terms of local optimality and an additional reachability condition (Theorem 1 further below). It is one of the key ingredients for the correctness of the algorithm presented later and also gives rise to a (nonconstructive) proof of the existence of MD lexoptimal strategies in the absorbing case.
We begin with the following lemma that summarizes some straightforward facts that we will frequently use. Recall that a strategy is locally lexoptimal if it only selects actions with optimal lexvalue.
Lemma 1
The following statements hold for every SG \(\mathcal {G}\) with absorbing reachsafe lexobjective \(\varvec{\Omega }\):

(a)
If \(\sigma \in \Sigma _{\textsf{Max}}^{\textsf{MD}}\) is lexoptimal and \(\tau \in \Sigma _{\textsf{Min}}^{\textsf{MD}}\) is a lexoptimal counter strategy against \(\sigma \), then \(\sigma \) and \(\tau \) are both locally lexoptimal. We do not (yet) claim that MD optimal strategies \({\sigma ,\tau }\) exist in general.

(b)
Let \(\widetilde{\mathcal {G}}\) be the subgame obtained by removing all actions (of both players) that are not locally lexoptimal in \(\mathcal {G}\). Let \(\widetilde{\textbf{v}}^\textsf{lex}\) be the lexvalues in \(\widetilde{\mathcal {G}}\). Then \(\widetilde{\textbf{v}}^\textsf{lex}= \textbf{v}^{\textsf{lex}}\).
Proof
Both claims follow from the definitions of lexvalue and lexoptimal strategy. For (b) in particular, a strategy using actions that are not lexoptimal can be transformed into a strategy that achieves a greater (lower, resp.) value. Thus removing the nonlexoptimal actions does not affect the lexvalue. \(\square \)
Example 3
(Modified game \(\widetilde{\mathcal {G}}\)) Consider again the SG from Fig. 1a. Recall the lexvalues from Example 2. Now we remove the actions that are not locally lexoptimal. This means we drop the action that leads from p to s and the action that leads from r to t or u (Fig. 1b). Since these actions were not used by the lexoptimal strategies, the value in the modified SG is the same as that of the original game.
Example 4
(Locally lexoptimal does not imply globally lexoptimal) Note that in the subgame in Fig. 1, we do not drop the action that leads from r to q, because \(\textbf{v}^{\textsf{lex}}(r)=\textbf{v}^{\textsf{lex}}(q)\), so this action is locally lexoptimal. In fact, a lexoptimal strategy for \(\textsf{Max}\) can use it arbitrarily many times without reducing the lexvalue, as long as it eventually picks the action leading from r to t or v. However, if \(\textsf{Max}\) only played the action leading to q, the lexvalue would be reduced to (0, 1) as we would not reach \(S_1\), but would also avoid \(S_2\).
We stress the following consequence of this: Playing a locally lexoptimal strategy is not necessarily globally lexoptimal. It is not sufficient to just restrict the game to locally lexoptimal actions of the previous objectives and then solve the current one. Note that in fact the optimal strategy for the second objective \(\textsf{Safe}\left( S_2\right) \) would be to remain in \(\{q,r\}\); however, \(\textsf{Max}\) must not pick this safety strategy before it has not “tried everything” for all previous reachability objectives, in this case reaching \(S_1\).
The final set
This idea of “trying everything” for an objective \(\textsf{Reach}\left( S_i\right) \) is equivalent to the following: either reach the target set \(S_i\), or reach a set of states from which \(S_i\) cannot be reached anymore. Formally, let \(\textsf{Zero}_i = \{ s \in S \mid \textbf{v}^{\textsf{lex}}_i(s) = 0 \}\) be the set of states where \(\textsf{Max}\) cannot enforce reaching \(S_i\) with positive probability if \(\textsf{Max}\) plays optimally w.r.t. to more important targets \(j < i\). Note that \(\textsf{Zero}_i\) indeed depends on the lexvalue, and not on the singleobjective value of reaching \(S_i\). This is important as the singleobjective value could be greater than 0, but a more important objective has to be sacrificed to achieve it.
We define the set of states where we have “tried everything” for all reachability objectives as follows:
Definition 6
(Final set) For absorbing \(\varvec{\Omega }= (\Omega _1,\ldots ,\Omega _n)\) and \(1 \le i \le n{+}1\), let \(R_{<i} = \{j<i \mid \Omega _j = \textsf{Reach}\left( S_j\right) \}\). We define the Final Set
with the convention that \(F_{<i} = S\) if \(R_{<i} = \emptyset \). We also let \(F = F_{<n+1}\).
In other words, the Final Set F contains all target states as well as the states whose lexvalue vector has zero entries at all positions corresponding to reachability objectives. The latter is necessary because as long as a state still has a positive lexvalue w.r.t. at least one reachability objective, the optimal behaviour of \(\textsf{Max}\) is trying to reach that. Otherwise, \(\textsf{Max}\) would not have “tried everything”.
Example 5
(Final set) For the game in Fig. 1 with objectives \(\Omega _1 = \textsf{Reach}\left( S_1\right) , \Omega _2 = \textsf{Safe}\left( S_2\right) \), we have \(\textsf{Zero}_1 = \{u,v,w\}\) and thus \(F = \textsf{Zero}_1 \cup S_1 = \{s,t,u,v,w\}\). An MD lexoptimal strategy of \(\textsf{Max}\) must almostsurely reach this set against any strategy of \(\textsf{Min}\); only then it has “tried everything”.
We can now characterize MD lexoptimal strategies in terms of local lexoptimality and the Final Set.
Theorem 1
Let \(\varvec{\Omega }\) be an absorbing reachsafe lexobjective and \(\sigma \in \Sigma _{\textsf{Max}}^{\textsf{MD}}\). Then \(\sigma \) is lexoptimal for \(\varvec{\Omega }\) if and only if \(\sigma \) is locally lexoptimal and for all \(s \in S\) we have
where F is the Final Set from Definition 6.
Proof sketch
The “if”direction is shown by induction on the number n of targets. We make a case distinction according to the type of \(\Omega _n\): If it is safety, then we prove that local lexoptimality is already sufficient for global lexoptimality. If \(\Omega _n\) is reachability, then intuitively, the additional condition (2) ensures that the strategy \(\sigma \) indeed “tries everything” and either reaches the target \(S_n\) or eventually a state in \(\textsf{Zero}_n\) where the opponent \(\textsf{Min}\) can make sure that \(\textsf{Max}\) cannot escape. The technical details of these assertions rely on a fixed point characterization of the reachability probabilities and the classical KnasterTarski fixed point theorem [48].
For the “only if”direction recall that lexoptimal strategies are necessarily locally lexoptimal by Lemma 1 (a). Further, let i be such that \(\Omega _i = \textsf{Reach}\left( S_i\right) \) and assume for contradiction that \(\sigma \) remains forever within \(S \setminus (S_i \cup \textsf{Zero}_i)\) with positive probability against some strategy of \(\textsf{Min}\). But then \(\sigma \) visits states with positive lexvalue for \(\Omega _i\) infinitely often without ever reaching \(S_i\). Thus \(\sigma \) is not globally lexoptimal, contradiction. \(\square \)
Full Proof of Theorem 1
We first prove the following lemma about the special case of MDP:
Lemma 2
Let \(\mathcal {G}\) be an MDP (i.e., \(S_{\square }= \emptyset \) or \(S_{\lozenge }= \emptyset \)) and let \(\varvec{\Omega }\) be an absorbing lexobjective. Then there exists an MD lexoptimal strategy for \(\varvec{\Omega }\) for the respective player.
Proof
We assume w.l.o.g. that \(S_{\lozenge }= \emptyset \); otherwise we can exchange all \(\textsf{Reach}\left( S_i\right) \) for \(\textsf{Safe}\left( S_i\right) \) in \(\varvec{\Omega }\) and swap the roles of \(\textsf{Min}\) and \(\textsf{Max}\). Fix a state \(s \in S\). It is known that the set of points \(\textbf{x} \in [0,1]^n\) such that there exists a strategy \(\sigma \in \Sigma _{\textsf{Max}}\) with
where \(\dot{\ge }\) denotes pointwise inequality is a closed convex polyhedron \(\mathfrak {P}\) [21, 49] which is contained in \([0,1]^n\). Therefore \(\mathfrak {P}\) contains a maximum \(\textbf{x}^*\) in the order \(\le _{\textsf{lex}}\). Moreover, \(\textbf{x}^*\) is a vertex of \(\mathfrak {P}\), i.e., a point contained in \(\mathfrak {P}\) which is not a proper convex combination of two different points of \(\mathfrak {P}\). If not, then \(\textbf{x}^* = \alpha \textbf{y} + (1\alpha )\textbf{z}\) for \(\textbf{y} \ne \textbf{z} \in \mathfrak {P}\) and \(0< \alpha < 1\). Let i be the tiebraker position of \(\textbf{y}\) and \(\textbf{z}\). We can assume w.l.o.g. that \(\textbf{y}_i > \textbf{z}_i\). But then it follows immediately that \(\textbf{y} >_{\textsf{lex}}\textbf{x}^*\), contradicting the fact that \(\textbf{x}^*\) was maximal in \(\mathfrak {P}\). The claim follows because in MDP the vertices of \(\mathfrak {P}\) are achieved by MD strategies [49]. \(\square \)
Before proving Theorem 1, we show the following intermediate result, where we use \(\textsf{Reach}\left( S_i \cup \textsf{Zero}_i\right) \) (for all \(1 \le i \le n\) such that \(\Omega _i = \textsf{Reach}\left( S_i\right) \)) instead of \(\textsf{Reach}\left( F\right) \).
Lemma 3
Let \(\sigma \in \Sigma _{\textsf{Max}}^{\textsf{MD}}\) and let \(\varvec{\Omega }\) be an absorbing reachsafe lexobjective. Then \(\sigma \) is lexoptimal for \(\varvec{\Omega }\) if and only if \(\sigma \) is locally lexoptimal and for all \(1 \le i \le n\) such that \(\Omega _i = \textsf{Reach}\left( S_i\right) \) and all \(s \in S\) it holds that
Proof
We show the two directions of the “if and only if” statement. Recall that an MC can be simplified to a tuple \(\mathcal {M}= (S,P)\) such that \(P: S \rightarrow \mathcal {D}(S)\).
“if”: We use the following characterization of the reachability probabilities in any (not necessarily finite) Markov chain: The probabilities \(\mathbb {P}_s(\textsf{Reach}\left( S'\right) )\) constitute the least fixed point x(s) of the operator
which is monotonic on the complete lattice \([0,1]^S\) (that is, the set of all mappings from S to [0, 1]) [3]. In a finite MC, the fixed point of \(\mathcal {R}\) can be made unique by requiring additionally that \(\mathcal {R}(x)(s) = 0\) if there is no path from s to \(S'\) in the MC.
We now prove the “if”direction by induction on n. We first show the inductive step and then argue that the base case \(n=1\) follows with a similar, slightly simpler argument. Let \(n > 1\). Moreover, let \(\sigma \in \Sigma _{\textsf{Max}}^{\textsf{MD}}\) be locally lexoptimal and assume that (3) holds. To prove that \(\sigma \) is lexoptimal, we let \(\tau \in \Sigma _{\textsf{Min}}\) be a lexoptimalcounter strategy against \(\sigma \) in the induced MDP \(\mathcal {G}^\sigma \) and show that \(\mathbb {P}_s^{\sigma ,\tau }(\Omega _i) = \textbf{v}^{\textsf{lex}}_i(s)\) for all \(1 \le i \le n\). By Lemma 2, we can assume that \(\tau \) is MD. By the I.H., \(\sigma \) is already lexoptimal for \(\varvec{\Omega }_{<n} = (\Omega _1,\ldots ,\Omega _{n1})\). Next observe that since \(\tau \) is a lexoptimal counterstrategy against \(\sigma \), it holds that
Thus we only need to prove the other inequality “\(\ge \)” in (5). Therefore we make a case distinction according to the type of \(\Omega _n\):

\(\Omega _n = \textsf{Safe}\left( S_n\right) \). Consider the MC \(\mathcal {G}^{\sigma ,\tau }\). Since \({\sigma ,\tau }\) are both MD, this MC has the same finite state space S as the game and its transition probability function is defined as \(P^{\sigma ,\tau }(s,s') = P(s,\sigma (s),s')\) if \(s \in S_{\square }\) and \(P^{\sigma ,\tau }(s,s') = P(s,\tau (s),s')\) if \(s \in S_{\lozenge }\). In \(\mathcal {G}^{\sigma ,\tau }\), the safety probabilities \(\mathbb {P}_s(\Omega _n) = \mathbb {P}_s(\textsf{Safe}\left( S_n\right) )\) constitute the greatest fixed point of the operator
$$\begin{aligned} \mathcal {S}:[0,1]^S \rightarrow [0,1]^S,\ \mathcal {S}(x)(s) = {\left\{ \begin{array}{ll} 0 &{}\text {if } s \in S_n\\ \sum _{s'} P^{\sigma ,\tau }(s,s') \cdot x(s') &{}\text {else} \end{array}\right. } \end{aligned}$$which is obtained from the operator \(\mathcal {R}\) for reachability using the relation \(\mathbb {P}_s(\textsf{Safe}\left( S_n\right) ) = 1  \mathbb {P}_s(\textsf{Reach}\left( S_n\right) )\). Just like \(\mathcal {R}\), the operator \(\mathcal {S}\) is also monotonic on the complete lattice \([0,1]^S\) and we can apply the wellknown Theorem of Knaster & Tarski: If we can prove that for all \(s \in S\)
$$\begin{aligned} \textbf{v}^{\textsf{lex}}_n(s) \le \mathcal {S}(\textbf{v}^{\textsf{lex}}_n)(s) \end{aligned}$$(6)then this implies \(\textbf{v}^{\textsf{lex}}_n(s) \le (\textsf{gfp}\ \mathcal {S})(s) = \mathbb {P}_s^{\sigma ,\tau }(\Omega _n)\), where \(\textsf{gfp}\ \mathcal {S}\) denotes the greatest fixed point of \(\mathcal {S}\). To prove (6), we let \(s \in S\) and make another case distinction:

\(s \in S_n\). In this case, \(\textbf{v}^{\textsf{lex}}_n(s) = 0 \le \mathcal {S}(\textbf{v}^{\textsf{lex}}_n)(s)\) holds trivially.

\(s \in S_{\square }\setminus S_n\). Then
$$\begin{aligned} \textbf{v}^{\textsf{lex}}(s)&= \max _{a \in \textsf{Act}(s)}{} & {} \sum _{s'} P(s,a,s') \cdot \textbf{v}^{\textsf{lex}}(s') \quad \quad \quad (\text {by Lemma 1(b)})\\= & {} {}&\sum _{s'} P(s,\sigma (s),s') \cdot \textbf{v}^{\textsf{lex}}(s') (\sigma \, \text {is locally lexoptimal}) \end{aligned}$$and thus in particular \(\textbf{v}^{\textsf{lex}}_n(s) = \mathcal {S}(\textbf{v}^{\textsf{lex}}_n)(s)\).

\(s \in S_{\lozenge }\setminus S_n\). Let \(\textbf{v}^{\textsf{lex}}_{<n}(s)\) be the lexvalue with respect to the first \(n1\) objectives \(\varvec{\Omega }_{<n}\). Since \(\sigma \) is lexoptimal for \(\varvec{\Omega }_{<n}\) and \(\tau \) is a lexoptimal counterstrategy against \(\sigma \), we have that
$$\begin{aligned} \textbf{v}^{\textsf{lex}}_{<n}(s)&= \min _{a \in \textsf{Act}(s)}{} {} \sum _{s'} P(s,a,s') \cdot \textbf{v}^{\textsf{lex}}_{<n}(s') \quad \quad \quad (\text {by Lemma 1(a)})\\ &= {} {}\sum _{s'} P(s,\tau (s),s') \cdot \textbf{v}^{\textsf{lex}}_{<n}(s') \end{aligned}$$Let \(\textsf{Act}_{<n}(s)\) be the lexoptimal actions available in s with respect to \(\varvec{\Omega }_{<n}\). By the previous equation, \(\tau (s) \in \textsf{Act}_{<n}(s)\). Therefore,
$$\begin{aligned} \textbf{v}^{\textsf{lex}}_n(s)&= \min _{a \in \textsf{Act}_{<n}(s)}{} {} \sum _{s'} P(s,a,s') \cdot \textbf{v}^{\textsf{lex}}_n(s')\\\le & {} {}\sum _{s'} P(s,\tau (s),s') \cdot \textbf{v}^{\textsf{lex}}_n(s') = \mathcal {S}(\textbf{v}^{\textsf{lex}}_n)(s). \end{aligned}$$
Thus together with (5) we have \(\textbf{v}^{\textsf{lex}}_n(s) = \mathbb {P}_s^{\sigma ,\tau }(\Omega _n)\) and \(\sigma \) is lexoptimal for \(\varvec{\Omega }= (\Omega _1,\ldots ,\Omega _n)\).


\(\Omega _n = \textsf{Reach}\left( S_n\right) \). This case is a similar but slightly more involved than the previous case. As mentioned earlier, in \(\mathcal {G}^{\sigma ,\tau }\) the probabilities \(\mathbb {P}_s(\Omega _n)\) constitute the unique fixed point of the following monotonic operator:
$$\begin{aligned} \mathcal {R}:[0,1]^S \rightarrow [0,1]^S,\ \mathcal {R}(x)(s) = {\left\{ \begin{array}{ll} 1 &{}\text {if } s \in S_n\\ 0 &{}\text {if } s \text { cannot reach } S_n \\ \sum _{s'} P(s,s') \cdot x(s') &{}\text {else} \end{array}\right. } \end{aligned}$$where the transition probability function P of the Markov chain \(\mathcal {G}^{\sigma ,\tau }\) is defined as before. As in the other case, we prove that \(\textbf{v}^{\textsf{lex}}_n(s) \le \mathcal {R}(\textbf{v}^{\textsf{lex}}_n)(s)\) for all \(s \in S\), which implies \(\textbf{v}^{\textsf{lex}}_n(s) \le (\textsf{gfp}\ \mathcal {R})(s) = \mathbb {P}_s^{\sigma ,\tau }(\Omega _n)\). Notice that the greatest fixed point \(\textsf{gfp}\ \mathcal {R}\) is equal to the unique fixed point of \(\mathcal {R}\). Let \(s \in S\) and let us again make a case distinction to prove \(\textbf{v}^{\textsf{lex}}_n(s) \le \mathcal {R}(\textbf{v}^{\textsf{lex}}_n)(s)\) for all s:

If \(s \in S_n\), then \(\textbf{v}^{\textsf{lex}}_n(s) = 1 = \mathcal {R}(\textbf{v}^{\textsf{lex}}_n)(s)\).

The cases where s can reach \(S_n\) but \(s \notin S_n\) can be shown exactly as in the previous case where \(\Omega _n\) was safety.

Now suppose s cannot reach \(S_n\) in \(\mathcal {G}^{\sigma ,\tau }\), i.e., \(\mathbb {P}_s^{\sigma ,\tau }(\textsf{Reach}\left( S_n\right) ) = 0\). In this case we need to show that \(\textbf{v}^{\textsf{lex}}_n(s) = 0\), or equivalently, \(s \in \textsf{Zero}_n\). By condition (3), we have for all \(t \in S\) that
$$\begin{aligned}&1 \\ ~=~&\mathbb {P}_t^{\sigma ,\tau }(\textsf{Reach}\left( S_n \cup \textsf{Zero}_n \right) ) \\ ~=~&\mathbb {P}_t^{\sigma ,\tau }(\textsf{Reach}\left( S_n\right) ) + \mathbb {P}_t^{\sigma ,\tau }(\textsf{Reach}\left( \textsf{Zero}_n\right) )\\ ~=~&\mathbb {P}_t^{\sigma ,\tau }(\textsf{Reach}\left( S_n\right) ) + 1  \mathbb {P}_t^{\sigma ,\tau }(\textsf{Safe}\left( \textsf{Zero}_n\right) ) \end{aligned}$$and thus \(\mathbb {P}_t^{\sigma ,\tau }(\textsf{Reach}\left( S_n\right) ) = \mathbb {P}_t^{\sigma ,\tau }(\textsf{Safe}\left( \textsf{Zero}_n\right) )\). Therefore, \(\sigma \) is also locally lexoptimal for the objective \((\Omega _1,\Omega _2,\ldots ,\textsf{Safe}\left( \textsf{Zero}_n\right) )\). But then we can show exactly as in the previous case that \(\textbf{v}^{\textsf{lex}}_n(t) \le \mathcal {S}(\textbf{v}^{\textsf{lex}}_n)(t)\) where \(\mathcal {S}\) is the fixed point operator for safety probabilities associated to the objective \(\textsf{Safe}\left( \textsf{Zero}_n\right) \). Thus \(\textbf{v}^{\textsf{lex}}_n(s) \le (\textsf{gfp}\ \mathcal {S})(s) = \mathbb {P}_s^{\sigma ,\tau }(\textsf{Safe}\left( \textsf{Zero}_n\right) ) = 0\).

Finally, for the base case \(n=1\) observe that the same reasoning applies with the simplification that we do not need to care about previous targets. In particular, we do not need to apply the I.H.
“only if”: Let \(\sigma \in \Sigma _{\textsf{Max}}^{\textsf{MD}}\) be lexoptimal. First observe that \(\sigma \) is also locally lexoptimal by Lemma 1(a). Now let i be such that \(\Omega _i = \textsf{Reach}\left( S_i\right) \), let \(s \in S\) be any state and let \(\tau \in \Sigma _{\textsf{Min}}^{\textsf{MD}}\). It remains to show (3). Assume for contradiction that \(\mathbb {P}_s^{\sigma ,\tau }(\textsf{Reach}\left( S_i \cup \textsf{Zero}_i\right) ) < 1\). This means that in the finite Markov chain \(\mathcal {G}^{\sigma ,\tau }\), there exists a reachable bottom strongly connected component (BSCC, see [3, Ch. 10]) \(B \subseteq S\) such that \(B \cap (S_i \cup \textsf{Zero}_i) = \emptyset \). Thus for every state \(t \in B\), we have \(\textbf{v}^{\textsf{lex}}_i(t) > 0\). Further it holds that \(\mathbb {P}_{t}^{\sigma ,\tau }(\textsf{Reach}\left( S_i\right) ) =0\) because s can only reach states inside B, but \(B \cap S_i = \emptyset \). However, this is a contradiction to the lexoptimality of \(\sigma \). \(\square \)
We now conclude the proof of Theorem 1:
Proof of Theorem 1
Let \(\sigma \) be locally lexoptimal, let \(s \in S\) and let \(\tau \in \Sigma _{\textsf{Min}}^{\textsf{MD}}\). We show the following equivalence:
where \(R = \{i \le n \mid \Omega _i = \textsf{Reach}\left( S_i\right) \}\). The equivalence states that conditions (3) and (2) are equivalent and thus Lemma 3 is equivalent to Theorem 1. For \(R = \emptyset \) there is nothing to show, so we let \(R \ne \emptyset \).
To show direction “\(\Rightarrow \)”, assume for contradiction that the lefthand side holds but \(\mathbb {P}_s^{\sigma ,\tau }(\textsf{Reach}\left( S_i \cup \textsf{Zero}_i\right) ) < 1\) for some \(i \in R\). Then in the finite MC \(\mathcal {G}^{\sigma ,\tau }\) there exists a BSCC B which is reachable from s with positive probability and \(B \cap (S_i \cup \textsf{Zero}_i) = \emptyset \). Thus if \(t \in B\), then \(t \notin S_i\) and \(t \notin \textsf{Zero}_i\). Thus \(t \notin F\), contradiction because t is reachable from s with positive probability.
For direction “\(\Leftarrow \)”, the argument is similar. Suppose that the righthand side holds but \(\mathbb {P}_s^{\sigma ,\tau }(\textsf{Reach}\left( F\right) ) < 1\). Then in the finite MC \(\mathcal {G}^{\sigma ,\tau }\) there exists a BSCC B which is reachable from s with positive probability and \(B \cap F = \emptyset \). Let \(t \in B\). Then since \(t \notin F\), we have by definition that \(t \notin S_i\) for all \(i \in R\) and \(t \notin \textsf{Zero}_j\) for some \(j \in R\). But this is a contradiction to \(\mathbb {P}_s^{\sigma ,\tau }(\textsf{Reach}\left( S_j \cup \textsf{Zero}_j\right) ) = 1\) because t is reachable from s with positive probability. \(\square \)
Existence of MD lexoptimal strategies
The characterization from Theorem 1 also allows us to prove that MD lexoptimal strategies exist for absorbing reachsafe lexobjectives.
Theorem 2
In every SG with absorbing reachsafe lexobjective \(\varvec{\Omega }\), there exist MD lexoptimal strategies for both players.
Proof sketch
We consider the subgame \(\widetilde{\mathcal {G}}\) obtained by removing lexsuboptimal actions for both players and then show that the (singleobjective) value of \(\textsf{Reach}\left( F\right) \) in \(\widetilde{\mathcal {G}}\) equals 1. An optimal MD strategy for \(\textsf{Reach}\left( F\right) \) exists [1]; further, it is locally lexoptimal, because we are in \(\widetilde{\mathcal {G}}\), and it reaches F almost surely. Thus, it is lexoptimal for \(\varvec{\Omega }\) by the “if”direction of Theorem 1. \(\square \)
Full proof
Let \(\widetilde{\mathcal {G}}\) be the game obtained by removing lexsuboptimal actions in \(\mathcal {G}\) for both players. Let \(v(s)\) be the value of state \(s \in S\) for the objective \(\textsf{Reach}\left( F\right) \) in the modified game \(\widetilde{\mathcal {G}}\), where F is the Final Set like in Theorem 1 (we can assume that \(R\ne \emptyset \)). We show that \(v(s) = 1\) for all \(s \in S\). Assume towards contradiction that there exists a state s with \(v(s) < 1\).

If \(s \in \textsf{Sinks}(\mathcal {G})\), then either \(s \in S_i\) for some \(i \in R\), or otherwise s is a sink which is not contained in any of the \(S_i\) with \(i \in R\) and thus \(s \in \textsf{Zero}_i\) for all \(i \in R\). Thus \(s \in F\) by definition of F and \(v(s) = 1\), contradiction.

Let \(s \notin \textsf{Sinks}(\mathcal {G})\). Let \(\sigma \) be an MD optimal strategy for \(\textsf{Reach}\left( F\right) \) in \(\widetilde{\mathcal {G}}\) and let \(\tau \) be an MD optimal counterstrategy. Notice that such strategies exist because we are only considering a single objective [1]. As usual, we consider the finite MC \(\widetilde{\mathcal {G}}^{\sigma ,\tau }\). Since \(v(s) < 1\), we have \(\mathbb {P}^{\sigma ,\tau }_s(\textsf{Reach}\left( F\right) ) < 1\) which means that there is a BSCC \(B \subseteq S\) in \(\widetilde{\mathcal {G}}^{\sigma ,\tau }\) such that \(B \cap F = \emptyset \) and \(\mathbb {P}^{\sigma ,\tau }_s(\textsf{Reach}\left( B\right) ) > 0\). Let \(t \in B\) be any state in the BSCC. Then clearly, \(\mathbb {P}^{\sigma ,\tau }_t(\textsf{Reach}\left( F\right) ) = 0\) and thus \(v(t) = 0\) because \(\sigma \) is optimal for \(\textsf{Reach}\left( F\right) \). But since \(t \notin F\), we have by definition of F that \(\exists i \in R :t \notin \textsf{Zero}_i\), which means that \(\textbf{v}^{\textsf{lex}}_i(t) > 0\). Notice that here, \(\textbf{v}^{\textsf{lex}}\) are the lexvalues in the original game \(\mathcal {G}\), however by Lemma 1(b), they coincide with the lexvalues in \(\widetilde{\mathcal {G}}\). Thus since \(\textbf{v}^{\textsf{lex}}_i(t) > 0\), there is a strategy of \(\textsf{Max}\) in \(\widetilde{\mathcal {G}}\) that reaches \(S_i\) with positive probability against all counterstrategies of \(\textsf{Min}\) and thus also reaches F with positive probability because \(S_i \subseteq F\). This is a contradiction to \(v(t) = 0\).
\(\square \)
3.1.2 Algorithm for SG with absorbing targets
Theorem 2 is not constructive because it relies on the values \(\textbf{v}^{\textsf{lex}}\) without showing how to compute them. Computing the values and constructing an optimal strategy for \(\textsf{Max}\) in the case of an absorbing reachsafe lexobjective is the topic of this subsection.
Definition 7
(QRO) Consider an SG with state space S and a function \(q:S' \rightarrow [0,1]\) for some \(S' \subseteq S\). The quantified reachability objective (QRO) \(\textsf{Reach}\left( q\right) \) is defined as follows: For all strategies \(\sigma \) and \(\tau \),
Intuitively, a QRO generalizes its standard Boolean counterpart by additionally assigning a [0, 1]valued weight – or reward – to the states in the target set \(S'\). The probability of a QRO in some (induced) MC with a given initial state s is defined as the expected value of the reward received when reaching \(S'\). Clearly, this number is in [0, 1]. Note that it does not depend on whatever happens after reaching \(S'\); in fact, it is unaffected by making all states in \(S'\) absorbing.
In Sect. 3.2, we also need the dual notion of quantified safety objectives, whose probability is defined as \(\mathbb {P}^{{\sigma ,\tau }}_s(\textsf{Safe}\left( q\right) ) = 1  \mathbb {P}^{{\sigma ,\tau }}_s(\textsf{Reach}\left( q\right) )\). Intuitively, maximizing a quantified safety objective is equivalent to minimizing the probability of the dual QRO.
Remark 1
The standard Boolean reachability objective \(\textsf{Reach}\left( S'\right) \) is a special case of a QRO with \(q(s) = 1\) for all \(s \in S'\). Vice versa, a QRO can be easily reduced to a standard reachability objective \(\textsf{Reach}\left( S'\right) \): Convert all states \(t \in S'\) into sinks, then for each such t prepend a new state \(t'\) with a single action a and \(P(t',a,t)=q(t)\) and \(P(t',a,\bot )=1q(t)\) where \(\bot \) is a sink state. Finally, redirect all transitions leading into t to \(t'\). Despite this equivalence, it turns out to be convenient and natural to use QROs.
Example 6
(QRO) Example 4 illustrated that solving a safety objective after a reachability objective can lead to problems, as the optimal strategy for \(\textsf{Safe}\left( S_2\right) \) did not use the action that actually reached \(S_1\). In Example 5, we indicated that the Final Set \(F = \{s,t,u,v,w\}\) has to be reached with probability 1, and among the states of F the ones with the highest safety values should be preferred. This requirement can be encoded in a QRO as follows: Compute the values for the \(\textsf{Safe}\left( S_2\right) \) objective for the states in F. Then construct the function \(q_2 :F \rightarrow [0,1]\) that maps all states in F to their safety value, i.e., \(q_2: \{\, s \mapsto 1,\, t \mapsto 0,\, u \mapsto 0,\, v \mapsto 1/2,\, w \mapsto 1 \,\}\).
With QRO, we can effectively reduce (interleaved) safety objectives to quantified reachability objectives:
Lemma 4
(Reduction Safe \(\rightarrow \) Reach) Let \(\varvec{\Omega }= (\Omega _1,\ldots ,\Omega _n)\) be an absorbing reachsafe lexobjective with \(\Omega _n = \textsf{Safe}\left( S_n\right) \). Define the QRO \(q_n :F \rightarrow [0,1]\) as \(q_n(t) = \textbf{v}^{\textsf{lex}}_n(t)\) for all \(t \in F\) where F is the Final Set (Definition 6), and define \(\varvec{\Omega }'=(\Omega _1,\ldots ,\Omega _{n1},\textsf{Reach}\left( q_n\right) )\). Then it holds that \(^{\varvec{\Omega }}\textbf{v}^{\textsf{lex}}= ~ ^{\varvec{\Omega }'}\textbf{v}^{\textsf{lex}}\).
Proof
(Proof sketch) By definition, \(^{\varvec{\Omega }}\textbf{v}^{\textsf{lex}}(s) =~^{\varvec{\Omega }'}\textbf{v}^{\textsf{lex}}(s)\) for all \(s \in F\), so we only need to consider the states in \(S \setminus F\). Since any lexoptimal strategy for \(\varvec{\Omega }\) or \(\varvec{\Omega }'\) must also be lexoptimal for \(\varvec{\Omega }_{<n}\), we know by Theorem 1 that such a strategy reaches \(F_{<n}\) almostsurely. Note that we have \(F_{<n} = F\), as the nth objective, either the QRO or the safety objective, does not add any new states to F. The reachability objective \(\textsf{Reach}\left( q_n\right) \) weighs the states in F with their lexicographic safety values \(\textbf{v}^{\textsf{lex}}_n\). Thus we additionally ensure that in order to reach F, we use those actions that give us the best safety probability afterwards. In this way we obtain the correct lexvalues \(\textbf{v}^{\textsf{lex}}_n\) for states in \(S \setminus F\). \(\square \)
Proof
(Full proof) Let \(\sigma \in \Sigma _{\textsf{Max}}^{\textsf{MD}}\) be lexoptimal for \(\varvec{\Omega }\) (such a \(\sigma \) exists by Theorem 2). Clearly, \(\sigma \) is in particular lexoptimal for the first \(n1\) objectives \(\varvec{\Omega }_{<n}\). Let us denote by \(\Sigma _{\textsf{Max}}^{<n}\) the set of all MD lexoptimal strategies for player \(\textsf{Max}\) with respect to \(\varvec{\Omega }_{<n}\). We have \(\sigma \in \Sigma _{\textsf{Max}}^{<n}\).
We know by Theorem 1 that \(\sigma \) reaches \(F_{<n} = F\) almost surely against all \(\tau \in \Sigma _{\textsf{Min}}^{\textsf{MD}}\). Now fix an optimal counter strategy \(\tau \in \Sigma _{\textsf{Min}}^{\textsf{MD}}\) of \(\textsf{Min}\) against \(\sigma \) w.r.t. the full objective \(\varvec{\Omega }\). For an arbitrary MD strategy \(\sigma ' \in \Sigma _{\textsf{Max}}^{\textsf{MD}}\) of \(\textsf{Max}\), let \(\Sigma _{\textsf{Min}}^{<n}(\sigma ')\) denote the set of all MD lexoptimal counter strategies against \(\sigma '\) w.r.t. \(\varvec{\Omega }_{< n}\). Clearly, \(\tau \in \Sigma _{\textsf{Min}}^{<n}(\sigma )\).
For all \(s \in S\), it holds that
This proves the claim. \(\square \)
Example 7
(Reduction Safe \(\rightarrow \) Reach) Recall Example 6. By Lemma 4, computing \(\sup _{\sigma } \inf _{\tau } \mathbb {P}^{{\sigma ,\tau }}_s(\textsf{Reach}\left( S_1\right) , \textsf{Reach}\left( q_2\right) )\) yields the correct lexvalue \(\textbf{v}^{\textsf{lex}}(s)\) for all \(s \in S\). Consider for instance state r: The action leading to q is clearly suboptimal for \(\textsf{Reach}\left( q_2\right) \) as it does not reach F. Both other actions surely reach F. However, since \(q_2(t) = q_2(u) = 0\) while \(q_2(v) = 1/2\), the action leading to u and v is preferred over that leading to t and u, as it ensures the higher safety probability after reaching F.
We now explain the basic structure of Algorithm 1. More technical details are explained in the proof sketch of Theorem 3. The idea of Algorithm 1 is, as sketched in Sect. 3.1.1, to consider the objectives sequentially in the order of importance, i.e., starting with \(\Omega _1\). The ith objective is solved (Lines 510) and the game is restricted to only the locally optimal actions (Line 11). This way, in the ith iteration of the main loop, only actions that are locally lexoptimal for objectives 1 through \((i{}1)\) are considered. Finally, we construct the optimal strategy and update the result variables (Lines 1213).
Theorem 3
Given an SG \(\mathcal {G}\) and an absorbing reachsafe lexobjective \(\varvec{\Omega }= (\Omega _1, \dots , \Omega _n)\), Algorithm 1 computes the vector of lexvalues \(\textbf{v}^{\textsf{lex}}\) and an MD lexoptimal strategy \(\sigma \) for player \(\textsf{Max}\). It needs \(n + m\) calls to a solver for single reachability, where m is the number of safety objectives in \(\varvec{\Omega }\).
Proof sketch
We explain the intuition of the algorithm and highlight the most interesting details.

\(\widetilde{\mathcal {G}}\)invariant For \(i\ge 1\), in the ith iteration of the loop, \(\widetilde{\mathcal {G}}\) is the original SG restricted to only those actions that are locally lexoptimal for the targets 1 to \((i{}1)\); this is the case because Line 11 was executed for all previous targets.

Singleobjective case The singleobjective that is solved in Line 5 can be either reachability or safety. We can use any (precise) singleobjective solver as a black box, e.g. strategy iteration [1]. By Remark 1, it is no problem to call a singleobjective solver with a QRO since there is a trivial reduction.

QRO for safety If an objective is of type reachability, no further steps need to be taken; if on the other hand, it is safety, we need to ensure that the problem explained in Example 4 does not occur. Thus we compute the Final Set \(F_{<i}\) for the ith target and then construct and solve the QRO as in Lemma 4.

Resulting strategy When storing the resulting strategy, we again need to avoid errors induced by the fact that locally lexoptimal actions need not be globally lexoptimal. This is why for a reachability objective, we only update the strategy in states that have a positive value for the current objective; if the value is 0, the current strategy does not have any preference, and we need to keep the old strategy. For safety objectives, we need to update the strategy in two ways: for all states in the Final Set \(F_{<i}\), we set it to the safety strategy \(\widetilde{\sigma }\) (from Line 5) as within \(F_{<i}\) we do not have to consider the previous reachability objectives and therefore must follow an optimal safety strategy. For all states in \(S \setminus F_{<i}\), we set it to the reachability strategy from the QRO \(\sigma _Q\) (from Line 9). This is correct, as \(\sigma _Q\) ensures almostsure reachability of \(F_{<i}\) which is necessary to satisfy all preceding reachability objectives; moreover \(\sigma _Q\) prefers those states in \(F_{<i}\) that have a higher safety value (cf. Lemma 4).
\(\square \)
Full proof
The proof is by induction on n. For \(n=1\), the algorithm is correct by the assumption that \(\texttt{SolveSingleObj}\) is correct in the singleobjective case. Next, we show the inductive step \(n>1\) for reachability and then for safety objectives.

Case 1 \(\Omega _n = \textsf{Reach}\left( S_n\right) \). By the I.H., \(\widetilde{\mathcal {G}}\) is the correct restriction of \(\mathcal {G}\) to lexoptimal actions for both players with respect to the first \(n1\) objectives \(\varvec{\Omega }_{<n} = (\Omega _1,\ldots ,\Omega _{n1})\) and \(\sigma \) is a lexoptimal MD strategy in \(\mathcal {G}\) with respect to \(\varvec{\Omega }_{<n}\). The algorithm computes an MD optimal strategy \(\widetilde{\sigma }\) in \(\widetilde{\mathcal {G}}\) with respect to the singleobjective \(\textsf{Reach}\left( S_n\right) \), and the singleobjective values \(v(s)\) of this objective in \(\widetilde{\mathcal {G}}\) by calling \(\texttt{SolveSingleObj}\) (line 5). The strategy \(\sigma \in \Sigma _{\textsf{Max}}^{\textsf{MD}}\) is then updated as follows:
$$\begin{aligned} \sigma (s) \quad = \quad {\left\{ \begin{array}{ll} \widetilde{\sigma }(s) &{}\text {if } \widetilde{v}(s) > 0\\ \sigma _{old}(s) &{}\text {if } \widetilde{v}(s) = 0. \end{array}\right. } \end{aligned}$$We claim that \(\sigma \) is lexoptimal for the whole lexobjective \(\varvec{\Omega }= (\Omega _1,\ldots ,\Omega _{n})\).

We first show that \(\sigma \) remains lexoptimal for the first \(n1\) objectives \(\varvec{\Omega }_{<n}\) by applying the “if”direction of Lemma 3: First observe that by definition, \(\sigma \) is locally lexoptimal with respect to \(\varvec{\Omega }_{<n}\). Therefore it only remains to show condition (3) in Lemma 3. Let \(i<n\) such that \(\Omega _i = \textsf{Reach}\left( S_i\right) \), let \(s \in S\) and let \(\tau \in \Sigma _{\textsf{Min}}^{\textsf{MD}}\) be a counterstrategy against \(\sigma \). If \(v(s) = 0\), then \(\mathbb {P}_s^{\sigma ,\tau }(\textsf{Reach}\left( S_i \cup \textsf{Zero}_i\right) ) = 1\) because from initial state s, \(\sigma \) behaves like \(\sigma _{old}\) which is lexoptimal for \(\varvec{\Omega }_{<n}\). Thus let \(v(s) > 0\). From the “only if”direction of Lemma 3 applied to \(\widetilde{\sigma }\), we know that \(\mathbb {P}^{\sigma ,\tau }_s(\textsf{Reach}\left( S_n \cup \{s \in S \mid v(s) = 0\}\right) ) = 1\). Thus for a play \(\pi \) that starts in s and is consistent with \({\sigma ,\tau }\), almostsurely one of the following two cases occurs:

\(*\) If \(\pi \) reaches a state \(t \in S_n\), then since t is a sink, we have \(t \in S_i \cup \textsf{Zero}_i\).

\(*\) If t with \(v(t) = 0\) is reached in \(\pi \), then since we play according to \(\sigma _{old}\) from t, we either reach \(S_i\) or \(\textsf{Zero}_i\) by the “only if”direction of Lemma 1 applied to \(\sigma _{old}\).
Thus \(\mathbb {P}_s^{\sigma ,\tau }(\textsf{Reach}\left( S_i \cup \textsf{Zero}_i\right) ) = 1\) and \(\sigma \) remains lexoptimal for \(\varvec{\Omega }_{<n}\).


To complete the proof that \(\sigma \) is lexoptimal for \(\varvec{\Omega }\), notice that \(v(s) = \textbf{v}^{\textsf{lex}}_n(s)\) for all \(s \in S\) by Lemma 1.


Case 2 \(\Omega _n = \textsf{Safe}\left( S_n\right) \). Since by the I.H., the values \(\textbf{v}^{\textsf{lex}}_{1},\ldots ,\textbf{v}^{\textsf{lex}}_{n1}\) are the correct lexvalues with respect to \(\varvec{\Omega }_{<n}\), the algorithm computes the final set \(F_{<n} = F_{<n+1} = F\).
Next observe that \(v(s) = \textbf{v}^{\textsf{lex}}_n(s)\) for all \(s \in F\) because of the following:

First, \(\widetilde{\sigma }\) is locally lexoptimal w.r.t. \(\varvec{\Omega }_{<n}\) because it is defined in the subgame \(\widetilde{\mathcal {G}}\). Therefore by Lemma 1, \(\sigma \) is already (globally) lexoptimal for \(\varvec{\Omega }_{<n}\) from all \(s \in F\) because condition 2 is satisfied trivially. Thus \(v(s) \le \textbf{v}^{\textsf{lex}}_n(s)\).

Second, by the same argument, an MD lexoptimal strategy for \(\varvec{\Omega }\) is necessarily locally lexoptimal w.r.t. \(\varvec{\Omega }_{<n}\). The strategy \(\widetilde{\sigma }\) is locally lexoptimal w.r.t. \(\varvec{\Omega }_{<n}\) and moreover optimal for \(\Omega _n\) in the subgame \(\widetilde{\mathcal {G}}\). Thus \({v}(s) \ge \textbf{v}^{\textsf{lex}}_n(s)\).
Notice that it is very well possible that \({v}(s) > \textbf{v}^{\textsf{lex}}_n(s)\) for \(s \notin F\) because the strategy \(\widetilde{\sigma }\) does not necessarily reach F from \(s \notin F\). Applying Lemma 4 concludes the proof: The quantified reachability objective \(q_n\) constructed by the algorithm indeed satisfies \(q_n(s) = {v}(s) = \textbf{v}^{\textsf{lex}}_n(s)\) for all \(s \in F\) as we have just shown. The result strategy \(\sigma \) defined by the algorithm is
$$\begin{aligned} \sigma (s) \quad = \quad {\left\{ \begin{array}{ll} \widetilde{\sigma }(s) &{}\text {if } s \in F\\ \sigma _{Q}(s) &{}\text {if } s\notin F \end{array}\right. } \end{aligned}$$where \(\sigma _{Q}\)is a lexoptimal strategy for \(\varvec{\Omega }'=(\Omega _1,\ldots ,\Omega _{n1},\textsf{Reach}\left( q_n\right) )\). Thus with Lemma 4 and the above discussion, \(\sigma \) is lexoptimal from all states.

\(\square \)
3.2 General lexicographic reachsafe SG
We now consider SG with general reachsafe lexobjectives that are not necessarily absorbing. In Section 3.2.1, we describe a reduction to the absorbing case. The resulting algorithm is given in Section 3.2.2. Finally, in Section 3.2.3, we discuss complexity and determinacy results that follow easily from our algorithms.
3.2.1 Reduction to absorbing reachsafe lexobjectives
In general lexicographic SG, strategies need memory. This is because they have to remember which reachability and safety objectives have already been satisfied and violated, respectively, and behave accordingly. We formalize the solution of such games by means of stages. Intuitively, one can think of a stage as a copy of the game with fewer objectives, or as the subgame that is played after visiting one of the targets or unsafe sets for the first time. This approach is standard for dealing with multiple reachability or safety objectives [12, 50, 51].
Definition 8
(Stage) Given a general reachsafe lexobjective \(\varvec{\Omega }= (\varvec{\Omega }_1, \dots , \varvec{\Omega }_n)\) and a set \(I \subseteq \{i \le n\}\), a stage \(\varvec{\Omega }(I)\) is the objective vector where the objectives \(\varvec{\Omega }_i\) are removed for all \(i \in I\). For every state \(s \in S\), let \(\varvec{\Omega }(s)\) denote the stage \(\varvec{\Omega }(\{i \mid s \in S_i\})\). If a stage contains only one objective, we call it simple.
Example 8
(Stages) Consider the SG in Fig. 2a on Page 24. There are two (reachability) objectives and thus \(2^2 = 4\) four possible stages: The one where we consider the original objective (the region denoted with \(\varvec{\Omega }\) in Fig. 2b), the simple ones where we consider only one of the objectives (regions \(\varvec{\Omega }(\{1\})\) and \(\varvec{\Omega }(\{2\})\)), and the one where both objectives have been visited. This final stage is trivial since there are no more objectives left, hence we do not depict it and do not have to consider it. Note that the actions of q and r are omitted in the \(\varvec{\Omega }\)stage, this is because a new stage begins once we visit these states for the first time.
Consider the two simple stages first: In stage \(\varvec{\Omega }(\{1\})\), state q has value 0 as it belongs to player \(\textsf{Min}\) who may use the selfloop to avoid reaching \(r \in S_2\). In stage \(\varvec{\Omega }(\{2\})\), both p and r have value 1 because \(\textsf{Max}\) can easily reach the target state \(q \in S_1\) from both of them. By combining this knowledge, we can assemble an optimal strategy for every possible initial state in the principal stage \(\varvec{\Omega }\). In particular, note that an optimal \(\textsf{Max}\)strategy for state p needs memory: First go to r and thereby reach stage \(\varvec{\Omega }(\{2\})\). Afterwards, go back from r to p, and then use the other action in p to reach q and the final stage \(\varvec{\Omega }(\{1,2\})\).
Note that this example reveals another interesting fact about lexicographic games: Optimal strategies may try to satisfy less important objectives first.
In Example 8, we combined our knowledge about the subobjectives in the corresponding stages to find the lexvalues for the overall objective. In general, the lexvalues of the stages are vectors in [0, 1]. We will reuse the idea of quantified reachability and safety objectives from Definition 7 (Page 18) in order to find an optimal strategy for some stage \(\varvec{\Omega }\) given the values of its substages: For all \(1 \le i \le n\), let \(q_i :\bigcup _{j\le n} S_j \rightarrow [0,1]\) be defined as follows:
Recall that \(\varvec{\Omega }(s)\) denotes the stage \(\varvec{\Omega }(\{i \mid s \in S_i\})\). Further, we define
where for all \(1 \le i \le n\), \(\textsf{type}_i = \textsf{Reach}\) if \(\Omega _i=\textsf{Reach}\left( S_i\right) \), and \(\textsf{type}_i = \textsf{Safe}\) if \(\Omega _i = \textsf{Safe}\left( S_i\right) \). Thus we have reduced a general reachsafe lexobjective \(\varvec{\Omega }\) to a vector of quantitative reachability or safety objectives \(\textsf{q}\varvec{\Omega }\). This reduction preserves the values:
Lemma 5
For every reachsafe lexobjective \(\varvec{\Omega }\), it holds that \(^{\varvec{\Omega }}\textbf{v}^{\textsf{lex}}=\ ^{\textsf{q}\varvec{\Omega }}\textbf{v}^{\textsf{lex}}\), where \(\textsf{q}\varvec{\Omega }\) is the quantified reachsafe lexobjective defined in (7) and (8).
Proof
In this proof, we write \(\mathfrak {S}= \bigcup _{i\le n} S_i\) for the sake of readability. For the case \(s \in \mathfrak {S}\), a straightforward induction shows that \(q_i(s) =\) \( ^{\varvec{\Omega }}\textbf{v}^{\textsf{lex}}_i(s) =\) \(^{\textsf{q}\varvec{\Omega }}\textbf{v}^{\textsf{lex}}_i(s)\) for all \(1 \le i \le n\).
Now let \(s \notin \mathfrak {S}\). Let \({\sigma ,\tau }\) be arbitrary strategies of \(\textsf{Max}\) and \(\textsf{Min}\), respectively. Further, suppose that \(\Omega _i = \textsf{Reach}\left( S_i\right) \). In the induced Markov chain \(\mathcal {G}^{\sigma ,\tau }\), the following holds (all infima and suprema are taken over lexoptimal (counter)strategies with respect to \(\varvec{\Omega }_{<i}\)):
where we used the notation \( Paths _{ fin }(\mathfrak {S}) = \{\pi t \in (S \setminus \mathfrak {S})^* \times S \mid t \in \mathfrak {S}\}\) for the set of all finite paths to a state in \(\mathfrak {S}\) in \(\mathcal {G}^{\sigma ,\tau }\) and \(\mathbb {P}_s^{\sigma ,\tau }(\pi t)\) denotes the probability of such a path when the Markov chain \(\mathcal {G}^{\sigma ,\tau }\) starts in s.
If \(\Omega _i\) is a safety objective instead, then the claim follows from the above using the relationship \(\mathbb {P}_s^{\sigma ,\tau }(\textsf{Safe}\left( S_i\right) ) = 1  \mathbb {P}_s^{\sigma ,\tau }(\textsf{Reach}\left( S_i\right) )\). \(\square \)
The reward functions \(q_i\) involved in \(\textsf{q}\varvec{\Omega }\) all have the same domain \(\bigcup _{j\le n} S_j\). Hence we can, as mentioned below Definition 7, consider \(\textsf{q}\varvec{\Omega }\) on the game where all states in \(\bigcup _{j\le n} S_j\) are sinks without changing the lexvalue. This is precisely the definition of an absorbing game, and hence we can compute \(^{\textsf{q}\varvec{\Omega }}\textbf{v}^{\textsf{lex}}\) using Algorithm 1 from Sect. 3.1.2.
3.2.2 Algorithm for general SG
Algorithm 2 computes the lexvalue \(^{\varvec{\Omega }}\textbf{v}^{\textsf{lex}}\) for a given lexicographic objective \(\varvec{\Omega }\) and an arbitrary SG \(\mathcal {G}\). We highlight the following technical details:

Reduction to absorbing case We just have seen that once we have the quantitative objective vector \(\textsf{q}\varvec{\Omega }\), we can use the algorithm for absorbing SG (Line 13).

Computing the quantitative objective vector To compute \(\textsf{q}\varvec{\Omega }\), the algorithm calls itself recursively on all states in the union of all target sets (Line 57). We annotated this recursive call “With dynamic programming”, as we can reuse the results of the computations. In the worst case, we have to solve all \(2^n  1\) possible nonempty stages. Finally, given the values \(^{\varvec{\Omega }(s)}\textbf{v}^{\textsf{lex}}\) for all \(s \in \bigcup _{j\le n} S_j\), we can construct the quantitative objective (Line 10 and 12) that is used for the call to \(\texttt{SolveAbsorbingRS}\).

Termination Since there are finitely many objectives in \(\varvec{\Omega }\) and in every recursive call at least one objective is removed from consideration, eventually we have a simple objective that can be solved by \(\texttt{SolveSingleObj}\) (Line 3).

Resulting strategy The resulting strategy is composed in Line 14: It adheres to the strategy for the quantitative query \(^{\textsf{q}\varvec{\Omega }}\sigma \) until some \(s \in \bigcup _{j \le n} S_j\) is reached. Then, to achieve the values promised by \(q_i(s)\) for all i with \(s \notin S_i\), it adheres to \(^{\varvec{\Omega }(s)} \sigma \), the optimal strategy for stage \(\varvec{\Omega }(s)\) obtained by the recursive call.
Theorem 4
Given an SG \(\mathcal {G}\) and a reachsafe lexobjective \(\varvec{\Omega }= (\Omega _1,\dots ,\Omega _n)\), Algorithm 2 correctly computes the vector of lexvalues \(\textbf{v}^{\textsf{lex}}\) and a deterministic lexoptimal strategy \(\sigma \) of player \(\textsf{Max}\) which uses memory of classsize at most \(2^n {}1\). The algorithm needs at most \(2^n {}1\) calls to \(\texttt{SolveAbsorbingRS}\) or \(\texttt{SolveSingleObj}\).
Proof
Correctness of the algorithm and termination follows from the discussion of the algorithm, Lemma 5 and Theorem 3. \(\square \)
3.2.3 Determinacy and complexity
Theorem 5 below states that lexicographic games are determined for arbitrary lexobjectives \(\varvec{\Omega }\). Intuitively, this means that the lexvalue is independent of the player who fixes their strategy first. This property does not hold for nonlexicographic multireachability/safety objectives [12].
Theorem 5
(Determinacy) For all SG \(\mathcal {G}\) and reachsafe lexobjectives \(\varvec{\Omega }\), it holds for all \(s \in S\) that:
Proof
This statement follows because singleobjective games are determined [1] and Algorithm 2 obtains all values by either solving singleobjective instances directly (Line 3) or calling Algorithm 1, which also reduces everything to the singleobjective case (Line 5 of Algorithm 1). Thus the supinf values \(\textbf{v}^{\textsf{lex}}\) returned by the algorithm are in fact equal to the infsup values. \(\square \)
By analyzing Algorithm 2, we also get the following complexity results:
Theorem 6
(Complexity) For any SG \(\mathcal {G}\) and reachsafe lexobjective \(\varvec{\Omega }\) of size n:

1.
Strategy complexity: Deterministic strategies with \(2^n  1\) memoryclasses (i.e., bitsize n) are sufficient and necessary for lexoptimal strategies.

2.
Computational complexity: The decision problem \(\textbf{v}^{\textsf{lex}}(s_0) \ge _{\textsf{lex}}^{?} \textbf{x}\) is \(\textsf{PSPACE}\)hard and can be solved in \(\textsf{NEXPTIME}\cap \textsf{coNEXPTIME}\). If n is a constant or \(\varvec{\Omega }\) is absorbing, then it is contained in \(\textsf{NP}\cap \textsf{coNP}\).
Proof

1.
For each stage, Algorithm 2 computes an MD strategy for the quantitative objective. These strategies are then concatenated whenever a new stage is entered. Equivalently, every stage has an MD strategy for every state, so as there are at most \(2^n  1\) stages (since there are n objectives), the strategy needs at most \(2^n  1\) states of memory; these can be represented with n bits. Intuitively, we save for every target set whether it has been visited. The memory lower bound already holds in nonstochastic reachability games where all n targets have to be visited with certainty [50].

2.
The work of [28] shows that in MDP, it is \(\textsf{PSPACE}\)hard to decide if n targets can be visited almostsurely. This problem trivially reduces to ours. For the \(\textsf{NP}\) upper bound, observe that there are at most \(2^n  1\) stages, i.e., a constant amount if n is assumed to be constant (or even just one stage if \(\varvec{\Omega }\) is absorbing). Thus we can guess an MD strategy for player \(\textsf{Max}\) in every stage. The guessed overall strategy can then be checked by analyzing the induced MDP in polynomial time [21]. The same procedure works for player \(\textsf{Min}\) and since the game is determined, we have membership in \(\textsf{coNP}\). In the same way, we obtain the \(\textsf{NEXPTIME}\cap \textsf{coNEXPTIME}\) upper bound in the general case where n is arbitrary.
\(\square \)
We leave the question of whether \(\textsf{PSPACE}\) is also an upper bound open. The main obstacle towards proving \(\textsf{PSPACE}\)membership is that it is unclear if the lexvalue – being dependent on the value of exponentially many stages in the worstcase – may actually have exponential bitcomplexity.
4 MDP with lexicographic \(\omega \)regular objectives
In this section, we study oneplayer SG, also known as MDP, with \(\omega \)regular lexicographic objectives. Similarly to the previous section, we first consider a simpler problem in Sect. 4.1: We show how to solve lexicographic \(\omega \)regular objectives in end components – fragments of the state space where the player can remain forever if it chooses to do so. Then, using this, we solve arbitrary MDP in Sect. 4.2. We provide our motivation for restricting attention to MDP instead of the more general SG in the final Sect. 4.3.
Intuitively, the reason for first considering end components is that \(\omega \)regular objectives depend on the infinite suffix of a path, and such an infinite suffix can only occur with positive probability inside an end component. Thus, a standard approach for dealing with \(\omega \)regular objectives is to analyze the end components first. Afterwards, we consider the problem of reaching the end components where the objectives can be satisfied. The end components that are ranked higher in the lexicographic order are preferred. In this way we reduce the problem to lexicographic reachability.
There exist several acceptance conditions that capture exactly the class of \(\omega \)regular objectives; here, we will use Streett acceptance for technical reasons. In practice, \(\omega \)regular conditions are often specified as formulas in Linear Temporal Logic (LTL). Such formulas can be transformed to deterministic Streett automata (DSA) with a doubly exponential space blowup, see e.g. [52, 53]. Model checking an MDP against an LTL formula reduces to checking Streett acceptance of the product of the MDP with the corresponding DSA (see Definition 3). It is also possible to iterate this product construction for several LTL formulas or DSA. We assume in the following that we are already given a “product” MDP with n Streett conditions.
4.1 Lexicographic streett in end components
Intuitively, an end component of an MDP is a subset of states where the player (i) can remain forever if it wants to, and (ii) can visit all states in the subset infinitely often with probability 1. We now give a formal definition. Let \(\mathcal {M}\) be an MDP with state space S, and let \(E\subseteq S\). For a state \(s \in E\) and an action \(a \in \textsf{Act}(s)\), we say that a exits \(E\) if \(P(s,a,s') > 0\) for some \(s' \notin E\).
Definition 9
(End component) Let \(\mathcal {M}= (S, \textsf{Act}, P)\) be an MDP. A set \(E\subseteq S\) is called an end component (EC) of \(\mathcal {M}\) if there exists \(A \subseteq \bigcup _{s \in E} \textsf{Act}(s)\) such that

1.
for all \(a \in A\), it holds that a does not exit \(E\), and

2.
for all \(s, s' \in E\), there exists a finite path \(s a_0 \ldots a_n s' \in (E\times A)^* \times E\).
An EC that is not contained in another EC is called maximal (MEC).
The set \(\mathcal {E}\) of all MEC of an MDP can be computed in polynomial time, see e.g. [54]. Further, recall that a Streett objective \(\Omega \) for an MDP with state space S is defined in terms of a set of pairs \(\Omega = \textsf{Streett}((\textsf{F}^{}_j,\textsf{I}^{}_j)_{1\le j\le m_{}})\) such that \(\textsf{F}_j\), \(\textsf{I}_j \subseteq S\) for all \(1 \le j \le m\). An infinite trajectory \(s_0 s_1 \ldots \in S ^\omega \) satisfies \(\Omega \) iff
4.1.1 Solving single streett in end components
We first recall a polynomialtime algorithm from [55, Sec. 4] for deciding whether a given EC of some MDP is good or bad for a Streett objective. Here, good means that the player can satisfy the Streett objective with probability 1 without leaving the EC; a bad EC is one that is not good. Note that remaining in a bad EC forever would violate the Streett objective. Hence the Street value of all states in an EC is either 0 or 1.
Let \(E\) be an EC of some MDP \(\mathcal {M}\), and let \((\textsf{F}^{}_j,\textsf{I}^{}_j)_{1\le j\le m_{}}\) be a Streett condition on \(\mathcal {M}\). The algorithm proceeds in an alternating fashion. It first identifies the bad states of \(E\). A state \(s \in E\) is bad if \(s \in \textsf{F}_j\) and \(\textsf{I}_j \cap E= \emptyset \) for some \(1 \le j \le m\). Intuitively, s is bad because the player may visit s at most finitely many times; thus, in order to satisfy the Streett condition, s must be eventually avoided forever with probability 1. The algorithm removes all bad states from \(E\) which results in a subMDP \(\widetilde{E}\). In general, \(\widetilde{E}\) is not an end component due to the state removal. However, if \(\widetilde{E}\) is an EC, then we can label it as good. Indeed, since \(\widetilde{E}\) contains no bad states, it suffices to visit all states in \(\widetilde{E}\) infinitely often with probability one^{Footnote 2}. The original EC \(E\) is then also good; the player can satisfy the Streett objective by simply navigating to \(\widetilde{E}\) which is possible with an MD strategy. If \(\widetilde{E}\) is not an EC, then the algorithm computes a MEC decomposition of \(\widetilde{E}\), calls itself recursively on all the resulting MEC, and labels \(E\) as good iff it finds that at least one of the MEC is good. In summary:
Lemma 6
(from [55]) There is a polynomialtime algorithm \(\texttt{SingleStreettSat}\) which takes as input an EC \(E\) of some MDP \(\mathcal {M}\) together with a Streett condition \((\textsf{F}^{}_j,\textsf{I}^{}_j)_{1\le j\le m_{}}\) on \(\mathcal {M}\), and outputs \(\texttt{true}\) iff \(E\) is good. In case \(E\) is good, the algorithm additionally outputs an optimal strategy for the states inside the EC.
4.1.2 Algorithm for lexicographic streett in end components
We now present an algorithm that computes the lexvalue of a Street lexobjective for an EC of some MDP. Note that this lexvalue is a vector in \(\{0,1\}^n\) which is moreover independent of the initial state. The key idea of our algorithm is the observation that Streett objectives are closed under conjunction. Indeed, for Streett conditions \(\Omega _1 = \textsf{Streett}((\textsf{F}^{1}_j,\textsf{I}^{1}_j)_{1\le j\le m_{1}})\) and \( \Omega _2 = \textsf{Streett}((\textsf{F}^{2}_j,\textsf{I}^{2}_j)_{1\le j\le m_{2}})\) it holds that a given path satisfies both \(\Omega _1\) and \(\Omega _2\) iff it satisfies the single Streett condition
With this observation and \(\texttt{SingleStreettSat}\) as described above, our algorithm is easy to implement and runs in polynomial time.
Lemma 7
Algorithm 3 computes the correct lexvalues of an EC in polynomial time.
Proof
The idea of the algorithm relies on the following observation: Let \(1 \le i \le n\), and let
Then it holds that \(^{\varvec{\Omega }} \textbf{v}^{\textsf{lex}}_i = 1\) iff the single Streett objective \(\Omega _{<i} \wedge \Omega _i\) can be achieved with probability 1 in \(E\), and otherwise \(^{\varvec{\Omega }} \textbf{v}^{\textsf{lex}}_i = 0\). This is because in an EC, a Streett objective is either achievable with probability 1 or 0. Our algorithm computes \(\Omega _{<i}\) iteratively; \(\Omega _{<i}\) is called \(\Omega _{ curr }\) in Algorithm 3.
Formally, we show the following invariant by induction: For all \(0 \le i \le n\) and \(j \le i\), it holds that after i iterations of the loop in Line 4, the objective \(\Omega _{ curr }\) contains \(\Omega _j\) iff \(^{\varvec{\Omega }}\textbf{v}^{\textsf{lex}}_j = 1\). This trivially holds for \(i = 0\), i.e., if the loop has not been executed yet. For \(i > 0\), observe that by the induction hypothesis we have after \(i1\) iterations that \(^{\varvec{\Omega }}\textbf{v}^{\textsf{lex}}_i = 1\) iff \(\Omega _{ test }= \Omega _{ curr }\wedge \Omega _i\) can be achieved with probability 1 in \(E\) (Line 6).
If so, then after the i iterations, \(\Omega _{ curr }\) will contain \(\Omega _i\), and otherwise \(\Omega _{ curr }\) will not contain \(\Omega _i\), which establishes the invariant again. Overall, correctness of the algorithm follows because after \(i = n\) iterations, we have for all \(1 \le j \le i\) that \(\varvec{v}_j = 1\) iff \(\Omega _j\) is contained in \(\Omega _{ curr }\) iff \(^{\varvec{\Omega }} \textbf{v}^{\textsf{lex}}_j = 1\). \(\square \)
4.2 Lexicographic streett in general MDP
We now show how to compute the value of a lexicographic Streett objective in general MDP, i.e., not just in end components as in the previous subsection. We use the standard approach of first computing a MEC decomposition of the MDP, analysing the MEC in isolation, and then reduce the overall question to a reachability problem. In particular, in our case, we obtain a lexicographic reachability problem, and we can reuse the algorithm from Sect. 3. We state our algorithm formally as Algorithm 4.
Concretely, we construct a modified MDP \(\widetilde{\mathcal {M}}\) (Line 2, 7 and 8), where for every MEC \(E\) we add a new sink state \(s_E\) that can be reached from every state in the MEC (Line 7 and 8). This sink intuitively corresponds to remaining in the MEC forever, using the optimal strategy \(\sigma _E\) of the MEC as computed by Algorithm 3 (Line 6). Thus, choosing to remain in \(E\) achieves the lexvalue \(\textbf{v}^{\textsf{lex}}_E\) of the MEC \(E\). We remodel our objective as a reachability lexobjective \(\Omega _{ reach }\) by having each target set \(T_i\) contain exactly those \(s_E\) where \(\textbf{v}^{\textsf{lex}}_{E,i} =1\) (Line 3, 9 and 11). Now, by solving this absorbing reachability objective \(\Omega _{ reach }\) in the modified MDP \(\widetilde{\mathcal {M}}\) (Line 12), we compute the lexvalue for the Streett objective in the original MDP \(\mathcal {M}\). Intuitively, we compute the optimal probability to reach a good MEC \(E\) and then, by going to the sink \(s_E\), use the optimal strategy in the MEC to indeed satisfy the best combination of Streett objectives. Thus, the resulting strategy is a combination of the strategy \(\sigma _{ reach }\) to reach the MEC \(E\) and the strategy \(\sigma _E\) to remain in the MEC \(E\) (Line 13).
Theorem 7
Given an MDP \(\mathcal {M}\) and a Streett lexobjective \(\varvec{\Omega }\), Algorithm 4 computes the lexvalues \(^{\varvec{\Omega }}\textbf{v}^{\textsf{lex}}:S \rightarrow [0,1]^n\) and a lexoptimal strategy \(\sigma \) in polynomial time.
Proof
The proof relies on the following standard observation: Independent of the chosen strategy, the player eventually reaches an EC of \(\mathcal {M}\) with probability 1 and stays there forever. That is, there may exist strategies such that the induced MC has infinite paths that never reach an EC; however, their total probability mass is zero. Moreover, whether or not a Streett objective is satisfied on a given infinite path that reaches and (stays in) some EC depends only on that EC, but not on whatever happened before reaching it. This discussion implies that the player should prefer reaching EC with higher lexvalue. More formally, for a given initial MDP state \(s \in S\), the player should choose a strategy \(\sigma \) that (lexicographically) maximizes the following quantity:
The supremum (and maximum) of this quantity is indeed the lexvalue \(\textbf{v}^{\textsf{lex}}(s)\). The strategy \(\sigma = \sigma _{ reach }\) computed in Line 12 maximizes (10) by construction of \(\Omega _{ reach }\) and due to the additional actions \(s \rightarrow s_E\) that we have added for all EC \(E\) and \(s \in E\) in Line 8; recall that these actions simulate staying forever in \(E\), thereby “earning” the lexvalue of \(E\).
For the runtime, observe that \(\texttt{ComputeMEC}\) can be implemented in polynomial time [54], and there are at most S many MEC. The algorithm \(\texttt{LexStreettEC}\) can also be implemented in polynomial time as discussed in Sect. 4.1. Finally, note that \(\texttt{SolveAbsorbingRS}\) can be implemented in polynomial time for MDP due to Algorithm 1 in Sect. 3.1 and the fact that singleobjective MDP reachability can be encoded as a linear program [2]. \(\square \)
Theorem 8
(Complexity) The following holds for all MDP \(\mathcal {M}\) with state space S and Streett lexobjective \(\varvec{\Omega }\).

1.
Strategy complexity: Memoryless (but randomized), or alternatively, deterministic strategies with at most S memory classes are sufficient to play optimally in \(\mathcal {M}\) w.r.t. to \(\varvec{\Omega }\).

2.
Computational complexity: The decision problem \(^{\varvec{\Omega }}\textbf{v}^{\textsf{lex}}(s_0) \ge _{\textsf{lex}}^{?} \textbf{x}\) is in \(\textsf{P}\) for all \(s_0 \in S\).
Proof

1.
The strategy that Algorithm 4 returns is a combination of the strategy \(\sigma _{ reach }\) for the absorbing reachability lexobjective and the strategies \(\sigma _E\) inside the maximal end components \(E\in \mathcal {E}\). The former is MD by Algorithm 1, but the latter either needs randomization or memory, see Sect. 4.1. The construction from [56, Lem. 5.4] shows that at most \(E\) memory classes are sufficient for a deterministic \(\sigma _E\). Therefore, the overall strategy needs at most \(\sum _{E\in \mathcal {E}} E \le S\) memory since the MEC are pairwise disjoint.

2.
Algorithm 4 solves the decision problem in polynomial time.
\(\square \)
We stress that the complexity results from Theorem 8 crucially rely on the Streett objectives being directly defined on \(\mathcal {M}\). If we had instead defined \(\varvec{\Omega }\) through n deterministic Streett automata, then we would first have to construct the product of \(\mathcal {M}\) with all the automata, causing an exponential space blowup.
Now that we have established how to solve MDP with lexicographic \(\omega \)regular objectives, we discuss the case of SG with such objectives.
4.3 On SG with lexicographic \(\omega \)regular objectives
The techniques we introduced in Sect. 3 for solving reachability and safety are not applicable to \(\omega \)regular objectives. This is because they rely on reachability and safety being achieved and violated in finite time, respectively. Every stage in the solution corresponds to a subset of objectives being achieved/violated, and this decomposition into stages is vital for the reduction to the singledimensional case.
In contrast, for \(\omega \)regular objectives, satisfaction or violation of an objective depends only on the infinite suffix of the path. This is why in MDP we analyze end components (EC), the parts of the state space where the infinite behaviour can occur. For SG, the concept of EC is not strong enough to fully understand the infinite behaviours. It still captures parts of the state space where a path can remain forever, but the standard definition assumes that the players cooperate to stay. However, the players are antagonistic, and \(\textsf{Min}\) will try to violate the objective. Thus, not all states in an EC will have the same value (as they do in MDP), and we need a finer decomposition of EC.
There are similar problems in singledimensional reachability in SG [57]. There, the concepts of bloated and simple EC were used for a more finegrained analysis. Equivalently, in deterministic games with parity objectives, the concept of tangle [58] captures parts of an EC where one player can certainly win.
We conjecture that, inspired by these concepts, it is possible to decompose EC in SG such that an algorithm similar to our Algorithm 4 solves the problem. However, even mimicking the first step of the development of simple EC [59] is not obvious: Consider for instance SG with oneplayer EC, i.e., for every EC T it holds that either \(T \subseteq S_{\square }\) or \(T \subseteq S_{\lozenge }\). In this case, it still holds that all states in an EC have the same value, so the general algorithm is the same as in MDP. The difference is that we have to compute the lexicographic value in end components belonging to \(\textsf{Min}\). \(\textsf{Min}\) wants to violate a Street objective, which is equivalent to achieving the Rabin objective with the same pairs [60]. But for minimizing the lexicographic value, we want to violate all the given Street objectives, which means achieving a conjunction of Rabin objectives. This, however, calls for a more involved solution, because we mix conjunction (between the objectives) and disjunction (between the pairs of objectives). In contrast, our solution relies on the fact that conjunction of Street objectives is again a Street objective. Thus, already in SG with oneplayer EC a solution needs more complex automata theoretic tools, e.g. EmersonLei automata [61] for representing the mixed Boolean combination of requirements on infinite paths.
We decided to focus on MDP with \(\omega \)regular objectives, where an elegant and easytoimplement solution is possible by using the lexicographic reachability approach from Section 3.1 and the existing results about Streett objectives in MDP. We leave the development of new graphtheoretic concepts for SG with \(\omega \)regular objectives as future work.
5 Experimental evaluation
In this section, we report the results of a series of experiments made with implementations of the algorithms in Sect. 3 and 4.
5.1 Reachsafe lexobjective in SG: experiments
5.1.1 Implementation
We have implemented Algorithms 1 and 2 as prototypes within PRISMgames [42]. The code is available online^{Footnote 3}. Since PRISMgames does not provide an exact algorithm to solve SG, we used the available value iteration approach to implement the singleobjective blackbox \(\texttt{SolveSingleObj}\). Note that since value iteration is not exact for singleobjective SG, PRISMgames cannot compute the exact lexvalues. Nevertheless, we focus on measuring the overhead introduced by our algorithm compared to a singleobjective solver. In our implementation, value iteration stops if the values do not increase by more than \(10^{8}\) in one iteration, which is PRISM’s default configuration.
5.1.2 Case studies
For our experiments, we have collected four case studies from the literature where lexicographic objectives seem particularly useful:

Dice This example is shipped with PRISMgames [42] and models a simple dice game between two players. The number of throws in this game is a configurable parameter, which we instantiate with 10, 20 and 50. The game has three possible outcomes: Player \(\textsf{Max}\) wins, Player \(\textsf{Min}\) wins or draw. A natural lexobjective is thus to maximize the winning probability and then the probability of a draw.

Charlton This case study [37] is also included in PRISMgames. It models an autonomous car navigating through a road network. A natural lexobjective is to minimize the probability of an accident (possibly damaging human life) and then maximize the probability to reach the destination.

Hallway (HW) This instance is based on the Hallway example which is standard in the AI literature [62, 63]. A robot can move north, east, south or west in a known environment, but each move only succeeds with a certain probability, and otherwise rotates or moves the robot in an undesired direction. We extend the example by a target wandering around based on a mixture of probabilistic and demonic nondeterministic behavior, thereby obtaining a stochastic game that models, for instance, a panicking person in a building on fire. Moreover, we assume a 0.01 probability of damaging the robot when executing certain movements; a damaged robot’s actions succeed with even smaller probability. The primary objective is to save the human, and the secondary objective is to avoid damaging the robot. We use gridworlds of sizes \(5 {\times } 5\), \(8 {\times } 8\), and \(10 {\times } 10\).

Avoid the observer (AV) This case study is inspired by a similar example in [64]. It models a game between an intruder and an observer in a gridworld. The grid can have different sizes as in HW, and we use \(10 {\times } 10\), \(15 {\times } 15\), and \(20 {\times } 20\). The most important objective of the intruder is to avoid the observer, and its secondary objective is to exit the grid. We assume that the observer can only detect the intruder within a certain distance and otherwise makes random moves. At every position, the intruder moreover has the option to stay and search for a treasure. In our example, a treasure is found with probability 0.1 each time the intruder decides to search for it. Collecting a treasure is the ternary (reachability) objective.
Note that the above case studies (and in fact also the ones from Sect. 5.2 further below) have just 23 objectives. We are currently not aware of any examples where a significantly larger amount of objectives is useful or natural. In any case, this might not be computationally feasible as the runtime of our algorithm is exponential in the number of objectives in the worst case.
5.1.3 Experimental results
The experiments were conducted on a 2.4 GHz QuadCore Intel^{©} Core\(^{TM}\)i5 processor, with 4GB of RAM available to the Java VM. The results are reported in Table 1. We only recorded the run time of the actual algorithms; the time needed to parse and build the model is excluded. All numbers are rounded to full seconds. All instances (even those with state spaces of order \(10^6\)) could be solved within a few minutes.
The two leftmost columns of Table 1 show the type of the lexobjective, the name of the case study with scaling parameters (if applicable), and the number of states in the model. The next three columns contain the verification times (excluding time to parse and build the model), rounded to full seconds. The columns labeled \(\mathcal {G}\) and \(\widetilde{\mathcal {G}}\) provide the average number of actions per state, i.e., the value \(S^{1} \sum _{s \in S} \textsf{Act}(s)\) in the original SG \(\mathcal {G}\) as well as in all subgames \(\widetilde{\mathcal {G}}\) (which result from removing lexsuboptimal actions, see Algorithm 1) considered in the main stage. The rightmost column reports on the fraction of stages that had to be analyzed, i.e., the stages solved by the algorithm compared to the theoretically maximal possible number of stages (\(2^n1\)).
We compare the time of our algorithm on the lexicographic objective (column Lex) to the time for checking just the first singleobjective (First) and the sum of checking all singleobjectives individually (All). The runtimes of our algorithm and checking all singleobjectives individually are always in the same order of magnitude. This shows that our algorithm works well in practice and that the overhead is often small. Even on SG of nontrivial size (HW[10\(\times \)10] and AV[20\(\times \)20]), our algorithm returns the result within a few minutes.
Regarding the average number of actions, we see that the decrease in the number of actions in the subgames \(\widetilde{\mathcal {G}}\) obtained by restricting the input game to optimal actions varies: For example, very few actions are removed in the Dice instances, in AV we have a moderate decrease, and in HW a significant decrease and almost all the nondeterminism is eliminated after the first objective. We conjecture that the less actions are removed, the higher is the overhead compared to the individual singleobjective solutions (column All). Consider the AV and HW examples: While for AV[20\(\times \)20], computing the lexicographic solution takes 1.7 times as long as all the singleobjective solutions, it took only about 25% longer for HW[10\(\times \)10]; this is the case because in HW, only very few actions remain after the first objective. In AV, on the other hand, lots of choices have to be considered even for the second and third objective. Note that the first objective sometimes (HW), but not always (AV) needs the majority of the runtime.
We also see that the algorithm does not always have to explore all possible stages. For example, for Dice we always just need a single stage, because the SG is absorbing. For Charlton and HW all stages are relevant for the lexobjective, while for AV only 4 of 7 need to be considered.
5.2 MDP with lexicographic LTL: experiments
5.2.1 Implementation
We implemented Algorithms 3 and 4 as an extension of Storm [44]. This allows us to reuse existing efficient implementations for basic algorithms, such as finding end components and solving singleobjective reachability. The code is available online^{Footnote 4}.
Our implementation accepts as input an MDP and an ordered list of LTL formulas. We transform each LTL formula into a deterministic Streett automaton using the tool Spot [65], and then construct the standard product automaton of the MDP and all the automata. This results in a (larger) MDP with several Streett objectives. We then proceed as described in the algorithms in Sect. 4.
5.2.2 Case studies
We evaluate our implementation on the benchmarks from [18] and on one additional new case study, the CleaningRobot. In summary, we consider the following case studies for our experiments:

Cleaningrobot A robot cleans a house with two floors. It starts in the upper floor. At some point, it can decide to try to clean the stairs, but with a probability of 0.5 it will fall down and not clean the stairs. After this, it can only clean the ground floor, and never go up to the first floor again. The LTL objectives that we check are (in this order) (1) \(\textbf{GF}\;clean_{first}\), (2) \(\textbf{F}\;clean_{stairs}\), and (3) \(\textbf{GF}\;clean_{ground}\).
Gridworld This example models a robot that moves around in a grid world, has to visit several outposts, avoid dangerous zones, and recharge every now and then. It can only move left, right, up and down. The uncertainty comes into play when the robot chooses to go into one direction: With probability of 0.5, it will move one step, and with 0.5 probability it will move two steps. A visualization can be found in Fig. 3, together with the objectives that we checked.
Virus This example models a computer virus that is trying to breach a system. The system consists of a grid of nodes, 3\(\times \)3 in our case. If a node is infected, it can choose to attack its neighboring nodes. This will be detected with a probability of 0.5. After a node was successfully attacked, it will get infected with a probability of 0.5, as well. This model stems originally from [66]. The objectives to be verified are:

1.
\(\textbf{F}s_{3,2} = 2 ~\wedge ~\textbf{G} ((s_{3,2}=2 \wedge s_{3,1}\ne 2) \implies \textbf{XX}s_{3,1} = 2)\), and

2.
\(\textbf{F}s_{1,1} = 2~\wedge ~\textbf{G}(s_{2,2}\ne 2 \wedge s_{3,2}\ne 2)\) .
The first one makes sure that node (3, 2) will eventually be infected, and once it is infected, node (3, 1) should be infected in at most two steps. The second objective defines a specific path to infect node (1, 1), without infecting nodes (2, 2) and (3, 2).
UAV This model considers an unmanned aerial vehicle (UAV) that moves on a network of roads, originally presented in [67]. There are waypoints that should be visited (modelled by the first objective), and restricted operation zones that should be avoided (modelled by the second objective). Formally, the objectives are (1) \(\bigwedge _i \textbf{G}\lnot roz_i\), and (2) \(\textbf{F}w_1\wedge \textbf{F}w_2\wedge \textbf{F}w_6\).

1.
5.2.3 Experimental results
The experiments were conducted a server running Ubuntu 20.04.2 with 251 GB RAM and a Intel(R) Xeon(R) CPU E52630 v4 @ 2.20GHz. The results are reported in Table 2. We only recorded the run time of the actual algorithms; the time needed to parse and build the model is excluded because they were all below one second. All numbers are rounded to full seconds.
For each model of the case study, we report the number of states, the number of MEC, and the lexicographic preference order of the objectives. As a result we show the probabilities of each objective, and the runtime of our approach, as well as the runtime of [18], which solves the same problem using Reinforcement Learning. Apart from the existing benchmarks we also added some more models.
We see that the runtime of our approach is significantly faster on most models, e.g., less than one second on a 4\(\times \)7grid in contrast to 44 seconds. However, on models with a larger state space, and especially a large number of MEC, the learning procedure of [18] scales better, e.g., it only takes 213s for the UAVmodel, whereas our approaches takes 1367s. This scaling is a known advantage and the goal of learning procedures. However, for this they they sacrifice guarantees on the resulting probabilities, which can be arbitrarily far off. In contrast, after termination, our algorithm outputs the exact lexvalue and provably optimal strategies.
6 Conclusion and future work
In this work, we considered simple stochastic games with lexicographic objectives. Simple stochastic games are a standard model in reactive synthesis of stochastic systems, and lexicographic objectives allow for an analysis with multiple objectives with an order of preference. We first focused on the most basic objectives: safety and reachability. While simple stochastic games with lexicographic objectives have not been studied before, we have presented determinacy, strategy complexity, computational complexity, and algorithms for these games. Moreover, we have shown how these games can model different case studies and presented experimental results. Afterwards, we discussed stochastic games with \(\omega \)regular objectives and gave a simple solution for the subclass of Markov decision processes with lexicographic \(\omega \)regular objectives represented as Streett conditions.
There are several directions for future work. First, for lexicographic reachabilitysafety objectives in SG, closing the complexity gap (\(\textsf{NEXPTIME}\cap \textsf{coNEXPTIME}\) upper bound and \(\textsf{PSPACE}\) lower bound, see Theorem 6) is an open question. Second, the problem of solving stochastic games with lexicographic \(\omega \)regular objectives remains open. Combinations of \(\omega \)regular objectives in SG are a difficult open problem in general. For instance, even qualitative combinations for more than two objectives as well as quantitative combinations have not been solved yet [68]. Finally, one can consider lexicographic combinations of quantitative objectives such as mean payoff or total reward, which allow for modeling further practical applications.
Data availibility
The software and data used to generate the experimental results of this paper are available online(See https://doi.org/10.5281/zenodo.5798108 and https://doi.org/10.5281/zenodo.7233605 for the experiments described in Sect. 5.1 and Sect. 5.2, respectively.).
Notes
For technical reasons (discussed in Sect. 4.1), we prefer Streett over the equally expressive Rabin, Parity, or Muller conditions.
Visiting every state in an EC infinitely often with probability 1 is achieved by a memoryless randomized strategy that picks actions uniformly at random, or, alternatively, by a deterministic strategy using memory constructed as in [56, Lem. 5.4].
References
Condon A (1992) The complexity of stochastic games. Inf Comput 96(2):203–224. https://doi.org/10.1016/08905401(92)90048K
Puterman ML (2014) Markov decision processes: discrete stochastic dynamic programming. Wiley, New Jersey
Baier C, Katoen JP (2008) Principles of model checking. MIT Press, Massachusetts
Filar J, Vrieze K (1997) Competitive markov decision processes. Springer, Switzerland
Chatterjee K, Henzinger TA (2012) A survey of stochastic \(\omega \)regular games. J Comput Syst Sci 78(2):394–413. https://doi.org/10.1016/j.jcss.2011.05.002
Chatterjee K, Sen K, Henzinger TA (2008) Modelchecking omegaregular properties of interval Markov chains. In: FoSSaCS. Lecture notes in computer science, vol 4962, Springer, Switzerland, pp 302–317
Weininger M, Meggendorfer T, Kretínský J (2019) Satisfiability bounds for \(\omega \)regular properties in boundedparameter Markov decision processes. In: CDC, IEEE, New York, pp, 2284–2291, doi: https://doi.org/10.1109/CDC40024.2019.9029460
Altman E (1999) Constrained Markov decision processes. CRC Press, Florida
Chatterjee K (2007) Markov decision processes with multiple longrun average objectives. In: FSTTCS. Lecture notes in computer science, vol 4855, Springer, Switzerland, pp 473–484, doi: https://doi.org/10.1007/9783540770503_39
Delgrange F, Katoen J, Quatmann T, Randour M (2020) Simple strategies in multiobjective MDPs. In: TACAS (1). Lecture notes in computer science, vol 12078, Springer, Switzerland, pp 346–364, doi: https://doi.org/10.1007/9783030451905_19
Berthon R, Guha S, Raskin J (2020) Mixing probabilistic and nonprobabilistic objectives in markov decision processes. In: LICS, ACM, New York, pp 195–208, doi: https://doi.org/10.1145/3373718.3394805
Chen T, Forejt V, Kwiatkowska MZ, Simaitis A, Wiltsche C (2013) On stochastic games with multiple objectives. In: MFCS. Lecture notes in computer science, vol 8087, Springer, Switzerland, pp 266–277, doi: https://doi.org/10.1007/9783642403132_25
Fishburn PC (1974) Exceptional paper  lexicographic orders, utilities and decision rules: a survey. Manage Sci 20(11):1442–1471. https://doi.org/10.1287/mnsc.20.11.1442
Blume L, Brandenburger A, Dekel E (1991) Lexicographic probabilities and choice under uncertainty. Econom: J Econom Soc. https://doi.org/10.2307/2938240
Bloem R, Chatterjee K, Henzinger TA, Jobstmann B (2009) Better quality in synthesis through quantitative objectives. In: CAV. Lecture notes in computer science, vol 5643, Springer, Switzerland, pp 140–156, doi: https://doi.org/10.1007/9783642026584_14
Colcombet T, Jurdzinski M, Lazic R, Schmitz S (2017) Perfect half space games. In: Logic in Computer Science, LICS 2017, IEEE Computer Society, Washington, DC, pp 1–11, doi: https://doi.org/10.1109/LICS.2017.8005105
Wray KH, Zilberstein S, Mouaddib A (2015) Multiobjective MDPs with conditional lexicographic reward preferences. In: AAAI, AAAI Press, California, pp 3418–3424, doi: https://doi.org/10.5555/2888116.2888191
Hahn EM, Perez M, Schewe S, Somenzi F, Trivedi A, Wojtczak D (2021) Modelfree reinforcement learning for lexicographic omegaregular objectives. In: FM. Lecture Notes in Computer Science, vol 13047, Springer, Switzerland , pp 142–159, doi: https://doi.org/10.1007/9783030908706_8
Chatterjee K, Katoen J, Weininger M, Winkler T (2020) Stochastic games with lexicographic reachabilitysafety objectives. In: CAV (2). Lecture notes in computer science, vol 12225, Springer, Switzerland, pp 398–420, doi: https://doi.org/10.1007/9783030532918_21
Bouyer P, Roux SL, Oualhadj Y, Randour M, Vandenhove P (2022) Games where you can play optimally with arenaindependent finite memory. Log Methods Comput Sci. https://doi.org/10.46298/lmcs18(1:11)2022
Etessami K, Kwiatkowska MZ, Vardi MY, Yannakakis M (2008) Multiobjective model checking of Markov decision processes. LMCS. https://doi.org/10.2168/LMCS4(4:8)2008
Brázdil T, Brozek V, Chatterjee K, Forejt V, Kucera A (2014) Two views on multiple meanpayoff objectives in Markov decision processes. LMCS. https://doi.org/10.2168/LMCS10(1:13)2014
Chatterjee K, Forejt V, Wojtczak D (2013) Multiobjective discounted reward verification in graphs and mdps. In: LPAR. Lecture notes in computer science, vol 8312, Springer, Switzerland, pp 228–242, https://doi.org/10.1007/9783642452215_17
Forejt V, Kwiatkowska MZ, Norman G, Parker D, Qu H (2011) Quantitative multiobjective verification for probabilistic systems. In: TACAS, pp 112–127 . doi: https://doi.org/10.1007/9783642198359_11
Quatmann T, Katoen J (2021) Multiobjective optimization of longrun average and total rewards. In: TACAS (1). Lecture notes in computer science, vol 12651, Springer, Switzerland, pp 230–249, doi: https://doi.org/10.1007/9783030720162_13
Brázdil T, Chatterjee K, Forejt V, Kucera A (2017) Trading performance for stability in markov decision processes. J Comput Syst Sci 84:144–170. https://doi.org/10.1016/j.jcss.2016.09.009
Filar JA, Krass D, Ross KW (1995) Percentile performance criteria for limiting average Markov decision processes. IEEE Trans Autom Control 40(1):2–10
Randour M, Raskin J, Sankur O (2017) Percentile queries in multidimensional Markov decision processes. Formal Methods Syst Des 50(2–3):207–248. https://doi.org/10.1007/s1070301602627
Chatterjee K, Kretínská Z, Kretínský J (2017) Unifying two views on multiple meanpayoff objectives in Markov decision processes. LMCS. https://doi.org/10.23638/LMCS13(2:15)2017
Baier C, Dubslaff C, Klüppelholz S (2014) Tradeoff analysis meets probabilistic model checking. In: CSLLICS, pp 1–1110, doi: https://doi.org/10.1145/2603088.2603089
Baier C, Dubslaff C, Klüppelholz S, Daum M, Klein J, Märcker S, Wunderlich S (2014) Probabilistic model checking and nonstandard multiobjective reasoning. In: FASE. Lecture notes in computer science, vol 8411, Springer, Switzerland, pp 1–16, doi: https://doi.org/10.1007/9783642548048_1
Roijers DM, Whiteson S (2017) Multiobjective decision making. Synth Lect Artif Intell Mach Learn. https://doi.org/10.2200/S00765ED1V01Y201704AIM034
Svorenová M, Kwiatkowska M (2016) Quantitative verification and strategy synthesis for stochastic games. Eur J Control 30:15–30. https://doi.org/10.1016/j.ejcon.2016.04.009
Basset N, Kwiatkowska MZ, Topcu U, Wiltsche C (2015) Strategy synthesis for stochastic games with multiple longrun objectives. In: TACAS. Lecture notes in computer science, vol 9035, Springer, Switzerland, pp 256–271, doi: https://doi.org/10.1007/9783662466810_22
Chatterjee K, Doyen L (2016) Perfectinformation stochastic games with generalized meanpayoff objectives. In: LICS, ACM, New York, pp 247–256, doi: https://doi.org/10.1145/2933575.2934513
Brenguier R, Forejt V (2016) Decidability results for multiobjective stochastic games. In: ATVA. Lecture notes in computer science, vol 9938, pp 227–243, doi: https://doi.org/10.1007/9783319465203_15
Chen T, Kwiatkowska MZ, Simaitis A, Wiltsche C (2013) Synthesis for multiobjective stochastic games: An application to autonomous urban driving. In: QEST, pp 322–337, doi: https://doi.org/10.1007/9783642401961_28
Ashok P, Chatterjee K, Kretínský J, Weininger M, Winkler T (2020) Approximating values of generalizedreachability stochastic games. In: LICS, ACM, New York, pp 102–115, doi: https://doi.org/10.1145/3373718.3394761
Bruyère V, Hautem Q, Raskin J (2018) Parameterized complexity of games with monotonically ordered omegaregular objectives. In: CONCUR. LIPIcs, vol 118, Schloss Dagstuhl  LeibnizZentrum für Informatik, Wadern . pp 29–12916, doi: https://doi.org/10.4230/LIPIcs.CONCUR.2018.29
Bruyère V, Filiot E, Randour M, Raskin J (2017) Meet your expectations with guarantees: beyond worstcase synthesis in quantitative games. Inf Comput 254:259–295. https://doi.org/10.1016/j.ic.2016.10.011
Wray KH, Zilberstein S (2015) Multiobjective POMDPs with lexicographic reward preferences. In: IJCAI, AAAI Press, California, pp 1719–1725, http://ijcai.org/Abstract/15/245
Kwiatkowska M, Parker D, Wiltsche C (2018) PRISMgames: verification and strategy synthesis for stochastic multiplayer games with multiple objectives. STTT 20(2):195–210. https://doi.org/10.1007/s100090170476z
Brázdil T, Chatterjee K, Forejt V, Kucera A (2015) MultiGain: a controller synthesis tool for MDPs with multiple meanpayoff objectives. In: TACAS. lecture notes in computer science, vol 9035, Springer, Switzerland, pp 181–187, doi: https://doi.org/10.1007/9783662466810_12
Dehnert C, Junges S, Katoen J, Volk M (2017) A storm is coming: A modern probabilistic model checker. In: CAV (2). Lecture notes in computer science, vol 10427, Springer, Switzerland pp. 592–600, doi: https://doi.org/10.1007/9783319633909_31
Quatmann T, Junges S, Katoen J (2017) Markov automata with multiple objectives. In: CAV (1). Lecture Notes in Computer Science, vol 10426, Springer, Switzerland, pp 140–159, doi: https://doi.org/10.1007/9783319633879_7
Hartmanns A, Junges S, Katoen J, Quatmann T (2020) Multicost bounded tradeoff analysis in MDP 64:1483–1522. https://doi.org/10.1007/s10817020095749
Pranger S, Könighofer B, Posch L, Bloem R (2021) TEMPEST  synthesis tool for reactive systems and shields in probabilistic environments. In: ATVA. Lecture notes in computer science, vol 12971, Springer, Switzerland, pp 222–228, doi: https://doi.org/10.1007/9783030888855_15
Tarski A (1955) A latticetheoretical fixpoint theorem and its applications. Pacific J Math 5(2):285–309. https://doi.org/10.2140/pjm.1955.5.285
Forejt V, Kwiatkowska MZ, Parker D (2012) Pareto curves for probabilistic model checking. In: ATVA. Lecture notes in computer science, vol 7561, Springer, Switzerland, pp. 317–332, doi: https://doi.org/10.1007/9783642333866_25
Fijalkow N, Horn F (2010) The surprizing complexity of reachability games. CoRR abs/1010.2420arxiv:1010.2420
Winkler T, Weininger M (2021) Stochastic games with disjunctions of multiple objectives. In: GandALF. EPTCS, vol 346, pp 83–100, doi: https://doi.org/10.4204/EPTCS.346.6
Kupferman O, Vardi MY (1998)Freedom, weakness, and determinism: From lineartime to branchingtime. In: LICS, IEEE Computer Society, Washington, DC, pp 81–92, doi: https://doi.org/10.1109/LICS.1998.705645
Sickert S (2019) A unified translation of linear temporal logic to \(\omega \)automata. PhD thesis, Technical University of Munich, Germany https://nbnresolving.org/urn:nbn:de:bvb:91diss20190801148493214
Chatterjee K, Henzinger M (2011) Faster and dynamic algorithms for maximal endcomponent decomposition and related graph problems in probabilistic verification. In: SODA, SIAM, Philadelphia, pp 1318–1336, doi: https://doi.org/10.1137/1.9781611973082.101
Chatterjee K, Dvorák W, Henzinger M, Loitzenbauer V (2016) Model and objective separation with conditional lower bounds: disjunction is harder than conjunction. In: LICS, ACM, New York, pp197–206, doi: https://doi.org/10.1145/2933575.2935304
Chatterjee K, Dvorák W, Henzinger M, Loitzenbauer V (2016) Model and objective separation with conditional lower bounds: disjunction is harder than conjunction. CoRR abs/1602.02670arxiv:1602.02670
Kelmendi E, Krämer J, Kretínský J, Weininger M (2018) Value iteration for simple stochastic games: stopping criterion and learning algorithm. In: CAV (1). Lecture notes in computer science, vol 10981, Springer, Switzerland, pp. 623–642, doi: https://doi.org/10.1007/9783319961453_36
van Dijk T (2018) Attracting tangles to solve parity games. In: CAV (2). Lecture notes in computer science, vol 10982, Springer, Switzerland, pp 198–215, doi: https://doi.org/10.1007/9783319961422_14
Ujma M (2015) On verification and controller synthesis for probabilistic systems at runtime. PhD thesis, University of Oxford, UK http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.711811
Löding C (1999) Optimal bounds for transformations of omegaautomata. In: FSTTCS. Lecture notes in computer science, vol 1738, Springer, Switzerland, pp 97–109, doi: https://doi.org/10.1007/3540466916_8
Müller D, Sickert S (2017) LTL to deterministic emersonlei automata. In: GandALF. EPTCS, vol 256, pp 180–194, doi: https://doi.org/10.4204/EPTCS.256.13
Littman ML, Cassandra AR, Kaelbling LP (1995) Learning policies for partially observable environments: Scaling up. In: ICML, Morgan Kaufmann, Massachusetts, pp 362–370, doi: https://doi.org/10.1016/b9781558603776.500529
Chatterjee K, Chmelik M, Gupta R, Kanodia A (2016) Optimal cost almostsure reachability in POMDPs. Artif Intell 234:26–48. https://doi.org/10.1016/j.artint.2016.01.007
Chatterjee K, Chmelík M (2015) POMDPs under probabilistic semantics. Artif Intell 221:46–72. https://doi.org/10.1016/j.artint.2014.12.009
DuretLutz A, Lewkowicz A, Fauchille A, Michaud T, Renault E, Xu L (2016) Spot 2.0  A framework for LTL and \(\omega \)automata manipulation. In: ATVA. Lecture notes in computer science, vol 9938, pp 122–129, doi: https://doi.org/10.1007/9783319465203_8
Kwiatkowska M, Norman G, Parker D, Vigliotti MG (2009) Probabilistic mobile ambients. Theor Comput Sci 410(12):1272–1303. https://doi.org/10.1016/j.tcs.2008.12.058
Feng L, Wiltsche C, Humphrey LR, Topcu U (2015) Controller synthesis for autonomous systems interacting with human operators. In: ICCPS, ACM, New York, pp 70–79, doi: https://doi.org/10.1145/2735960.2735973
Chatterjee K, Piterman N (2019) Combinations of qualitative winning for stochastic parity games. In: CONCUR. LIPIcs, vol 140, Schloss Dagstuhl  LeibnizZentrum für Informatik, Wadern, pp 6–1617, doi: 0.4230/LIPIcs.CONCUR.2019.6
Funding
Tobias Winkler and JoostPieter Katoen are supported by the DFG RTG 2236 UnRAVeL and the innovation programme under the Marie SkłodowskaCurie grant agreement No. 101008233 (Mission). Krishnendu Chatterjee is supported by the ERC CoG 863818 (ForMSMArt) and the Vienna Science and Technology Fund (WWTF) Project ICT15003. Maximilian Weininger is supported by the DFG projects 383882557 Statistical Unbounded Verification (SUV) and 427755713 GroupBy Objectives in Probabilistic Verification (GOPro). Stefanie Mohr is supported by the DFG RTG 2428 CONVEY. Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chatterjee, K., Katoen, JP., Mohr, S. et al. Stochastic games with lexicographic objectives. Form Methods Syst Des (2023). https://doi.org/10.1007/s10703023004114
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10703023004114