The Complexity of Graph-Based Reductions for Reachability in Markov Decision Processes

We study the never-worse relation (NWR) for Markov decision processes with an infinite-horizon reachability objective. A state q is never worse than a state p if the maximal probability of reaching the target set of states from p is at most the same value from q, regardless of the probabilities labelling the transitions. Extremal-probability states, end components, and essential states are all special cases of the equivalence relation induced by the NWR. Using the NWR, states in the same equivalence class can be collapsed. Then, actions leading to sub-optimal states can be removed. We show the natural decision problem associated to computing the NWR is coNP-complete. Finally, we describe an incomplete polynomial-time iterative algorithm to under-approximate the NWR.


Introduction
Markov decision processes (MDPs) are a useful model for decision-making in the presence of a stochastic environment. They are used in several fields, including robotics, automated control, economics, manufacturing and in particular planning [19], model-based reinforcement learning [21], and formal verification [1]. We elaborate on the use of MDPs and the need for graph-based reductions in verification and reinforcement learning applications below.
Several verification problems for MDPs reduce to reachability [1,5]. For instance, MDPs can be model checked against linear-time objectives (expressed in, say, LTL) by constructing an omega-automaton recognizing the set of runs that satisfy the objective and considering the product of the automaton with the original MDP [6]. In this product MDP, accepting end components -a generalization of strongly connected components -are identified and selected as target components. The question of maximizing the probability that the  The maximal reachability probability values from p and q are the same since, from both, one can enforce to reach s with probability 1, or t with probability 1, using different strategies.
MDP behaviours satisfy the linear-time objective is thus reduced to maximizing the probability of reaching the target components. The maximal reachability probability is computable in polynomial time by reduction to linear programming [6,1]. In practice, however, most model checkers use value iteration to compute this value [16,9]. The worst-case time complexity of value iteration is pseudo-polynomial. Hence, when implementing model checkers it is usual for a graph-based pre-processing step to remove as many unnecessary states and transitions as possible while preserving the maximal reachability probability. Well-known reductions include the identification of extremal-probability states and maximal end components [5,1]. The intended outcome of this pre-processing step is a reduced amount of probability values that need to be considered when computing the number of iterations required by value iteration.
The main idea behind MDP reduction heuristics is to identify sets of states from which the maximal probability of reaching the target set of states is the same. Such states are in fact redundant and can be "collapsed". Figure 1 depicts an MDP with actions and probabilities omitted for clarity. From p and q there are strategies to ensure that s is reached with probability 1. The same holds for t. For instance, from p, to get to t almost surely, one plays to go to the distribution directly below q; from q, to the distribution above q. Unfortunately, since from the state p, there is no strategy to ensure that q is reached with probability 1, p and q do not form an end component. In fact, to the best of our knowledge, no known MDP reduction heuristic captures this example.
In reinforcement learning the actual probabilities labelling the transitions of an MDP are not assumed to be known in advance. Thus, they have to be estimated by experimenting with different actions in different states and collecting statistics about the observed outcomes [13]. In order for the statistics to be good approximations, the number of experiments has to be high enough. In particular, when the approximations are required to be probably approximately correct [22] then the necessary number of experiments is pseudo-polynomial [12]. The fact that an excessive amount of experiments is required is a known drawback of reinforcement learning [14,18].
A natural question to ask in this context is whether the maximal reachability probability does indeed depend on the actual value of the probability labelling a particular transition of the MDP. If this is not the case, then it need not be learnt. One natural way to remove transition probabilities which do not affect the maximal reachability value is to apply model checking MDP reduction techniques.
Contributions and structure of the paper.
We view the directed graph underlying an MDP as a directed bipartite graph. Vertices in this graph are controlled by players Protagonist and Nature. Nature is only allowed to choose full-support probability distributions for each one of his vertices, thus instantiating an MDP from the graph; Protagonist has strategies just as he would in an MDP. Hence, we consider infinite families of MDPs with the same support. In the game played between Protagonist and Nature, and for vertices u and v, we are interested in knowing whether the maximal reachability probability from u is never (in any of the MDPs with the game as its underlying directed graph) worse than the same value from v.
In Section 2 we give all the required definitions. We formalize the neverworse relation in Section 3. We also show that we can "collapse" sets of equivalent vertices with respect to the NWR (Theorem 1) and remove sub-optimal edges according to the NWR (Theorem 2). Finally, we also argue that the NWR generalizes most known heuristics to reduce MDP size before applying linear programming or value iteration. Then, in Section 4 we give a graphbased characterization of the relation (Theorem 3), which in turn gives us a coNP upper bound on its complexity. A matching lower bound is presented in Section 5 (Theorem 4). To conclude, we recall and extend an iterative algorithm to efficiently (in polynomial time) under-approximate the never-worse relation from [2].

Previous and related work.
Reductions for MDP model checking were considered in [7] and [5]. From the reductions studied in both papers, extremal-probability states, essential states, and end components are computable using only graph-based algorithms. In [3], learning-based techniques are proposed to obtain approximations of the maximal reachability probability in MDPs. Their algorithms, however, do rely on the probability values of the MDP.
In [2] a preliminary version of the iterative algorithm we give in Section 6 was described, implemented, and shown to be efficient in practice. Proposition 1 was first stated therein. In contrast with [2], we focus chiefly on characterizing the never-worse relation and determining its computational complexity.

Preliminaries
We use set-theoretical notation to indicate whether a letter b ∈ Σ occurs in a word α = a 0 . . . a k ∈ Σ * , i.e. b ∈ α if and only if b = a i for some 0 ≤ i ≤ k.
Consider a directed graph G = (V, E) and a vertex u ∈ V , we write uE for the set of successors of u. That is to say, uE := {v ∈ V | (u, v) ∈ E}. We say that a path π = u 0 . . . u k ∈ V * in G visits a vertex v if v occurs in π.

Stochastic models
Let S be a finite set. We denote by D(S) the set of all (rational) probabilistic distributions on S, i.e. the set of all functions f : S → Q ≥0 such that Definition 1 (Markov chains). A Markov chain C is a tuple (Q, δ) where Q is a finite set of states and δ is a probabilistic transition function δ : Q → D(Q).
A run of a Markov chain is a finite non-empty word ̺ = p 0 . . . p n over Q. We say ̺ reaches q if q = p i for some 0 ≤ i ≤ n. The probability of the run is Let T ⊆ Q be a set of states. The probability of (eventually) reaching T in C from q 0 , denoted by P q0 C [♦T ], is the measure of the runs of C that start at q 0 and reach T . For convenience, let us first define the probability of staying in We then define P q0 . When all runs from q 0 to T reach some set U ⊆ Q before, the probability of reaching T can be decomposed into a finite sum. Lemma 1. Consider a Markov chain C = (Q, δ), a set of states U ⊆ Q, and a state q 0 ∈ Q \ U . If P q0 For convenience, we write δ(q|p, a) instead of δ(p, a)(q).

Definition 3 (Strategies). A (memoryless deterministic) strategy σ in an MDP
Note that we have deliberately defined only memoryless deterministic strategies. This is at no loss of generality since, in this work, we focus on maximizing the probability of reaching a set of states. It is known that for this type of objective, memoryless deterministic strategies suffice [17].
From MDPs to chains. An MDP M = (Q, A, δ, T ) and a strategy σ induce the Markov chain M σ = (Q, q 0 , µ) where µ(q) = δ(q, σ(q)) for all q ∈ Q. The labels on arrows from states to distributions are actions; those on arrows from distributions to states, probabilities.
Consider the strategy σ that plays from p the action a and from q the action b, i.e. σ(p) = a and σ(q) = b. The Markov chain on the right is the chain induced by σ and the MDP on the left. Note that we no longer have action labels.
The maximal probability of reaching a target state from q under σ is easily seen to be 3/4. In other words, if we write M for the MDP and T for the set of target states then P q M σ [♦T ] = 3 4 .

Reachability games against Nature
We will speak about families of MDPs whose probabilistic transition functions have the same support. To do so, we abstract away the probabilities and focus on a game played on a graph. That is, given an MDP M = (Q, A, δ, T ) we consider its underlying directed graph G M = (V, E) where V := Q ∪ (Q × A) and E := {(q, q, a ) ∈ Q × (Q × A)} ∪ {( p, a , q) | δ(q|p, a) > 0}. In G M , Nature controls the vertices Q × A. We formalize the game and the arena it is played on below.
Definition 4 (Target arena). A target arena A is a tuple (V, V P , E, T ) such that (V P , V N := V \ V P , E) is a bipartite directed graph, T ⊆ V P is a set of target vertices, and uE = ∅ for all u ∈ V N .
Informally, there are two agents in a target arena: Nature, who controls the vertices in V N , and Protagonist, who controls the vertices in V P .
From arenas to MDPs. A target arena A = (V, V P , E, T ) together with a family of probability distributions µ = (µ u ∈ D(uE)) u∈VN induce an MDP. Formally, let A µ be the MDP (Q, A, δ, T ) where Q = V P ⊎{⊥}, A = V N , δ(q|p, a) is µ a (q) if (p, a), (a, q) ∈ E and 0 otherwise, for all p ∈ V P and a ∈ A we have δ(⊥|p, a) = 1 if (p, a) ∈ E.
The value of a vertex. Consider a target arena A = (V, V P , E, T ) and a vertex v ∈ V P . We define its (maximal reachability probability) value with respect to a family of full-support probability distributions µ as follows

The never-worse relation
We are now in a position to define the relation we study in this work. Let us fix a target arena A = (V, V P , E, T ).
where all the µ u have full support. We write v ∼ w if v {w} and w {v}.
It should be clear from the definition that ∼ is an equivalence relation. For u ∈ V P let us denote byũ the Protagonist vertex equivalence class of u, i.e.ũ := {v ∈ V P | v ∼ u}.
Example 2. Consider the left target arena depicted in Figure 3. Using Lemma 1, it is easy to show that neither p nor q is ever worse than the other since t is visited before fin by all paths starting from p or q.
The literature contains various heuristics which consist in computing sets of states and "collapsing" them to reduce the size of the MDP without affecting the maximal reachability probability of the remaining states. We now show that we can collapse equivalence classes and, further, remove sub-optimal distributions using the NWR.

The usefulness of the NWR
We will now formalize the idea of "collapsing" equivalent vertices with respect to the NWR. For convenience, we will also remove self-loops while doing so.
Consider a target arena A = (V, V P , E, T ). We denote by A /∼ its ∼-quotient. That is, A /∼ is the target arena (S, S P , R, U ) where For a family µ = (µ u ∈ D(uE)) u∈VN of full-support probability distributions we denote by µ /∼ the family ν = (ν u ∈ D(uR)) u∈VN as follows. For all u ∈ V N and allṽ ∈ uR we have The following property of the ∼-quotient follows from the fact that all the vertices inṽ have the same maximal probability of reaching the target set of vertices.

Theorem 1. Consider a target arena
We can further remove all edges that lead to sub-optimal Nature vertices. That is, for a target arena A = (V, V P , E, T ), we denote by A / its optimal trimming defined as the target arena (V, When the optimal trimming is applied after ∼-quotienting the maximal reachability probabilities are preserved.

Theorem 2. Consider a target arena
For all families µ = (µ u ∈ D(uE)) u∈VN of full-support probability distributions and all v ∈ V P we have

Known efficiently-computable special cases
We now recall the definitions of the set of extremal-probability states, end components, and essential states. Then, we observe that for all these sets of states we have that their maximal probability reachability coincide and their definitions are independent of the probabilities labelling the transitions of the MDP. Hence, they are subsets of the set of the equivalence classes induced by ∼.

Extremal-probability states.
The set of extremal-probability states of an MDP M = (Q, A, δ, T ) consists of the set of states with maximal probability reachability 0 and 1. Both sets can be computed in polynomial time [1,4]. We give below a game-based definition of both sets inspired by the classical polynomial-time algorithm to compute them (see, e.g., [1]). Let us fix a target arena A = (V, V P , E, T ) for the sequel. For a set T ⊆ V , let us write Z T := {v ∈ V | T is not reachable from v}.
(Almost-surely winning) strategies. A strategy for Protagonist in a target arena is a function σ : We say that a strategy for Protagonist is almost-surely winning from We denote the set of all such strategies by Win v0 T ′ . The following properties regarding almost-surely winning strategies in a target arena follow from the correctness of the graph-based algorithm used to compute extremal-probability states in an MDP [1, Lemma 10.108].
Lemma 2 (From [1]). Consider a target arena A = (V, V P , E, T ). For all families µ = (µ u ∈ D(uE)) u∈VN of full-support probability distributions, for all v ∈ V P the following hold. It follows immediately from the definition of end component that the maximal probability of reaching T from states in the same end component is the same.

Let us consider an MDP
It is easy to see that if S, U ⊆ Q are end components in M and S ∩ U = ∅ then S ∪ U is also an end component in M. Thus, we can speak about maximal end components. Furthermore, from the definition of end components in MDPs and Lemma 2 it follows that we can lift the notion of end component to target arenas. More precisely, a set S ⊆ V P is an end component in A if and only if for a family of full-support probability distributions µ we have that S is an end The set of all maximal end components of a target arena can be computed in polynomial time using an algorithm based on the strongly connected components of the graph [8,1].

Essential states.
Consider a target arena A = (V, V P , E, T ) and let ⊑ be the smallest relation satisfying the following. For all u ∈ V P we have u ⊑ u. For all u 0 , v ∈ V P \ Z T such that u 0 = v we have u 0 ⊑ v if for all paths u 0 u 1 u 2 we have that u 2 ⊑ v and there is at least one such path. Intuitively, u ⊑ v holds whenever all paths starting from u reach v. In [7], the maximal vertices according to ⊑ are called essential states 2 .
Lemma 4 (From [7]). Consider a target arena A = (V, V P , E, T ). For all families µ = (µ u ∈ D(uE)) u∈VN of full-support probability distributions, for all v ∈ V P and all essential states Note that, in the left arena in Figure 3, p ⊑ t does not hold since there is a cycle between p and q which does not visit t.
It was also shown in [7] that the ⊑ relation is computable in polynomial time.

Graph-based characterization of the NWR
In this section we give a characterization of the NWR that is reminiscent of the topological-based value iteration proposed in [5]. The main intuition behind our characterization is as follows. If v W does not hold, then for some family µ of full-support distributions the maximal probability of reaching the target set T of states from v can be set arbitrarily close to 1 while the same value from all w ∈ W can be set arbitrarily close to 0. In turn, this must mean that there is a path from v to T which can be assigned a high probability by µ while, from W , all paths go with high probability to Z T .
We capture the idea of separating a "good" v-T path from all paths starting from W by using partitioning of V into layers S i ⊆ V . Intuitively, we would like it to be easy to construct a family µ of probability distributions such that from all vertices in S i+1 all paths going to vertices outside of S i+1 end up, with high probability, in lower layers, i.e. some S k with k < i. A formal definition follows.
Definition 6 (Drift partition and vertices). Consider a target arena A = (V, V P , E, T ) and a partition (S i ) 0≤i≤k of V . For all 0 ≤ i ≤ k, let S + i := ∪ i<j S j and S − i := ∪ j<i S j , and let We define the set D := ∪ 0<i<k D i of drift vertices. The partition is called a drift partition if the following hold.
Using drift partitions, we can now formalize our characterization of the negation of the NWR.  (ii) There exists a drift partition (S i ) 0≤i≤k and a simple path π starting in v and ending in T such that π ⊆ S k and W ⊆ S − k .
Before proving Theorem 3 we need an additional definition and two intermediate results.
It can be shown that from any path in a target arena ending in T one can obtain a simple non-decreasing one.
Lemma 5. Consider a target arena A = (V, V P , E, T ) and a family of fullsupport probability distributions µ = (µ u ∈ D(uE)) u∈VN . If there is a path from some v ∈ V to T , there is also a simple µ-non-decreasing one.
Additionally, we will make use of the following properties regarding vertexvalues. They formalize the relation between the value of a vertex, its owner, and the values of its successors. Lemma 6. Consider a target arena A = (V, V P , E, T ) and a family of fullsupport probability distributions µ = (µ u ∈ D(uE)) u∈VN .
(i) For all Protagonist-owned vertices u ∈ V P , for all successors v ∈ uE it holds that Val µ (v) ≤ Val µ (u).
of Theorem 3. Recall that, by definition, (i) holds if and only if there exists a family µ = (µ u ∈ D(uE)) u∈VN of full-support probability distributions such that ∀w ∈ W : Val µ (w) < Val µ (v). Let us prove (i) =⇒ (ii). Let x 0 < x 1 < . . . be the finitely many (i.e. at most |V |) values that occur in the MDP A µ , and let k be such that Val µ (v) = x k . For all 0 ≤ i < k let S i := {u ∈ V | Val µ (u) = x i }, and let S k := V \ ∪ i<k S i . Let us show below that the S i form a drift partition.
• ∀i ≤ k, ∀u ∈ S i ∩ S P : uE ∩ S + i = ∅ by Lemma 6.(i) (for i < k) and since (ii) (for i < k) and since S + k = ∅. We have that Val µ (w) < Val µ (v) = x k for all w ∈ W , by assumption, so W ⊆ S − k by construction. By Lemma 5 there exists a simple µ-non-decreasing path π from v to T , so all the vertices occurring in π have values at least Val µ (v), so π ⊆ S k .
We will prove (ii) =⇒ (i) by defining some full-support distribution family µ. The definition will be partial only, first on π ∩ V N , and then on the drift vertices in V \ S k . Let 0 < ε < 1, which is meant to be small enough. Let us write π = v 0 . . . v n so that v 0 = v and v n ∈ T . Let us define µ on π ∩ V N as follows: for all i < n, if v i ∈ V N let µ vi (v i+1 ) := 1 − ε. Let σ be an arbitrary Protagonist strategy such that for all i < n, if v i ∈ V P then σ(v i ) := v i+1 . Therefore , which will prove (ii) =⇒ (i). However, the last part of the proof is more difficult.
For all 1 ≤ i ≤ k, for all drift vertices u ∈ S i , let ̺(u) be a successor of u in S − i . Such a ̺(u) exists by definition of the drift vertices. Then let µ u (̺(u)) := 1 − ε. So (2) is the probability that, starting at u and following σ, T is never reached; and (1 − ε) ( is the probability that, starting at u and following σ, the second vertex is ̺(u) and T is never reached. Now let σ be an arbitrary strategy, and let us prove the following by induction on j.
Base case, j = 0: by assumption W is non-empty and included in S − k , so 0 < k. Also by assumption T ⊆ S k , so T ∩ S 0 = ∅. By definition of a drift partition, there are no edges going out of S 0 , regardless of whether the starting vertex is in V P or V N . So there is no path from w to T , which implies Val µ (w) = 0 for all w ∈ S 0 , and the claim holds for the base case. Inductive case, let w ∈ S j , let D ′ := D ∩ (S j ∪ S − j ) and let us argue that every path π from w to T must at some point leave S j ∪ S − j to reach a vertex with higher index, i.e. there is some edge (π i , π i+1 ) from π i ∈ S j ∪ S − j to some π i+1 ∈ S ℓ with j < ℓ. By definition of a drift partition, π i must also be a drift vertex, i.e.

Intractability of the NWR
It follows from Theorem 3 that we can decide whether a vertex is sometimes worse than a set of vertices by guessing a partition of the vertices and verifying that it is a drift partition. The verification can clearly be done in polynomial time.
Corollary 1. Given a target arena A = (V, V P , E, T ), a non-empty vertex set W ⊆ V , and a vertex v ∈ V , determining whether v W is decidable and in coNP.
We will now show that the problem is in fact coNP-complete already for Markov chains. The idea is to reduce the 2-Disjoint Paths problem (2DP) to the existence of a drift partition witnessing that v {w} does not hold, for some v ∈ V . Recall that 2DP asks, given a directed graph G = (V, E) and vertex pairs (s 1 , t 1 ), (s 2 , t 2 ) ∈ V × V , whether there exists an s 1 -t 1 path π 1 and an s 2 -t 2 path π 2 such that π 1 and π 2 are vertex disjoint, i.e. π 1 ∩ π 2 = ∅. The problem is known to be NP-complete [11,10]. In the sequel, we assume without loss of generality that (a) t 1 and t 2 are reachable from all s ∈ V \ {t 1 , t 2 }; and (b) t 1 and t 2 are the only sinks G.
of Theorem 4. From the 2DP input instance, we construct the target arena We will show there are vertex-disjoint s 1 -t 1 and s 2 -t 2 paths in G if and only if there is a drift partition (S i ) 0≤i≤k and a simple s 1 -t 1 path π such that π ⊆ S k and s 2 ∈ S − k . The result will then follow from Theorem 3.
Suppose we have a drift partition (S i ) 0≤i≤k with s 2 ∈ S − k and a simple path all paths from vertices in the set visit only vertices from it, we can assume that S 0 = {t 2 , t 2 , t 2 }. (Indeed, for any drift partition, one can obtain a new drift partition by moving any trapping set to a new lowest layer.) Now, using the assumption that t 2 is reachable from all s ∈ V \ {t 1 , t 2 } one can show by induction that for all 0 ≤ j < k and for all ̺ = u 0 ∈ S j there is a path u 0 . . . u m in G with u m = t 2 and ̺ ⊆ S − j+1 . This implies that there is a s 2 -t 2 path π 2 in G such that π 2 ⊆ S − k . It follows that π 2 is vertex disjoint with the Now, let us suppose that we have s 1 -t 1 and s 2 -t 2 vertex disjoint paths π 1 = u 0 . . . u n and π 2 = v 0 . . . v m . Clearly, we can assume both π 1 , π 2 are simple. We will construct a partition (S i ) 0≤i≤m+1 and show that it is indeed a drift partition, that u 0 u 0 , u 1 . . . u n−1 , u n u n ⊆ S m+1 , and s 2 = v 0 ∈ S − m+1 . Let us set and S m+1 := S \ ∪ 0≤i≤m S i . Since π 2 is simple, (S i ) 0≤i≤m+1 is a partition of V . Furthermore, we have that s 2 = v 0 ∈ S − m+1 , and u 0 u 0 , u 1 . . . u n−1 , u n u n ⊆ S m+1 since π 1 and π 2 are vertex disjoint. Thus, it only remains for us to argue that for all 0 ≤ i ≤ m + 1: for all w ∈ S i ∩ S N we have wR ∩ S + i = ∅, and for all w ∈ S i ∩ V N we have wR ∩ S + i = ∅ =⇒ wR ∩ S − i = ∅. By construction of the S i , we have that eR ⊆ S i for all 0 ≤ i ≤ m and all e ∈ S i ∩ S P . Furthermore, for all 0 < i ≤ m, for all To conclude, we observe that since 6 Efficiently under-approximating the NWR Although the full NWR cannot be efficiently computed for a given MDP, we can hope for "under-approximations" that are accurate and efficiently computable. We denote by * the pseudo transitive closure of . That is, * is the smallest relation such that ⊆ * and for all u ∈ V, X ⊆ V if there exists W ⊆ V such that u * W and ∀w ∈ W : w * X then u * X. Remark 1. The following hold.
• The empty set is an under-approximation of the NWR.
• For all under-approximations of the NWR, the pseudo transitive closure * of is also an under-approximation of the NWR.
In [2], efficiently-decidable sufficient conditions for the NWR were given. In particular, those conditions suffice to infer relations such as those in the right MDP from Figure 3. We recall (Proposition 1) and extend (Proposition 2) these conditions below.
Proposition 1 (From [2]). Consider a target arena A = (V, V P , E, T ) and an under-approximation of the NWR. For all vertices v 0 ∈ V , and sets W ⊆ V the following hold.
Sketch. The main idea of the proof of item (i) is to note that S is visited before T . The desired result then follows from Lemma 1. For item (ii), we intuitively have that there is a strategy to visit T with some probability or visit W , where the chances of visiting T are worse than before. We then show that it is never worse to start from v 0 to have better odds of visiting T .
The above "rules" give an iterative algorithm to obtain increasingly better under-approximations of the NWR: from i apply the rules and obtain a new under-approximation i+1 by adding the new pairs and taking the pseudo transitive closure; then repeat until convergence. Using the special cases from Section 3.2 we can obtain a nontrivial initial under-approximation 0 of the NWR in polynomial time.
The main problem is how to avoid testing all subsets W ⊆ V in every iteration. One natural way to ensure we do not consider all subsets of vertices in every iteration is to apply the rules from Proposition 1 only on the successors of Protagonist vertices.
In the same spirit of the iterative algorithm described above, we now give two new rules to infer equivalence with respect to the NWR.  (ii) For all u, v ∈ V P \ T , if for all w ∈ uE such that w (uE \ {w}) does not hold we have that w vE, then u {v}. The rules stated in Proposition 2 can be used to infer relations like those depicted in Figure 4 and are clearly seen to be computable in polynomial time as they speak only of successors of vertices.

Conclusions
We have shown that the never-worse relation is, unfortunately, not computable in polynomial time. On the bright side, we have extended the iterative polynomialtime algorithm from [2] to under-approximate the relation. In that paper, a prototype implementation of the algorithm was used to empirically show that interesting MDPs (from the set of benchmarks included in PRISM [16]) can be drastically reduced.
As future work, we believe it would be interesting to implement an exact algorithm to compute the NWR using SMT solvers. Symbolic implementations of the iterative algorithms should also be tested in practice. In a more theoretical direction, we observe that the planning community has also studied maximizing the probability of reaching a target set of states under the name of MAXPROB (see, e.g., [15,20]). There, online approximations of the NWR would make more sense than the under-approximation we have proposed here. Finally, one could define a notion of never-worse for finite-horizon reachability objectives or quantitative objectives.

A Preliminaries
A.1 Definitions A finite-memory strategy σ is a strategy that can be encoded as a deterministic Mealy machine A = (S, s I , Q, A, λ u , λ o ) where S is a finite set of (memory) states, s I is the initial state, λ u : S × Q → S is the update function and λ o : S × Q → A is the output function. The machine encodes σ in the following sense: σ(q 0 a 0 . . . q n ) = λ o (s n , q n ) where s 0 = s I and s i+1 = λ u (s i , q i ) for all 0 ≤ i < n. We then say that A realizes the strategy σ and that σ has memory |S|. In particular, strategies which have memory 1 are said to be positional (or memoryless).

A.2 Bellman equations
The following result [1,Theorem 10.100] about MDPs will be useful. • if p ∈ Z T then x p = 0, • otherwise x p = max a∈A q∈Q δ(q|p, a) · x q .

B Proof of Theorem 1
Proof. It suffices to prove that for two vertices u, v ∈ V P such that u ∼ v, we can collapse them into one single vertex, and that probability-1 self-loops can be removed. Removing self-loops is clearly correct, so we focus on correctness of collapsing vertices. For convenience, let us assume that extremal-probability vertices have already been collapsed so that u, v ∈ V P are not extremal-probability vertices. (Correctness of extremal-probability vertices is trivial.) From Lemma 7 we have that adding edges from u to v and from v to u (with intermediate Nature vertices to preserve bipartiteness) does not affect the maximal reachability probability values. Now, u and v form an end component. Since collapsing end components preserves the desired value [8,5], the result follows.

C Proof of Theorem 2
Proof. Intuitively, removing edges that lead to sub-optimal vertices should be clearly correct. However, self-loops pose a risk. That is, imagine we remove all edges except for those involved in a loop that does not contain a target vertex. We have now reduced the probability of reaching the target set of vertices! Fortunately, the claim assumes that we have applied ∼-quotienting. Since end components are a special case of ∼ equivalence classes, we have no end components nor probability-1 self-loops in the target arena. In other words, for all families µ of full-support distributions, and for all cycles in the target arena, the cycle will have probability strictly less than 1. To conclude, we observe that by removing sub-optimal edges one at a time, we preserve the Bellman optimality equations from Lemma 7. (The latter was not true before quotienting since the cycles that could be formed would create new vertices with probability 0 of reaching the target set of states.)

D Proof of Lemma 1
Proof. Since P q0 C [Q \ U U T ] = 0, then q 0 ∈ T . Thus, by definition, we have The set of runs that start at q 0 and reach T is equivalent to the union of: the set of runs that start at q 0 and stay in a set S ⊆ Q until they T ; and the set of runs that start at q 0 and reach some state from Q ⊆ Q before eventually reaching T . The measure of the runs from the first set is P q0 C [S U T ]. Let us denote by P[τ Q\S < τ T ] the measure of the runs from the second set. Since in Markov chains we have that for all runs p 0 . . . p i . . . p m the probabilities of the prefix p 0 . . . p i and the suffix p i . . . p m are independent, we can rewrite (3) as follows.
P q0 C [♦T ] = P q0 C [Q \ U U T ] + P[τ U < τ T ] By assumption we have that the first summand is equal to 0, so Also by assumption, we know that there are no runs of C starting at q 0 and staying in Q ⊆ U until reaching T . Hence, the set of runs from q 0 that reach some state in U and then eventually reach T is exactly the union, over all u ∈ U , of the sets of runs q 0 . . . q ℓ . . . q n ∈ Q * that satisfy • q ℓ = u, • ∀ℓ < j < n : q j ∈ T , and Once more using the fact that events (i.e. transition probabilities) in a Markov chain are independent of the history (i.e. run prefixes) we can write the measure of the above set as follows.
This concludes the proof.

E Proof of Lemma 3
Proof. Let σ be a strategy maximizing the value P q M σ [♦T ]. We can then construct a (finite-memory) strategy which, from p ensures P p M σ ′ [S U q] = 1 and from q onwards behaves as σ. If q ∈ T then we are done. Otherwise, Lemma 1 implies that P p . Furthermore, we know that memoryless strategies suffice for reachability [17,1], so σ ′ can be replaced by a memoryless strategy.

F Proof of Lemma 5
Let us denote by λ the empty word.
Lemma 8. Let w be a word. s(w) is repetition-free; w and s(w) have the same starting and ending letters; a letter occurs in s(w) only if it occurs in w; a two-letter word xy occurs as a factor in s(w) only if it occurs in w.
Proof. By induction. The base case is clear. For the inductive case, let w ∈ Σ * and let a ∈ Σ. First case, a / ∈ s(w). So s(wa) = s(w)a is repetition-free like s(w) is by IH, and wa and s(wa) share the same ending letter a. Since w and s(w) share the same starting letter by IH, so do wa and s(wa) = s(w)a. (It is a if w = λ). Let b ∈ s(wa). If b = a, the claim clearly holds; if b = a then b ∈ s(w), so b ∈ w by IH, and b ∈ wa. Let xy occur as a two-letter factor in s(wa). If y = a then xy actually occurs in s(w), so by IH it occurs in w, and thus in wa; if y = a then s(w) ends with x, and by IH so do w, so wa ends with xy.
Second case, a ∈ s(w). Since s(wa) is a prefix of s(w), it is also repetitionfree, and s(wa) ends with a by definition. As a prefix, s(wa) starts with the same letter as s(w), i.e. the same letter as w by IH, i.e. the same as wa. If b ∈ s(wa) then b ∈ s(w), so b ∈ w by IH, and b ∈ wa. If xy occurs in s(wa), it also occurs in its extension s(w), and in w by IH, and finally in wa.
Lemma 9. For all paths γ, s(γ) is a simple path starting (ending) with the same vertex as γ. It only visits vertices that are already visited in γ, and it only takes edges that are already taken in γ.
Proof. By Lemma 8 the word s(γ) is repetition-free, and it starts and ends like γ. Again by Lemma 8, if two vertices xy occur consecutively in s(γ), they also occur consecutively and in the same order in γ, i.e. (x, y) ∈ E. Therefore s(γ) is a path, and moreover it takes only edges that are already taken in γ.
We can now proceed with the proof of the Lemma.
of Lemma 5. Since there is a path from x to T by assumption, the set L of all the simple paths from x to T is non-empty. Let T be the prefix closure of L, and let T ′ be the set of the µ-non-decreasing paths in T , so that T ′ ⊆ T . The set T is a tree by prefix-closure construction, and T ′ is a tree since the µ-non-decreasing paths are closed by taking prefixes. Moreover, the elements of T (and thus T ′ ) are simple paths since the prefixes of simple paths are again simple paths. Let L ′ be the prefix-wise maximal paths in T ′ . The claimed lemma amounts to L ∩ L ′ = ∅, equivalently L ∩ T ′ = ∅. Also note that a path in T is in L iff it ends with T , so L ∩ T ′ = ∅ iff some path in T ′ ends with T .
Towards a contradiction, let us assume that L ∩ L ′ = ∅. We can thus let M consist of the one-vertex extensions in T of the paths in L ′ . More formally, a path γ ∈ T is in M iff there exists a vertex y and a path β ∈ L ′ such that βy = γ. Clearly, T ′ ∩ M = ∅ by prefix-wise maximality of L ′ within T ′ . So, the elements in M are the least not µ-non-decreasing paths in T . This implies that ∀γyz ∈ M, Val µ (z) < Val µ (y) (where γ is a path and y, z are vertices), since γy ∈ L ′ . Let B ′ (resp. B) be the ending vertices of the paths in L ′ (resp. M ). Let us proceed with a few remarks: Remark 2. 1.
x occurs as the first vertex of all paths in T , and only as first vertex, otherwise the paths would not be simple. So x / ∈ B, since x (as a path) is clearly in T ′ and since T ′ ∩ M = ∅.
which also holds in the graph restricted by σ 0 , so P y Therefore, Val µ (y) = max σ P y A σ µ [♦T ] ≤ max b∈B Val µ (b). This contradicts the above claim that Val µ (z) < Val µ (y) for all z ∈ B.

G Proof of Proposition 1
The following observation regarding convex combinations will be useful.
Lemma 10. Consider a finite set of values N ⊆ Q and a probability distribution δ ∈ D(N ). There exist n, m ∈ N such that • ∃m ∈ N : m ≤ n∈N δ(n) · n • ∃m ∈ N : n∈N δ(n) · n ≤ m of Proposition 1. Let us start with item (i). From Lemma 1 we have that for all families µ of full-support distributions and strategies σ. It follows from Lemma 10 that there is some s ∈ S such that P v0 By definition, this means that v 0 S. Hence, by choice of S and Remark 1 we have v 0 W .
For item (ii), let us assume, without loss of generality, that the vertices from T ∪ Z T are all sinks. We thus have that there is a strategy σ such that in there is no path from v 0 to Z T without first visiting T ∪ S. We will also assume that σ, after visiting any vertex from S ∪ T starts playing optimally in order to maximize the probability of visiting T . (This may require memory.) We have from Lemma 1 that for all full-support distribution families µ. Since all the vertices from T are sinks P s A σ µ [♦Z T ] = 0 for all s ∈ T . Therefore, we can rewrite the above as From Lemma 10 we get that there is some s ∈ S such that P v0 By choice of σ, this implies that s {v 0 } (recall that memoryless strategies suffice to maximize the probability of reaching a target set of states). Then, by choice of S ∋ s we have that w {v 0 }.

H Proof of Proposition 2
Proof. Item (i) follows from the definition of Val.
For item (ii), we suppose the assumptions hold. Hence, by definition of , we have that for all families µ of full-support distributions max w∈uE x∈wE The result thus follows from Lemma 7.