Partial and Conditional Expectations in Markov Decision Processes with Integer Weights

The paper addresses two variants of the stochastic shortest path problem ('optimize the accumulated weight until reaching a goal state') in Markov decision processes (MDPs) with integer weights. The first variant optimizes partial expected accumulated weights, where paths not leading to a goal state are assigned weight 0, while the second variant considers conditional expected accumulated weights, where the probability mass is redistributed to paths reaching the goal. Both variants constitute useful approaches to the analysis of systems without guarantees on the occurrence of an event of interest (reaching a goal state), but have only been studied in structures with non-negative weights. Our main results are as follows. There are polynomial-time algorithms to check the finiteness of the supremum of the partial or conditional expectations in MDPs with arbitrary integer weights. If finite, then optimal weight-based deterministic schedulers exist. In contrast to the setting of non-negative weights, optimal schedulers can need infinite memory and their value can be irrational. However, the optimal value can be approximated up to an absolute error of $\epsilon$ in time exponential in the size of the MDP and polynomial in $\log(1/\epsilon)$.


Introduction
Stochastic shortest path (SSP) problems generalize the shortest path problem on graphs with weighted edges. The SSP problem is formalized using finite state Markov decision processes (MDPs), which are a prominent model combining probabilistic and nondeterministic choices. In each state of an MDP, one is allowed to choose nondeterministically from a set of actions, each of them is augmented with probability distributions over the successor states and a weight (cost or reward). The SSP problem asks for a policy to choose actions (here called a scheduler) maximizing or minimizing the expected accumulated weight until reaching a goal state. In the classical setting, one seeks an optimal proper scheduler where proper means that a goal state is reached almost surely. Polynomialtime solutions exist exploiting the fact that optimal memoryless deterministic schedulers exist (provided the optimal value is finite) and can be computed using linear programming techniques, possibly in combination with model transformations (see [5,10,1]). The restriction to proper schedulers, however, is often too restrictive. First, there are models that have no proper scheduler. Second, even if proper schedulers exist, the expectation of the accumulated weight of schedulers missing the goal with a positive probability should be taken into account as well. Important such applications include the semantics of probabilistic programs (see e.g. [12,14,4,7,16]) where no guarantee for almost sure termination can be given and the analysis of program properties at termination time gives rise to stochastic shortest (longest) path problems in which the goal (halting configuration) is not reached almost surely. Other examples are the fault-tolerance analysis (e.g., expected costs of repair mechanisms) in selected error scenarios that can appear with some positive, but small probability or the trade-off analysis with conjunctions of utility and cost constraints that are achievable with positive probability, but not almost surely (see e.g. [2]).
This motivates the switch to variants of classical SSP problems where the restriction to proper schedulers is relaxed. One option (e.g., considered in [8]) is to seek a scheduler optimizing the expectation of the random variable that assigns weight 0 to all paths not reaching the goal and the accumulated weight of the shortest prefix reaching the goal to all other paths. We refer to this expectation as partial expectation. Second, we consider the conditional expectation of the accumulated weight until reaching the goal under the condition that the goal is reached. In general, partial expectations describe situations in which some reward (positive and negative) is accumulated but only retrieved if a certain goal is met. In particular, partial expectations can be an appropriate replacement for the classical expected weight before reaching the goal if we want to include schedulers which miss the goal with some (possibly very small) probability. In contrast to conditional expectations, the resulting scheduler still has an incentive to reach the goal with a high probability, while schedulers maximizing the conditional expectation might reach the goal with a very small positive probability. Previous work on partial or conditional expected accumulated weights was restricted to the case of non-negative weights. More precisely, partial expectations have been studied in the setting of stochastic multiplayer games with non-negative weights [8]. Conditional expectations in MDPs with non-negative weights have been addressed in [3]. In both cases, optimal values are achieved by weight-based deterministic schedulers that depend on the current state and the weight that has been accumulated so far, while memoryless schedulers are not sufficient. Both [8] and [3] prove the existence of a saturation point for the accumulated weight from which on optimal schedulers behave memoryless and maximize the probability to reach a goal state. This yields exponential-time algorithms for computing optimal schedulers using an iterative linear programming approach. Moreover, [3] proves that the threshold problem for conditional expectations ("does there exist a scheduler S such that the conditional expectation under S exceeds a given threshold?") is PSPACE-hard even for acyclic MDPs.
The purpose of the paper is to study partial and conditional expected accumulated weights for MDPs with integer weights. The switch from non-negative to integer weights indeed causes several additional difficulties. We start with the following observation. While optimal partial or conditional expectations in non-negative MDPs are rational, they can be irrational in the general setting:  Figure 1. In the initial state s init , two actions are enabled. Action τ leads to goal with probability 1 and weight 0. Action σ leads to the states s and t with probability 1/2 from where we will return to s init with weight −2 or +1, respectively. The scheduler choosing τ immediately leads to an expected weight of 0 and is optimal among schedulers reaching the goal almost surely. As long as we choose σ in s init , the accumulated weight follows an asymmetric random walk increasing by 1 or decreasing by 2 with probability 1/2 before we return to s init . It is well known that the probability to ever reach accumulated weight +1 in this random walk is 1/Φ where Φ = 1+ √ 5 2 is the golden ratio. Likewise, ever reaching accumulated weight n has probability 1/Φ n for all n ∈ N. Consider the scheduler S k choosing τ as soon as the accumulated weight reaches k in s init . Its partial expectation is k/Φ k as the paths which never reach weight k are assigned weight 0. The maximum is reached at k = 2. In Section 4, we prove that there are optimal schedulers whose decisions only depend on the current state and the weight accumulated so far. With this result we can conclude that the maximal partial expectation is indeed 2/Φ 2 , an irrational number.
The conditional expectation of S k in M is k as S k reaches the goal with accumulated weight k if it reaches the goal. So, the conditional expectation is not bounded. If we add a new initial state making sure that the goal is reached with positive probability as in the MDP N , we can obtain an irrational maximal conditional expectation as well: The scheduler T k choosing τ in c as soon as the weight reaches k has conditional expectation k/2Φ k 1/2+1/2Φ k . The maximum is obtained for k = 3; the maximal conditional expectation is . Moreover, while the proposed algorithms of [8,3] crucially rely on the monotonicity of the accumulated weights along the prefixes of paths, the accumulated weights of prefixes of path can oscillate when there are positive and negative weights. As we will see later, this implies that the existence of saturation points is no longer ensured and optimal schedulers might require infinite memory (more precisely, a counter for the accumulated weight). These observations provide evidence why linear-programming techniques as used in the case of non-negative MDPs [8,3] cannot be expected to be applicable for the general setting.
Contributions. We study the problem of maximizing the partial and conditional expected accumulated weight in MDPs with integer weights. Our first result is that the finiteness of the supremum of partial and conditional expectations in MDPs with integer weights can be checked in polynomial time (Section 3). For both variants we show that there are optimal weight-based deterministic schedulers if the supremum is finite (Section 4). Although the suprema might be irrational and optimal schedulers might need infinite memory, the suprema can be ǫ-approximated in time exponential in the size of the MDP and polynomial in log(1/ǫ) (Section 5). By duality of maximal and minimal expectations, analogous results hold for the problem of minimizing the partial or conditional expected accumulated weight. (Note that we can multiply all weights by −1 and then apply the results for maximal partial resp. conditional expectations.) Related work. Closest to our contribution is the above mentioned work on partial expected accumulated weights in stochastic multiplayer games with nonnegative weights in [8] and on computation schemes for maximal conditional expected accumulated weights in non-negative MDPs [3]. Conditional expected termination time in probabilistic push-down automata has been studied in [11], which can be seen as analogous considerations for a class of infinite-state Markov chains with non-negative weights. The recent work on notions of conditional value at risk in MDPs [15] also studies conditional expectations, but the considered random variables are limit averages and a notion of (non-accumulated) weight-bounded reachability.

Preliminaries
We give basic definitions and present our notation. More details can be found in textbooks, e.g. [17].
Notations for Markov decision processes. A Markov decision process (MDP) is a tuple M = (S, Act, P, s init , wgt ) where S is a finite set of states, Act a finite set of actions, s init ∈ S the initial state, P : S × Act × S → [0, 1] ∩ Q is the transition probability function and wgt : S × Act → Z the weight function. We require that t∈S P (s, α, t) ∈ {0, 1} for all (s, α) ∈ S × Act. We write Act(s) for the set of actions that are enabled in s, i.e., α ∈ Act(s) iff t∈S P (s, α, t) = 1. We assume that Act(s) is non-empty for all s and that all states are reachable from s init . We call a state absorbing if the only enabled action leads to the state itself with probability 1 and weight 0. The paths of M are finite or infinite sequences s 0 α 0 s 1 α 1 s 2 α 2 . . . where states and actions alternate such that P (s i , α i , s i+1 ) > 0 for all i ≥ 0. If π = s 0 α 0 s 1 α 1 . . . α k−1 s k is finite, then wgt (π) = wgt (s 0 , α 0 ) + . . . + wgt (s k−1 , α k−1 ) denotes the accumulated weight of π, P (π) = P (s 0 , α 0 , s 1 ) · . . . · P (s k−1 , α k−1 , s k ) its probability, and last (π) = s k its last state. The size of M, denoted size(M), is the sum of the number of states plus the total sum of the logarithmic lengths of the non-zero probability values P (s, α, s ′ ) as fractions of co-prime integers and the weight values wgt (s, α).

Scheduler.
A (history-dependent, randomized) scheduler for M is a function S that assigns to each finite path π a probability distribution over Act(last (π)). S is called memoryless if S(π) = S(π ′ ) for all finite paths π, π ′ with last (π) = last (π ′ ), in which case S can be viewed as a function that assigns to each state s a distribution over Act(s). S is called deterministic if S(π) is a Dirac distribution for each path π, in which case S can be viewed as a function that assigns an action to each finite path π. Scheduler S is said to be weight-based if S(π) = S(π ′ ) for all finite paths π, π ′ with wgt (π) = wgt (π ′ ) and last (π) = last (π ′ ). Thus, deterministic weight-based schedulers can be viewed as functions that assign actions to state-weight-pairs. By HR M we denote the class of all schedulers, by WR M the class of weight-based schedulers, by WD M the class of weight-based, deterministic schedulers, and by MD M the class of memoryless deterministic schedulers. Given a scheduler S, ς = s 0 α 0 s 1 α 1 . . . is a S-path iff ς is a path and S(s 0 α 0 s 1 α 1 . . . α k−1 s k )(α k ) > 0 for all k ≥ 0.
Probability measure. We write Pr S M,s or briefly Pr S s to denote the probability measure induced by S and s. For details, see [17]. We will use LTL-like formulas to denote measurable sets of paths and also write ♦(wgt ⊲⊳ x) to describe the set of infinite paths having a prefix π with wgt(π) ⊲⊳ x for x ∈ Z and ⊲⊳ ∈ {<, ≤, =, ≥, >}. Given a measurable set ψ of infinite paths, we define Pr min M,s (ψ) = inf S Pr S M,s (ψ) and Pr max M,s (ψ) = sup S Pr S M,s (ψ) where S ranges over all schedulers for M. Throughout the paper, we suppose that the given MDP has a designated state goal . Then, p max s and p min s denote the maximal resp. minimal probability of reaching goal from s. That is, p max s = sup S Pr S s (♦goal ) and p min s = inf S Pr S s (♦goal ). Let Act max (s) = {α ∈ Act(s)| t∈S P (s, α, t) · p max t = p max s }, and Act min (s) = {α ∈ Act(s)| t∈S P (s, α, t) · p min t = p min s }. Mean payoff. A well-known measure for the long-run behavior of a scheduler S in an MDP M is the mean payoff. Intuitively, the mean payoff is the amount of weight accumulated per step on average in the long run. Formally, we define the mean payoff as the following random variable on infinite paths ζ = s 0 α 0 s 1 α 1 . . . : . The mean payoff of the scheduler S starting in s init is then defined as the expected value E S sinit (MP ). The maximal mean payoff is the supremum over all schedulers which is equal to the maximum over all M D-schedulers: E max sinit (MP ) = max S∈MD E S sinit (MP ). In strongly connected MDPs, the maximal mean payoff does not depend on the initial state.
End components, MEC-quotient. An end component of M is a strongly connected sub-MDP. End components can be formalized as pairs E = (E, A) where E is a nonempty subset of S and A a function that assigns to each state s ∈ E a nonempty subset of Act (s) such that the graph induced by E is strongly The MEC-quotient of an MDP M is the MDP MEC (M) arising from M by collapsing all states that belong to the same maximal end component E to a state s E . All actions enabled in some state in E not belonging to E are enabled in s E . Details and the formal construction can be found in [9]. We call an end component E positively weight-divergent if there is a scheduler S for E such that Pr S E,s (♦(wgt ≥ n)) = 1 for all s ∈ E and n ∈ N. In [1], it is shown that the existence of positively weight-divergent end components can be decided in polynomial time.

Partial and Conditional Expectations in MDPs
We define partial and conditional expectations in MDPs. We extend the definition of [8] by introducing partial expectations with bias which are closely related to conditional expectations. Afterwards, we sketch the computation of maximal partial expectations in MDPs with non-negative weights and in Markov chains.
Partial and conditional expectation. In the sequel, let M be an MDP with a designated absorbing goal state goal . Furthermore, we collapse all states from which goal is not reachable to one absorbing state fail . Let b ∈ R. We define the random variable ⊕ b goal on infinite paths ζ by We if M is clear from the context, we drop the subscript. In order to maximize the partial expectation, intuitively one has to find the right balance between reaching goal with high probability and accumulating a high positive amount of weight before reaching goal . The bias can be used to shift this balance by additionally rewarding or penalizing a high probability to reach goal .
The conditional expectation of S is defined as the expectation of ⊕ 0 goal under the condition that goal is reached. It is defined if Pr S M,sinit (♦goal ) > 0. We write CE S M,sinit := E S M,sinit (⊕ 0 goal |♦goal ) and CE sup M,sinit = sup S CE S M,sinit where the supremum is taken over all schedulers S with Pr S M,sinit (♦goal ) > 0. We can express the conditional expectation as CE S M,sinit = PE S M,sinit /Pr S M,sinit (♦goal ). The following proposition establishes a close connection between conditional expectations and partial expectations with bias. In [3], it is shown that deciding whether CE sup sinit ⊲⊳ θ for ⊲⊳∈ {<, ≤, ≥, >} and θ ∈ Q is PSPACE-hard even for acyclic MDPs. We conclude: Corollary 3. Given an MDP M, ⊲⊳∈ {<, ≤, ≥, >}, and θ ∈ Q, deciding whether PE sup M,sinit ⊲⊳ θ is PSPACE-hard. . Finiteness. We present criteria for the finiteness of PE sup sinit [b] and CE sup sinit . Detailed proofs can be found in Appendix A.1. By slightly modifying the construction from [1] which removes end components only containing 0-weight cycles, we obtain the following result.  To obtain an analogous result for conditional expectations, we observe that the finiteness of the maximal partial expectation is necessary for the finiteness of the maximal conditional expectation. However, this is not sufficient. In [3], a critical scheduler is defined as a scheduler S for which there is a path containing a positive cycle and for which Pr S sinit (♦goal ) = 0. Given a critical scheduler, it is easy to construct a sequence of schedulers with unbounded conditional expectation (see Appendix A.1 and [3]). On the other hand, if Pr min M,sinit (♦goal ) > 0, then CE sup sinit is finite if and only if PE sup sinit is finite. We will show how we can restrict ourselves to this case if there are no critical schedulers: So, let M be an MDP with Pr min M,sinit (♦goal ) = 0 and suppose there are no critical schedulers for M. Let S 0 be the set of all states reachable from s init while only choosing actions in Act min . As there are no critical schedulers, (S 0 , Act min ) does not contain positive cycles. So, there is a finite maximal weight w s among paths leading from s init to s in S 0 . Consider the following MDP N : It contains the MDP M and a new initial state t init . For each s ∈ S 0 and each α ∈ Act(s) \ Act min (s), N also contains a new state t s,α which is reachable from t init via an action β s,α with weight w s and probability 1. In t s,α , only action α with the same probability distribution over successors and the same weight as in s is enabled. So in N , one has to decide immediately in which state to leave S 0 and one accumulates the maximal weight which can be accumulated in M to reach this state in S 0 . In this way, we ensure that Pr min N ,tinit (♦goal ) > 0.
We can rely on this reduction to an MDP in which goal is reached with positive probability for ǫ-approximations and the exact computation of the optimal conditional expectation. In particular, the values w s for s ∈ S 0 are easy to compute by classical shortest path algorithms on weighted graphs. Furthermore, we can now decide the finiteness of the maximal conditional expectation. Proof. Let α be the only action available in C. Assume that all states from which goal is not reachable have been collapsed to an absorbing state fail . Then PE C,sinit is the value of x sinit in the unique solution to the following system of linear equations with one variable x s for each state s: The existence of a unique solution follows from the fact that {goal } and {fail } are the only end components (see [17]). It is straight-forward to check that (PE C,s ) s∈S is this unique solution. The conditional expectation is obtained from the partial expectation by dividing by the probability p sinit to reach the goal.
This result can be seen as a special case of the following result. Restricting ourselves to schedulers which reach the goal with maximal or minimal probability in an MDP without positively weight-divergent end components, linear programming allows us to compute the following two memoryless deterministic schedulers (see [8,3]). These schedulers will play a crucial role for the approximation of the maximal partial expectation and the exact computation of maximal partial expectations in MDPs with non-negative weights. .
Partial expectations in MDPs with non-negative weights. In [8], the computation of maximal partial expectations in stochastic multiplayer games with non-negative weights is presented. We adapt this approach to MDPs with non-negative weights. A key result is the existence of a saturation point, a bound on the accumulated weight above which optimal schedulers do not need memory.
In the sequel, let R ∈ Q be arbitrary, let M be an MDP with non-negative weights, PE sup sinit < ∞, and assume that end components have negative maximal mean payoff (see Proposition 4). A saturation point for bias R is a natural which is memoryless and deterministic as soon as the accumulated weight reaches p. I.e. for any two paths π and π ′ , with last(π) = last(π ′ ) and wgt(π), wgt(π ′ ) > p, S(π) = S(π ′ ).
Transferring the idea behind the saturation point for conditional expectations given in [3], we provide the following saturation point which can be considerably smaller than the saturation point given in [8] in stochastic multiplayer games. Detailed proofs to this section are given in Appendix A.2.
The saturation point p R is chosen such that, as soon as the accumulated weight exceeds p R , the scheduler Max is better than any scheduler deviating from Max for only one step. So, the proposition states that Max is then also better than any other scheduler.
As all values involved in the computation can be determined by linear programming, the saturation point p R is computable in polynomial time. This also means that the logarithmic length of p R is polynomial in the size of M and hence p R itself is at most exponential in the size of M.
Proposition 11. Let R ∈ Q and let B R be the least integer greater or equal to p R + max s∈S,α∈Act(s) wgt(s, α) and let S ′ := S \ {goal , f ail}. The values (PE sup sinit [r+R]) s∈S ′ ,0≤r≤BR form the unique solution to the following linear program in the variables (x s,r ) s∈S ′ ,0≤r≤BR (r ranges over integers): Minimize s∈S ′ ,0≤r≤BR x s,r under the following constraints: From a solution x to the linear program, we can easily extract an optimal weight-based deterministic scheduler. This scheduler only needs finite memory because the accumulated weight increases monotonically along paths and as soon as the saturation point is reached Max provides the optimal decisions. As B R is exponential in the size of M, the computation of the optimal partial expectation via this linear program runs in time exponential in the size of M.

Existence of Optimal Schedulers
We prove that there are optimal weight-based deterministic schedulers for partial and conditional expectations. After showing that, if finite, PE sup sinit is equal to sup S∈WD M PE S sinit , we take an analytic approach to show that there is indeed a weight-based deterministic scheduler maximizing the partial expectation. We define a metric on WD M turning it into a compact space. Then, we prove that the function assigning the partial expectation to schedulers is upper semicontinuous. We conclude that there is a weight-based deterministic scheduler obtaining the maximum. Proofs to this section can be found in Appendix B.
Proof sketch. We can assume that all end components have negative maximal expected mean payoff (see Proposition 4). Given a scheduler S ∈ HR M , we take the expected number of times θ s,w that s is visited with accumulated weight w under S for each state-weight pair (s, w), and the expected number of times θ s,w,α that S then chooses α. These values are finite due to the negative maximal mean payoff in end components. We define the scheduler T ∈ WR M choosing α in s with probability θ s,w,α /θ s,w when weight w has been accumulated. Then, we show by standard arguments that we can replace all probability distributions that T chooses by Dirac distributions to obtain a scheduler It remains to show that the supremum is obtained by a weight-based deterministic scheduler. Given an MDP M with arbitrary integer weights, we define the following metric d M on the set of weight-based deterministic schedulers, i.e. on the set of functions from S × Z → Act: For two such schedulers S and Having defined this compact space of schedulers, we can rely on the analytic notion of upper semi-continuity.
sinit to a weight-based deterministic scheduler S is upper semi-continuous.
The technical proof of this lemma can be found in Appendix B. We arrive at the main result of this section. In MDPs with non-negative weights, the optimal decision in a state s only depends on s as soon as the accumulated weight exceeds a saturation point. In MDPs with arbitrary integer weights, it is possible that the optimal choice of action does not become stable for increasing values of accumulated weight as we see in the following example. The MDP N : The MDP M: All non-trivial transition probabilities are 1/2. In the MDP M, the optimal choice to maximize the partial expectation in t depends on the parity of the accumulated weight.
Example 17. Let us first consider the MDP N depicted in Figure 2. Let π be a path reaching t for the first time with accumulated weight r. Consider a scheduler which chooses β for the first k times and then α. In this situation, the partial expectation from this point on is: For r ≥ 2, this partial expectation has its unique maximum for the choice k = r−2. This already shows that an optimal scheduler needs infinite memory. No matter how much weight r has been accumulated when reaching t, the optimal scheduler has to count the r−2 times it chooses β. Furthermore, we can transfer the optimal scheduler for the MDP N to the MDP M. In state t, we have to make a nondeterministic choice between two action leading to the states q 0 and q 1 , respectively. In both of these states, action β is enabled which behaves like the same action in the MDP N except that it moves between the two states if goal is not reached. So, the action α is only enabled every other step. As in N , we want to choose α after choosing β r−2 times if we arrived in t with accumulated weight r ≥ 2. So, the choice in t depends on the parity of r: For r = 1 or r even, we choose δ. For odd r ≥ 3, we choose γ. This shows that the optimal scheduler in the MDP M needs specific information about the accumulated weight, in this case the parity, no matter how much weight has been accumulated.
In the example, the optimal scheduler has a periodic behavior when fixing a state and looking at optimal decisions for increasing values of accumulated weight. The question whether an optimal scheduler always has such a periodic behavior remains open.

Approximation
As the optimal values for partial and conditional expectation can be irrational, there is no hope to compute these values by linear programming as in the case of non-negative weights. In this section, we show how we can nevertheless approximate the values. The main result is the following. We first prove that upper bounds for PE sup M,sinit and CE sup M,sinit can be computed in polynomial time. Then, we show that there are ǫ-optimal schedulers for the partial expectation which become memoryless as soon as the accumulated weight leaves a sufficiently large weight window around 0. We compute the optimal partial expectation of such a scheduler by linear programming. The result can then be extended to conditional expectations.
Upper Bounds. Let M be an MDP in which all end components have negative maximal mean payoff. Let δ be the minimal non-zero transition probability in M and W := max s∈S,α∈Act(s) |wgt (s, α)|. Moving through the MEC-quotient, the probability to reach an accumulated weight of |S| · W is bounded by 1 − δ |S| as goal or fail is reached within S steps with probability at least 1 − δ |S| . It remains to show similar bounds inside an end component.
We will use the characterization of the maximal mean payoff in terms of super-harmonic vectors due to Hordijk and Kallenberg [13] to define a supermartingale controlling the growth of the accumulated weight in an end component under any scheduler. As the value vector for the maximal mean payoff in an end component is constant and negative in our case, the results of [13] yield: Proposition 19 (Hordijk, Kallenberg). Let E = (S, Act) be an end component with maximal mean payoff −t for some t > 0. Then there is a vector (u s ) s∈S such that −t + u s ≥ wgt(s, α) + s ′ ∈S P (s, α, s ′ ) · u s ′ .
Furthermore, let v be the vector (-t,. . . ,-t) in R S . Then, (v, u) is the solution to a linear program with 2|S| variables, 2|S||Act| inequalities, and coefficients formed from the transition probabilities and weights in E.
We will call the vector u a super-potential because the expected accumulated weight after i steps is at most u s − min t∈S u t − i · t when starting in state s. Let S be a scheduler for E starting in some state s. We define the following random variables on S-runs in E: let s(i) ∈ S be the state after i steps, let α(i) be the action chosen after i steps, let w(i) be the accumulated weight after i steps, and let π(i) be the history, i.e. the finite path after i steps.
We are going to apply the following theorem by Blackwell [6].
Corollary 22. For any scheduler S and any starting state s in E, we have Let MEC be the set of maximal end components in M. For each E ∈ MEC , let λ E and c E be as in Corollary 22. Define λ M := 1 − (δ |S| · E∈MEC (1 − λ E )), and c M := |S| · W + E∈MEC c E . Then an accumulated weight of c M cannot be reached with a probability greater than λ M because reaching accumulated weight c M would require reaching weight c E in some end component E or reaching weight |S|·W in the MEC-quotient and 1−λ M is a lower bound on the probability that none of this happens (under any scheduler).
Proposition 23. Let M be an MDP with PE sup sinit < ∞. There is an upper bound PE ub for the partial expectation in M computable in polynomial time.
Proof. In any end component E, the maximal mean payoff −t and the superpotential u are computable in polynomial time. Hence, c E and λ E , and in turn also c M and λ M are also computable in polynomial time. When we reach accumulated weight c M for the first time, the actual accumulated weight is at most c M + W . So, we conclude that Pr max The partial expectation can now be bounded by Corollary 24. Let M be an MDP with CE sup M,sinit < ∞. There is an upper bound CE ub for the conditional expectation in M computable in polynomial time.
Proof. By Proposition 6, we can construct an MDP N in which goal is reached with probability q > 0 in polynomial time with CE sup M,sinit = CE sup N ,sinit . Now, CE ub := PE ub /q is an upper bound for the conditional expectation in M.
Approximating optimal partial expectations. The idea for the approximation is to assume that the partial expectation is PE Max sinit + w · p max s if a high weight w has been accumulated in state s. Similarly, for small weights w ′ , we use the value PE Min sinit + w · p min s . We will first provide a lower "saturation point" making sure that only actions minimizing the probability to reach the goal are used by an optimal scheduler as soon as the accumulated weight drops below this saturation point. Proofs to this section can be found in Appendix C.1 Theorem 26. There is a weight-based deterministic scheduler S such that the scheduler T defined by This result now allows us to compute an ǫ-approximation and an ǫ-optimal scheduler with finite memory by linear programming, similar to the case of nonnegative weights, in a linear program with R + ǫ + R − ǫ many variables and |Act|times as many inequalities. The proof can be found in Appendix C.2. This finishes the proof of Theorem 18.

Conclusion
Compared to the setting of non-negative weights, the optimization of partial and conditional expectations faces substantial new difficulties in the setting of integer weights. The optimal values can be irrational showing that the linear programming approaches from the setting of non-negative weights cannot be applied for the computation of optimal values. We showed that this approach can nevertheless be adapted for approximation algorithms. Further, we were able to show that there are optimal weight-based deterministic schedulers. These schedulers, however, can require infinite memory and it remains open whether we can further restrict the class of schedulers necessary for the optimization. In examples, we have seen that optimal schedulers can switch periodically between actions they choose for increasing values of accumulated weight. Further insights on the behavior of optimal schedulers would be helpful to address threshold problems ("Is PE sup sinit ≥ θ?").

Appendix A Partial and Conditional Expectations in Markov Decision Processes
In this section, we give proofs to the claims of Section 3.

A.1 Finiteness and Preprocessing
The finiteness of maximal partial expectations depends on the existence of positively weight-divergent end components. Using the construction from [1] which removes end components only containing 0-weight cycles, we can show the following: Proof. In an end component which has non-negative maximal expected mean payoff and which is not positively weight-divergent, all cycles have weight 0 (see [1]). We will use the so-called spider construction from [1] with a small modification to remove such end components: So, let M be an MDP and let E = (E, A) be an end component of M in which all cycles have weight 0. The spider construction successively flattens sub-end components which contain exactly one action per state: So, let E ′ ⊆ E and for each s ∈ E ′ , let α s ∈ A(s) such that E ′ = {(s, α s )|s ∈ E ′ } is an end component. We pick a state s 0 ∈ E ′ . As all cycles in E ′ have weight 0, there is a unique weight w s for each s ∈ E ′ such that all paths from s to s 0 in E ′ have weight w s . The spider construction now does the following: We extend the construction by adding an absorbing state fail and additionally enabling a new action τ in s 0 with P (s 0 , τ, fail ) = 1 and wgt(s 0 , τ ) = 0. We call the resulting MDP after one application of the construction N ′ . In [1], it is shown that polynomially many applications (in polynomial time in total) of the construction result in an MDP N satisfying the first requirement in the statement. Hence, it is sufficient to show the correspondence between schedulers claimed in the second requirement for the MDPs M and N ′ . Given a scheduler S for M, we construct the following scheduler S ′ for N ′ : Whenever a run in M under S reaches E ′ , let p fail be the probability that S will never leave E ′ again. Further, for each state s in E ′ and each action β ∈ Act(s) not belonging to E, let p s,β be the probability that S leaves E ′ from s via β. This behavior can now be mimicked in N ′ : S ′ goes to fail with probability p fail and takes the action β s in s 0 with probability p s,β . It is straightforward to check that this does not affect the partial expectation or the probability to reach goal .
Conversely, a scheduler S ′ for N ′ can easily be transformed to a scheduler S for M: Whenever S ′ moves to fail from s 0 , the scheduler S stays in E ′ forever. If S ′ chooses β s in s 0 , S moves through E ′ until it reaches s. This happens almost surely. Then, S chooses β. Again, it is easy to check that the partial expectation and the probability to reach goal are preserved. Proof. Suppose there is a positively weight-divergent end component E. Since E is reachable and we can accumulated arbitrarily high weights inside E with probability 1, we can easily construct a sequence of schedulers whose partial expectation diverges to +∞ by letting the schedulers stay in a positively weight divergent end component until an arbitrarily high weight has been accumulated, before they try to reach the goal. Now, suppose that there are no positively weight-divergent end components. We can assume that all end components have negative maximal mean payoff (see Proposition 4). We claim that there is a natural number W such that max s Pr max s (♦wgt ≥ W ) := p < 1. Let M := max s,α |wgt(s, α)|. Then, the claim follows as follows: For all n ∈ N we get that max s∈S Pr max M,s (♦wgt ≥ n · W + M ) ≤ p n . Then the partial expectation of any scheduler is bounded by ∞ n=0 (n + 1) · W · p n = W (1−p) 2 . For each end component E, there is a number W E and a probability p E such that in E we have max s∈E Pr max E,s (♦wgt ≥ W E ) := p E < 1. On the other hand, in the MEC-quotient of M the probability to reach goal or f ail in |S| steps is at least δ |S| where δ is the minimal transition probability. Then we can conclude that max All in all, it is impossible for a scheduler to almost surely reach an accumulated weight above M · |S| + E is an end component W E .
Recall that we define a critical scheduler to be a scheduler S, for which there is a path containing a positive cycle, and for which Pr S sinit (♦goal ) = 0 Proposition 31. Let M be an MDP. The optimal conditional expectation CE sup sinit = ∞ if PE sup sinit = ∞ or if there is a critical scheduler S. Proof. If PE sup sinit = ∞ clearly also CE sup sinit = ∞. So, let S be a scheduler which can reach a positive cycle but almost surely does not reach goal . Then, for any n, we can construct the following scheduler S n . The scheduler S n attempts to reach the positive cycle directly, i.e. without visiting a state twice before. Then, it attempts to take the cycle n times in a row. Only if S n succeeds to do so, it maximizes the probability to reach the goal from then on. Otherwise, it avoids the goal . The scheduler S n reaches the goal with positive probability and CE Sn sinit → ∞ for n → ∞. In Section 3, we gave the following construction: Let M be an MDP with Pr min M,sinit (♦goal ) = 0 and CE sup M,sinit < ∞. In particular, this means that there are no critical schedulers for M. Let S 0 be the set of all states reachable from s init while only choosing actions in Act min . As there are no critical schedulers, (S 0 , Act min ) does not contain positive cycles. So, there is a unique maximal weight w s of paths leading from s init to s in S 0 . Consider the following MDP N : It contains the MDP M and a new initial state t init . For each s ∈ S 0 and each α ∈ Act(s) \ Act min (s), N also contains a new state t s,α which is reachable from t init via an action β s,α with weight w s and probability 1. In t s,α , only action α with the same probability distribution over successors and the same weight as in s is enabled. In this way, we ensure that Pr min N ,tinit (♦goal ) > 0. Proposition 32. The constructed MDP N satisfies CE sup N ,tinit = CE sup M,sinit . Proof. For each pair (s, α) with s ∈ S 0 and α ∈ Act(s) \ Act min (s), let c s,α := sup S CE S N ,tinit where the supremum is taken over all schedulers S for N which assign probability 1 to the action β s,α in t init . Then, CE sup N ,tinit = max s,α c s,α =: c. A scheduler reaching the goal with positive probability has to choose an action not in Act min after at least one path. Let s ∈ S 0 and α ∈ Act(t)\Act min (s) be such that c = c s,α . For any scheduler T for N starting with β s,α , we define the following scheduler T ′ : T ′ starts by following a path with maximal accumulated weight from s init to s. If it reaches s with accumulated weight w s it chooses α and follows the choices of T from then on. If it does not reach s with accumulated weight w s , T ′ just picks actions in Act min , in this way making sure that the goal will not be reached. In this way, CE T ′ M,sinit = CE T N ,tinit . So, CE sup M,sinit ≥ c. Before we show the other direction, we define, given a finite path π, a finite path ρ starting in last(π), and a scheduler Q, the scheduler Q ↑ π by Q ↑ π (ρ) := Q(π; ρ) where π; ρ denotes the concatenation of the paths π and ρ.
To show that for any scheduler S for M with Pr S sinit > 0 we have CE S M,sinit ≤ c, let S be such a scheduler and consider the set Π of finite S-paths π in M min such that S(π) ∈ Act(last(π)) \ Act min . We know that for each π ∈ Π, wgt(π) + PE S↑π We conclude that also Proof. We have seen that CE sup sinit = ∞ if there is a positively weight-divergent end component or a critical scheduler. On the other hand, we can rely on the reduction from the previous proposition if there are no critical scheduler. In N , the maximal partial expectation is finite as there are no positively weightdivergent end components. As the minimal probability to reach goal is furthermore positive, the maximal conditional expectation is finite as well. Hence, CE sup M,sinit = CE sup N ,tinit < ∞.

A.2 Partial Expectations in MDPs with Non-Negative Weights
Let R ∈ Q be arbitrary. In this section, we consider an MDP M in which all weights are non-negative, and we assume: 1. PE sup sinit < ∞, 2. the only end components are the two distinct absorbing states goal and f ail, 3. goal can be reached from any state s ∈ S \ {f ail}.
Assumption 2. is justified as all weight are non-negative and hence the maximal expected mean payoff of an end component cannot be negative.
. Saturation point. Recall that a saturation point for bias R is a natural number p such that there is a scheduler S with PE S sinit [R] = PE sup sinit [R] which is memoryless and deterministic as soon as the accumulated weight reaches p. I.e. for any two paths π and π ′ , with last(π) = last(π ′ ) and wgt(π), wgt(π ′ ) > p, S(π) = S(π ′ ). We first provide the following saturation point which we need in the proof of the smaller saturation point given in Section 3.
is a saturation point for bias R.
(If the minimum in the definition of δ is taken over an empty set, Max is already the optimal scheduler and hence any value is an upper saturation point.) Proof. Given two schedulers S and T and some x ∈ R, we define the scheduler S ⊳ x T via: Given a finite path π and a path ρ starting in last(π), we further define the scheduler S ↑ π by S ↑ π (ρ) := S(π; ρ) where π; ρ denotes the concatenation of the paths π and ρ.
In the setting of non-negative weights, it has been shown in [8] that there is a weight-based deterministic scheduler maximizing the partial expectation. Of course, this also follows from our results in Section 4 for MDPs with arbitrary weights. This allows us to conclude the following.
Corollary 35. The supremum in PE max,R s := sup S PE S,R s can also be taken over weight-based schedulers which behave memoryless as soon as the accumulated weight reaches q. As there are only finitely many such schedulers, the supremum is furthermore in fact a maximum. Then, is an upper saturation point for M.
Proof. It is enough to show that for all states s and all schedulers S. Since Max maximizes the probability of reaching the goal, this implies that E Max for some state s and some scheduler S. Now, let for each state s. As we know that there are only finitely many relevant schedulers for the supremum we can actually choose a (deterministic weight-based) scheduler S such that sup T E T s (⊕ pR+R goal ) = E S s (⊕ pR+R goal ) for all states s. Now, let t be a state such that D t > 0 is maximal, and such that the first action α that S chooses starting in t leads to a state r with D r < D t with positive probability. As D goal = 0 such a state exists. Then, .
But, the right hand side evaluates to 0 by the definition of p R leading to a contradiction. . Computation of the Partial Expectation.
Proposition 37. Let R ∈ Q and let B R be the least integer greater or equal to p R + max s∈S,α∈Act(s) wgt(s, α) and let S ′ := S \ {goal , f ail}. Consider the following linear program in the variables (x s,r ) s∈S ′ ,0≤r≤BR (r ranges over integers): Minimize s∈S ′ ,0≤r≤BR x s,r under the following constraints: for r < p R and α ∈ Act(s) : The values (PE sup sinit [r + R]) s∈S,0≤r≤BR form the unique solution to this linear program.
We prove the unique solvability of this linear program in detail. Linear programs claimed to be uniquely solvable below can be treated analogously.
Proof. Following a standard approach by Veinott [19], we want to show that the linear program is uniquely solvable by defining a contraction mapping with respect to a weighted maximum norm whose fixed point is the optimal solution. The definition of the weights we use is made explicit by Tseng [18].
We define a function T R : R S ′ ×{0,...,BR} → R S ′ ×{0,...,BR} . For s ∈ S and r ≤ B R , let In order to define a suitable weighted maximum norm, we begin by recursively defining the following partition S 0 , . . . , S k of S: If at some point S 0 , . . . , S i is not yet a partition of S, then S i+1 is non-empty. If it was empty, then T := S \ (S 0 ∪ · · · ∪ S i ) would contain an end component, as for each t ∈ T there would be an action α such that P (t, α, T ) = 1. So, the recursive definition produces a partition S = S 0 ∪ · · · ∪ S k in finitely many steps. Now, for each s ∈ S i let w s := 1 − δ 2i where δ := min{P (s, α, t)|s, α, t s.t. P (s, α, t) > 0}.
These w s will serve as weights for our weighted supremum norm. We use the following fact [18, Lemma 3]: Let γ : For all s ∈ {gaol, f ail} and all α ∈ Act(s), we have We define the following norm on R S ′ ×{0,...,BR} : We show that T R is a contraction with respect to this norm: For x, y ∈ R S ′ ×{0,...,BR} , we claim Let s ∈ S ′ and r < p R .
That this fixed point is the unique solution of the linear program is now easy to see. The map T R is defined such that any z satisfying the constraints of the linear program satisfies z ≥ T R (z). But, if there are coordinates s, r such that z s,r > T R s,r (z) then replacing z s,r by T R s,r (z) leads to z ′ still satisfying the constraints and resulting in a smaller value of s,r z ′ s,r . So, the unique fixed point of T R is the unique optimal solution of the linear program.
Finally, we can easily check that (E max s (⊕ r+R goal )) s∈S,0≤r≤BR is indeed a fixed point of T R as we know that p R is a saturation point.

B Existence of Optimal Schedulers
We provide the proofs to Section 4 here.
Recall that we consider an MDP M with finite maximal partial expectation. In particular, we assume that all states are reachable from s init and that goal is reachable from all states except f ail. Furthermore, there are no positively weight-divergent end components and so we can assume that all end components have negative maximal expected mean payoff (see Proposition 4).
We split the proof of Proposition 12 into the following two propositions: Proposition 38. Let M be an MDP with PE sup sinit < ∞. For each scheduler S ∈ HR M , there is a scheduler T ∈ W R M such that PE S = PE T and Pr S sinit (♦goal ) = Pr T sinit (♦goal ). Proof. Let S ∈ HR. For each state-weight pair (s, w) with s ∈ S \ {goal , f ail} and w ∈ Z, we let θ S s,w be the expected number of times that s is reached with accumulated weight w under S, and we let θ S s,w,α be the expected number of times that α is chosen in this situation by S. We have that θ S s,w = π finite path, last(π)=s, wgt(π)=w Pr S sinit (π) and θ S s,w,α = π finite path, last(π)=s, wgt(π)=w Pr S sinit (π) · S(π)(α).
Note that θ S s,w is finite for all s ∈ S \ {goal , f ail}, w ∈ Z as all end components have negative maximal expected mean payoff. Now, we define a weight-based deterministic scheduler T by where δ s,w = 1 iff s = s init and w = 0, and δ s,w = 0 otherwise. By spelling out the last steps of the paths in the definition of θ S s,w , one can see that θ S s,w provides the solution to this set of equations and hence θ S s,w = θ T s,w for all (s, w). By the definition of T, the expected number of times action α is chosen in (s, w) under T is hence θ S s,w,α as well and the claim follows.
Proposition 39. Let M be an MDP with PE sup sinit < ∞. Then, we have Proof. Let S be a weight-based randomized scheduler and let (s, w) be a stateweight-pair such that S(s, w) is not a Dirac distribution. We define S ↑ w by But then there is an action β ∈ Act(last(π)) such that because S(s, w) is a probability distribution. We conclude that the scheduler S ′ which agrees with S on all state-weight-pairs except (s, w) and assigns probability 1 to β for (s, w) satisfies PE S ′ sinit ≥ PE S sinit . In this way, we can replace all probability distributions that S chooses by Dirac distributions and generate a sequence of schedulers with non-decreasing partial expectations. Ultimately, we obtain a weight-based deterministic scheduler T with PE T sinit ≥ PE S sinit .
Definition 40 (Metric on weight-based deterministic schedulers). Given an MDP M with arbitrary integer weights, we define the following metric d M on the set of weight-based deterministic schedulers, i.e. on the set of functions from S × Z → Act: For two such schedulers S and T, we let Proof. We can identify Act S×Z with (Act S×{+,−} ) N . Then it is easy to see that the metric d M induces the usual tree topology on this finitely branching tree of height ω. Therefore, the space is homeomorphic to the Cantor space 2 ω and hence compact.
We define l S ǫ to be the smallest natural number such that ∞ n=l S ǫ Pr S sinit (♦ =−n goal ) · n < ǫ.
We claim that R := max{R − , R + } does the job. So let T be a scheduler with Recall the following definition. Given a finite path π and a path ρ starting in last(π) and a scheduler Q, we further define the scheduler Q ↑ π by Q ↑ π (ρ) := Q(π; ρ) where π; ρ denotes the concatenation of the paths π and ρ.
For the first sum, we have the following estimation: π∈Π + ǫ Pr S sinit (π) · (PE sup last(π) [wgt(π)] − PE Max last(π) − p max last(π) · wgt(π)) For the second sum, consider the following scheduler. On extensions of paths in Π − ǫ , let S ′ be the scheduler which behaves like S until the accumulated weight is at least q again and then switches to the choices of Min. We know that S only chooses actions in Act min (s) when in a state s with accumulated weight below q. On the other hand, Min is optimal among these schedulers. So, Min is at least as good as S ′ on extensions of paths in Π − ǫ with respect to maximizing the partial expectation. Further, starting at a path in Π − ǫ we reach an accumulated weight of at least q only if we accumulate a weight of at least R + . Afterwards, we can bound the advantage of S over Min by D. So, we get the following estimation: π∈Π−ǫ Pr S sinit (π) · (PE sup last(π) [wgt(π)] − PE Min last(π) − p min last(π) · wgt(π)) ≤ π∈Π − ǫ Pr S sinit (π) · (Pr max last(π) (♦wgt ≥ R + ) · D) ≤ ǫ/2.
So, PE sup sinit − PE T sinit ≤ ǫ. This result now allows us to compute an ǫ-approximation and an ǫ-optimal scheduler with finite memory by linear programming, similar to the case of nonnegative weights. The linear program has R + ǫ + R − ǫ many variables and |Act|times as many inequalities.
Corollary 46. The maximal partial expectation PE sup sinit can be approximated up to an absolute error of ǫ in time exponential in the size of M and polynomial in log(1/ǫ).
Proof. Consider the following linear program with one variable x s,w for each s ∈ S and R − − W ≤ w ≤ R + + W : Minimize s,w x s,w under the following constraints: x goal ,w = w, and x f ail,w = 0, for w ≥ R + and s ∈ S \ {goal , f ail}, and for R − < w < R + , s ∈ S \ {goal , f ail}, and α ∈ Act(s), x s,w ≥ t∈S P (s, α, t) · x t,w+wgt(s,α) .
The unique solvability can be shown as in Proposition 37 using that all end components have negative mean payoff: We can interpret the linear program on an MDP with state space S × {R − − W, . . . , R + } and the transitions induced by M. This MDP now has no end components.

C.2 Transfer to Conditional Expectations
We restate the algorithm given in Section 5. Let M be an MDP with CE sup sinit < ∞ and let ǫ > 0. By Proposition 6, we can assume that Pr min M,sinit (♦goal ) =: p is positive. We know that CE sup sinit ∈ [CE Max sinit , CE ub ]. We perform a binary search to approximate CE sup sinit : We put A 0 := CE Max sinit and B 0 := CE ub . Given A i and B i , let θ i := (A i + B i )/2. Then, we approximate PE sup sinit [−θ i ] up to an absolute error of p · ǫ. Let E i be the value of this approximation. If E i ∈ [−2p·ǫ, 2p·ǫ], terminate and return θ i as the approximation for CE sup sinit . If E i < −2p · ǫ, put A i+1 := A i and B i+1 := θ i , and repeat. If E i > 2p · ǫ, put A i+1 := θ i and B i+1 := B i , and repeat.
Proposition 47. The procedure terminates after at most ⌈log((A 0 −B 0 )/(p·ǫ))⌉ iterations and returns an 3ǫ-approximation of CE sup sinit in time exponential in the size of M and polynomial in log(1/ǫ).
This contradicts PE sup sinit [−θ i ] ≤ 3pǫ. Therefore, the algorithm indeed returns a 3ǫ-approximation of CE sup sinit . Finally, we show that the claimed running time is correct: The algorithm stops after at most ⌈log((A 0 − B 0 )/(ǫ · p))⌉ iterations. As all values involved can be computed in polynomial time, this is polynomial in the size of M and linear in log(1/ǫ). In each iteration, we have to approximate the maximal partial expectation PE sup sinit [−θ i ] up to an absolute error of p·ǫ. As the logarithmic length of θ i is polynomial in the size of M as well, this can be done in time exponential in the size of M and polynomial in log(1/ǫ).