Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Stochastic shortest (or longest) path problems are a prominent class of optimization problems where the task is to find a policy for traversing a probabilistic graph structure such that the expected value of the generated paths satisfying a certain objective is minimal (or maximal). In the classical setting (see e.g. [15, 22, 25, 29]), the underlying graph structure is given by a finite-state Markov decision process (MDP), i.e., a state-transition graph with nondeterministic choices between several actions for each of its non-terminal states, probability distributions specifying the probabilities for the successor states for each state-action pair and a reward function that assigns rational values to the state-action pairs. The stochastic shortest (longest) path problem asks to find a scheduler, i.e., a function that resolves the nondeterministic choices, possibly in a history-dependent way, which minimizes (maximizes) the expected accumulated reward until reaching a goal state. To ensure the existence of the expectation for given schedulers, one often assumes that the given MDP is contracting, i.e., the goal is reached almost surely under all schedulers, in which case the optimal expected accumulated reward is achieved by a memoryless deterministic scheduler that optimizes the expectation from each state and is computable using a linear program with one variable per state (see e.g. [25]). The contraction assumption can be relaxed by requiring the existence of at least one scheduler that reaches the goal almost surely and taking the extremum over all those schedulers [15, 16, 22]. These algorithms and corresponding value or policy iteration approaches have been implemented in various tools and used in many application areas.

The restriction to schedulers that reach the goal almost surely, however, limits the applicability and significance of the results. First, the known algorithms for computing extremal expected accumulated rewards are not applicable for models where the probability for never visiting a goal state is positive under each scheduler. Second, statements about the expected rewards for schedulers that reach the goal with probability 1 are not sufficient to draw any conclusion for the best- or worst-case behavior, if there exist schedulers that miss the goal with positive probability. This motivates the consideration of conditional stochastic path problems where the task is to compute the optimal expected accumulated reward until reaching a goal state, under the condition that a goal state will indeed be reached and where the extrema are taken over all schedulers that reach the goal with positive probability. More precisely, we address here a slightly more general problem where we are given two sets F and G of states in an MDP \(\mathcal {M}\) with non-negative integer rewards and ask for the maximal expected accumulated reward until reaching F, under the condition that G will be visited (denoted where \(s_{ \scriptscriptstyle init }\) is the initial state of \(\mathcal {M}\)). Computation schemes for conditional expectations of this type can, e.g., be used to answer the following questions (assuming the underlying model is a finite-state MDP):

  1. (Q1)

    What is the maximal termination time of a probabilistic and nondeterministic program, under the condition that the program indeed terminates?

  2. (Q2)

    What are the maximal expected costs of the repair mechanisms that are triggered in cases where a specific failure scenario occurs, under the condition that the failure scenario indeed occurs?

  3. (Q3)

    What is the maximal energy consumption, under the condition that all jobs of a given list will be successfully executed within one hour?

The relevance of question (Q1) and related problems becomes clear from the work [14, 20, 23, 24, 26] on the semantics of probabilistic programs where no guarantees for almost-sure termination can be given. Question (Q2) is natural for a worst-case analysis of resilient systems or other types of systems where conditional probabilities serve to provide performance guarantees on the protocols triggered in exceptional cases that appear with positive, but low probability. Question (Q3) is typical when the task is to study the trade-off between cost and utility functions (see e.g. [9]). Given the work on anonymity and related notions for information leakage using conditional probabilities in MDP-like models [7, 21] or the formalization of posterior vulnerability as an expectation [4], the concept of conditional accumulated excepted rewards might also be useful to specify the degree of protection of secret data or to study the trade-off between privacy and utility, e.g., using gain functions [3, 5]. Other areas where conditional expectations play a crucial role are risk management where the conditional value-at-risk is used to formalize the expected loss under the assumption that very large losses occur [2, 32] or regression analysis where conditional expectations serve to predict the relation between random variables [31].

Fig. 1.
figure 1

MDP \(\mathcal {M}[\mathfrak {r}]\) for Example 1.1

Example 1.1

To illustrate the challenges for designing algorithms to compute maximal conditional expectations we regard the MDP \(\mathcal {M}[\mathfrak {r}]\) shown in Fig. 1. The reward of the state-action pair \((s_1,\gamma )\) is given by a reward parameter \(\mathfrak {r}\in \mathbb {N}\). Let \(s_{ \scriptscriptstyle init }=s_0\) be the initial state and \(F=G=\{ goal \}\). The only nondeterministic choice is in state \(s_2\), while states \(s_0\) and \(s_1\) behave purely probabilistic and \( goal \) and \( fail \) are trap states. Given a scheduler \(\mathfrak {S}\), we write \(\mathbb {CE}^{\mathfrak {S}}\) for the conditional expectation . (See also Sect. 2 for our notations.) For the two memoryless schedulers that choose \(\alpha \) resp. \(\beta \) in state \(s_2\) we have:

$$ \mathbb {CE}^{\alpha } = \frac{\frac{1}{2} \cdot \mathfrak {r}+ \frac{1}{2}\cdot 0}{\frac{1}{2}+\frac{1}{2}} = \frac{\mathfrak {r}}{2} \quad \text {and} \quad \mathbb {CE}^{\beta } = \frac{\frac{1}{2} \cdot \mathfrak {r}+ 0}{\frac{1}{2}+0} = \mathfrak {r}$$

We now regard the schedulers \(\mathfrak {S}_n\) for \(n =1,2,\ldots \) that choose \(\beta \) for the first n visits of \(s_2\) and action \(\alpha \) for the \((n{+}1)\)-st visit of \(s_2\). Then:

$$ \mathbb {CE}^{\mathfrak {S}_n} = \frac{\frac{1}{2} \cdot \mathfrak {r}+ \frac{1}{2}\cdot \frac{1}{2^n} \cdot n}{\frac{1}{2} + \frac{1}{2}\cdot \frac{1}{2^n}} = \mathfrak {r}\ + \ \frac{n-\mathfrak {r}}{2^n{+}1} $$

Thus, \(\mathbb {CE}^{\mathfrak {S}_n} > \mathbb {CE}^{\beta }\) iff \(n > \mathfrak {r}\), and the maximum is achieved for \(n=\mathfrak {r}{+}2\).

This example illustrates three phenomena that distinguish conditional and unconditional expected accumulated rewards and make reasoning about maximal conditional expectations harder than about unconditional ones. First, optimal schedulers for \(\mathcal {M}[\mathfrak {r}]\) need a counter for the number of visits in state \(s_2\). Hence, memoryless schedulers are not powerful enough to maximize the conditional expectation. Second, while the maximal conditional expectation for \(\mathcal {M}[\mathfrak {r}]\) with initial state \(s_{ \scriptscriptstyle init }= s_0\) is finite, the maximal conditional expectation for \(\mathcal {M}[\mathfrak {r}]\) with starting state \(s_2\) is infinite as:

Third, as \(\mathfrak {S}_2\) maximizes the conditional expected accumulated reward for \(\mathfrak {r}=0\), while \(\mathfrak {S}_3\) is optimal for \(\mathfrak {r}=1\), optimal decisions for paths ending in state \(s_2\) depend on the reward value r of the \(\gamma \)-transition from state \(s_1\), although state \(s_1\) is not reachable from \(s_2\). Thus, optimal decisions for a path \(\pi \) do not only depend on the past (given by \(\pi \)) and possible future (given by the sub-MDP that is reachable from \(\pi \)’s last state), but require global reasoning. \({\scriptscriptstyle \blacksquare }\)

The main results of this paper are the following theorems. We write \(\mathbb {CE}^{\max }\) for the maximal conditional expectation, i.e., the supremum of the conditional expectations , when ranging over all schedulers \(\mathfrak {S}\) where \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s_{ \scriptscriptstyle init }}(\Diamond G)\) is positive and \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s_{ \scriptscriptstyle init }}(\Diamond F |\Diamond G)=1\). (See also Sect. 2 for our notations.)

Theorem 1

(Checking finiteness and upper bound). There is a polynomial-time algorithm that checks if \(\mathbb {CE}^{\max }\) is finite. If so, an upper bound \(\mathbb {CE}^{\mathrm {ub}}\) for \(\mathbb {CE}^{\max }\) is computable in pseudo-polynomial time for the general case and in polynomial time if \(F=G\) and \(\mathrm {Pr}^{\min }_{\mathcal {M},s}(\Diamond G) >0\) for all states s with \(s \models \exists \Diamond G\).

The threshold problem asks whether the maximal conditional expectation exceeds or misses a given rational threshold \(\vartheta \).

Theorem 2

(Threshold problem). The problem “does \(\mathbb {CE}^{\max } \bowtie \vartheta \) hold?” (where \(\bowtie \in \{>,\geqslant ,<,\leqslant \}\)) is PSPACE-hard and solvable in exponential (even pseudo-polynomial) time. It is PSPACE-complete for acyclic MDPs.

For the computation of an optimal scheduler, we suggest an iterative scheduler-improvement algorithm that interleaves calls of the threshold algorithm with linear programming techniques to handle zero-reward actions. This yields:

Theorem 3

(Computing optimal schedulers). The value \(\mathbb {CE}^{\max }\) and an optimal scheduler \(\mathfrak {S}\) are computable in exponential time.

Algorithms for checking finiteness and computing an upper bound (Theorem 1) will be sketched in Sect. 3. Section 4 presents a pseudo-polynomial threshold algorithm and a polynomially space-bounded algorithm for acyclic MDPs (Theorem 2) as well as an exponential-time computation scheme for the construction of an optimal scheduler (Theorem 3). Further details, soundness proofs and a proof for the PSPACE-hardness as stated in Theorem 2 can be found in [13]. The general feasibility of the algorithms will be shown by experimental studies with a prototypical implementation (for details, see Appendix K of [13]).

Related Work. Although conditional expectations appear rather naturally in many applications and despite the large amount of publications on variants of stochastic path problems and other forms of expectations in MDPs (see e.g. [18, 30]), we are not aware that they have been addressed in the context of MDPs. Computation schemes for extremal conditional probabilities \(\mathrm {Pr}^{\max }(\varphi | \psi )\) or \(\mathrm {Pr}^{\min }(\varphi | \psi )\) where both the objective \(\varphi \) and the assumption \(\psi \) are path properties specified in some temporal logic have been studied in [6, 8, 11]. For reachability properties \(\varphi \) and \(\psi \), the algorithm of [6, 8] has exponential time complexity, while the algorithm of [11] runs in polynomial time. Although the approach of [11] is not applicable for calculating maximal conditional expectations (see Appendix B of [13]), it can be used to compute an upper bound for \(\mathbb {CE}^{\max }\) (see Sect. 3). Conditional expected rewards in Markov chains can be computed using the rescaling technique of [11] for finite Markov chains or the approximation techniques of [1, 19] for certain classes of infinite-state Markov chains. The conditional weakest precondition operator of [26] yields a technique to compute conditional expected rewards for purely probabilistic programs (without non-determinism).

2 Preliminaries

We briefly summarize our notations used for Markov decision processes. Further details can be found in textbooks, see e.g. [25, 29] or Chapter 10 in [10].

A Markov decision process (MDP) is a tuple \(\mathcal {M}= (S, Act ,P,s_{ \scriptscriptstyle init }, rew )\) where S is a finite set of states, \( Act \) a finite set of actions, \(s_{ \scriptscriptstyle init }\in S\) the initial state, \(P : S \times Act \times S \rightarrow [0,1] \cap \mathbb {Q}\) is the transition probability function and \( rew : S \times Act \rightarrow \mathbb {N}\) the reward function. We require that \(\sum _{s'\in S} P(s,\alpha ,s') \in \{0,1\}\) for all \((s,\alpha )\in S\times Act \). We write \( Act (s)\) for the set of actions that are enabled in s, i.e., \(\alpha \in Act (s)\) iff \(P(s,\alpha ,\cdot )\) is not the null function. State s is called a trap if \( Act (s)=\varnothing \). The paths of \(\mathcal {M}\) are finite or infinite sequences \(s_0 \, \alpha _0 \, s_1 \, \alpha _1 \, s_2 \, \alpha _2 \ldots \) where states and actions alternate such that \(P(s_i,\alpha _i,s_{i+1}) >0\) for all \(i\geqslant 0\). A path \(\pi \) is called maximal if it is either infinite or finite and its last state is a trap. If \(\pi = s_0 \, \alpha _0 \, s_1 \, \alpha _1 \, s_2 \, \alpha _2 \ldots \alpha _{k-1} \, s_k\) is finite then \( rew (\pi )= rew (s_0,\alpha _0) + rew (s_1,\alpha _1) + \ldots + rew (s_{k-1},\alpha _{k-1})\) denotes the accumulated reward and \( first (\pi )=s_0\), \( last (\pi )=s_k\) its first resp. last state. The size of \(\mathcal {M}\), denoted \( size (\mathcal {M})\), is the sum of the number of states plus the total sum of the logarithmic lengths of the non-zero probability values \(P(s,\alpha ,s')\) and the reward values \( rew (s,\alpha )\).Footnote 1

An end component of \(\mathcal {M}\) is a strongly connected sub-MDP. End components can be formalized as pairs \(\mathcal {E}= (E,\mathfrak {A})\) where E is a nonempty subset of S and \(\mathfrak {A}\) a function that assigns to each state \(s\in E\) a nonempty subset of \( Act (s)\) such that the graph induced by \(\mathcal {E}\) is strongly connected.

A (randomized) scheduler for \(\mathcal {M}\), often also called policy or adversary, is a function \(\mathfrak {S}\) that assigns to each finite path \(\pi \) where \( last (\pi )\) is not a trap a probability distribution over \( Act ( last (\pi ))\). \(\mathfrak {S}\) is called memoryless if \(\mathfrak {S}(\pi )=\mathfrak {S}(\pi ')\) for all finite paths \(\pi \), \(\pi '\) with \( last (\pi )= last (\pi ')\), in which case \(\mathfrak {S}\) can be viewed as a function that assigns to each non-trap state s a distribution over \( Act (s)\). \(\mathfrak {S}\) is called deterministic if \(\mathfrak {S}(\pi )\) is a Dirac distribution for each path \(\pi \), in which case \(\mathfrak {S}\) can be viewed as a function that assigns an action to each finite path \(\pi \) where \( last (\pi )\) is not a trap. We write \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}\) or briefly \(\mathrm {Pr}^{\mathfrak {S}}_{s}\) to denote the probability measure induced by \(\mathfrak {S}\) and s. Given a measurable set \(\psi \) of maximal paths, then \(\mathrm {Pr}^{\min }_{\mathcal {M},s}(\psi ) = \inf _{\mathfrak {S}} \mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\psi )\) and \(\mathrm {Pr}^{\max }_{\mathcal {M},s}(\psi ) = \sup _{\mathfrak {S}} \mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\psi )\). We will use LTL-like notations to specify measurable sets of maximal paths. For these it is well-known that optimal deterministic schedulers exists. If \(\psi \) is a reachability condition then even optimal deterministic memoryless schedulers exist.

Let \(\varnothing \not = F \subseteq S\). For a comparison operator \(\bowtie \ \in \{=,>,\geqslant ,<,\leqslant \}\) and \(r\in \mathbb {N}\), \(\Diamond ^{\bowtie r} F\) denotes the event “reaching F along some finite path \(\pi \) with \( rew (\pi )\bowtie r\)”. The notation will be used for the random variable that assigns to each maximal path \(\varsigma \) in \(\mathcal {M}\) the reward \( rew (\pi )\) of the shortest prefix \(\pi \) of \(\varsigma \) where \( last (\pi )\in F\). If \(\varsigma \not \models \Diamond F\) then . If \(s\in S\) then denotes the expectation of in \(\mathcal {M}\) with starting state s under \(\mathfrak {S}\), which is infinite if \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\Diamond F) <1\). stands for where the supremum is taken over all schedulers \(\mathfrak {S}\) with \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\Diamond F)=1\). Let \(\psi \) be a measurable set of maximal paths. stands for the expectation of w.r.t. the conditional probability measure \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\ \cdot \ | \psi )\) given by \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\varphi | \psi ) = \mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\varphi \wedge \psi )/\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\psi )\). is the supremum of where \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\psi )>0\) and \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\Diamond F| \psi )=1\), and \(\mathrm {Pr}^{\max }_{\mathcal {M},s}(\varphi | \psi ) = \sup _{\mathfrak {S}} \mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\varphi | \psi )\) where \(\mathfrak {S}\) ranges over all schedulers with \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\psi ) >0\) and \(\sup \varnothing = -\infty \).

For the remainder of this paper, we suppose that two nonempty subsets F and G of S are given such that \(\mathrm {Pr}^{\max }_{\mathcal {M},s}(\Diamond F |\Diamond G)=1\). The task addressed in this paper is to compute the maximal conditional expectation given by:

Here, \(\mathfrak {S}\) ranges over all schedulers \(\mathfrak {S}\) with \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\Diamond G)>0\) and \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\Diamond F |\Diamond G)=1\). If \(\mathcal {M}\) and its initial state are clear from the context, we often simply write \(\mathbb {CE}^{\max }\) resp. \(\mathbb {CE}^{\mathfrak {S}}\). We assume that all states in \(\mathcal {M}\) are reachable from \(s_{ \scriptscriptstyle init }\) and \(s_{ \scriptscriptstyle init }\notin F \cup G\) (as \(\mathbb {CE}^{\max }=0\) if \(s\in F\) and if \(s\in G \setminus F\)).

3 Finiteness and Upper Bound

Checking Finiteness. We sketch a polynomially time-bounded algorithm that takes as input an MDP \(\mathcal {M}= (S, Act ,P,s_{ \scriptscriptstyle init }, rew )\) with two distinguished subsets F and G of S such that \(\mathrm {Pr}^{\max }_{\mathcal {M},s_{ \scriptscriptstyle init }}(\Diamond F|\Diamond G)=1\). If then the output is “no”. Otherwise, the output is an MDP \(\hat{\mathcal {M}} = (\hat{S},\hat{ Act },\hat{P},\hat{s}_{ \scriptscriptstyle init },\hat{ rew })\) with two trap states \( goal \) and \( fail \) such that:

  1. (1)

    ,

  2. (2)

    \(\hat{s} \models \exists \Diamond goal \) and \(\mathrm {Pr}^{\min }_{\hat{\mathcal {M}},\hat{s}}\bigl (\Diamond ( goal \vee fail )\bigr )=1\) for all states \(\hat{s}\in \hat{S} \setminus \{ fail \}\), and

  3. (3)

    \(\hat{\mathcal {M}}\) does not have critical schedulers where a scheduler \(\mathfrak {U}\) for \(\hat{\mathcal {M}}\) is said to be critical iff \(\mathrm {Pr}^{\mathfrak {U}}_{\hat{\mathcal {M}},\hat{s}_{ \scriptscriptstyle init }}(\Diamond fail )=1\) and there is a reachable positive \(\mathfrak {U}\)-cycle.Footnote 2

We provide here the main ideas of the algorithms and refer to Appendix C of [13] for the details. The algorithm first transforms \(\mathcal {M}\) into an MDP \(\tilde{\mathcal {M}}\) that permits to assume \(F=G = { goal }\). Intuitively, \(\tilde{\mathcal {M}}\) simulates \(\mathcal {M}\), while operating in four modes: “normal mode”, “after G”, “after F” and “goal”. \(\tilde{\mathcal {M}}\) starts in normal mode where it behaves as \(\mathcal {M}\) as long as neither F nor G have been visited. If a \(G\setminus F\)-state has been reached in normal mode then \(\tilde{\mathcal {M}}\) switches to the mode “after G”. Likewise, as soon as an \(F\setminus G\)-state has been reached in normal mode then \(\tilde{\mathcal {M}}\) switches to the mode “after F”. \(\tilde{\mathcal {M}}\) enters the goal mode (consisting of a single trap state \( goal \)) as soon as a path fragment containing a state in F and a state in G has been generated. This is the case if \(\mathcal {M}\) visits an F-state in mode “after G” or a G-state in mode “after F”, or a state in \(F \cap G\) in the normal mode. The rewards in the normal mode and in mode “after G” are precisely as in \(\mathcal {M}\), while the rewards are 0 in all other cases. We then remove all states \(\tilde{s}\) in the “after G” mode with \(\mathrm {Pr}^{\max }_{\tilde{\mathcal {M}},\tilde{s}}(\Diamond goal ) <1\), collapse all states \(\tilde{s}\) in \(\tilde{\mathcal {M}}\) with \(\tilde{s}\not \models \exists \Diamond goal \) into a single trap state called \( fail \) and add zero-reward transitions to \( fail \) from all states \(\tilde{s}\) that are not in the “after G” mode and \(\mathrm {Pr}^{\max }_{\tilde{\mathcal {M}},\tilde{s}}(\Diamond goal ) =0\). Using techniques as in the unconditional case [22] we can check whether \(\tilde{\mathcal {M}}\) has positive end components, i.e., end components with at least one state-action pair \((s,\alpha )\) with \( rew (s,\alpha )>0\). If so, then . Otherwise, we collapse each maximal end component of \(\tilde{\mathcal {M}}\) into a single state.

Let \(\hat{\mathcal {M}}\) denote the resulting MDP. It satisfies (1) and (2). Property (3) holds iff . This condition can be checked in polynomial time using a graph analysis in the sub-MDP of \(\hat{\mathcal {M}}\) consisting of the states \(\hat{s}\) with \(\mathrm {Pr}^{\min }_{\hat{\mathcal {M}},\hat{s}}(\Diamond goal )=0\) (see Appendix C of [13]).

Computing an Upper Bound. Due to the transformation used for checking finiteness of the maximal conditional expectation, we can now suppose that \(\mathcal {M}=\hat{\mathcal {M}}\), \(F=G=\{ goal \}\) and that (2) and (3) hold. We now present a technique to compute an upper bound \(\mathbb {CE}^{\mathrm {ub}}\) for \(\mathbb {CE}^{\max }\). The upper bound will be used later to determine a saturation point from which on optimal schedulers behave memoryless (see Sect. 4).

We consider the MDP \(\mathcal {M}'\) simulating \(\mathcal {M}\), while operating in two modes. In its first mode, \(\mathcal {M}'\) attaches the reward accumulated so far to the states. More precisely, the states of \(\mathcal {M}'\) in its first mode have the form \(\langle s,r\rangle \in S \times \mathbb {N}\) where \(0 \leqslant r \leqslant R\) and \(R = \sum _{s\in S'} \max \{ rew _{\mathcal {M}'}(s,\alpha ):\alpha \in Act _{\mathcal {M}'}(s) \}\). The initial state of \(\mathcal {M}'\) is \(s_{ \scriptscriptstyle init }'=\langle s_{ \scriptscriptstyle init },0\rangle \). The reward for the state-action pairs \((\langle s,r\rangle ,\alpha )\) where \(r{+} rew (s,\alpha ) \leqslant R\) is 0. If \(\mathcal {M}'\) fires an action \(\alpha \) in state \(\langle s,r\rangle \) where \(r'\mathop {=}\limits ^{\text {\tiny def}}r{+} rew (s,\alpha ) > R\) then it switches to the second mode, while earning reward \(r'\). In its second mode \(\mathcal {M}'\) behaves as \(\mathcal {M}\) without additional annotations of the states and earning the same rewards as \(\mathcal {M}\). From the states \(\langle goal ,r\rangle \), \(\mathcal {M}'\) moves to \( goal \) with probability 1 and reward r. There is a one-to-one correspondence between the schedulers for \(\mathcal {M}\) and \(\mathcal {M}'\) and the switch from \(\mathcal {M}\) to \(\mathcal {M}'\) does not affect the probabilities and the accumulated rewards until reaching \( goal \).

Let \(\mathcal {N}\) denote the MDP resulting from \(\mathcal {M}'\) by adding reset-transitions from \( fail \) (as a state of the second mode) and the copies \(\langle fail ,r\rangle \) in the first mode to the initial state \(s_{ \scriptscriptstyle init }'\). The reward of all reset transitions is 0. The reset-mechanism has been taken from [11] where it has been introduced as a technique to compute maximal conditional probabilities for reachability properties. Intuitively, \(\mathcal {N}\) “discards” all paths of \(\mathcal {M}'\) that eventually enter \( fail \) and “redistributes” their probabilities to the paths that eventually enter the goal state. In this way, \(\mathcal {N}\) mimics the conditional probability measures \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M}',s_{ \scriptscriptstyle init }'}(\ \cdot \ |\Diamond goal ) = \mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s_{ \scriptscriptstyle init }}(\ \cdot \ |\Diamond goal )\) for prefix-independent path properties. Paths \(\pi \) from \(s_{ \scriptscriptstyle init }\) to \( goal \) in \(\mathcal {M}\) are simulated in \(\mathcal {N}\) by paths of the form \(\varrho = \xi _1; \ldots \xi _k ; \pi \) where \(\xi _i\) is a cycle in \(\mathcal {N}\) with \( first (\xi _i)=s_{ \scriptscriptstyle init }'\) and \(\xi _i\)’s last transition is a reset-transition from some fail-state to \(s_{ \scriptscriptstyle init }'\). Thus, \( rew (\pi ) \leqslant rew _{\mathcal {N}}(\varrho )\). The distinction between the first and second mode together with property (3) ensure that the new reset-transitions do not generate positive end components in \(\mathcal {N}\). By the results of [22], the maximal unconditional expected accumulated reward in \(\mathcal {N}\) is finite and we have:

Hence, we can deal with , which is computable in time polynomial in the size of \(\mathcal {N}\) by the algorithm proposed in [22]. As \( size (\mathcal {N})={ \Theta }(R \cdot size (\mathcal {M}))\) we obtain a pseudo-polynomial time bound for the general case. If, however, \(\mathrm {Pr}^{\min }_{\mathcal {M},s}(\Diamond goal )>0\) for all states \(s\in S \setminus \{ fail \}\) then there is no need for the detour via \(\mathcal {M}'\) and we can apply the reset-transformation \(\mathcal {M}\leadsto \mathcal {N}\) by adding a reset-transition from \( fail \) to \(s_{ \scriptscriptstyle init }\) with reward 0, in which case the upper bound is obtained in time polynomial in the size of \(\mathcal {M}\). For details we refer to Appendix C of [13].

4 Threshold Algorithm and Computing Optimal Schedulers

In what follows, we suppose that \(\mathcal {M}=(S, Act ,P,s_{ \scriptscriptstyle init }, rew )\) is an MDP with two trap states \( goal \) and \( fail \) such that \(s\models \exists \Diamond goal \) for all states \(s\in S \setminus \{ fail \}\) and \(\min _{s\in S} \mathrm {Pr}^{\min }_{\mathcal {M},s}(\Diamond ( goal \vee fail ))=1\) and .

A scheduler \(\mathfrak {S}\) is said to be reward-based if \(\mathfrak {S}(\pi )=\mathfrak {S}(\pi ')\) for all finite paths \(\pi \), \(\pi '\) with \(( last (\pi ), rew (\pi )) = ( last (\pi '), rew (\pi '))\). Thus, deterministic reward-based schedulers can be seen as functions \(\mathfrak {S}: S \times \mathbb {N}\rightarrow Act \). We show in Appendix D of [13] that \(\mathbb {CE}^{\max }\) equals the supremum of the values \(\mathbb {CE}^{\mathfrak {S}}\), when ranging over all deterministic reward-based schedulers \(\mathfrak {S}\) with \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s_{ \scriptscriptstyle init }}(\Diamond goal )>0\).

The basis of our algorithms are the following two observations. First, there exists a saturation point \(\wp \in \mathbb {N}\) such that the optimal decision for all paths \(\pi \) with \( rew (\pi )\geqslant \wp \) is to maximize the probability for reaching the goal state (see Proposition 4.1 below). The second observation is a technical statement that will be used at several places. Let \(\rho ,\theta ,\zeta ,r,x,y,z,p\in \mathbb {R}\) with \(0\leqslant p,x,y,z \leqslant 1\), \(p>0\), \(y > z\) and \(x+z>0\) and let

$$\begin{aligned} \mathsf {A} = \frac{\displaystyle \rho + p(ry + \theta )}{\displaystyle x+py}, \quad \mathsf {B} = \frac{\displaystyle \rho + p(rz + \zeta )}{\displaystyle x+pz} \quad \text {and} \quad \mathsf {C} = \max \{\mathsf {A},\mathsf {B}\} \end{aligned}$$

Then:

$$\begin{aligned} \mathsf {A} \geqslant \mathsf {B}\quad \text {iff} \quad r + \frac{\theta {-} \zeta }{y {-} z} \geqslant \mathsf {C}\quad \text {iff} \quad \theta - (\mathsf {C}{-}r)y \ \geqslant \ \zeta - (\mathsf {C}{-}r)z \end{aligned}$$
(†)

and the analogous statement for > rather than \(\geqslant \). For details, see Appendix G of [13]. We will apply this observation in different nuances. To give an idea how to apply statement (), suppose \(\mathsf {A} = \mathbb {CE}^{\mathfrak {T}}\) and \(\mathsf {B}=\mathbb {CE}^{\mathfrak {U}}\) where \(\mathfrak {T}\) and \(\mathfrak {U}\) are reward-based schedulers that agree for all paths \(\varrho \) that do not have a prefix \(\pi \) with \( rew (\pi )=r\) where \( last (\pi )\) is a non-trap state, in which case x denotes the probability for reaching \( goal \) from \(s_{ \scriptscriptstyle init }\) along such a path \(\varrho \) and \(\rho \) stands for the corresponding partial expectation, while p denotes the probability of the paths \(\pi \) from \(s_{ \scriptscriptstyle init }\) to some non-trap state with \( rew (\pi )=r\). The crucial observation is that \(r+(\theta {-}\zeta )/(y{-}z)\) does not depend on \(x,\rho ,p\). Thus, if \(r+(\theta {-}\zeta )/(y{-}z) \geqslant \mathbb {CE}^{\mathrm {ub}}\) for some upper bound \(\mathbb {CE}^{\mathrm {ub}}\) of \(\mathbb {CE}^{\max }\) then () allows to conclude that \(\mathfrak {T}\)’s decisions for the state-reward pairs (sr) are better than \(\mathfrak {U}\), independent of \(x,\rho \) and p.

Let \(R\in \mathbb {N}\) and \(\mathfrak {S}\), \(\mathfrak {T}\) be reward-based schedulers. The residual scheduler \(\mathfrak {S} {\uparrow } {R}\) is given by \((\mathfrak {S} {\uparrow } {R})(s,r) = \mathfrak {S}(s,R{+}r)\). \(\mathfrak {S} \lhd _{R} \mathfrak {T}\) denotes the unique scheduler that agrees with \(\mathfrak {S}\) for all state-reward pairs (sr) where \(r < R\) and \((\mathfrak {S} \lhd _{R} \mathfrak {T}) {\uparrow } {R} = \mathfrak {T}\). We write \(\mathrm {E}^{\mathfrak {S}}_{\mathcal {M},s}\) for the partial expectation

$$\begin{aligned} \mathrm {E}^{\mathfrak {S}}_{\mathcal {M},s} = \sum \limits _{r=0}^{\infty } \ \mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\Diamond ^{=r} goal ) \cdot r \end{aligned}$$

Thus, if \(\mathrm {Pr}^{\mathfrak {T}}_{\mathcal {M},s}(\Diamond goal )=1\), while if \(\mathrm {Pr}^{\mathfrak {T}}_{\mathcal {M},s}(\Diamond goal )<1\).

Proposition 4.1

There exists a natural number \(\wp \) (called saturation point of \(\mathcal {M}\)) and a deterministic memoryless scheduler \(\mathfrak {M}\) such that:

  1. (a)

    \(\mathbb {CE}^{\mathfrak {T}} \ \leqslant \ \mathbb {CE}^{\mathfrak {T} \lhd _{\wp } \mathfrak {M}}\) for each scheduler \(\mathfrak {T}\) with \(\mathrm {Pr}^{\mathfrak {T}}_{\mathcal {M},s_{ \scriptscriptstyle init }}(\Diamond goal )>0\), and

  2. (b)

    \(\mathbb {CE}^{\mathfrak {S}} \ =\ \mathbb {CE}^{\max }\) for some deterministic reward-based scheduler \(\mathfrak {S}\) such that \(\mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s_{ \scriptscriptstyle init }}(\Diamond goal )>0\) and \(\mathfrak {S} {\uparrow } {\wp }=\mathfrak {M}\).

The proof of Proposition 4.1 (see Appendices E and F of [13]) is constructive and yields a polynomial-time algorithm for generating a scheduler \(\mathfrak {M}\) as in Proposition 4.1 and a pseudo-polynomial algorithm for the computation of a saturation point \(\wp \).

Scheduler \(\mathfrak {M}\) maximizes the probability to reach \( goal \) from each state. If there are two or more such schedulers, then \(\mathfrak {M}\) is one where the conditional expected accumulated reward until reaching goal is maximal under all schedulers \(\mathfrak {U}\) with \(\mathrm {Pr}^{\mathfrak {U}}_{\mathcal {M},s}(\Diamond goal ) = \mathrm {Pr}^{\max }_{\mathcal {M},s}(\Diamond goal )\) for all states s. Such a scheduler \(\mathfrak {M}\) is computable in polynomial time using linear programming techniques. (See Appendix E of [13].)

The idea for the computation of the saturation point is to compute the threshold \(\wp \) above which the scheduler \(\mathfrak {M}\) becomes optimal. For this we rely on statement () where \(\theta /y\) stands for the conditional expectation under \(\mathfrak {M}\), \(\zeta /z\) for the conditional expectation under an arbitrary scheduler \(\mathfrak {S}\) and \(\mathsf {C}=\mathbb {CE}^{\mathrm {ub}}\) is an upper bound of \(\mathbb {CE}^{\max }\) (see Theorem 1), while \(r=\wp \) is the wanted value. More precisely, for \(s\in S\), let \(\theta _s=\mathrm {E}^{\mathfrak {M}}_{\mathcal {M},s}\), \(y_s = \mathrm {Pr}^{\mathfrak {M}}_{\mathcal {M},s}(\Diamond goal ) = \mathrm {Pr}^{\max }_{\mathcal {M},s}(\Diamond goal )\). To compute a saturation point we determine the smallest value \(\wp \in \mathbb {N}\) such that

$$\begin{aligned} \theta _s - (\mathbb {CE}^{\mathrm {ub}}{-}\wp )\cdot y_s = \max \limits _{\mathfrak {S}} \ \bigl ( \ \mathrm {E}^{\mathfrak {S}}_{\mathcal {M},s} - (\mathbb {CE}^{\mathrm {ub}} {-}\wp )\cdot \mathrm {Pr}^{\mathfrak {S}}_{\mathcal {M},s}(\Diamond goal ) \ \bigr ) \end{aligned}$$

for all states s where \(\mathfrak {S}\) ranges over all schedulers for \(\mathcal {M}\). In Appendix F of [13] we show that instead of the maximum over all schedulers \(\mathfrak {S}\) it suffices to take the local maximum over all “one-step-variants” of \(\mathfrak {M}\). That is, a saturation point is obtained by \(\wp = \max \{ \lceil \mathbb {CE}^{\mathrm {ub}} - D \rceil , 0 \} \) where

$$\begin{aligned} D \ = \ \min \ \bigl \{ (\theta _s - \theta _{s,\alpha })/(y_s - y_{s,\alpha }) \, : \, s \in S, \alpha \in Act (s), y_{s,\alpha } < y_s \bigr \} \end{aligned}$$

and \(y_{s,\alpha } = \sum \limits _{t\in S} P(s,\alpha ,t)\cdot y_t\) and \(\theta _{s,\alpha } = rew (s,\alpha )\cdot y_{s,\alpha } + \sum \limits _{t\in S} P(s,\alpha ,t)\cdot \theta _t\).

Example 4.2

The so obtained saturation point for the MDP \(\mathcal {M}[\mathfrak {r}]\) in Fig. 1 is \(\wp = \lceil \mathbb {CE}^{\mathrm {ub}}{+}1\rceil \). Note that only state \(s=s_2\) behaves nondeterministically, and \(\mathfrak {M}(s)=\alpha \), \(y_s=y_{s,\alpha }=1\), \(\theta _s= \theta _{s,\alpha }=0\), while \(y_{s,\beta }=\theta _{s,\beta }=\frac{1}{2}\). This yields \(D = (0{-}\frac{1}{2})/(1{-}\frac{1}{2})=-1\). Thus, \(\wp \geqslant \mathfrak {r}{+}2\) as \(\mathbb {CE}^{\mathrm {ub}}\geqslant \mathbb {CE}^{\max } > \mathfrak {r}\). \({\scriptscriptstyle \blacksquare }\)

The logarithmic length of \(\wp \) is polynomial in the size of \(\mathcal {M}\). Thus, the value (i.e., the length of an unary encoding) of \(\wp \) can be exponential in \( size (\mathcal {M})\). This is unavoidable as there are families \((\mathcal {M}_k)_{k \in \mathbb {N}}\) of MDPs where the size of \(\mathcal {M}_k\) is in \(\mathcal {O}(k)\), while \(2^k\) is a lower bound for the smallest saturation point of \(\mathcal {M}_k\). This, for instance, applies to the MDPs \(\mathcal {M}_k = \mathcal {M}[2^k]\) where \(\mathcal {M}[\mathfrak {r}]\) is as in Fig. 1. Recall from Example 1.1 that the scheduler \(\mathfrak {S}_{\mathfrak {r}{+}2}\) that selects \(\beta \) by the first \(\mathfrak {r}{+} 2\) visits of s and \(\alpha \) for the \((\mathfrak {r}{+} 3)\)-rd visit of s is optimal for \(\mathcal {M}[\mathfrak {r}]\). Hence, the smallest saturation point for \(\mathcal {M}[2^k]\) is \(2^k{+}2\).

Threshold Algorithm. The input of the threshold algorithm is an MDP \(\mathcal {M}\) as above and a non-negative rational number \(\vartheta \). The task is to generate a deterministic reward-based scheduler \(\mathfrak {S}\) with \(\mathfrak {S} {\uparrow } {\wp }=\mathfrak {M}\) (where \(\mathfrak {M}\) and \(\wp \) are as in Proposition 4.1) such that \(\mathbb {CE}^{\mathfrak {S}} > \vartheta \) if \(\mathbb {CE}^{\max } > \vartheta \), and \(\mathbb {CE}^{\mathfrak {S}} = \vartheta \) if \(\mathbb {CE}^{\max } = \vartheta \). If \(\mathbb {CE}^{\max } < \vartheta \) then the output of the threshold algorithm is “no”.Footnote 3

The algorithm operates level-wise and determines feasible actions \( action (s,r)\) for all non-trap states s and \(r=\wp {-}1,\wp {-}2,\ldots ,0\), using the decisions \( action (\cdot ,i)\) for the levels \(i \in \{r{+}1,\ldots ,\wp \}\) that have been treated before and linear programming techniques to treat zero-reward loops. In this context, feasibility is understood with respect to the following condition: If \(\mathbb {CE}^{\max } \unrhd \vartheta \) where \(\unrhd \in \{>,\geqslant \}\) then there exists a reward-based scheduler \(\mathfrak {S}\) with \(\mathbb {CE}^{\mathfrak {S}} \unrhd \vartheta \) and \(\mathfrak {S}(s,R)= action (s,\min \{\wp ,R\})\) for all \(R \geqslant r\).

The algorithm stores for each state-reward pair (sr) the probabilities \(y_{s,r}\) to reach \( goal \) from s and the corresponding partial expectation \(\theta _{s,r}\) for the scheduler given by the decisions in the action table. The values for \(r=\wp \) are given by \( action (s,\wp )=\mathfrak {M}(s)\), \(y_{s,\wp }=\mathrm {Pr}^{\mathfrak {M}}_{\mathcal {M},s}(\Diamond goal )\) and \(\theta _{s,\wp }=\mathrm {E}^{\mathfrak {M}}_{\mathcal {M},s}\). The candidates for the decisions at level \(r < \wp \) are given by the deterministic memoryless schedulers \(\mathfrak {P}\) for \(\mathcal {M}\). We write \(\mathfrak {P}_{+}\) for the reward-based scheduler given by \(\mathfrak {P}_{+}(s,0)=\mathfrak {P}(s)\) and \(\mathfrak {P}_{+}(s,i)= action (s,\min \{\wp ,r{+}i\})\) for \(i \geqslant 1\). Let \(y_{s,r,\mathfrak {P}} = \mathrm {Pr}^{\mathfrak {P}_{+}}_{\mathcal {M},s}(\Diamond goal )\) and \(\theta _{s,r,\mathfrak {P}} = \mathrm {E}^{\mathfrak {P}_{+}}_{\mathcal {M},s}\) be the corresponding partial expectation.

To determine feasible actions for level r, the threshold algorithm makes use of a variant of () stating that if \(\theta - (\vartheta {-} r)y \geqslant \zeta - (\vartheta {-}r)z\) and \(\mathsf {B} \unrhd \vartheta \) then \(\mathsf {A} \unrhd \vartheta \), where \(\mathsf {A}\) and \(\mathsf {B}\) are as in () and the requirement \(y>z\) is dropped. Thus, the aim of the threshold algorithm is to compute a deterministic memoryless scheduler \(\mathfrak {P}^*\) for \(\mathcal {M}\) such that the following condition (*) holds:

$$\begin{aligned} \theta _{s,r,\mathfrak {P}^*} - (\vartheta {-} r)\cdot y_{s,r,\mathfrak {P}^*} = \max \limits _{\mathfrak {P}} \ \bigl ( \ \theta _{s,r,\mathfrak {P}} - (\vartheta {-} r)\cdot y_{s,r,\mathfrak {P}} \ \bigr ) \end{aligned}$$
(*)

Such a scheduler \(\mathfrak {P}^*\) is computable in time polynomial in the size of \(\mathcal {M}\) (without the explicit consideration of all schedulers \(\mathfrak {P}\) and their extensions \(\mathfrak {P}_{+}\)) using the following linear program with one variable \(x_s\) for each state. The objective is to minimize \(\sum \limits _{s\in S} x_s\) subject to the following conditions:

  1. (1)

    If \(s \in S \setminus \{ goal , fail \}\) then for each action \(\alpha \in Act (s)\) with \( rew (s,\alpha )=0\):

    $$x_s \geqslant \sum \limits _{t\in S} P(s,\alpha ,t) \cdot x_t$$
  2. (2)

    If \(s \in S \setminus \{ goal , fail \}\) then for each action \(\alpha \in Act (s)\) with \( rew (s,\alpha )>0\):

    $$\begin{aligned} x_s \geqslant \sum \limits _{t\in S} P(s,\alpha ,t) \cdot \bigl ( \, \theta _{t,R} + rew (s,\alpha ) \cdot y_{t,R} \, - \, (\vartheta {-} r) \cdot y_{t,R} \, \bigr ) \end{aligned}$$

    where \(R=\min \{\wp ,r{+} rew (s,\alpha )\}\)

  3. (3)

    For the trap states: \(x_{ goal } = r-\vartheta \) and \(x_{ fail }=0\).

This linear program has a unique solution \((x_s^*)_{s\in S}\). Let \( Act ^*(s)\) denote the set of actions \(\alpha \in Act (s)\) such that the following constraints (E1) and (E2) hold:

$$\begin{aligned} \text {(E1)}&\text {If} \,\, rew (s,\alpha )=0 \,\,\text {then: } x_s^* = \sum _{t\in S} P(s,\alpha ,t)\cdot x_t^* \\ \text {(E2)}&\text {If}\,\, rew (s,\alpha )>0\,\, \text {and}\,\, R\, =\, \min \, \bigl \{\, \wp ,\, r{+} rew (s,\alpha )\, \bigr \}\,\, \text {then:} \\&\quad \quad \ x_s^* \ = \ \sum _{t\in S} P(s,\alpha ,t)\cdot \bigl ( \theta _{t,R} + rew (s,\alpha )\cdot y_{t,R} \, - \, (\vartheta {-} r)\cdot y_{t,R} \bigr ) \end{aligned}$$

Let \(\mathcal {M}^*=\mathcal {M}^*_{r,\vartheta }\) denote the MDP with state space S induced by the state-action pairs \((s,\alpha )\) with \(\alpha \in Act ^*(s)\) where the positive-reward actions are redirected to the trap states. Formally, for \(s,t\in S\), \(\alpha \in Act ^*(s)\) we let \(P_{\mathcal {M}^*}(s,\alpha ,t)=P(s,\alpha ,t)\) if \( rew (s,\alpha )=0\) and \(P_{\mathcal {M}^*}(s,\alpha , goal ) = \sum _{t\in S} P(s,\alpha ,t) \cdot y_{t,R}\) and \(P_{\mathcal {M}^*}(s,\alpha , fail ) = 1 - P_{\mathcal {M}^*}(s,\alpha , goal )\) if \( rew (s,\alpha )>0\) and \(R=\min \{\wp ,r{+} rew (s,\alpha )\}\). The reward structure of \(\mathcal {M}^*\) is irrelevant for our purposes.

A scheduler \(\mathfrak {P}^*\) satisfying (*) is obtained by computing a memoryless deterministic scheduler for \(\mathcal {M}^*\) with \(\mathrm {Pr}^{\mathfrak {P}^*}_{\mathcal {M}^*,s}(\Diamond goal ) = \mathrm {Pr}^{\max }_{\mathcal {M}^*,s}(\Diamond goal )\) for all states s. This scheduler \(\mathfrak {P}^*\) indeed provides feasible decisions for level r, i.e., if \(\mathbb {CE}^{\max } \unrhd \vartheta \) where \(\unrhd \in \{>,\geqslant \}\) then there exists a reward-based scheduler \(\mathfrak {S}\) with \(\mathbb {CE}^{\mathfrak {S}} \unrhd \vartheta \), \(\mathfrak {S}(s,r)=\mathfrak {P}^*(s)\) and \(\mathfrak {S}(s,R)= action (s,\min \{\wp ,R\})\) for all \(R > r\).

The threshold algorithm then puts \( action (s,r)=\mathfrak {P}^*(s)\) and computes the values \(y_{s,r}\) and \(\theta _{s,r}\) as follows. Let T denote the set of states \(s\in S \setminus \{ goal , fail \}\) where \( rew (s,\mathfrak {P}^*(s))>0\). For \(s\in T\), the values \(y_{s,r}=y_{s,r,\mathfrak {P}^*}\) and \(\theta _{s,r}= \theta _{s,r,\mathfrak {P}^*}\) can be derived directly from the results obtained for the previously treated levels \(r{+}1,\ldots ,\wp \) as we have:

$$\begin{aligned} y_{s,r} = \sum \limits _{t\in S} P(s,\alpha ,t)\cdot y_{t,R} \quad \text {and} \quad \theta _{s,r} = rew (s,\alpha )\cdot y_{s,r} + \sum \limits _{t\in S} P(s,\alpha ,t)\cdot \theta _{t,R} \end{aligned}$$

where \(\alpha = \mathfrak {P}^*(s)\) and \(R = \min \{\wp , r{+} rew (s,\alpha )\}\). For the states \(s\in S \setminus T\):

$$\begin{aligned} y_{s,r} = \sum \limits _{t\in T} \mathrm {Pr}^{\mathfrak {P}^*}_{\mathcal {M},s}(\lnot T {{\mathrm{\mathrm {U}}}}t) \cdot y_{t,r} \quad \text {and} \quad \theta _{s,r} = \sum \limits _{t\in T} \mathrm {Pr}^{\mathfrak {P}^*}_{\mathcal {M},s}(\lnot T {{\mathrm{\mathrm {U}}}}t) \cdot \theta _{t,r} \end{aligned}$$

Having treated the last level \(r=0\), the output of the algorithm is as follows. Let \(\mathfrak {S}\) be the scheduler given by the action table \( action (\cdot )\). For the conditional expectation we have \(\mathbb {CE}^{\mathfrak {S}} = \theta _{s_{ \scriptscriptstyle init },0}/y_{s_{ \scriptscriptstyle init },0}\) if \(y_{s_{ \scriptscriptstyle init },0}>0\). If \(y_{s_{ \scriptscriptstyle init },0}=0\) or \(\theta _{s_{ \scriptscriptstyle init },0}/y_{s_{ \scriptscriptstyle init },0} < \vartheta \) then the algorithm returns the answer “no”. Otherwise, the algorithm returns \(\mathfrak {S}\), in which case \(\mathbb {CE}^{\mathfrak {S}} > \vartheta \) or \(\mathbb {CE}^{\mathfrak {S}}=\vartheta = \mathbb {CE}^{\max }\). Proofs for the soundness and the pseudo-polynomial time complexity are provided in Appendix G of [13].

Example 4.3

For the MDP \(\mathcal {M}[\mathfrak {r}]\) in Example 1.1, scheduler \(\mathfrak {M}\) selects action \(\alpha \) for state \(s=s_2\). Thus, \( action (s,\wp )=\alpha \) for the computed saturation point \(\wp \geqslant \mathfrak {r}+ 2\) (see Example 4.2). The threshold algorithm for each positive rational threshold \(\vartheta \) computes for each level \(r=\wp {-}1, \wp {-}2,\ldots ,1,0\) where \( action (s,r+1)=\alpha \), the value \(x_s^*=\max \{ r{-}\vartheta , \frac{1}{2}+\frac{1}{2}(r{-}\vartheta ) \}\) and the action set \( Act ^*(s)=\{\alpha \}\) if \(r>\vartheta {+}1\), \( Act ^*(s)=\{\alpha ,\beta \}\) if \(r=\vartheta {+}1\) and \( Act ^*(s)=\{\beta \}\) if \(r<\vartheta {+}1\). Thus, if \(n=\min \{ \wp , \lceil \vartheta {+} 1 \rceil \}\) then \( action (s,r)=\alpha \), \(y_{s,r}=1\), \(\theta _{s,r}=0\) for \(r \in \{n,\ldots ,\wp \}\), while \( action (s,n{-}k)=\beta \), \(y_{s,n{-}k}=1/2^k\), \(\theta _{s,n{-}k}=k/2^k\) for \(k=1,\ldots ,n\). That is, the threshold algorithm computes the scheduler \(\mathfrak {S}_n\) that selects \(\beta \) for the first n visits of s and \(\alpha \) for the \((n{+}1)\)-st visit of s. Thus, if \(\mathfrak {r}\leqslant \vartheta < \mathfrak {r}{+} 1\) then \(n = \mathfrak {r}{+} 2\), in which case the computed scheduler \(\mathfrak {S}_n\) is optimal (see Example 1.1). The returned answer depends on whether \(\vartheta \leqslant \mathbb {CE}^{\max }\). If, for instance, \(\vartheta = \frac{\mathfrak {r}}{2}\) and \(\mathfrak {r}>0\) is even then the threshold algorithm returns the scheduler \(\mathfrak {S}_{n}\) where \(n=\frac{\mathfrak {r}}{2}{+}1\), whose conditional expectation is \(\mathfrak {r}- (\frac{\mathfrak {r}}{2}{-}1)/(2^{\frac{\mathfrak {r}}{2}+1}{+}1) > \frac{\mathfrak {r}}{2}=\vartheta \). \({\scriptscriptstyle \blacksquare }\)

MDPs Without Zero-Reward Cycles and Acyclic MDPs. If \(\mathcal {M}\) does not contain zero-reward cycles then there is no need for the linear program. Instead we can use a topological sorting of the states in the graph of the sub-MDP consisting of zero-reward actions and determine a scheduler \(\mathfrak {P}^*\) satisfying (*) directly. For acyclic MDPs, there is even no need for a saturation point. We can explore \(\mathcal {M}\) using a recursive procedure and determine feasible decisions for each reachable state-reward pair (sr) on the basis of (*). This yields a polynomially space-bounded algorithm to decide whether \(\mathbb {CE}^{\max } \unrhd \vartheta \) in acyclic MDPs. (See Appendix I of [13].)

Construction of an Optimal Scheduler. Let \( ThresAlgo [\vartheta ]\) denote the scheduler that is generated by calling the threshold algorithm for the threshold value \(\vartheta \). A simple approach is to apply the threshold algorithm iteratively:

  • let \(\mathfrak {S}\) be the scheduler \(\mathfrak {M}\) as in Proposition 4.1;

  • REPEAT \(\vartheta := \mathbb {CE}^{\mathfrak {S}}\); \(\mathfrak {S}:= ThresAlgo [\vartheta ]\) UNTIL \(\vartheta = \mathbb {CE}^{\mathfrak {S}}\);

  • return \(\vartheta \) and \(\mathfrak {S}\)

The above algorithm generates a sequence of deterministic reward-based schedulers that are memoryless from \(\wp \) on with strictly increasing conditional expectations. The number of such schedulers is bounded by \( md ^{\wp }\) where \( md \) denotes the number of memoryless deterministic schedulers for \(\mathcal {M}\). Hence, the algorithm terminates and correctly returns \(\mathbb {CE}^{\max }\) and an optimal scheduler. As \( md \) can be exponential in the number of states, this simple algorithm has double-exponential time complexity.

To obtain a (single) exponential-time algorithm, we seek for better (larger, but still promising) threshold values than the conditional expectation of the current scheduler. We propose an algorithm that operates level-wise and freezes optimal decisions for levels \(r=\wp ,\wp {-}1,\wp {-}2,\ldots ,1,0\). The algorithm maintains and successively improves a left-closed and right-open interval \(I = [A,B[\) with \(\mathbb {CE}^{\max }\in I\) and \(\mathbb {CE}^{\mathfrak {S}} \in I\) for the current scheduler \(\mathfrak {S}\).

Initialization. The algorithm starts with the scheduler \(\mathfrak {S}= ThresAlgo [\mathbb {CE}^{\mathfrak {M}}]\) where \(\mathfrak {M}\) is as above. If \(\mathbb {CE}^{\mathfrak {S}}=\mathbb {CE}^{\mathfrak {M}}\) then the algorithm immediately terminates. Suppose now that \(\mathbb {CE}^{\mathfrak {S}} > \mathbb {CE}^{\mathfrak {M}}\). The initial interval is \(I = [A,B[\) where \(A = \mathbb {CE}^{\mathfrak {S}}\) and \(B = \mathbb {CE}^{\mathrm {ub}}{+}1\) where \(\mathbb {CE}^{\mathrm {ub}}\) is as in Theorem 1.

Level-wise Scheduler Improvement. The algorithm successively determines optimal decisions for the levels \(r=\wp {-}1,\wp {-}2,\ldots ,1,0\). The treatment of level r consists of a sequence of scheduler-improvement steps where at the same time the interval I is replaced with proper sub-intervals. The current scheduler \(\mathfrak {S}\) has been obtained by the last successful run of the threshold algorithm, i.e., it has the form \(\mathfrak {S}= ThresAlgo [\vartheta ]\) where \(\mathbb {CE}^{\mathfrak {S}} > \vartheta \). Besides the decisions of \(\mathfrak {S}\) (i.e., the actions \(\mathfrak {S}(s,R)\) for all state-reward pairs (sR) where \(s\in S \setminus \{ goal , fail \}\) and \(R\in \{0,1,\ldots ,\wp \}\)), the algorithm also stores the values \(y_{s,R}\) and \(\theta _{s,R}\) that have been computed in the threshold algorithm.Footnote 4 For the current level r, the algorithm also computes for each state \(s\in S \setminus \{ goal , fail \}\) and each action \(\alpha \in Act (s)\) the values \(y_{s,r,\alpha } = \sum _{t\in S} P(s,\alpha ,t) \cdot y_{t,R}\) and \(\theta _{s,r,\alpha } = rew (s,\alpha ) \cdot y_{s,r,\alpha } \ + \ \sum _{t\in S} P(s,\alpha ,t) \cdot \theta _{t,R}\) where \(R = \min \{ \wp , r+ rew (s,\alpha )\}\).

Scheduler-improvement Step. Let r be the current level, \(I = [A,B[\) the current interval and \(\mathfrak {S}\) the current scheduler with \(\mathbb {CE}^{\max }\in I\). At the beginning of the scheduler-improvement step we have \(\mathbb {CE}^{\mathfrak {S}} =A\). Let

$$ \begin{array}{rcl} \mathcal {I}_{\mathfrak {S},r} &{} \ = \ &{} \Bigl \{ \ r + \frac{\theta _{s,r} - \theta _{s,r,\alpha }}{y_{s,r}-y_{s,r,\alpha }} \ : \ s \in S \setminus \{ goal , fail \}, \ \alpha \in Act (s), \ y_{s,r} > y_{s,r,\alpha } \ \Bigr \} \\ \\ \mathcal {I}^{\uparrow }_{\mathfrak {S},r} &{} = &{} \bigl \{ \ d \in \mathcal {I}_{\mathfrak {S},r} \ : \ d \ \geqslant \ \mathbb {CE}^{\mathfrak {S}} \bigr \} \quad \quad \quad \mathcal {I}^{B}_{\mathfrak {S},r} = \bigl \{ \ d \in \mathcal {I}_{\mathfrak {S},r} \ : \ d < B \ \bigr \} \end{array} $$

Intuitively, the values in \(d \in \mathcal {I}^B_{\mathfrak {S},r}\) are the “most promising” threshold values, as according to statement () these are the points where the decision of the current scheduler \(\mathfrak {S}\) for some state-reward pair (sr) can be improved, provided that \(\mathbb {CE}^{\max } > d\). (Note that the values in \(\mathcal {I}_{\mathfrak {S},r}\setminus \mathcal {I}^B_{\mathfrak {S},r}\) can be discarded as \(\mathbb {CE}^{\max } < B\).)

The algorithm proceeds as follows. If \(\mathcal {I}^{B}_{\mathfrak {S},r} = \varnothing \) then no further improvements at level r are possible as the function \(\mathfrak {P}^* = \mathfrak {S}(\cdot ,r)\) satisfies (*) for the (still unknown) value \(\vartheta =\mathbb {CE}^{\max }\). See Appendix H of [13]. In this case:

  • If \(r=0\) then the algorithm terminates with the answer \(\mathbb {CE}^{\max }=\mathbb {CE}^{\mathfrak {S}}\) and \(\mathfrak {S}\) as an optimal scheduler.

  • If \(r > 0\) then the algorithm goes to the next level \(r{-}1\) and performs the scheduler-improvement step for \(\mathfrak {S}\) at level \(r{-}1\).

Suppose now that \(\mathcal {I}^B_{\mathfrak {S},r}\) is nonempty. Let \(\mathcal {K}= \mathcal {I}^{\uparrow }_{\mathfrak {S},r} \cup \{\mathbb {CE}^{\mathfrak {S}}\}\). The algorithm seeks for the largest value \(\vartheta ' \in \mathcal {K}\cap I\) such that \(\mathbb {CE}^{\max }\geqslant \vartheta '\). More precisely, it successively calls the threshold algorithm for the threshold value \(\vartheta '=\max (\mathcal {K}\cap I)\) and performs the following steps for the generated scheduler \(\mathfrak {S}' = ThresAlgo [\vartheta ']\):

  • If the result of the threshold algorithm is “no” and \(\mathrm {Pr}^{\mathfrak {S}'}_{\mathcal {M},s_{ \scriptscriptstyle init }}(\Diamond goal )\) is positive (in which case \(\mathbb {CE}^{\mathfrak {S}'} \leqslant \mathbb {CE}^{\max } < \vartheta '\)), then:

    • If \(\mathbb {CE}^{\mathfrak {S}'} \leqslant A\) then the algorithm refines I by putting \(B:=\vartheta '\).

    • If \(\mathbb {CE}^{\mathfrak {S}'} > A\) then the algorithm refines I by putting \(A := \mathbb {CE}^{\mathfrak {S}'}\), \(B := \vartheta '\) and adds \(\mathbb {CE}^{\mathfrak {S}'}\) to \(\mathcal {K}\) (Note that then \(\mathbb {CE}^{\mathfrak {S}'}\in \mathcal {K}\cap I\), while \(\mathbb {CE}^{\mathfrak {S}} \in \mathcal {K}\setminus I\).)

  • Suppose now that \(\mathbb {CE}^{\mathfrak {S}'} \geqslant \vartheta '\). The algorithm terminates if \(\mathbb {CE}^{\mathfrak {S}'} = \vartheta '\), in which case \(\mathfrak {S}'\) is optimal. Otherwise, i.e., if \(\mathbb {CE}^{\mathfrak {S}'} > \vartheta '\), then the algorithm aborts the loop by putting \(\mathcal {K}:= \varnothing \), refines the interval I by putting \(A := \mathbb {CE}^{\mathfrak {S}'}\), updates the current scheduler by setting \(\mathfrak {S}:= \mathfrak {S}'\) and performs the next scheduler-improvement step.

The soundness proof and complexity analysis can be found in Appendix H of [13], where (among others) we show that the scheduler-improvement step for schedulers \(\mathfrak {S}\) with \(\mathbb {CE}^{\mathfrak {S}}<\mathbb {CE}^{\max }\) terminates with some scheduler \(\mathfrak {S}'\) such that \(\mathbb {CE}^{\mathfrak {S}} < \mathbb {CE}^{\mathfrak {S}'}\). The total number of calls of the threshold algorithm is in \(\mathcal {O}(\wp \cdot md \cdot |S|\cdot | Act |)\). This yields an exponential time bound as stated in Theorem 3.

Example 4.4

We regard again the MDP \(\mathcal {M}[\mathfrak {r}]\) of Example 1.1 where we suppose \(\mathfrak {r}\) is positive and even. The algorithm first computes \(\mathbb {CE}^{\mathrm {ub}}\) (see Sect. 3), a saturation point \(\wp \geqslant \mathfrak {r}{+} 2\) (see Example 4.2), the scheduler \(\mathfrak {M}\), its conditional expectation \(\mathbb {CE}^{\mathfrak {M}}=\frac{\mathfrak {r}}{2}\) and the scheduler \(\mathfrak {S}= ThresAlgo [\frac{\mathfrak {r}}{2}]\). The initial interval is \(I=[A,B[\) where \(A=\mathbb {CE}^{\mathfrak {S}} = \mathfrak {r}- (\frac{\mathfrak {r}}{2}{-}1)/(2^{\frac{\mathfrak {r}}{2}+1}{+}1)\) (see Example 4.3) and \(B = \mathbb {CE}^{\mathrm {ub}}{+}1\). The scheduler improvement step for \(\mathfrak {S}\) at levels \(r=\wp {-}1,\ldots ,\mathfrak {r}{+}1\) determines the set \(\mathcal {I}_{\mathfrak {S},r}=\{r{-}1\}\) and calls the threshold algorithm for \(\vartheta '=r{-}1\). These calls are not successful for \(r=\wp {-}1,\ldots ,\mathfrak {r}{+}2\). That is, the scheduler \(\mathfrak {S}\) remains unchanged and the upper bound B is successively improved to \(r{-}1\). At level \(r=\mathfrak {r}{+}1\), the threshold algorithm is called for \(\vartheta ' = \mathfrak {r}\), which yields the optimal scheduler \(\mathfrak {S}'= ThresAlgo [\vartheta ']\) (see Example 4.3). \({\scriptscriptstyle \blacksquare }\)

Implementation and Experiments. We have implemented the algorithms presented in this paper as a prototypical extension of the model checker PRISM [27, 28] and carried out initial experiments to demonstrate the general feasibility of our approach (see https://wwwtcs.inf.tu-dresden.de/ALGI/PUB/TACAS17/ and Appendix K of [13] for details).

5 Conclusion

Although the switch to conditional expectations appears rather natural to escape from the limitations of known solutions for unconditional extremal expected accumulated rewards, to the best of our knowledge computation schemes for conditional expected accumulated rewards have not been addressed before. Our results show that new techniques are needed to compute maximal conditional expectations, as optimal schedulers might need memory and local reasoning in terms of the past and possible future is not sufficient (Example 1.1). The key observations for our algorithms are the existence of a saturation point \(\wp \) for the reward that has been accumulated so far, from which on optimal schedulers can behave memoryless, and a linear correlation between optimal decisions for all state-reward pairs (sr) of the same reward level r (see (*) and the linear program used in the threshold algorithm). The difficulty to reason about conditional expectations is also reflected in the achieved complexity-theoretic results stating that all variants of the threshold problem lie between PSPACE and EXPTIME. While PSPACE-completeness has been established for acyclic MDPs (Appendix I of [13]), the precise complexity for cyclic MDPs is still open. In contrast, optimal schedulers for unconditional expected accumulated rewards as well as for conditional reachability probabilities are computable in polynomial time [11, 22].

Using standard automata-based approaches, our method can easily be generalized to compute maximal conditional expected rewards for regular co-safety conditions (rather than reachability conditions \(\Diamond G\)) and/or where the accumulation of rewards is “controlled” by a deterministic finite automaton as in the logics considered in [12, 17] (rather than ). In this paper, we restricted to MDPs with non-negative integer rewards. Non-negative rational rewards can be treated by multiplying all reward values with their least common multiple (Appendix J.1 of [13]). In the case of acyclic MDPs, our methods are even applicable if the MDP has negative and positive rational rewards (Appendix J.2 of [13]). By swapping the sign of all rewards, this yields a technique to compute minimal conditional expectations in acyclic MDPs. We expect that minimal conditional expectations in cyclic MDPs with non-negative rewards can be computed using similar algorithms as we suggested for maximal conditional expectations. This as well as MDPs with negative and positive rewards will be addressed in future work.