Qualitative Controller Synthesis for Consumption Markov Decision Processes

Consumption Markov Decision Processes (CMDPs) are probabilistic decision-making models of resource-constrained systems. In a CMDP, the controller possesses a certain amount of a critical resource, such as electric power. Each action of the controller can consume some amount of the resource. Resource replenishment is only possible in special reload states, in which the resource level can be reloaded up to the full capacity of the system. The task of the controller is to prevent resource exhaustion, i.e. ensure that the available amount of the resource stays non-negative, while ensuring an additional linear-time property. We study the complexity of strategy synthesis in consumption MDPs with almost-sure Büchi objectives. We show that the problem can be solved in polynomial time. We implement our algorithm and show that it can efficiently solve CMDPs modelling real-world scenarios.

1 Introduction Theorem 1. Given a consumption MDP M with a capacity cap, an initial resource level 0 ≤ d ≤ cap, and a set T of accepting states, we can decide, in polynomial time, whether there exists a strategy σ such that when playing according to σ, the following consumption-Büchi objectives are satisfied: -Starting with resource level d, the resource level never 5 drops below 0.
-With probability 1, the system visits some state in T infinitely often.
Moreover, if such a strategy exists then we can compute, in polynomial time, its polynomial-size representation.
For the sake of clarity, we restrict to proving Theorem 1 for a natural sub-class of MDPs called decreasing consumption MDPs, where there are no cycles of zero consumption. The restriction is natural (since in typical resource-constrained systems, each action -even idling -consumes some energy, so zero cycles are unlikely) and greatly simplifies presentation. In addition to the theoretical analysis, we implemented the algorithm behind Theorem 1 and evaluated it on several benchmarks, including a realistic model of an AEV navigating the streets of Manhattan. The experiments show that our algorithm is able to efficiently solve large CMDPs, offering a good scalability.
Significance. Some comments on Theorem 1 are in order. First, all the numbers in the MDP, and in particular the capacity cap, are encoded in binary. Hence, "polynomial time" means time polynomial in the encoding size of the MDP itself and in log(cap). In particular, a naive "unfolding" of the MDP, i.e. encoding the resource levels between 0 and cap into the states, does not yield a polynomial-time algorithm, but an exponentialtime one, since the unfolded MDP has size proportional to cap. We employ a valueiteration-like algorithm to compute minimal energy levels with which one can achieve the consumption-Büchi objectives.
A similar concern applies to the "polynomial-size representation" of the strategy σ. To satisfy a consumption-Büchi objective, σ generally needs to keep track of the current resource level. Hence, under the standard notion of a finite-memory (FM) strategy (which views FM strategies as transducers), σ would require memory proportional to cap, i.e. a memory exponentially large w.r.t. size of the input. However, we show that for each state s we can partition the integer interval [0, . . . , cap] into polynomially many sub-intervals I s 1 , . . . , I s k such that, for each 1 ≤ j ≤ k, the strategy σ picks the same action whenever the current state is s and the current resource level is in I s j . As such, the endpoints of the intervals are the only extra knowledge required to represent σ, a representation which we call a counter selector. We instrument our main algorithm so as to compute, in polynomial time, a polynomial-size counter selector representing the witness strategy σ.
Finally, we consider linear-time properties encoded by Büchi objectives over the states of the MDP. In essence, we assume that the translation of the specification to the Büchi automaton and its product with the original MDP model of the system were already performed. Probabilistic analysis typically requires the use of deterministic Büchi automata, which cannot express all linear-time properties. However, in this paper we consider qualitative analysis, which can be performed using restricted versions of nondeterministic Büchi automata that are still powerful enough to express all ω-regular languages. Examples of such automata are limit-deterministic Büchi automata [51] or good-for-MDPs automata [41]. Alternatively, consumption MDPs with parity objectives could be reduced to consumption-Büchi MPDs using the standard parity-to-Büchi MDP construction [25,33,32,30]. We abstract from these aspects and focus on the technical core of our problem, solving consumption-Büchi MDPs.
Consequently, to our best knowledge, we present the first polynomial-time algorithm for controller synthesis in resource-constrained MDPs with ω-regular objectives.
Related Work. There is an enormous body of work on energy models. Stemming from the models introduced in [23,11], the subsequent work covered energy games with various combinations of objectives [27,13,48,12,21,20,18,10], energy games with multiple resource types [37,43,31,57,44,24,15,28] or the variants of the above in the MDP [17,49], infinite-state [1], or partially observable [34] settings. As argued previously, the controller synthesis within these models is at least as hard as solving mean-payoff games. The paper [29] presents polynomial-time algorithms for non-stochastic energy games with special weight structures. Recently, an abstract algebraic perspective on energy models was presented in [22,35,36].
Consumption systems were introduced in [14] in the form of consumption games with multiple resource types. Minimizing mean-payoff in automata with consumption constraints was studied in [16].
Our main result requires, as a technical sub-component, solving the resource-safety (or just safety) problem in consumption MDPs, i.e. computing a strategy which prevents resource exhaustion. The solution to this problem consists (in principle) of a Turing reduction to the problem of minimum cost reachability in two-player games with nonnegative costs. The latter problem was studied in [46], with an extension to arbitrary costs considered in [19] (see also [40]). We present our own, conceptually simple, valueiteration-like algorithm for the problem, which is also used in our implementation.
Elements of resource-constrained optimization and minimum-cost reachability are also present in the line of work concerning energy-utility quantiles in MDPs [5,7,6,4,42]. In this setting, there is no reloading in the consumption-or energy-model sense, and the task is typically to minimize the total amount of the resource consumed while maximizing the probability that some other objective is satisfied.
Paper Organization & Outline of Techniques After the preliminaries (Section 2), we present counter selectors in Section 3. The next three sections contain the three main steps of our analysis. In Section 4, we solve the safety problem in consumption MDPs. The technical core of our approach is presented in Section 5, where we solve the problem of safe positive reachability: finding a resource-safe strategy which ensures that the set T of accepting states is visited with positive probability. Solving consumption-Büchi MDPs then, in principle, consists of repeatedly applying a strategy for safe positive reachability of T , ensuring that the strategy is "re-started" whenever the attempt to reach T fails. Details are given in Section 6. Finally, Section 7 presents our experiments. Due to space constraints, most technical proofs were moved to the appendix.
We denote by N the set of all non-negative integers and by N the set N∪{∞}. Given a set I and a vector v ∈ N I of integers indexed by I, we use v(i) to denote the i-component of v. We assume familiarity with basic notions of probability theory. In particular, a probability distribution on an at most countable set X is a function f : We use D(X) to denote the set of all probability distributions on X. Definition 1 (CMDP). A consumption Markov decision process (CMDP) is a tuple M = (S , A, ∆, C, R, cap) where S is a finite set of states, A is a finite set of actions, ∆ : S × A → D(S ) is a total transition function, C : S × A → N is a total consumption function, R ⊆ S is a set of reload states where the resource can be reloaded, and cap is a resource capacity.
Given a set R ⊆ S , we denote by M(R ) the CMDP obtained from M by changing the set of reloads to R . Given s ∈ S and a ∈ A, we denote by Succ(s, a) the set {t | ∆(s, a)(t) > 0}. A path is a (finite or infinite) state-action sequence α = s 1 a 1 s 2 a 2 s 3 · · · ∈ (S × A) ω ∪ (S · A) * · S such that s i+1 ∈ Succ(s i , a i ) for all i. We define α i = s i and Act i (α) = a i . We use α ..i for the finite prefix s 1 a 1 s 2 . . . s i of α, we use α i.. for the suffix s i a i s i+1 . . . , and α i.. j for the infix s i a i . . . s j . The length of a path α is the number len(α) of actions on α (len(α) = ∞ if α is infinite).
A finite path α is simple if no state appears more than once on α. A finite path is a cycle if it starts and ends in the same state. A CMDP is decreasing if for every simple cycle s 1 a 1 s 2 . . . a k−1 s k there exists 1 ≤ i < k such that C(s i , a i ) > 0. Throughout this paper we consider only decreasing CMDPs. The only place where this assumption is used are the proofs of Theorem 4 and Theorem 8.
An infinite path is called a run. We typically name runs by variants of the symbol . The set of all runs in M is denoted Runs M or simply Runs if M is clear from context. A finite path is also called history. The set of all possible histories of M is hist M or simply hist. We denote by last(α) the last state of a history α. Let α be a history with last(α) = s 1 and β = s 1 a 1 s 2 a 2 . . .; we define a joint path as α β = αa 1 s 2 a 2 . . ..
A strategy for M is a function σ : hist M → A assigning to each history an action to play. A strategy is memoryless if σ(α) = σ(β) whenever last(α) = last(β), i.e., when the decision depends only on the current state. We do not consider randomized strategies in this paper, as they are non-necessary for qualitative ω-regular objectives on finite MDPs [33,32,30].
A computation of M under the control of a given strategy σ from some initial state s ∈ S creates a path. The path starts with s 1 = s. Assume that the current path is α and let s i = last(α) (we say that M is currently in s i ). Then the next action on the path is a i = σ(α) and the next state s i+1 is chosen randomly according to ∆(s i , a i ). Repeating this process ad infinitum yields an infinite sample run . We say that a is σ-compatible if it can be produced using this process, and s-initiated if it starts in s. We denote the set of all σ-compatible s-initiated runs by Comp M (σ, s).
We denote by P σ M,s (A) the probability that a sample run from Comp M (σ, s) belongs to a given measurable set of runs A (the subscript M is dropped when M is known from the context). For details on the formal construction of measurable sets of runs as well as the probability measure P σ M,s see [2].

Resource: Consumption, Levels, and Objectives
We denote by cap(M) the battery capacity in the MDP M. A resource is consumed along paths and can be reloaded in the reload states up to the full capacity. For a path α = s 1 a 1 s 2 . . . we define the consumption of α as cons(α) = len(α) i=1 C(s i , a i ) (since the consumption is non-negative, the sum is always well defined, though possibly diverging). Note that cons does not consider reload states at all. To accurately track the remaining amount of the resource, we use the concept of a resource level.
Definition 2 (Resource level). Let M be a CMDP with a set of reload states R, let α be a history, and let 0 ≤ d ≤ cap(M) be an integer called initial load. Then the energy level after α initialized by d, denoted by RL M d (α) or simply as RL d (α), is defined inductively as follows: for a zero-length history s we have RL M d (s) = d. For a non-zero-length history α = βat we denote c = C(last(β), a), and put Let α be a history and let f, l ≥ 0 that are the minimal and maximal indices i such that α i ∈ R, respectively. Following the inductive definition of RL d (α) it is easy to see Further, for each history α and d such that e = RL d (α) ⊥, and each history β suitable for joining with α it holds that RL d (α β) = RL e (β).
A run is d-safe if and only if the energy level initialized by d is a non-negative number for each finite prefix of ρ, i.e. if for all i > 0 we have RL d ( ..i ) ⊥. We say that a run is safe if it is cap(M)-safe. The next lemma follows immediately from the definition of an energy level. Lemma 1. Let = s 1 a 1 s 2 . . . be a d-safe run for some d and let α be a history such that last(α) = s 1 . Then the run α is e-safe if RL e (α) ≥ d.
Objectives An objective is a set of runs. The objective SafeRuns(d) contains exactly d-safe runs. Given a target set T ⊆ S and i ∈ N, we define Reach i T = { ∈ Runs | j ∈ T for some 1 ≤ j ≤ i + 1} to be the set of all runs that reach some state from T within the first i steps. We put Reach T = i∈N Reach i T . Finally, the set Büchi T = { ∈ Runs | i ∈ T for infinitely many i ∈ N}.
Problems We solve three main qualitative problems for CMDPs, namely safety, positive reachability, and Büchi.
Let us fix a state s and a target set of states T . We say that a strategy is d-safe in s if Comp(σ, s) ⊆ SafeRuns(d). We say that σ is T -positive d-safe in s if it is dsafe in s and P σ s (Reach T ) > 0, which means that there exists a run in Comp(σ, s) that visits T . Finally, we say that σ is T -Büchi d-safe in a state s if it is d-safe in s and P σ s (Büchi T ) = 1.
The vectors Safe, SafePR T (PR for "positive reachability"), and SafeBüchi T of type N S contain, for each s ∈ S , the minimal d such that there exists a strategy that is d-safe in s, T -positive d-safe in s, and T -Büchi d-safe in s, respectively, and ∞ if no such strategy exists. The problems we consider for a given CMDP are: -Safety: compute the vector Safe and a strategy that is Safe(s)-safe in every s ∈ S .
-Positive reachability: compute the vector SafePR T and a strategy that is T -positive SafePR T (s)-safe in every state s. We illustrate the key concepts using the example CMDP M in Figure 1. Consider the parameterized history α i = (s 1 a 2 s 5 a 2 ) i s 1 . Then cons(α i ) = 3i while RL 2 (α i ) = 19 for all i ≥ 1. Thus, a strategy, that always picks a 2 in s 1 is d-safe in s 1 for all d ≥ 2. On the other hand, a strategy that always picks a 1 in s 1 is not d-safe in s 1 for any 0 ≤ d ≤ 20. Now consider again the strategy that always picks a 2 ; such a strategy is 2-safe in s 1 , but is not useful if we attempt to eventually reach T . Hence memoryless strategies are not sufficient in our setting. Consider instead a strategy σ that, in s 1 , picks a 1 whenever the current resource level is at least 10 and picks a 2 otherwise. Such a strategy is 2-safe in s 1 and guarantees reaching s 2 with a positive probability: we need at least 10 units of energy to return to s 5 in the case we are unlucky and picking a 1 leads us to s 3 . If we are lucky, a 1 leads us to s 2 by consuming just 5 units of the resource, witnessing that σ is T -positive. As a matter of fact, during every revisit of s 5 there is a 1 2 chance of hitting s 2 during the next try, so σ actually ensures that s 2 is visited with probability 1.
We note that solving a CMDP is very different from solving a consumption 2-player game [14]. Indeed, imagine that in Figure 1, the outcome of the action a 1 from state s 1 is resolved by an adversarial player. In such a game, there is no strategy that would guarantee reaching T at all.
The strategy σ we discussed above uses finite memory to track the resource level exactly. An efficient representation of such strategies is described in the next section.
In this section, we define a succinct representation of finite-memory strategies via so called counter selectors. Under the standard definition, a strategy σ is a finite memory strategy, if σ can be encoded by a memory structure, a type of finite transducer. Formally, a memory structure is a tuple µ = (M, nxt, up, m 0 ) where M is a finite set of memory elements, nxt : The structure µ encodes a strategy σ µ such that for each history α = s 1 a 1 s 2 . . . s n we have σ µ (α) = nxt up * (m 0 (s 1 ), α), s n .
In our setting, strategies need to track energy levels of histories. Let us fix an CMDP M = (S , A, ∆, C, R, cap). A non-exhausted energy level is always a number between 0 and cap(M), which can be represented with a binary-encoded bounded counter. We call strategies with such counters finite counter (FC) strategies. An FC strategy selects actions to play according to selection rules.

Definition 3 (Selection rule).
A selection rule ϕ for M is a partial function from the set {0, . . . , cap(M)} to A. Undefined value for some n is indicated by ϕ(n) = ⊥.
⊥} to denote the domain of ϕ and we use Rules M or simply Rules for the set of all selection rules for M. Intuitively, a selection according to rule ϕ selects the action that corresponds to the largest value from dom(ϕ) that is not larger than the current energy level. To be more precise, if dom(ϕ) consists of numbers n 1 < n 2 < · · · < n k , then the action to be selected in a given moment is ϕ(n i ), where n i is the largest element of dom(ϕ) which is less then or equal to the current amount of the resource. In other words, ϕ(n i ) is to be selected if the current resource level is in [n i , n i+1 ) (putting n k+1 = ∞).

Definition 4 (Counter selector).
A counter selector for M is a function Σ : S → Rules.
A counter selector itself is not enough to describe a strategy. A strategy needs to keep track of the energy level throughout the path. With a vector r ∈ {0, . . . , cap(M)} S of initial resource levels, each counter selector Σ defines a strategy Σ r that is encoded by the following memory structure (M, nxt, up, m 0 ) with a ∈ A being a globally fixed action (for uniqueness). We stipulate that ⊥ < n for all n ∈ N.
-Let m ∈ M be a memory element, let s ∈ S be a state, let n ∈ dom(Σ(s)) be the largest element of dom(Σ(s)) such that n ≤ m. Then nxt(m, s) = Σ(s)(n) if n exists, and nxt = a otherwise. -The function up is defined for each m ∈ M, a ∈ A, s, t ∈ S as follows. otherwise.
A strategy σ is a finite counter (FC) strategy if there is a counter selector Σ and a vector r such that σ = Σ r . The counter selector can be imagined as a finite-state device that implements σ using O(log(cap(M))) bits of additional memory (counter) used to represent numbers 0, 1, . . . , cap(M). The device uses the counter to keep track of the current resource level, the element ⊥ representing energy exhaustion. Note that a counter selector can be exponentially more succinct than the corresponding memory structure.

Safety
In this section, we present an algorithm that computes, for each state, the minimal value d (if it exists) such that there exists a d-safe strategy from that state. We also provide the corresponding strategy. In the remainder of the section we fix an MDP M.
A d-safe run has the following two properties: (i) It consumes at most d units of the resource (energy) before it reaches the first reload state, and (ii) it never consumes more than cap(M) units of the resource between 2 visits of reload states. To ensure (ii), we need to identify a maximal subset R ⊆ R of reload states for which there is a strategy σ that, starting in some r ∈ R , can always reach R again (within at least one step) using at most cap(M) resource units. The d-safe strategy we seek can be then assembled from σ and from a strategy that suitably navigates towards R , which is needed for (i).
In the core of both properties (i) and (ii) lies the problem of minimum cost reachability. Hence, in the next subsection, we start with presenting necessary results on this problem.

Minimum Cost Reachability
The problem of minimum cost reachability with non-negative costs was studied before [46]. Here we present a simple approach to the problem used in our implementation.
Definition 5. Let T ⊆ S be a set of target states, let α = s 1 a 1 s 2 . . . be a finite or infinite path, and let 1 ≤ f be the smallest index such that s f ∈ T . We define consumption of α to T as ReachCons M,T (α) = cons(α .. f ) if f exists and we set ReachCons M,T (α) = ∞ otherwise. For a strategy σ and a state s ∈ S we define ReachCons M,T (σ, s) = sup ∈Comp(σ,s) ReachCons M,T ( ). A minimum cost reachability of T from s is a vector defined as As usual, we drop the subscript M when M is clear from context. Intuitively, d = MinReach T (s) is the minimal initial load with which some strategy can ensure reaching T with consumption at most d, when starting in s. We say that a strategy σ is optimal for MinReach T if we have that MinReach T (s) = ReachCons T (σ, s) for all states s ∈ S .
We also define functions ReachCons + M,T and the vector MinReach + M,T in a similar fashion with one exception: we require the index f from definition of ReachCons M,T (α) to be strictly larger than 1, which enforces to take at least one step to reach T .
For the rest of this section, fix a target set T and consider the following functional F : F is a simple generalization of the standard Bellman functional used for computing shortest paths in graphs. The proof of the following Theorem is rather standard and is omitted for brevity. To compute MinReach + M,T , we construct a new CMDP M from M by adding a copỹ s of each state s ∈ S such that dynamics ins is the same as in s; i.e. for each a ∈ A, ∆(s, a) = ∆(s, a) and C(s, a) = C(s, a). We denote the new state set as S . We don't change the set of reload states, sos is never in T , even if s is. Given the new CMDP M and the new state set as S , the following lemma is straightforward.

Safely Reaching Reload States
In the following, we use MinInitCons M (read minimal initial consumption) for the vector MinReach + M,R -minimal resource level that ensures we can surely reach a reload state in at least one step. By Lemma 2 and Theorem 2 we can construct M and iterate the operator F for |S | steps to compute MinInitCons M . Note that S is the state space of M since introducing the new states into M did not increase the length of the maximal simple path. However, we can avoid the construction of M and still compute MinInitCons M using a truncated version of the functional F , which is the approach used in our implementation. We first introduce the following truncation operator: Then, we define a truncated functional G as follows: The following lemma connects the iteration of G on M with the iteration of F on M.  Proof. The repeat-loop performs the iteration of the operator G. We show that the fixed point of the iteration equals MinInitCons M . Consider the iteration of F and G on ∞ and x R , respectively. Let i be the number of steps (possibly infinite) after which the F -iteration reaches a fixed point and j the number of steps after which the G-iteration reaches a fixed point. We prove that i = j. Indeed, for each step k we have that (by Lemma 3). Hence, i ≥ j. For the reverse inequality, assume that i > j. Then there is t ∈ S such that (x R )(t) F j (x R )(t).From (1) and from the fact that the G-iteration already reached a fixed point we get that t ∈ S . Then either t ∈ S \ R, but then by with G-iteration already being at a fixed point. Or t ∈ S ∩ R, but then F j+1 (x R )(t) = F j (x R )(t) = 0, again a contradiction. Hence, iterating G also reaches a fixed point in at most |S |-steps, by Theorem 2. Moreover, for each s ∈ S we have G i (∞)(s) = F i (x R )(s) = MinReach M,R (s) = MinInitCons M (s), the first equality coming from Lemma 3, the second from Theorem 2 and from the fact that i = j and the last one from Lemma 2.

Solving the Safety Problem
We want to identify a set R ⊆ R such that we can reach R in at least 1 step and with consumption at most cap = cap(M), from each r ∈ R . This entails identifying the maximal R ⊆ R such that MinInitCons M(R ) ≤ cap for each r ∈ R . This can be done by initially setting R = R and iteratively removing states that have MinInitCons M(R ) > cap, from R , as in Algorithm 2. Proof. The algorithm clearly terminates. Computing MinInitCons M(Rel) on line 5 takes a polynomial number of steps per call due to Theorem 3 and since M(Rel) has asymptotically the same size as M. Since the repeat loop performs at most |R| iterations, the complexity follows.
As for correctness, we first prove that out ≤ Safe M . It suffices to prove for each s ∈ S that upon termination, mic(s) ≤ Safe M (s) whenever the latter value is finite. Since MinInitCons M (s) ≤ Safe M (s) for each MDP M and each its state such that Safe M (s) < ∞, it suffices to show that Safe M(Rel) ≤ Safe M is an invariant of the algorithm (as a matter of fact, we prove that Safe M(Rel) = Safe M ). To this end, it suffices to show that at every point of execution Safe M (t) = ∞ for each t ∈ R \ Rel: indeed, if this holds, no strategy that is safe for some state s t can play an action a from s such that t ∈ Succ(s, a), so declaring such states non-reloading does not influence the Safe M -values. So denote by Rel i the contents of Rel after the i-th iteration. We prove, by induction on i, that Safe M (s) = ∞ for all s ∈ R \ Rel. For i = 0 we have R = Rel, so the statement holds. For i > 0, let s ∈ R \ Rel i , and let σ be any strategy. If some run from Comp(σ, s) visits a state from R \ Rel i−1 , then σ is not cap-safe, by induction hypothesis. Now assume that all such runs only visit reload states from Rel i−1 . Then, since MinInitCons M(Rel i−1 ) (s) > cap, there must be a run ∈ Comp(σ, s) with ReachCons + Rel i−1 ( ) > cap. Assume that is cap-safe in s. Since we consider only decreasing CMDPs, must infinitely often visit a reload state (as it cannot get stuck in a zero cycle). Hence, there exists an index f > 1 such that f ∈ Rel i−1 , and for this f we have RL cap ( .. f ) = ⊥, a contradiction. So again, σ is not safe in s. Since there is no safe strategy from s, we have Safe M (s) = ∞.
Finally, we need to prove that upon termination, out ≥ Safe M . Informally, per the definition of out, from every state s we can ensure reaching a state of Rel by consuming at most out(s) units of the resource. Once in Rel, we can ensure that we can again return to Rel without consuming more than cap units of the resource. Hence, when starting with out(s) units, we can surely prevent resource exhaustion. Proof. The first part of the theorem follows directly from Definition 6, Definition 2 (resource levels), and from definition of d-safe runs. The second part is a corollary of Theorem 4 and the fact that in each state, the safe strategy from Definition 6 can fix one such action in each state and thus is memoryless. The complexity follows from Theorem 4.

Positive Reachability
In this section, we focus on strategies that are safe and such that at least one run they produce visits a given set T ⊆ S of targets. The main contribution of this section is Algorithm 3 used to compute such strategies as well as the vector SafePR M,T of minimal initial resource levels for which such a strategy exist. As before, for the rest of this section we fix a CMDP M.
We The max operator considers, for given t, the value x(t) and the values needed to survive from all possible outcomes of a other than t. Let v = SPR-Val M (s, a, x) and t the outcome selected by min. Intuitively, v is the minimal amount of resource needed to reach t with at least x(t) resource units, or survive if the outcome of a is different from t. We now define a functional whose fixed point characterizes SPR-Val M,T . We first define a two-sided version of the truncation operator from the previous section: the operator · M such that Using the functions SPR-Val and · M , we now define an auxiliary operator A and the main operator B as follows.
Let SafePR i T be the vector such that for a state s ∈ S the number d = SafePR i T (s) is the minimal number such that there exists a strategy that is -safe in s and produces at least one run that visits T within first i steps. Further, we denote by y T a vector such that The following lemma can proved by a rather straightforward but technical induction. The following lemma says that iterating B M reaches a fixed point in a polynomial number of iterations. Intuitively, this is because when trying to reach T , it doesn't make sense to perform a cycle between two visits of a reload state (as this can only increase the resource consumption) and at the same time it doesn't make sense to visit the same reload state twice (since the resource is reloaded to the full capacity upon each visit). The proof is straightforward and is omitted in the interest of brevity.    The rest of this section is devoted to the proof of Theorem 7. The complexity follows from Theorem 6. Indeed, since the algorithm has a polynomial complexity, also the size of Σ is polynomial. The correctness proof is based on the following invariant of the main repeat loop: the finite counter strategy π = Σ r has these properties: (a) We have that π is Safe M (s)-safe in every state s ∈ S ; in particular, we have for l = min{r(s), cap(M)} that RL l (α) ⊥ for every finite path α produced by π from s. (b) For each state s ∈ S such that r(s) ≤ cap(M) there exists a π-compatible finite path α = s 1 a 1 s 2 . . . s n such that s 1 = s and s n ∈ T and such that "the resource level with initial load r(s) never decreases below r along α", which means that for each prefix α ..i of α it holds RL r(s) (α ..i ) ≥ r(s i ).
The theorem then follows from this invariant (parts (a) and the first half of (b)) and from Theorem 6. We start with the following support invariant, which is easy to prove. Lemma 6. The inequality r ≥ Safe M is an invariant of the main repeat-loop.
Proving part (a) of the main invariant. We use the following auxiliary lemma.

Lemma 7.
Assume that Σ is a counter selector such that for all s ∈ S such that Safe(s) < ∞: Proof. Let s be a state such that y(s) < ∞. It suffices to prove that for every πcompatible finite path α started in s it holds ⊥ RL y(s) (α). We actually prove a stronger statement: ⊥ RL y(s) (α) ≥ Safe(last(α)). We proceed by induction on the length of α. If len(α) = 0 we have RL y(s) (α) = y(s) ≥ Safe M (s) ≥ 0. Now let α = β t 1 at 2 for some shorter path β with last(β) = t 1 and a ∈ A, t 1 , t 2 ∈ S . By induction hypothesis, l = RL y(s) (β) ≥ Safe M (t 1 ), from which it follows that Safe M (t 1 ) < ∞. Due to (1.), it follows that there exists at least one x ∈ dom(Σ(t 1 )) such that x ≤ l. We select maximal x satisfying the inequality so that a = Σ(t 1 )(x). We have that RL y(s) (α) = RL l (t 1 at 2 ) by definition and from (2.) it follows that ⊥ RL x (t 1 at 2 ) ≥ Safe(t 2 ) ≥ 0. All together, as l ≥ x we have that RL y(s) (α) ≥ RL x (t 1 at 2 ) ≥ Safe(t 2 ) ≥ 0. Now we prove the part (a) of the main invariant. We show that throughout the execution of Algorithm 3, Σ satisfies the assumptions of Lemma 7. Property (1.) is ensured by the initialization on line 3. The property (2.) holds upon first entry to the main loop by the definition of a safe action (Definition 6). Now assume that Σ(s)(r(s)) is redefined on line 13, and let a be the action a(s).
We first handle the case when s R. Since a was selected on line 8, from the definition of SPR-Val we have that there is t ∈ Succ(s, a) such that after the loop iteration, (2) the latter inequality following from Lemma 6. Satisfaction of property (2.) in s then follows immediately from the equation (2).
If s ∈ R, then (2)  Proving part (b) of the main invariant. Clearly, (b) holds right after initialization. Now assume that an iteration of the main repeat loop was performed. Denote π old denote the strategy Σ r old and by π the strategy Σ r . Let s be any state such that r(s) ≤ cap(M). If r(s) = r old (s), then we claim that (b) follows directly from the induction hypothesis: indeed, by induction hypothesis we have that there is an s-initiated π old -compatible path α ending in a target state s.t. the r old (s)-initiated resource level along α never drops r old , i.e. for each prefix β of α it holds RL r old (s) (β) ≥ r old (last(β)). But then β is also π-compatible, since for each state q, Σ(q) was only redefined for values smaller than r old (q).
The case when r(s) < r old (s) is treated similarly. As in the proof of part (a), denote by a the action a(s) assigned on line 13. There must be a state t ∈ Succ(s, a) s.t. (2) holds before the truncation on line 10. In particular, for this t it holds RL r(s) (sat) ≥ r old (t). By induction hypothesis, there is a t-initiated π old -compatible path β ending in T satisfying the conditions in (b). We put α = sat β. Clearly α is s-initiated and reaches T . Moreover, it is π-compatible. To see this, note that Σ r (s)(r(s)) = a; moreover, the resource level after the first transition is e(t) = RL r(s) (sat) ≥ r old (t), and due to the assumed properties of β, the r old (t)-initiated resource level (with initial load e(t)) never decreases below r old along β. Since Σ was only re-defined for values smaller than those given by the vector r old , π mimics π old along β. Since r ≤ r old , we have that along α, the r(s)-initiated resource level never decreases below r. This finishes the proof of part (b) of the invariant and thus also the proof of Theorem 7

Büchi
This section proofs Theorem 1 which is the main theoretical result of the paper. The proof is broken down into the following steps.
(1.) We identify a largest set R ⊆ R of reload states such that from each r ∈ R we can reach R again (in at least one step) while consuming at most cap resource units and restricting ourselves only to strategies that (i) avoid R \ R and (ii) guarantee positive reachability of T in M(R ). Proof. We first show that σ is T -Büchi r(s)-safe in M(R ) for all s ∈ S with r(s) ≤ cap. Clearly it is r(s)-safe, so it remains to prove that T is visited infinitely often with probability 1. We know that upon every visit of a state r ∈ R , σ guarantees a future visit to T with positive probability. As a matter of fact, since σ is a finite memory strategy, there is δ > 0 such that upon every visit of some r ∈ R , the probability of a future visit to T is at least δ. As M(R ) is decreasing, every s-initiated σ-compatible run must visit the set R infinitely many times. Hence, with probability 1 we reach T at least once. The argument can then be repeated from the first point of visit to T to show that with probability 1 ve visit T at least twice, three times, etc. ad infinitum. By the monotonicity of probability, P σ M,s (Büchi T ) = 1. It remains to show that r ≤ SafeBüchi M,T . Assume that there is a state s ∈ S and a strategy σ such that σ is d-safe in s for some d < r(s) = SafePR M(R ),T (s). We show that this strategy is not T -Büchi d-safe in M. If all σ -compatible runs reach T , then there must be at least one history α produced by σ that visits r ∈ R \ R before reaching T (otherwise d ≥ r(s)). Then either (a) SafePR M,T (r) = ∞, in which case any σ -compatible extension of α avoids T ; or (b) since SafePR M(R ),T (r) > cap, there must be an extension of α that visits, between the visit of r and T , another r ∈ R \ R such that r r. We can then repeat the argument, eventually reaching the case (a) or running out of the resource, a contradiction with σ being d-safe.
We can finally proceed to prove Theorem 1.
Proof (of Theorem 1). The theorem follows immediately from Theorem 8 since we can (1.) compute SafeBüchi M,T and the corresponding strategy σ T in polynomial time (see Theorem 7 and Algorithm 4), (2.) we can easily check whether d ≥ SafeBüchi M,T (s), if yes, than σ T is the desired strategy σ and (3.) represent σ T in polynomial space as it is a finite counter strategy represented by a polynomial-size counter selector.

Implementation and Case Studies
We have implemented the presented algorithms in Python in a tool called FiMDP (Fuel in MDP) available at https://github.com/xblahoud/FiMDP. The docker artifact is available at https://hub.docker.com/r/xblahoud/fimdp and can be run without installation via the Binder project [50]. We investigate the practical behavior of our algorithms using two case studies: (1) An autonomous electric vehicle (AEV) routing problem in the streets of Manhattan modeled using realistic traffic and electric car energy consumption data, and (2) a multi-agent grid world model inspired by the Mars Helicopter Scout [8] to be deployed from the planned Mars 2020 rover. The first scenario demonstrates the utility of our algorithm for solving real-world problems [59], while the second scenario reaches the algorithm's scalability limits.
The consumption-Büchi objective can be also solved by a naive approach that encodes the energy constraints in the state space of the MDP, and solves it using techniques for standard MDPs [33]. States of such an MDP are tuples (s, e) where s is a state of the input CMDP and e is the current level of energy. Naturally, all actions that would lead to states with e < 0 lead to a special sink state. The standard techniques rely on decomposition of the MDP into maximal end-components (MEC). We implemented the explicit encoding of CMDP into MDP, and the MEC-decomposition algorithm.
All computations presented in the following were performed on a PC with Intel Core i7-8700 3.20GHz 12 core processor and a RAM of 16 GB running Ubuntu 18.04 LTS. All running times are means from at least 5 runs and the standard deviation was always below 5% among these runs. We consider the area in the middle of Manhattan, from 42nd to 116th Street, see Fig. 2. Street intersections and directions of feasible movement form the state and action spaces of the MDP. Intersections in the proximity of real-world fast charging stations [56] represent the set of reload states.

Electric Vehicle Routing
After the AEV picks a direction, it reaches the next intersection in that direction deterministically with a stochastic energy consumption. We base our model of consumption on distributions of vehicle travel times from the area [55] and conversion of velocity and travel times to energy consumption [52]. We discretize the consumption distribution into three possible values (c 1 , c 2 , c 3 ) reached with corresponding probabilities (p 1 , p 2 , p 3 ). The transition from one intersection (I 1 ) to another (I 2 ) is then modelled using three dummy states as explained in Fig. 3. In this fashion, we model the street network of Manhattan as a CMDP with with 7378 states and 8473 actions. For a fixed set of 100 randomly selected target states, Fig. 4 shows influence of requested capacity on running times for (a) strategy for Büchi objective using CMDP (our approach), and (b) MECdecomposition for the corresponding explicit MDP. We can see from the plots that our algorithm runs reasonably fast for all capacities (it stabilizes for cap > 95), it is not the case for the explicit approach. The running times for MEC-decomposition is dependent on the numbers of states and actions in the explicit MDP, which keep growing. The number of states of the explicit MDP for capacity 95 is 527475, while it is still only 7378 in the original CMDP. Also note that actually solving the Büchi objective in the explicit MDP requires computing almostsure reachability of MECs with some target states. Therefore, we can expect that even for small capacities our approach would outperform the explicit one (Fig. 4 (c)).

Multi-agent Grid World
We use multi-agent grid world to generate CMDP with huge number of states to reach the scalability limits of the proposed algorithms. We model the rover and the helicopter of the Mars 2020 mission with the following realistic considerations: the rover enjoys infinite energy while the helicopter is restricted by batteries recharged at the rover.  These two vehicle jointly operate on a mission where the helicopter reaches areas inaccessible to the rover. The outcomes of the helicopter's actions are deterministic while those of the rover -influenced by terrain dynamics -are stochastic. For a grid world of size n, this system can be naturally modeled as a CMDP with n 4 states. Fig. 5 shows the running times of the Büchi objective for growing grid sizes and capacities in CMDP. It also shows the running time for the MEC decomposition of the corresponding explicit MDP when the capacity is 10. We observe that the increase in the computational time of CMDP follows the growth in the number of states roughly linearly, and our implementation deals with an MDP with 1.6 × 10 5 states in no more than seven minutes.

Conclusion & Future Work
We presented a first study of consumption Markov decision processes (CMDPs) with qualitative ω-regular objectives. We developed and implemented a polynomial-time algorithm for CMDPs with an objective of probability-1 satisfaction of a given Büchi condition. Possible directions for the future work are extensions to quantitative analy-sis (e.g. minimizing the expected resource consumption), stochastic games, or partially observable setting.

A Proofs
A.1 Proof of Theorem 2 Lemma 8. There exists a memory-less optimal strategy for the objective MinReach T .
Proof. It is clear that in every state s, the player must play a good action, i.e. an action a such that C(s, a) + max s ∈Succ(s,a) MinReach T (s ) ≤ MinReach T (s). If there are multiple good actions in some state s we proceed as follows.
We first assign a ranking to states. All states in T have rank 0. Now assume that we have assigned ranks ≤ i and we want to assign rank i + 1 to some of the yet unranked states. We say that an action a is progressing in an unranked state s if all s ∈ Succ(s, a) have a rank (smaller than i + 1). If all good actions in the yet unranked states are nonprogressing, than all the unranked states s have MinReach T (s) = ∞, since by playing any good action, the player cannot force reaching a ranked state and thus also a target state; hence, in this case we assign all the unranked states the rank ∞ and finish the construction. Otherwise, we assign rank i + 1 to all unranked states that have a good progressing action and continue with the construction. It is easy to see that a state s is assigned a finite rank if and only if MinReach T (s) < ∞.
We now fix a memory-less strategy σ such that in a state of finite rank, σ chooses a good progressing action; for states of an infinite rank, σ chooses an arbitrary (but fixed) action. Since σ only uses good actions, a straightforward induction shows that for each state s and each s-initiated run that reaches T actually reaches T with consumption at most MinReach T (s). So we need to show that σ does not admit runs initiated in a state of finite rank that never reach T . But all σ-compatible runs initiated in a state of finite rank decrease the rank in every step, since σ only plays progressing actions. The result follows.
Given a target set T , a number i ∈ N, and a run , we define ReachCons i T ( ) as ReachCons T ( ) with the additional restriction that f ≤ i. Intuitively, ReachCons i T ( ) = ∞ if does not visit T withing i − 1 steps. We then put  Proof. It is easy to see that for each s and each i ≥ 0 it holds that MinReach i T (s) ≥ MinReach T (s). Furthermore, an easy computation shows that MinReach T (s) is a fixed point of F . Hence, it suffices to show that MinReach |S | T (s) = MinReach T (s). Assume, for the sake of contradiction, that we have some s ∈ S such that MinReach |S | T (s) > MinReach T (s). Fix σ to be the memory-less optimal strategy from Lemma 8. By the definition of MinReach |S | T (s) we have a run ∈ Comp(σ, s) such that ReachCons |S | T ( ) > MinReach T (s). This is only possible if ReachCons |S | T ( ) = ∞, otherwise this would contradict the optimality of σ (note that if ReachCons i T ( ) < ∞, then ReachCons j T ( ) = ReachCons i T ( ) for all j ≥ i). Hence, there is a run compatible with σ whose prefix of length |S | does not contain a target state. But this prefix must contain a cycle, and since σ is memory-less, we can iterate this cycle forever. Hence, the optimal strategy σ admits a run from s that never reaches T , a contradiction with MinReach T (s) < MinReach |S | T (s) ≤ ∞.

A.2 Proof of Lemma 3
By induction on i. The base case is clear. Now assume that the equality holds for some i (The last equation following from the definition of F and from the fact thats is never a reload state.) Moreover, from Lemma 9 it follows that F i+1 (x R )(s) = MinReach i+1 R (s). But if s is not a reload state, then MinReach i+1 R (s) = MinReach i+1 R (s), so for s ∈ S \ R we have G i+1 (∞)(s) = F i+1 (x R )(s) = F i+1 (x R )(s), which proves the second part.

A.3 Completion of the proof of Theorem 4
Now we prove that upon termination, out ≥ Safe M . For every t ∈ S there exists, by the definition of MinInitCons, a strategy σ t s.t. ReachCons + M(Rel),Rel (σ t , t) = mic(t) = out(t) and thus ReachCons + M(Rel),Rel ( ) is bounded by out(t) for each ∈ Comp(σ t , t). We construct a new strategy π that, starting in some state t initially mimics σ t until the next visit of a state r 1 ∈ Rel. Once this happens, the strategy π begins to mimic σ r 1 until the next visit of some r 2 ∈ Rel, when π begins to mimic σ r 2 , and so on ad infinitum.
Fix any state s such that upon termination, mic(s) ≤ cap (for other states, the inequality out(s) ≥ Safe M (s) clearly holds). We prove that every run ∈ Comp M(Rel) (π, s) is actually mic(s)-safe from s in the original MDP M. In fact, it is sufficient to show this in M(Rel), since each its reload state is also a reload state of M.
Let i 1 < i 2 < . . . be all indices i such that i ∈ Rel. Since π mimics σ s until i 1 , we have that RL mic(s) ( .. j ) = mic(s) − cons( .. j ) where cons( .. j ) is by definition of σ s bounded by mic(s), for all j ≤ i 1 . Now let m ≥ 1, we set k = i m and l = i m+1 . As π mimics σ k for between k and l we have for k < j ≤ l that RL mic(s) ( .. j ) = RL cap ( k.. j ) = cap−cons( k.. j ), where cons( k.. j ) is bounded by ReachCons + Rel (σ k , k ) = mic( k ) < cap. Therefore, RL mic(s) ( .. j ) ≥ 0 for all j and is mic(s)-safe. This finishes the proof.

A.4 Proof of Lemma 4
By induction on i. The base case is clear. Now assume that the statement holds for some i ≥ 0. Fix any s. Denote by b = B i+1 (y T )(s) and d = SafePR i+1 T (s). We show that b = d. The equality holds whenever s ∈ T , so in the remainder of the proof we assume that s T .
We first prove that b ≥ d. If b = ∞, this is clearly true. Otherwise, let a min be the action minimizing SPR-Val(s, a min , B i (y T )) (which equals b if s R) and let t min ∈ Succ(s, a min ) be the successor used to achieve this value. By induction hypothesis, there exists a strategy σ 1 that is B i (y T )(t min )-safe in t min and P σ 1 t min (Reach i T ) > 0, and there also exists a strategy σ 2 that is Safe(t)-safe from t for all t ∈ Succ(s, a min ), t t min .
Consider now a strategy π which, starting in s, plays a min . If the outcome of a min is t min , π starts to mimic σ 1 , otherwise it starts to mimic σ 2 . We claim that π is b-safe in s and that P π s (Reach i+1 T ) > 0 by showing the following two points. 1. There is at least one run T ∈ Comp(π, s) that reaches T in ≤ i + 1 steps. 2. All runs in Comp(π, s) are b-safe. We construct T easily as T = α where α = sa min t min and is the witness that P σ 1 t min (Reach i T ) > 0. Now let ∈ Comp(π, s) be a run produced by π from s. Then it has to be of the form = sa min t σ where σ ∈ Comp(σ 1 , t) if t = t min or σ ∈ Comp(σ 2 , t) it t t min ; in both cases Safe(t)-safe in t. By definition of SPR-Val and by induction hypothesis, we have for s R that b ≥ C(s, a min ) + Safe(t) and thus RL b (sa min t) ≥ Safe(t) and thus is b-safe by Lemma 1. If s ∈ R and b = 0, by similar arguments, as C(s, a min ) + Safe(t) ≤ cap (otherwise b would be ∞), we have that RL b (sa min t) ≥ Safe(t) and thus is b-safe. Now we prove that b ≤ d. This clearly holds if d = ∞, so in the remainder of the proof we assume d ≤ cap(M). By the definition of d there exists a strategy σ s.t. σ is d-safe in s and P σ s (Reach i+1 T (d)) > 0. Let a = σ(a) be the action selected by σ in the first step when starting in s. We denote by τ the strategy such that for all histories α we have τ(α) = σ(saα). For each t ∈ Succ(s, a) we assign a number d t defined as d t = 0 if t ∈ R and d t = RL d (sat) otherwise.
We finish the proof by proving these two claims: