figure a
figure b

1 Introduction

Markov Decision Processes (MDPs) are the model for sequential decision making under probabilistic uncertainty, and as such are central in modelling of randomized algorithms, distributed systems with lossy channels, or as the underlying formalism in reinforcement learning. A key question in the verification of MDPs is: What is the maximal probability that some error state is reached? In this question, one accounts for the probabilistic nature as well as the inherit (potentially adversarial) nondeterminism of the system. Various state-of-the-art probabilistic model checkers, such as Storm [20], Prism [27] and Modest [17] implement a variety of methods that automatically compute such maximal probabilities. Most widespread are variations of value-iteration that iteratively apply a transition function to converge towards the requested probability.

Hierarchical Structure. Despite various successes, the state space explosion remains a significant challenge to the model-based analysis of MDPs. To overcome this challenge, some approaches exploit symmetries or the parallel composition of a system. Other approaches exploit that typically not all paths through a system are equally likely and thus aim to find the essential or critical subsystem. While we exploit related ideas—a detailed comparison is given in the related work, cf. Sect. 7—our approach is fundamentally different and instead exploits a hierarchical decomposition natural in many system models. This decomposition is captured naturally by probabilistic programs (over discrete bounded variables) with non-nested subroutines, where some subroutines are called repeatedly with similar arguments. Figure 1 shows an example in which we demonstrate our approach in Sect. 2. More generally, we are interested in systems with an overall task that is achieved by a suitable combination of a limited number of sub-tasks. Such a setting occurs naturally, e.g. (i) in robotics, when multiple rooms in a floor need to be inspected, or (ii) in routing, when multiple packets need to be routed sequentially. The underlying problem structure is also exploited in hierarchical planning [5, 19, 30], where the goal is to find a good but not necessarily optimal policy (and induced value). We combine insights from hierarchical planning with an abstraction-refinement perspective and then construct an anytime algorithm with strict guarantees on the result.

Fig. 1.
figure 1

Simplified example for sending a token over an unreliable channel.

Local Model-Based Analysis. An adequate operational model for the model-based analysis of hierarchical systems is given by a hierarchical MDP, where the state space of a hierarchical MDP can be partitioned into subMDPs. Abstractly, one can represent a hierarchical MDP by the collection of subMDPs and a macro-level MDP [19] where the probabilities of outgoing transitions at a state are described by a corresponding subMDP, cf. Sect. 3.2. In this paper, we focus on a hierarchical MDPs where the policies that are optimal in (only) a subMDP are optimal (partial) policies in the hierarchical MDP. More intuitively, we can solve the subMDPs individually, i.e., the solution (w.r.t. the fixed measure) for the subMDP is part of the globally optimal solution. While this assumption is restrictive, it is satisfied in various interesting settings. The assumption allows us to analyse subMDPs out-of-context, i.e., we can first analyse the subMDPs and then construct the correct macro-MDP, i.e., extract transition probabilities and rewards from the subMDP analysis. This approach already improves the maximal memory consumption and allows for additional speed-ups if the same subMDP occurs multiple times.

Epistemic Uncertainty During Computation. The key insight to accelerate the outlined approach further is to avoid analysing all subMDPs precisely, while still providing sound guarantees on the obtained results. Therefore, consider that even before analysing the subMDPs we can analyse an uncertain variant of the macro-level MDP where we do not yet know the associated transition probabilities and rewards but instead only know intervals. We may then do two things: First, we can identify the subMDPs which are most critical, i.e., where replacing the interval by a concrete value yields most benefits. Second, and more importantly, we can analyse a set of subMDPs and refine the associated uncertainties, i.e., tighten the associated intervals. To support the analysis of sets of subMDPs, we observe that often, these subMDPs are slight variations. In this paper, we represent them as parameterised instances of a particular templates that we define using parametric MDPs (pMDPs). The resulting intervals can be used to create an (interval-valued version of the) macro-level MDP. Analysing this gives bounds on the expected reward in the hierarchical MDP, and the bounds can be refined by analysing the subMDPs more precisely.

Contributions. In a nutshell, we explicitly allow for uncertainty during the solving process to speed up the analysis of hierarchical MDPs. Concretely, we contribute a scalable approach to solve hierarchical MDPs with many different subMDPs, in particular when these subMDPs are similar, but not the same. The approach resembles an abstraction-refinement loop where we abstract the hierarchical MDP in two layers and then refine the analysis of the lower layer to get a refined representation of the complete MDP. In every step, we can provide absolute error bounds. Our approach interprets the different subMDPs as a form of uncertainty. The efficient analysis originates from progress made in the analysis of uncertain (or parametric) MDPs, and brings that progress to a novel setting. The empirical evaluation with a prototype called level-up shows the efficacy of the approach.

2 Overview

We clarify the approach and its applicability with a motivating example that drastically abstracts a token passing process where the channel quality varies [12].

Setting. Consider the protocol in Fig. 1a which sends a token N times via a channel. That channel successfully transmits packets with probability p, where p varies over time. The subroutine takes t amount of time, depending on p. Specifically, in the model, we alternate between accumulating the required time and updating the channel quality for N token transmissions and then return the accumulated time. We aim to compute the expected return value. For the subroutine, we assume that sending a token is repeated until an acknowledgement is received, which is abstractly modelled in Fig. 1b and corresponds to the small Markov chain in Fig. 2a. First, the file must successfully be sent (\(s_0 \rightarrow s_1\)), then we start sending acknowledgements. The process terminates (\(s_1 \rightarrow s_2\)) once an acknowledgement is received. The complete protocol from Fig. 1 including the subroutine is reflected by the large Markov chain in Fig. 2b that repeats the small Markov chain (with different probabilities). This model may be analysed with standard tools, but for large N (and larger subroutines), the state space explosion must be alleviated.

Fig. 2.
figure 2

Ingredients for hierarchical MDPs with the Example from Fig. 1. Annotations reflect subMDPs within the macro-MDPs in Fig. 3.

Macro-MDPs and Enumeration. We thus suggest to abstract the hierarchical model into the macro-level MDP in Fig. 3a. Here, every state corresponds to an invocation of the subprocess. The reward at the states corresponds to the expected reward for the complete subprocess. Thus, naively, one may construct the macro-MDP, analyse all (reachable) subMDPs independently and annotate the macro-MDP states with the appropriate rewards, and finally analyse the macro-MDP to obtain a result of \(\approx \)12.3. This approach avoids representing the complete hMDP in the memory, but it is still restricted to analysing systems with a limited number of subMDPs.

Our Approach. We improve scalability by constructing a parameterized macro-MDP. Reconsider the rewards for Fig. 3a. The values can be computed via the graph in Fig. 3d, where we pick for each value for p (x-axis) and compute the corresponding expected reward \(\mathbb {E}\) (y-axis) obtained by analysing the subMDP in Fig. 2a. Intuitively, in our abstraction, we annotate the rewards with lower- and upper bounds rather than exact values. Therefore, we compute bounds on the rewards by selecting an interval for the values \(p \in [8/25, 25/32]\), as shown in Fig. 3e. Conceptually, this means that we analyse a set of subMDPs at once, namely all subMDPs with \(p \in [8/25, 25/32]\). Annotating the corresponding expected rewards, in this case \([64/25, 25/4]\), then yields the macro-MDP in Fig. 3b. Analysis of this MDP yields that overall expected time is in [7.68, 18.75]. We refine these bounds by analysing subsets of the subMDPs. We may split the values for p into two sets \([8/25, 2/5]\) and \([1/2, 25/32]\). Then, we obtain two corresponding intervals on the expected time in the subMDP as shown in Fig. 3f. Model checking the associated macro-MDP, in Fig. 3c, bounds to expected time by [10.12, 14.25]. Technically, we realize this reasoning using parameter lifting [33].

Fig. 3.
figure 3

Visualising the computation of expected rewards for the hMDP from Fig. 2b using a macro-MDP and interval-based abstractions.

Supported Extensions. For conciseness, this example is necessarily simple. Our approach allows nondeterminism, i.e., action-choices, in the macro-MDP and in the subMDPs. The subMDPs may have multiple outgoing transitions, but this must be combined with a restricted type of nondeterminism in the subMDP: If multiple outgoing transitions are present, the macro-MDP has transition probabilities that depend on the subMDPs. We present a useful extension for reachability probabilities, see the discussion at the bottom of Sect. 3.3.

More Examples. Key ingredient to models where the approach excels are a repetitive task whose characteristics depend on some global state. Two variations are the expected energy consumption of a robot with slowly degrading components that, e.g., can be improved by maintenance or for job scheduling with periodically changing distribution of tasks (e.g., day vs. night).

3 Formal Problem Statement

We formalize MDPs and hierarchical MDPs (hMDPs) to pose the problem statement, then identify a subclass of hMDPs which we call local-policy hMDPs and restrict our problem on computing optimal expected rewards in local-policy hMDPs. Furthermore, we introduce parametric MDPs as they are key to the abstraction-refinement procedure later in the paper.

3.1 Background

Definition 1 (Parametric MDP)

A parametric MDP (pMDP) is a tuple \(\mathcal {M} = \langle S_\mathcal {M}, A_\mathcal {M}, \iota _\mathcal {M}, \vec {x}, P_\mathcal {M}, r_\mathcal {M}, T_\mathcal {M}\rangle \) where \(S_\mathcal {M}\) is a finite set of states, \(A_\mathcal {M}\) is a finite set of actions, \(\iota _\mathcal {M} \in S_\mathcal {S}\) is the initial state, \(\vec {x} = \langle x_0, \ldots x_n \rangle \) is a vector of parameters, \(P_\mathcal {M}:S_\mathcal {M} \times A_\mathcal {M} \times S_\mathcal {M} \rightarrow \mathbb {Q}[\vec {x}]\) are the transition probabilities, \(r_\mathcal {M}:S \rightarrow \mathbb {Q}[\vec {x}]\) the state rewards, and \(T_\mathcal {M}\) is a set of target states.

We drop the subscripts whenever possible. MDPs are parametric if \(\vec {x} \ne \langle \rangle \) and parameter-free otherwise. We omit parameters for parameter-free MDPs. We recap some standard notions on pMDPs (and MDPs):

For a (parameter) valuation \(u \in \mathbb {R}^{\vec {x}}\), the instantiation \(\mathcal {M}[u]\) globally substitutes \(P_\mathcal {M}(s,a,s')\) with \(P_\mathcal {M}(s,a,s')(u)\) and \(r_\mathcal {M}(s)\) with \(r_\mathcal {M}(s)(u)\). An assignment u is well-defined, if \(\mathcal {M}(u)\) constitutes an MDP, i.e., if \(\sum _{s'} P_\mathcal {\mathcal {M}}(s,\alpha ,s')(u) \in \{0, 1\}\) and \(r_\mathcal {M}(s)(u) \ge 0\) for each \(s \in S\), \(\alpha \in A\). We denote the set of all well-defined assignments with \(U_\mathcal {M}\). The set \(\textsf {Act}(s)\) denotes the enabled actions at state s, \(\textsf {Act}(s) = \{ \alpha \mid \sum _{s'} P_\mathcal {\mathcal {M}}(s,\alpha ,s') \ne 0\) }. If \(|\textsf {Act}(s)| = 1\) for every \(s \in S\), then the (parametric) MDP is a (parametric) Markov chain (MC). A path \(\pi \) is an (in)finite sequence of states \(s_0 \mathop {\rightarrow }\limits ^{\alpha _0} s_1\ldots \), with \(s_i \in S\), \(\alpha _i \in \textsf {Act}(s_i)\), \(P(s_i, \alpha _i,s_{i+1}) \ne 0\). For finite \(\pi \), \(\textsf {last}(\pi )\) denotes the last state of \(\pi \). We use \([s \rightarrow \lozenge T]\) to denote the set of (finite) paths T only at the end. The reward \(r(\pi )\) along a finite path \(\pi \) is the sum of the state rewards \(r(\pi ) :=\sum r(s_i)\).

Specifications. We consider indefinite horizon expected reward, i.e., the expected accumulated reward until reaching the target states. We refer to [3, 32] for a formal treatment and only introduce notation. Therefore, the unique probability measure \(Pr\) for a set of paths in a parameter-free Markov chain \(\mathcal {M}\) reaching state T can be defined using the usual cylinder set construction. We define \(Pr_\mathcal {M}(s \rightarrow \lozenge T)\) as the probability to reach a state in T, \(\int _{\pi \in [s \rightarrow \lozenge T]} Pr(\pi ) d\pi \). We then define the expected reward until hitting T, \(\textsf {ER}_{\mathcal {M}}(s \rightarrow \lozenge T) = \int _{\pi \in [s \rightarrow \lozenge T]} Pr(\pi ) \cdot r(\pi ) d\pi \). In both definitions, if s is the initial state, we simply write \(\ldots (\lozenge T)\). For technical conciseness, we make the standard assumption that target states are reached with probability 1, which ensures that the integral exists and is finite. (Arbitrary) reachability probabilities can be nevertheless be modelled using rewards.

Policies. In pMDPs, we resolve nondeterminism with policies. In this paper, it suffices to consider memoryless policies \(\sigma :S \rightarrow A\). The set of such policies is denoted \(\varSigma (\mathcal {M})\). We omit \(\mathcal {M}\) if it is clear from the context. It is helpful to also consider partial policies \(\hat{\sigma }:S \nrightarrow A\). For an pMDP \(\mathcal {M}\) and a (partial) policy \(\hat{\sigma }\), the induced dynamics are described by the induced pMDP \(\mathcal {M}[\hat{\sigma }]\), defined as \(\langle S_\mathcal {M}, A_\mathcal {M}, \iota _\mathcal {M}, \vec {x}, P, r_\mathcal {M}, T_\mathcal {M}\rangle \), where the transition probabilities are given as

$$\begin{aligned} P(s,\alpha , s') = {\left\{ \begin{array}{ll} P_\mathcal {M}(s,\alpha ,s') &{} \text {if } \hat{\sigma }(s) = \alpha , \\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

If \(\sigma \) is total (not partial), then \(\mathcal {M}\) is a MC. We define the maximal expected reward \(\textsf {ER}_{\mathcal {M}}^\text {max}(\lozenge T) = \max _{\sigma \in \varSigma } \textsf {ER}_{\mathcal {M}[\sigma ]}(\lozenge T)\), and say that a policy \(\sigma \) is optimal, if \(\textsf {ER}_{\mathcal {M}}^\text {max}(\lozenge T) = \textsf {ER}_{\mathcal {M}[\sigma ]}(\lozenge T)\).

Regions and Parametric Model Checking. A set of valuations described by is called a (rectangular) region, if \(R = \{ u \mid u^{-}\le u \le u^{+}\}\) for adequate bounds \(u^{-}, u^{+}\in \mathbb {R}^{\vec {x}}\) and using pointwise inequalities, i.e., R is a Cartesian product of intervals of parameter values. We denote this region also with \([[u^{-},u^{+}]]\). For regions, we may compute a lower bound on \(\min _{u \in R} \textsf {ER}_{\mathcal {M}[u]}^\textsf {max}(\lozenge T)\) and an upper bound on \(\max _{u \in R} \textsf {ER}_{\mathcal {M}[u]}^\textsf {max}(\lozenge T)\) via parameter lifting [33, 36].

3.2 Hierarchical MDPs

We concentrate on solving hierarchical MDPs (hMDPs). We assume that hMDPs are parameter-free and that their topology has some additional known structure.

Definition 2 (Hierarchical MDPs)

A MDP \(\mathcal {M}\) with a partitioning of its states \(S_\mathcal {M}= \bigcup \mathbf {S}_i\) is a hierarchical MDP, if for all i,

  • there exists a unique \(s^i_\iota \in \mathbf {S}_i\) such that \(s^i_\iota = \iota _\mathcal {M}\text { or } \textsf {pred}_\mathcal {M}(s^i_\iota ) \not \subseteq \mathbf {S}_i\), and

  • \(\text {for all } s \in \mathbf {S}_i \setminus \{ s^i_\iota \}\), it holds that \(s^i_\iota \ne \iota _\mathcal {M}\text { and } \textsf {pred}_\mathcal {M}(s) \subseteq \mathbf {S}_i.\)

The state \(s_\iota \) is called the entry state, which we denote \(\textsf {entry}_i\). States with \(\textsf {succ}_{\mathcal {M}}(s) \cap \mathbf {S}_i = \emptyset \) are called exit-states. The set \(\textsf {succ}(i) :=\textsf {succ}_{\mathcal {M}}(\mathbf {S}_i) \setminus \mathbf {S}_i\) are the successor states of the partition i. Let \(Y = \max _i |\textsf {succ}(i)|\). By adding auxiliary states, we can assume that \(|\textsf {succ}(i)| = Y\) for all i. We call partitions with \(|\mathbf {S}_i| = 1\) trivial. We use \(\mathbb {I}:=\{ i \mid |\mathbf {S}_i| > 1 \}\) to denote the indices of the nontrivial partitions. We remark that every MDP can be considered as an hMDP with only trivial partitions.

figure c

The naive solution to this problem is to ignore the hierarchical structure and solve the MDP monolithically. In this paper, we contribute methods that actively exploit the structure of the hierarchical MDPs with \(|\mathbb {I}| \gg 1\). We will make an additional assumption on the structure of the hierarchical MDP.

3.3 Optimal Local Subpolicies and Beyond

Intuitively, we want to ensure that the optimal policy within the partitions can be computed locally, i.e., on partition without taking into account the complete MDP. Therefore, each partition within the MDP can be considered as an individual MDP. In particular, each \(\mathbf {S}_i\) induces a subMDP as follows:

Definition 3 (subMDP)

Given a hierarchical MDP \(\mathcal {M}\) and partition \({\textbf {S}}_i\), the corresponding subMDP is an MDP \(\mathcal {M}_i :=\langle S_i :={\textbf {S}}_i \cup \textsf {succ}_{\mathcal {M}}({\textbf {S}}_i) \cup \{ \bot \}, A_\mathcal {M}\cup \{ \alpha _\bot \}, \iota :=\textsf {entry}_{i}, P_i, r_i, G_i \rangle \) with \(P_i\) defined by

$$\begin{aligned} P_i(s,\alpha ,s') :={\left\{ \begin{array}{ll} P_\mathcal {M}(s,\alpha , s') &{} \text { if } s \in {\textbf {S}}_i \text { and }\alpha \in A_\mathcal {M}, \\ 1 &{} \text {else if } s \not \in {\textbf {S}}_i, \alpha = \alpha _\bot , \text { and } s' = \bot \\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

\(r_i\) is defined as \(r_i(s) = r_\mathcal {M}(s)\) if \(s\in {\textbf {S}}_i\), \(r_i(s) = 0\) otherwise, and \(G_i :=\{ \bot _i \}\).

Thus, for every partition of the hierarchical MDP, the corresponding subMDP contains additionally the successor states, and a unique bottom state that is a target state and simplifies our construction later.

Likewise, we can (de)compose memoryless policies for the hierarchical MDP as a union of policies on the individual subMDPs. We do this only for nontrivial partitions. Let \(\sigma _i :S_i \mapsto A\) denote memoryless policies for \(\mathcal {M}_i\) and \(\sigma '_i\) the restriction of \(\sigma _i\) to \({\textbf {S}}_i\), then \(\left( \bigsqcup _{\mathbb {I}} \sigma _{i } \right) :S \nrightarrow A\) is the unique partial policy such that

$$\begin{aligned} \big ( \bigsqcup _{\mathbb {I}} \sigma _{i } \big )(s) :=\sigma '_i(s) \text { if } s \in {\textbf {S}}_i, i \in \mathbb {I}\quad \text { and }\quad \big ( \bigsqcup _{\mathbb {I}} \sigma _{i } \big )(s) :=\bot \text { otherwise. } \end{aligned}$$

Intuitively, we want that the union of locally optimal policies, a partial policy, can be completed to a total policy that is optimal.

Definition 4 (Optimal local subpolicies)

Given a hierarchical MDP \(\mathcal {M}\) with target states T and optimal policies \(\sigma _i \in \varSigma (\mathcal {M}_i)\) for all \(i \in \mathbb {I}\). The hierarchical MDP has optimal local subpolicies, if for \(\hat{\sigma }= \bigsqcup _\mathbb {I}\sigma _i\) it holds that \(\textsf {ER}_{\mathcal {M}[\hat{\sigma }]}^\textsf {max} = \textsf {ER}_{\mathcal {M}}^\textsf {max}\).

That is, if we collect (locally) optimal policies \(\sigma _i\) and apply them to \(\mathcal {M}\), we obtain the MDP \(\mathcal {M}[\left( \bigsqcup _{\mathbb {I}} \sigma _{i } \right) ]\). In that MDP, we can pick an optimal policy, and together with \(\left( \bigsqcup _{\mathbb {I}} \sigma _{i } \right) \) this constitutes an optimal and total policy for \(\mathcal {M}\).

figure d

Roughly, the idea now becomes that rather than solving one large MDP with S states, we solve \(|\mathbb {I}|\) MDPs with \(S/|\mathbb {I}|\) states and one MDP with \(\mathbb {I}\) states (assuming equally-sized and only nontrivial partitions).

The assumption is restrictive, but not unreasonable: A subroutine may not have any nondeterminism, or a finished task will have no influence on any future task. The following proposition, while obvious, formalizes that:

Proposition 1 (Sufficient criterion)

Let \(\mathcal {M}\) be a hierarchical MDP. The MDP has optimal local subpolicies, if for each \(i \in \mathbb {I}\) either

  • there is a single successor for the partition, i.e., \(|\textsf {succ}_{\mathcal {M}}({\textbf {S}}_i) \setminus {\textbf {S}}_i|=1\), or

  • there are no choices, i.e., \(|\textsf {Act}(s)| = 1\) for all \(s \in {\textbf {S}}_i\),

Beyond Optimal Local Subpolicies. The efficiency of our approach is partly due to the assumption in Definition 4. We observe that adapting this definition allows for a spectrum of specific yet useful cases. In particular, say that our system describes a protocol in which we must optimize the probability to satisfy N tasks all may fail – the subMDPs will have two successor states. Often, it is then easy to see (and model) that a locally optimal policy will aim to satisfy each task and that thus, the locally optimal policy optimizes the probability to reach the corresponding successor state. Then, by adopting the target states in Definition 3 to be the successor state where the task is successful, the notion of an optimal policy—and thus of an optimal local subpolicy—changes. These changes are minimal and everything that follows below is easily adapted to this setting as demonstrated by the prototypical implementation.

4 Solving hMDPs with Abstraction-Refinement

In this section, we consider hMDPs with optimal local subpolicies. We step-wise develop a sketch of an anytime algorithm that provides lower and upper bounds on the expected reward in this hMDP. In Sect. 4.1, we introduce an alternative representation of our problem that formalizes the idea of individually computing subMDPs. We then formalize the ideas that allow to construct an anytime algorithm in Sect. 4.2. In Sect. 4.3, we introduce the abstract requirements for analysing sets of subMDPs into the algorithm, and finally, in Sect. 4.4 we introduce a method that realises this using pMDPs.

4.1 The Macro-MDP Formulation

We adapt macro-MDPs [5] which summarize the subMDPs by single states.

Definition 5 (Macro-MDP)

Let \(\mathcal {M}\) be a hMDP with n non-trivial \({\textbf {S}}_i\) partitions and \(S_\mathcal {M}\) partitioned as \(S_\mathcal {M}= \bigcup {\textbf {S}}_i \cup S'\). The macro-MDP is defined as \(\mu (\mathcal {M}) :=\langle S' \cup \{ \textsf {entry}_i \mid 1 \le i \le n \}, A_\mathcal {M}, \iota _\mathcal {M}, \emptyset , P, r, T_\mathcal {M}\rangle \) with P and r given by

$$ P(s, \alpha , s') = {\left\{ \begin{array}{ll} \textsf {Pr}_{\mathcal {M}_i[\sigma _i]}(\lozenge \{s'\}) &{}\text {if } s \in {\textbf {S}}_i, \\ P_\mathcal {M}(s,\alpha ,s') &{} \text {otherwise,} \end{array}\right. } \quad r(s) = {\left\{ \begin{array}{ll} \textsf {ER}_{\mathcal {M}_i}^\textsf {max}(\lozenge \{\bot \}) &{}\text {if } s \in {\textbf {S}}_i, \\ r_\mathcal {M}(s) &{} \text {otherwise.} \end{array}\right. } $$

where \(\mathcal {M}_i\) is the corresponding subMDP (see Definition 3) and \(\sigma _i\) is an arbitrary but fixed optimal policy, i.e., a policy such that \(\textsf {ER}_{\mathcal {M}_i[\sigma _i]}(\lozenge G_i) = \textsf {ER}_{\mathcal {M}_i}^\textsf {max}(\lozenge G_i)\).

Intuitively, we replace the transitions within \({\textbf {S}}_i\) by a ‘big-step semantics’ that aggregates the transitions within \({\textbf {S}}_i\) by single transitions such that the probability to reach any successor matches the probability to do so within \({\textbf {S}}_i\) under a specific –optimal– policy. Likewise, the expected reward matches the expected reward collected in \({\textbf {S}}_i\)Footnote 1.

Remark 1

To define a unique macro-MDP, we can take the lexicographically smallest policy \(\sigma _i\) among the optimal policies. Furthermore, we observe that for the cases covered by Proposition 1, it is not necessary to compute \(\sigma _i\) at all: Either there is a single successor—implying \(\textsf {Pr}_{\mathcal {M}_i[\sigma _i]}(\lozenge \{s'\}) = 1\) for any \(\sigma _i\)—or \(|\varSigma (\mathcal {M}_i)|=1\).

The following theorem formalises that, given the assumptions, taking the big-step semantics is adequate when optimizing for an expected reward.

Theorem 1

Let \(\mathcal {M}\) be a hMDP with optimal local subpolicies and let \(\mu (\mathcal {M})\) be the corresponding macro-MDP. Then: \(\textsf {ER}_{\mu (\mathcal {M})}^\textsf {max}(\lozenge T) = \textsf {ER}_{\mathcal {M}}^\textsf {max}(\lozenge T)\).

The important ingredient are the optimal local subpolicies that ensure that we aggregate behavior within the partitions by behavior that agrees with a (globally) optimal policy. We give a proof in the appendixFootnote 2.

Naive Algorithm. Algorithmically, we first compute \(\textsf {ER}_{\mathcal {M}_i}^\textsf {max}(\lozenge T_i)\) and the associated policy \(\sigma _i\), then compute the reachability probabilities on the induced Markov chain. We collect these results in a vector \(\textsf {res}_i\), which is helpful to construct the macro-MDP. To clarify further constructions in this paper, we make \(\textsf {res}_i\) explicit. Recall that \(|\textsf {succ}_{\mathcal {M}}({\textbf {S}}_i)| = Y\) for all i.

Definition 6 (Results for subMDP)

Let \(\mathcal {M}_i\) be a subMDP for the partition \({\textbf {S}}_i\) of a hMDP \(\mathcal {M}\). Let \(\textsf {succ}_{\mathcal {M}}({\textbf {S}}_i)\) be ordered. We define \(\textsf {res}_i \in \mathbb {R}^{Y+1}\) s.t.

$$\begin{aligned} \textsf {res}_i(j) :=\textsf {Pr}_{\mathcal {M}_i[\sigma _i]}(\lozenge \{\textsf {succ}_{\mathcal {M}}({\textbf {S}}_i)_j \}) \text { for }0 \le j < Y \text { and } \textsf {res}_i(Y) :=\textsf {ER}_{\mathcal {M}_i}^\textsf {max}(\lozenge G_i), \end{aligned}$$

where \(\sigma _i\) is an arbitrary but fixed policy such that \(\textsf {ER}_{\mathcal {M}_i[\sigma _i]}(\lozenge G_i) = \textsf {ER}_{\mathcal {M}_i}^\textsf {max}(\lozenge G_i)\).

This allows us to reformulate the macro-MDP, in particular, the following two identities do hold:

$$\begin{aligned} P(s, \alpha , s') {=} {\left\{ \begin{array}{ll} \textsf {res}_i(j) &{}\text {if } s \in {\textbf {S}}_i \text { and } \\ {} &{} ~s' = \textsf {succ}_{\mathcal {M}}({\textbf {S}}_i)_j \\ P_\mathcal {M}(s,\alpha ,s') &{} \text {otherwise,} \end{array}\right. } \quad r(s) {=} {\left\{ \begin{array}{ll} \textsf {res}_i(Y) &{}\text {if } s \in {\textbf {S}}_i, \\ r_\mathcal {M}(s) &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

The identities trivialize that constructing the macro-MDP can be done by precomputing the necessary result-vectors.

figure e

This rather naive algorithm already limits memory and may exploit similarities between subMDPs during the analysis, e.g., based on the structure discussed in Sect. 4.4. It performs well if the number \(|\mathbb {I}|\) of subMDPs is sufficiently small. We are interested in considering methods that allow for larger \(\mathbb {I}\) or larger subMDPs. In particular, we want to avoid analysing all subMDPs, all individually.

4.2 The Uncertain Macro-MDP Formulation

Uncertainty Before Computation. We start introducing a method that allows providing bounds on the expected rewards after individually analysing a subset of the subMDPs. Before computing the individual probabilities in \(\mathcal {M}_i\), we are uncertain about the probabilities and rewards in the MDP \(\mu (\mathcal {M})\). Under this uncertainty, we may not be able to compute \(\textsf {ER}_{\mu (\mathcal {M})}^\textsf {max}(\lozenge T)\) precisely. However, we may solve the problem statement by bounding the expected reward. Thus, the goal is to compute values \(\textsf {lb}, \textsf {ub}\) s.t.

$$\begin{aligned} \textsf {lb}\le \textsf {ER}_{\mathcal {M}}^\textsf {max}(\lozenge T) = \textsf {ER}_{\mu (\mathcal {M})}^\textsf {max}(\lozenge T) \le \textsf {ub}. \end{aligned}$$

Uncertain Macro-MDPs. We capture the a-priori uncertainty about the subMDP results in an uncertain macro-MDP, a particularly shaped parametric MDP.

Definition 7 (Uncertain macro-MDP)

Let \(\mathcal {M}\) be a hMDP with n non-trivial \({\textbf {S}}_i\) partitions and \(S_\mathcal {M}\) partitioned as \(S_\mathcal {M}= \bigcup {\textbf {S}}_i \cup S'\). The uncertain macro-MDP is defined as \(\nu (\mathcal {M}) :=\langle S' \cup \{ \textsf {entry}_i \mid 1 \le i \le n \}, A_\mathcal {M}, \iota _\mathcal {M}, \vec {x}, P, r, T_\mathcal {M}\rangle \) with parameters \(\vec {x} :=\{ p_{i,j}, q_i \mid 1 \le i \le n, 1 \le j \le Y \}\) where \(Y = |\textsf {succ}_{\mathcal {M}}({\textbf {S}}_i)|\). P and r given by

$$ P(s, \alpha , s') :={\left\{ \begin{array}{ll} p_{i,j} &{}\text {if } s \in {\textbf {S}}_i \text { and } \\ {} &{} ~s' = \textsf {succ}_{\mathcal {M}}({\textbf {S}}_i)_j, \\ P_\mathcal {M}(s,\alpha , s') &{} \text {otherwise,} \end{array}\right. } \quad r(s) :={\left\{ \begin{array}{ll} q_i &{}\text {if } s \in {\textbf {S}}_i, \\ r_\mathcal {M}(s) &{} \text {otherwise.} \end{array}\right. } $$

Remark 2

Whenever \(\mathcal {M}_i\) and \(\mathcal {M}_{i'}\) are isomorphic, we may reduce the parameters and replace each occurrence of \(p_{i',j}\) with \(p_{i,j}\) and each occurrence of \(q_{i'}\) with \(q_i\).

The uncertain macro-MDP can be instantiated to coincide with the macro-MDP by setting the parameters accordingly.

Theorem 2

Let \(\mathcal {M}\) be a hMDP, \(\mu (\mathcal {M})\) the associated unique macro-MDP, and \(\nu (\mathcal {M})\) the associated uncertain macro-MDP with parameters \(p_{i,j}\) and \(q_i\). Let \(u^*\) be a parameter valuation with \(u^*(p_{i,j})= \textsf {res}_i(j)\) and \(u^*(q_i)= \textsf {res}_i(Y)\) for all ij. Then:

$$ \nu (\mathcal {M})[u^*] = \mu (\mathcal {M}) $$

Proof sketch. The construction of the uncertain macro-MDP and the macro-MDP only differs in the assignment of probabilities. We set u here as in the characterisation in (1) and thus the equality follows.    \(\square \)

Computing Bounds. Assume for now that we can derive some (trivial) sound bounds on the results vector for any subMDP \(\mathcal {M}_i\)Footnote 3.

Definition 8 (Sound bounds on results)

For \(\mathcal {M}_i\), the vectors \(\textsf {lbres}_i\) and \(\textsf {ubres}_i\) are sound bounds if the following pointwise inequality holds

$$\begin{aligned} \textsf {lbres}_i \le \textsf {res}_i \le \textsf {ubres}_i. \end{aligned}$$

These bounds on properties in the subMDP correspond to bounds on the parameters of the uncertain macro-level MDP \(\nu (\mathcal {M})\). Let us formalize this idea.

Definition 9 (Suitable parameter region)

Given \(u^*\) from Theorem 2. The bounds \(u^{-}, u^{+}\) are suitable if \(u^{-}\le u^* \le u^{+}\). For suitable \(u^{-}, u^{+}\), the region \([[u^{-}, u^{+}]]\) is called suitable.

Using this notion, sound bounds \(\textsf {lbres}_i\) and \(\textsf {ubres}_i\) thus yield suitable bounds \(u^{-}(x), u^{+}(x)\) for all \(x \in \bigcup _j p_{i,j} \cup \{ q_i \}\). Combined, the sound bounds for every i yields a suitable region. Formally:

Fig. 4.
figure 4

Analysing hMDPs via uncertain macro-MDPs via individual refinement.

Lemma 1

Given sound bounds \(\textsf {lbres}_i, \textsf {ubres}_i\) for each i, there exists a trivial mapping \(\textsf {Reg}\) s.t. \(\textsf {Reg}(\textsf {lbres}_1, \ldots \textsf {lbres}_n, \textsf {ubres}_1, \ldots \textsf {ubres}_n)\) is a suitable region.

With the suitable region we can apply verification on the parametric MDP.

Lemma 2

Let R be a suitable region. Then:

$$\begin{aligned} \min _{u \in R} \textsf {ER}_{\nu (\mathcal {M})[u]}^\textsf {max}(\lozenge T) \le \textsf {ER}_{\mathcal {M}}^\textsf {max}(\lozenge T) \le \max _{u \in R} \textsf {ER}_{\nu (M)[u]}^\textsf {max}(\lozenge T). \end{aligned}$$

Proof sketch. We observe that the inequalities follow from the fact that \(u^* \in R\) with \(u^*\) as in Theorem 2. By that theorem, \(\textsf {ER}_{\nu (\mathcal {M})[u^*]}^\textsf {max}(\lozenge T) = \textsf {ER}_{\mu (\mathcal {M})}^\textsf {max}(\lozenge T)\). The statement then follows from Theorem 1.    \(\square \)

From the bounds that we can compute using a suitable region, we then set \(\textsf {lb}\) and \(\textsf {ub}\) for Eq. (2):


Computationally, we may use parameter lifting [33] to find these values.

Refinement Loop. The complete anytime algorithm is summarized in Fig. 4. We start with an hMDP \(\mathcal {M}\) and extract the uncertain macro-MDP \(\nu (\mathcal {M})\) and the subMDPs \(\{\mathcal {M}_i\}\)Footnote 4. Furthermore we compute (trivial) sound bounds on \(\textsf {lbres}_i \le \textsf {res}_i \le \textsf {ubres}_i\). This leads to a suitable region \([[u^{-}, u^{+}]] = \textsf {Reg}(\textsf {lbres}_1, \textsf {ubres}_1, \ldots )\). Then, we may at any time compute the bounds \(\textsf {lb}, \textsf {ub}\) on the expected reward in the hMDP \(\mathcal {M}\) by analysing \(\nu (\mathcal {M})\) on the region \([[u^{-}, u^{+}]]\). To tighten these bounds, we must first refine the suitable region. Therefore, we analyse individual subMDPs \(\mathcal {M}_i\) and compute \(\textsf {res}_i\) and thus \(u^*(x)\) for \(x \in \cup _j p_{i,j} \cup q_i\). This refines the suitable bounds such that \(u^{-}(x) = u^*(x) = u^{+}(x)\) for \(x \in \cup _j p_{i,j} \cup q_i\). We call this refinement individual refinement. The new region is suitable and Theorem 2 ensures correctness of the refinement. As we only have finitely many subMDPs, we obtain \(\textsf {lb}= \textsf {ub}\) after finitely many steps.

figure f

4.3 Set-Based SubMDP Analysis

Next, we aim to provide an alternative refinement procedure that analyses a set of subMDPs at once, i.e., that refines the suitable bounds for a set of parameters at once. We denote the set of goal states for all subMDPs as GFootnote 5.

Adequate Abstractions. We aim to compute sound bounds on the results for a set of subMDPs such that the bounds are sound for every individual subMDP in this set. We generalize Definition 8 as follows: The (lower and upper) bounds \(\textsf {lbres}_{I}, \textsf {ubres}_{I}\) are sound, if they are sound (lower and upper) bounds for every \(\textsf {res}_i\), \(i \in I\).

Lemma 3

Let \(\textsf {lbres}_I\) satisfy the following inequations using \(0 \le j < Y\):

$$\begin{aligned} \textsf {lbres}_{I}(Y) \le \min _{i} \textsf {ER}_{\mathcal {M}_i}^\textsf {max}(\lozenge G)\quad \text { and }\quad \textsf {lbres}_I(j) \le \min _i \min _\sigma \textsf {Pr}_{\mathcal {M}_i[\sigma ]}(\lozenge G).\end{aligned}$$

Then, \(\textsf {lbres}_I\) is a sound lower bound.

Proof sketch. We must show \(\textsf {lbres}_I \le \textsf {res}_i\) for each \(i \in I\). By definition for each \(1 \le j \le Y\), \(\textsf {lbres}_I(j) \le \min _{i' \in I} \textsf {res}_{i'}(j)\) and trivially \(\min _{i' \in I} \textsf {res}_{i'}(j) \le \textsf {res}_i(j)\).    \(\square \)

We omit the analogous statement for \(\textsf {ubres}\)Footnote 6. In Sect. 4.4, we discuss a particular approach to obtain these bounds, i.e., the right hand sides of the equations in Eq. 5. Here, we update the algorithm sketch to handle this alternative refinement.

Remark 3

We cannot compute the optimal policy \(\sigma _i\) for the subMDP \(\mathcal {M}_i\) in this setting. Thus, we must compute probability bounds for all policies, which may make these bounds weak. Some optimizations are possible as some actions can in fact be excluded. More importantly, however, is that for cases within Proposition 1 the policy \(\sigma _i\) is irrelevant.

Updated Algorithm. We update the loop from Fig. 4: Rather than refining using a single i, we refine using a set I. Instead of \(\textsf {res}_i\), we use Lemma 3 to compute sound bounds \(\textsf {lbres}_I, \textsf {ubres}_I\) and call this set-based refinement. We may set \(\textsf {lbres}_i = \textsf {lbres}_I\) for each \(i \in I\). Then, we can compute a new suitable region via Lemma 1. With the suitable region, we can still utilise Eq. (4) to compute an approximation \([\textsf {lb}, \textsf {ub}]\). However, for completeness we must ensure that if \(|I|=1\), the upper and lower bounds coincide, i.e., \(\textsf {lbres}_{\{i\}} = \textsf {ubres}_{\{ i\}}\) for every i. That can be ensured by using individual subMDP refinement when \(|I|=1\).

figure g

We now first discuss the set-based analysis of multiple subMDPs \(\mathcal {M}_i\). We clarify the realization of the loop box in Sect. 5.

Fig. 5.
figure 5

Analysing hMDPs with set-based refinement on templated subMDPs.

4.4 Templates for Set-Based subMDP Analysis

We present an instance of set-based subMDP analysis where the subMDPs can be described as instantiations of a parametric MDPs.

Parametric Templates. We observe that the subMDPs are often similar, e.g., they define sending a file over a channel, exploring a room, in different conditions. We capture this similarity as follows: Let \(\{ \mathcal {T}_1, \ldots \mathcal {T}_m \}\) define a set of parametric MDPs, where we call each pMDP a template. In particular, for a hierarchical MDP \(\mathcal {M}\) with partitioning \({\textbf {S}}_1, \ldots {\textbf {S}}_n\) and corresponding subMDPs \(\mathcal {M}_1,\ldots , \mathcal {M}_n\) a subMDP \(\mathcal {M}_i\) is an instantiation of template \(\mathcal {T}_j\) and parameter instantiation vFootnote 7, if \(\mathcal {M}_i = \mathcal {T}_{j}[v]\). For a concise description, this paper considers hMDPs over a single template \(\mathcal {T}\) and, for any \(I \subseteq \mathbb {I}\), we denote \(V_I :=\{ v_1, \ldots , v_n \}\) the finite (multi)set of parameter instantiations for the pMDP \(\mathcal {T}\) such that \(\mathcal {T}[v_i] = \mathcal {M}_i\).

Abstractions from Templates. In terms of the templates, Lemma 3 requires us to bound the expected rewards \(\textsf {ER}_{\mathcal {T}[v]}^\textsf {max}(\lozenge G)\) for all \(v\in V_I\). We realize this by defining the smallest region \(\textsf {toRegion}(V_I) \supseteq V_I\). For this region, we obtain expected rewards by computing the minimum maximal reward in \(\textsf {toRegion}(V_I)\). That is:

$$\begin{aligned} \textsf {lbres}_I(Y) :=\min _{v \in \textsf {toRegion}(V_I)} \textsf {ER}_{\mathcal {T}[v]}^\textsf {max}(\lozenge G) \quad \le \quad \min _i \textsf {ER}_{\mathcal {M}_i}^\textsf {max}(\lozenge G). \end{aligned}$$

We handle the probabilities equally while taking into account the quantification over the policies. Following Lemma 3, these bounds are sound. Upper bounds are handled analogously. Computationally, we again use parameter lifting [33] to find these bounds. We can easily refine: Whenever we split I (or equally, \(V_I\)), we can compute (potentially) smaller regions \(\textsf {toRegion}(V_I)\).

In Fig. 5, we depict our method. In contrast to Fig. 4, we pass the template \(\mathcal {T}\) rather than the individual subMDPs. Furthermore, we now compute initial sound bounds via the analysis of the template (i.e., of \(V_I\)) and must pass the mapping from I to \(V_I\) to clarify the shape of the subMDPs.

figure h
figure i

5 Implementing the Abstraction-Refinement Loop

Algorithm 1 outlines a basic implementation of the idea sketched in Fig. 5. We detail this implementation and then discuss an essential improvement.

We construct \(\nu (\mathcal {M})\), \(\mathcal {T}\), and (the implicit) mapping \(V :\mathbb {I}\rightarrow V_\mathbb {I}\) to map subMDPs to instantiations of \(\mathcal {T}\) from a suitable high-level representation. We initialize a priority queue with triples that represent sets of template instantiations: I such that \(V_I :=\{ v_i :=V(i) \mid i \in I\}\) contains all valuations v such that \(\mathcal {T}[v]\) is a subMDP of \(\mathcal {M}\). We initially store bounds reflecting \(\textsf {lbres}_I\) and \(\textsf {ubres}_I\) as well as weights for the computation of the priority (see below). Initially, we assume that \(\textsf {lb}=0\) and \(\textsf {ub}= \infty \), we count the number of iterations in \(\#\text {iter}\). \(\text {Res}\) is map for storing result vectors. The algorithm now refines \(\textsf {lb}\) and \(\textsf {ub}\) until the gap between \(\textsf {lb}\) and \(\textsf {ub}\) is sufficiently small.

The main loop now iteratively refines \(\textsf {lb},\textsf {ub}\) by first refining \(\textsf {lbres}_I\) and \(\textsf {ubres}_I\), by splitting I and model checking \(\mathcal {T}\) w.r.t. subsequently smaller regions \(\textsf {toRegion}(V_I)\) (l. 5-11): Therefore, we take a set R from the queue. If \(R.I =\{ i \}\) is a singleton, we compute \(\textsf {lbres}_{R.I} = \textsf {res}_{i} = \textsf {ubres}_{R.I}\) and store this result. Otherwise, we apply model checking to the pMDP \(\mathcal {T}\) w.r.t. the region representation of R.I. We then split R.I, by splitting I into (here) two subsets. For splitting I, we use the geometric interpretation of \(\textsf {toRegion}(V_I)\) as a subset of \(\mathbb {R}^{|\vec {y}|}\), where we then split along one of the axis into two equally large subsets. Every k (we use \(k=8\)) iterations, we analyse the macro-MDP (l. 12-15). From Q and \(\text {Res}\) we extract the proper bounds \(\textsf {lbres}_i, \textsf {ubres}_i\) from Res[i] if possible and from Q using \(R.\text {bounds}\) for R such that \(i \in R.I\) otherwise. Then via \(\textsf {Reg}(\textsf {lbres}_1, \textsf {ubres}_1,\ldots )\) from Lemma 1 we compute a suitable region \(R'\). We analyse the uncertain macro-MDP to obtain \(\textsf {lb}\) and \(\textsf {ub}\) in accordance with Eq. (4).

Finally, we discuss the priority function: If we a-priori naively assume that each subMDP contributes an equal amount to the overal minimal expected reward in the hMDP (weights are all one) then the following priority function: \(|R.\text {bounds}| \cdot \sum _{v \in I} \text {R.weights}(v)\) computes priorities that correlate with how much computing \(\textsf {res}_i\) for all \(i \in I\) would reduce the gap between \(\textsf {lb}\) and \(\textsf {ub}\).

Termination and Correctness Argument. Algorithm 1 terminates. We split in such way that \(\max _{I \in Q} |I|\) monotonically decreases. Thus, eventually Q is empty and \(\text {Res}\) contains results for all subMDPs. Then, \(R'\) is a point region and checking \(\nu (\mathcal {M})\) with this point region ensures that \(\textsf {lb}= \textsf {ub}\). Correctness follows as \(R'\) is always suitable, see Eq. (4).

Computing Expected Visits. Based on our empirical evaluation we added one crucial improvement: While the algorithm above assumed that all subMDPs (or states in the macro-MDP) are equally important, that assumption is generally inadequate. Roughly, only states reached by the optimal policy contribute at all (provided the bounds are tight enough that we can identify these states). The reachable states are weighted by the expected number of visits of these states. We compute an approximation of this expected number of visit by computing the currently optimizing policy (a by-product of l. 13) and compute the center of \(R'\); this results in a MC for which we can compute the number of expected visits by a standard equation system [32]. Additionally, we update the weights for the regions in the queue based on these new results. We remark that this also makes the priority function more useful.

Interleaving Individual Refinement. Furthermore, for a subMDPs for which the expected number of visits is largeFootnote 8 are individually analysed (and the points are removed from the region in the queue). This optimization reduces the need to split the corresponding regions until we obtain tight bounds.

6 Experiments

Implementation. We implemented level-upFootnote 9, a prototype on top of the python bindings for Storm [20]. level-up analyses hierarchical MDPs by taking two MDPs, each provided as probabilistic program descriptions in the PRISM format: One MDP that encodes the (uncertain) macro-MDP and one that describes the parametric template for the subMDPs. The parameter instance of the subMDP can be deduced as a function of the high-level variable assignment of the macro-MDP states. For technical reasons, the prototype currently provides support for subMDPs with one or two successor states – arguably the setting in which we expect our prototype to perform best. For subMDPs with a single successor state, the uncertain macro-MDP may be represented as an (parameter-free) MDP with interval-valued rewards. For two successors, we include support of the extension of Sect. 3.3 where the successor aims to optimize reaching a fixed successor state.

Table 1. Benchmark statistics, runtimes of the approaches, and details for Algorithm 1.

Setup. We investigate the scalability and the quality of the approximation over time. Therefore, we run our prototype on an MacBook 2020 M1 with an 8 GB RAM limit. We compare the enumerative baseline from Sect. 4.1 with Algorithm 1. Both exploit the hierarchical nature of the MDP. We qualitatively compare to standard model checking on the flat MDP, see below. We use a collection of benchmarks reflecting networks, job schedulers and robots.

Results. We consider instances that we summarize in Table 1. In particular, we give the benchmark name and instance for reference, the approximate number of states in the hierarchical MDP (computed from the macro-MDP and the subMDPs), the number of nontrivial partitions, and the number of states and actions in the (uncertain) macro-MDP and subMDPs, respectively. Then, we give the time to setup the data structures from the high-level representation \(t_\text {init}\) in seconds. We highlight that a flat representation of all our benchmarks has at least \(10^7\), often more, states. As a reference, we present the performance of the enumerative baseline from Sect. 4.1. The performance of this approach is positive as it enables the verification of huge MDPs. A TO indicates \({>}1200\) s. To scale to either larger subMDPs or more subMDPs, we use the abstraction-refinement loop. To reflect the anytime nature, we list three run times for terminating when \(\eta \cdot \textsf {ub}\le \textsf {lb}\) with \(\eta \in \{ 0.5, 0.9, 0.95 \}\) respectively. The largest time faster than the enumerative baseline is highlighted (further to the right is better for the abstraction-refinement). For \(\eta =0.95\), we give details: The number of iterations (iter), the number of individual refinements based on the improvement from Sect. 5, and the fraction of time spent on model checking the uncertain macro-MDPs \(\%_\text {um}\), the set-refinements \(\%_\text {sr}\), and the individual refinements \(\%_\text {ir}\), respectively.

Discussion. Before we discuss details of the results, let us clarify that exploiting the hierarchical structure is essential. MDPs with \({\approx }10^8\) states are at the limit of what fits in around 8GB of memoryFootnote 10. Symbolic methods based on MTBDDs easily scale beyond these sizes, but—noting that the subMDPs are all slightly different—the models we consider lack the necessary symmetry that make MTBDDs compact. Thus, support for hierarchical MDPs is a necessary step forward.

Regarding the abstraction-refinement: While a larger study may be necessary, we can start with two standard observations: The abstraction-refinement loop is significantly faster on \(\eta \le 0.9\). As \(\eta \rightarrow 1\), coarse abstractions are insufficient. Furthermore, the efficiency of the abstraction-refinement heavily depends on the particular structure. That being said, the approach outperforms the enumerative approach, especially for \(\eta = 0.9\), and up to more than an order of magnitude. This happens even if \(\mathbb {I}\) is rather small, or if, e.g., \(\mathcal {T}\) is small. We furthermore observe that for large \(\mathbb {I}\), the bookkeeping in python becomes a bottleneck. We think these observations are promising: we left many options for further optimizations and tweaking towards particular examples on the table. However, for models where most time is spent on model checking the macro-level MDP, the approach is less suitable. We furthermore conjecture that tailored algorithms may exploit some of these dimensions, e.g., when there is the macro-MDP or the subMDPs are indeed MCs or perhaps acyclic, depending on the number of parameters and their influence [36], or based on the relative weight of the uncertain rewards compared to rewards in the macro-MDP.

7 Related Work

In the model-free reinforcement learning (RL) setting, hierarchical models are popular. An excellent, recent survey is given in [29]. Our work generalizes the solution techniques on hierarchical MDPs that assume that these subMDPs are the same. In RL, this assumption is treated liberally, and the methods provide only weak error bounds. In contrast, our model-based approach provides error-bounds in every step, and the error disappears in finitely many steps.

Hierarchical abstractions are used to analyse large MDPs in [5]. There, the goal is to find a policy that almost optimizes the reward. Rather than preimposing a hierarchy, the algorithm aims to find a hierarchy and define the goal states of the subMDP such that the model admits local policies. Instead, our solution can find the optimal policy and in particular gives strict error bounds at the cost of requiring a high-level model that induces the hierarchy. An symbolic approach for continuous MDP, where the transition probabilities are the result of an associated LP, has recently been discussed in [24]. An hierarchical SCC-decomposition [1] aims to accelerate the process of solving a (given, monolithic) Markov chain. The computation of reward-bounded properties [18] generalizes topological value iteration and their notion of episodes mildly resembles an hierarchical approach but no uncertainty is assumed or used in the approach. The probabilistic model checker PAT [35] analyses a hierarchical probabilistic timed automaton given as a process algebra. The hierarchy is not exploited in the solving process.

While symbolic approaches, often on decision diagrams, exploit the transition system by compressing the data structures, abstractions aim to yield smaller systems that may assess an approximation for the sought-for values. Abstraction-refinement without an imposed hierarchy is explored in [16, 21, 25]: Refinement amounts to considering a better approximation of the state space. In contrast, we impose the hierarchy, the abstraction amounts to an imprecise analysis of this fixed state space and we refine by analysing the state space more precisely (by means of analysing subMDPs at a greater level of detail). Contract-based abstractions (in probabilistic systems) are used to decompose the analysis of systems given by parallel running subsystems [14, 28, 38]. Partial exploration and bounded model checking approaches focus on the most critical paths, i.e., the paths where most of the probability mass lies [7, 23, 26], but these approaches do generally not exploit the hierarchical and repetitive structure. The observation that many parts of the system are not critical allows us to weigh the potential benefit of refining the intervals in various parts of the macro-MDP.

Parametric MDPs are commonly used to model and analyse the effects of uncertainty in the precise transitions [15, 23, 31]. The methods presented in [13, 22] exploit a repetitive structure in parametric MCs to accelerate the construction of closed form solutions and are not applicable to MDPs. Parametric models have been used to support the design of systems [2, 8] or their adaption [6, 9], to find policies for partially observable systems [11], to analyse Bayesian networks [34], and to speed up the analysis of, e.g., software product lines [10, 37]. On top of technical differences, none of these approaches uses a hierarchical decomposition of an MDP or uses the results of the analysis in the analysis of a larger MDP.

8 Conclusion

This paper presents a first verification approach that exploits a specific hierarchical structure natural in many models to accelerate analysing the underlying MDP. An essential ingredient is to separate the two levels in the hierarchy. Then, when analysing the (toplevel) macro-MDP, we may consider subMDPs that have not yet been analysed as epistemic uncertainty. Analysis techniques for uncertain (more precise: parametric) MDPs then enable an online approximation loop that incrementally removes uncertainty in a targeted fashion by analysing more and more subMDPs (more) precisely. Three clear directions for future work are to (i) consider an approach where one lifts the restrictions to locally-optimal policies, (ii) investigate the applicability to a richer set of temporal properties and (iii) to allow automatic detection of partitions in, e.g., the Prism language.