Compositional Value Iteration with Pareto Caching

Watanabe, Kazuki; Vegt, Marck van der; Junges, Sebastian; Hasuo, Ichiro

doi:10.1007/978-3-031-65633-0_21

Kazuki Watanabe^9,10,
Marck van der Vegt¹¹,
Sebastian Junges¹¹ &
…
Ichiro Hasuo^9,10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14683))

Included in the following conference series:

International Conference on Computer Aided Verification

Abstract

The de-facto standard approach in MDP verification is based on value iteration (VI). We propose compositional VI, a framework for model checking compositional MDPs, that addresses efficiency while maintaining soundness. Concretely, compositional MDPs naturally arise from the combination of individual components, and their structure can be expressed using, e.g., string diagrams. Towards efficiency, we observe that compositional VI repeatedly verifies individual components. We propose a technique called Pareto caching that allows to reuse verification results, even for previously unseen queries. Towards soundness, we present two stopping criteria: one generalizes the optimistic value iteration paradigm and the other uses Pareto caches in conjunction with recent baseline algorithms. Our experimental evaluations shows the promise of the novel algorithm and its variations, and identifies challenges for future work.

K.W. and I.H. are supported by ERATO HASUO Metamathematics for Systems Design Project (No. JPMJER1603) and the ASPIRE grant No. JPMJAP2301, JST. K.W. is supported by the JST grants No. JPMJFS2136 and JPMJAX23CU. S.J. is supported by the NWO Veni ProMiSe (222.147).

K. Watanabe and M. van der Vegt—Equal contribution.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

MDP Model Checking and Value Iteration. Markov decision processes (MDPs) are the standard model for sequential decision making in stochastic settings. A standard question in the verification of MDPs is: what is the maximal probability that an error state is reached. MDP model checking is an active topic in the formal verification community. Value iteration (VI) [44] is an iterative and approximate method whose performance in MDP model checking is well-established [11, 29, 30]. Several extensions with soundness have been proposed; they provide, in addition to under-approximations, also over-approximations with a desired precision [4, 24, 30, 43, 46], so that an approximate answer comes with an error bound. These sound algorithms are implemented in mature model checkers such as Prism [37], Modest [27], and Storm [32].

Compositional Model Checking. Even with these state-of-the-art algorithms, it is a challenge to model check large MDPs efficiently with high precision. Experiments observe that MDPs with more than $10^8$ states are too large for those algorithms [35, 53, 54]—they simply do not fit in memory. However, such large MDPs often arise as models of complicated stochastic systems, e.g. in the domains of network and robotics. Furthermore, even small models may be numerically challenging to solve due to their structure [4, 24, 29].

Compositional model checking is a promising approach to tackle this scalability challenge. Given a compositional structure of a target system, compositional model checking executes a divide-and-conquer algorithm that avoids loading the entire state space at once, often solving the above memory problem. Moreover, reusing the model checking results for components can lead to speed-up by magnitudes. Although finding a suitable compositional structure for a given “monolithic” MDP is still open, many systems come with such an a priori compositional structure. For example, such compositional structures are often assumed in robotics and referred to as hierarchical models [5, 23, 31, 35, 40, 48, 51].

Recently, string diagrams of MDPs are introduced for compositional model checking [53, 54]; the current paper adopts this formalism. There, MDPs are extended with (open) entrances and exits (Fig. 1), and they get composed by sequential composition and sum $\oplus $. See Fig. 2, where the right-hand sides are simple juxtapositions of graphs (wires get connected in ). This makes the formalism focused on sequential (as opposed to parallel) composition. This restriction eases the design of compositional algorithms; yet, the formalism is rich enough to capture the compositional structures of many system models.

Current Work: Compositional Value Iteration. In this paper, we present a compositional value iteration (CVI) algorithm that solves reachability probabilities of string diagrams of MDPs, operating in a divide-and-conquer manner along compositional structures. Our approximate VI algorithm comes with soundness—it produces error bounds—and exploits compositionality for efficiency.

Specifically, for soundness, we lift the recent paradigm of optimistic value iteration (OVI) [30] to the current compositional setting. We use it both for local (component-level) model checking and—in one of the two global VI stopping criteria that we present—for providing a global over-approximation.

For efficiency, firstly, we adopt a top-down compositional approach where each component is model-checked repeatedly, each time on a different weight $\textbf{w}$, in a by-need manner. Secondly, in order to suppress repetitive computation on similar weights, we introduce a novel technique of Pareto caching that allows “approximate reuse” of model checking results. This closely relates to multi-objective probabilistic model checking [17, 20, 45], without the explicit goal of building Pareto curves. Our Pareto caching also leads to another (sound) global VI stopping criterion that is based on the approximate bottom-up approach [54].

Our algorithm is approximate (unlike the exact one in [53]), and top-down (unlike the bottom-up approximate one in [54]). Experimental evaluation demonstrates its performance thanks to the combination of these two features.

Contributions and Organization. We start with an overview (Sect. 2) that presents graphical intuitions. After formalizing the problem setting in Sect. 3, we move on to describe our technical contributions:

compositional value iteration for string diagrams of MDPs where VI is run in a top-down and thus by-need manner (Sect. 4.2),
the Pareto caching technique for reusing results for components (Sect. 5.2),
two global stopping criteria that ensure soundness (Sect. 6).

We evaluate and discuss our approach through experiments (Sect. 7), show related work (Sect. 8), and conclude this paper (Sect. 9).

Notations. For a natural number m, we write [m] for $\{1, \dots , m\}$. For a set X, we write $\mathcal {D}(X)$ for the set of distributions on X. For sets X, Y, we write $X\uplus Y$ for their disjoint union and $f:X\rightharpoonup Y$ for a partial function f from X to Y.

2 Overview

This section illustrates our take on CVI with so-called Pareto caches using graphical intuitions. We describe MDPs as string diagrams over so-called open MDPs [53]. Open MDPs, such as $\mathcal {A}, \mathcal {B}$ in Fig. 1, extend MDPs with open ends (entrances and exits). We use two operations and $\oplus $; see Fig. 2. That figure also illustrates the bidirectional nature of the formalism: arrows can point left and right; thus acyclic MDPs can create cycles when combined. String diagrams come from category theory (see [53]) and they are used in many fields of computer science [8, 9, 22, 52].

2.1 Approximate Bottom-Up Model Checking

The first compositional model checking algorithm for string diagrams of MDPs is in [53], which is exact. Subsequently, in [54], an approximate compositional model checking algorithm is proposed. This is the basis of our algorithm and we shall review it here. Consider, for illustration, the sequential composition in Fig. 3, where the exit $o_{3}$ is the target. The algorithm from [54] proceeds in the following bottom-up manner.

First Step: Model Checking Each Component. Firstly, model checking is conducted for component oMDPs $\mathcal {A}$ and $\mathcal {B}$ separately, which amounts to identifying an optimal scheduler for each. At this point, however, it is unclear what constitutes an optimal scheduler:

Example 1

In the MDP $\mathcal {A}$ in Fig. 3, let’s say the reachability probabilities $\bigl (\,\textrm{RPr}^{\sigma _{1}}(i_1\rightarrow o_1),\,\textrm{RPr}^{\sigma _{1}}(i_1\rightarrow o_2)\bigr )$ are (0.2, 0.7) under a scheduler $\sigma _{1}$, and (0.6, 0.2) under another $\sigma _{2}$. One cannot tell which scheduler ($\sigma _{1}$ or $\sigma _{2}$) is better for the global objective (i.e. reaching $o_{3}$ in ) since $\mathcal {B}$ is a black box.

Concretely, the context of $\mathcal {A}$ is unknown. Therefore we have to compute all candidates of optimal schedulers, instead of one. This set is given by, for each component $\mathcal {C}$ and its entrance i,

$$\begin{aligned} \bigl \{\,\text {schedulers }\sigma \,\big |\, \bigl (\textrm{RPr}^{\sigma }(i\rightarrow o)\bigr )_{o: \mathcal {C}\text { 's exit}} \text { is Pareto optimal}\,\bigr \}. \end{aligned}$$

(1)

Here the Pareto optimality is a usual notion from multi-objective model checking (e.g. [17, 41]); here, it means that there is no scheduler $\sigma '$ that dominates $\sigma $ in the sense that $\textrm{RPr}^{\sigma }(i\rightarrow o) \le \textrm{RPr}^{\sigma '}(i\rightarrow o)$ holds for each o and < holds for some o. The two points from the example can be plotted, see Fig. 4.

The Pareto curve—the set of points $\bigl (\textrm{RPr}^{\sigma }(i\rightarrow o)\bigr )_{o}$ for the Pareto optimal schedulers $\sigma $ in (1)—will look like the dashed blue line in Fig. 4.The solid blue line is realizable by a convex combination of the schedulers $\sigma _1$ and $\sigma _2$. It is always below the Pareto curve.

The algorithm in [54] computes guaranteed under- and over-approximations (L, U) of Pareto-optimal points (1) for every open MDP. See Fig. 5; here the green area indicates the under-approximation, and the red area is the complement of the over-approximation, so that any Pareto-optimal points are guaranteed to be in their gap (white). These approximations are obtained by repeated application of (optimistic) value iteration on the open MDPs, i.e., a standard approach for verifying MDPs, based on [20, 47]. We formalize these notions in Sect. 5.1.

Second Step: Combination along. The second (inductive) step of the bottom-up algorithm in [54] is to combine the results of the first step—approximations as in Fig. 5, and the corresponding (near) optimal schedulers (1), for each component $\mathcal {C}$—along the operations in a string diagram.

Here we describe this second step through the example in Fig. 3. It computes reachability probabilities

$$\begin{aligned} \begin{array}{ll} \textrm{RPr}^{\sigma ,\tau }(i_{1}\rightarrow o_{3}) =&{} \textrm{RPr}^{\sigma }(i_{1}\rightarrow o_{1}) \cdot \textrm{RPr}^{\tau }(i_{2}\rightarrow o_{3})\\ &{}+ \textrm{RPr}^{\sigma }(i_{1}\rightarrow o_{2}) \cdot \textrm{RPr}^{\tau }(i_{3}\rightarrow o_{3}) \end{array} \qquad \qquad \qquad \qquad \end{aligned}$$

(2)

for each combination of Pareto-optimal schedulers $\sigma $ (for $\mathcal {A}$) and $\tau $ (for $\mathcal {B}$) to find which combinations of $\sigma ,\tau $ are Pareto optimal for .

The equality (2)—called the decomposition equality in [53]—enables compositional reasoning on Pareto-optimal points and on their approximations: Pareto-optimal schedulers for can be computed from those for $\mathcal {A}$ and $\mathcal {B}$. This compositional reasoning can be exploited for performance. In particular, when the same component $\mathcal {A}$ occurs multiple times in a string diagram $\mathbb {D}$, the model checking result of $\mathcal {A}$ can be reused multiple times.

2.2 Key Idea I: From Bottom-Up to Top-Down

The bottom-up approaches compute the Pareto curves independent of the context of the open MDP. One key idea is to move from bottom-up to top-down, a direction followed by other compositional techniques too, see Sect. 8.

For illustration, consider the sequential composition in Fig. 6; we have concretized $\mathcal {B}$ in Fig. 3. For this $\mathcal {B}$, it follows that $\textrm{RPr}(i_{2}\rightarrow o_{3})=0.8$ and $\textrm{RPr}(i_{3}\rightarrow o_{3})=0.3$. Therefore the equality (2) boils down to

$$\begin{aligned} \textrm{RPr}^{\sigma }(i_{1}\rightarrow o_{3}) = 0.8\cdot \textrm{RPr}^{\sigma }(i_{1}\rightarrow o_{1}) + 0.3\cdot \textrm{RPr}^{\sigma }(i_{1}\rightarrow o_{2}). \end{aligned}$$

(3)

The equation (3) is a significant simplification compared to (2):

in (2), since the weight $\bigl (\textrm{RPr}^{\tau }(i_{2}\rightarrow o_{3}),\textrm{RPr}^{\tau }(i_{3}\rightarrow o_{3})\bigr )$ is unknown, we must compute multidimensional Pareto curves as in Figs. 4 and 5;
in (3), since the weight is known to be (0.8, 0.3), we can solve the equation using standard single-objective model checking.

Exploiting this simplification is our first key idea. We introduce a systematic procedure for deriving weights (such as (0.8, 0.3) above) that uses the context of an oMDP, i.e., it goes top-down along the string diagram. The procedure works for bi-directional sequential composition (thus for loops, cf. Fig. 2), not only for uni-directional as in Fig. 6. In the procedure, we first examine the context of a component $\mathcal {C}$, approximate a weight $\textbf{w}$ for $\mathcal {C}$, and then compute maximum weighted reachability probabilities in $\mathcal {C}$. We formalize the approach in Sect. 4.2.

Potential performance advantages compared to the bottom-up algorithm in [54] should be obvious from Fig. 6. Specifically, the bottom-up algorithm draws a complete picture for Pareto-optimal points (such as Fig. 5) once for all, but a large part of this complete picture may not be used. In contrast, the top-down one draws the picture in a by-need manner, for a weight $\textbf{w}$ only when the weight $\textbf{w}$ is suggested by the context.

The top-down approximation of Pareto-optimal points is illustrated in Fig. 7. Here a weight $\textbf{w}$ is the normal vector of the blue lines; the figure shows a situation after considering two weights.

2.3 Key Idea II: Pareto Caching

Our second key idea (Pareto caching) arises when we try to combine the last idea (top-down compositionality) with the key advantage of the bottom-up approach [54], namely exploiting duplicates. Consider the string diagram in Fig. 8, for motivation, where we designate multiple occurrences of $\mathcal {A}$ by $\mathcal {A}_{1}, \mathcal {A}_{2}, \mathcal {A}_{3}$ for distinction, from left to right.

Let us run the top-down algorithm. The component $\mathcal {E}$ suggests the weight (0.8, 0.3) for the two exits of $\mathcal {A}_{3}$, and $\mathcal {D}$ suggest the weight (0.2, 0.7) for the exits of $\mathcal {A}_{2}$. Recalling that $\mathcal {A}_{2}$ and $\mathcal {A}_{3}$ are identical, the weighted optimization results for these two weights can be combined, leading to a picture like Fig. 7.

Now, in Fig. 8, we go on to the component $\mathcal {B}$. It suggests the weight (0.75, 0.3).

In the bottom-up approach [54], performance advantages are brought by exploiting duplicates, that is, by reusing the model checking result of a component $\mathcal {C}$ for its multiple occurrences.
Therefore, also here, we wish to use the previous analysis results for $\mathcal {A}$—for the weights (0.8, 0.3) and (0.2, 0.7)—for the weight (0.75, 0.3).
Intuitively, (0.75, 0.3) seems close enough to (0.8, 0.3), suggesting that we can use the previously obtained result for (0.8, 0.3).

But this casts the following questions: what is it for two weights to be “close enough”? Is (0.75, 0.3) really closer to (0.8, 0.3) than to (0.2, 0.7)? Can we bound errors—much like in Sect. 2.1—that arise from this “approximate reuse”?

In Sect. 5.2, we use the existing theory on Pareto curves in multi-objective model checking from [17, 20, 45] to answer these questions. Intuitively, the previous analysis result (red and green regions) gets queried on a new weight $\textbf{w}$ (the normal vector of the blue lines), as illustrated in Fig. 9. We call answering weighted reachability based on the Pareto curve Pareto caching. The technique can prevent many invocations of using VI to compute the weighted reachability for $\textbf{w}$. The distance between the under- and over-approximations computed this way can be big; if so (“cache miss”), we run VI again for the weight $\textbf{w}$.

2.4 Global Stopping Criteria (GSCs)

On top of two key ideas, we provide two global stopping criteria (GSCs) in Sect. 6: one is based on the ideas from OVI [30] and the other is a symbiosis of the Pareto caches with the bottom-up approach. Although ensuring the termination of our algorithm in finite steps with our GSCs remains future work, we show that our GSCs are sound, that is, its output satisfies a given precision upon termination.

3 Formal Problem Statement

We recall (weighted) reachability in Markov decision processes (MDPs) and formalize string diagrams as their compositional representation. Together, this is the formal basis for our problem statement as already introduced above.

3.1 Markov Decision Process (MDP)

Definition 3.1

(MDP). An MDP $\mathcal {M} = (S, A, P)$ is a tuple with a finite set S of states, a finite set A of actions, and a probabilistic transition function $P:S\times A \rightharpoonup \mathcal {D}(S)$ (which is a partial function, cf. notations in Sect. 1).

A (finite) path (on $\mathcal {M}$) is a finite sequence of states . We write $\textsf{FPath}_{\mathcal {M}}$ for the set of finite paths on $\mathcal {M}$. A memoryless scheduler $\sigma $ is a function $\sigma :S\rightarrow \mathcal {D}(A)$; in this paper, memoryless schedulers suffice [20, 44]. We say $\sigma $ is deterministic memoryless (DM) if for each $s\in S$, $\sigma (s)$ is Dirac. We also write $\sigma :S\rightarrow A$ for a DM scheduler $\sigma $. The set of all memoryless schedulers on $\mathcal {M}$ is $\varSigma ^{\mathcal {M}}$, and the set of all DM schedulers on $\mathcal {M}$ is $\varSigma _\text {d}^{\mathcal {M}}$.

For a memoryless scheduler $\sigma $ and a target state $t\in S$, the reachability probability $\textrm{RPr}^{\mathcal {M}, \sigma ,t}(s)$ from a state s is given by , where (i) the set $\textsf{FPath}_{\mathcal {M}}(t)\subseteq \textsf{FPath}_{\mathcal {M}}$ is defined by , and (ii) the probability $\textrm{Pr}^{\mathcal {M}}_{\sigma ,s}(\pi )$ is defined by if $\pi _1 = s$ and otherwise.

Towards our compositional approach for a reachability objective, we must generalize the objective to a weighted reachability probability objective: we want to compute the weighted sum—with respect to a certain weight vector $\textbf{w}$—over reachability probabilities to multiple target states. The standard reachability probability problem is a special case of this weighted reachability problem using a suitable unit vector $\textbf{e}$ as the weight $\textbf{w}$.

Definition 3.2

(weighted reachability probability). Let $\mathcal {M}$ be an MDP, and T be a set of target states. A weight $\textbf{w}$ on T is a vector .

Let s be a state, and $\sigma $ be a scheduler. The weighted reachability probability $\textrm{WRPr}^{\mathcal {M},\sigma , T}(\textbf{w}, s)\in [0, 1]$ from s to T over $\sigma $ with respect to a weight $\textbf{w}$ is defined naturally by a weighted sum, that is, . We write $\textrm{WRPr}^{\mathcal {M}, T}_{\textrm{max}}(\textbf{w}, s)$ for the maximum weighted reachability probability $ \sup _{\sigma } \textrm{WRPr}^{\mathcal {M},\sigma , T}(\textbf{w}, s) $. (The supremum is realizable; see e.g. [28].)

3.2 String Diagram of MDPs

Definition 3.3

(oMDP). An open MDP (oMDP) $\mathcal {A} = (M, \textsf{IO})$ is a pair consisting of an MDP M with open ends $\textsf{IO}=(I_{\textbf{r}}, I_{\textbf{l}}, O_{\textbf{r}}, O_{\textbf{l}})$, where $I_{\textbf{r}}, I_{\textbf{l}}, O_{\textbf{r}}, O_{\textbf{l}}\subseteq S$ are pairwise disjoint and each of them is totally ordered. The states in are the entrances, and the states in are the exits, respectively. We often use superscripts to designate the oMDP $\mathcal {A}$ in question, such as $I^{\mathcal {A}}$ and $O^{\mathcal {A}}$.

We write $\textrm{arity}(\mathcal {A}):(m_{\textbf{r}}, m_{\textbf{l}})\rightarrow (n_{\textbf{r}}, n_{\textbf{l}})$ for the arities of $\mathcal {A}$, where , , , and . We assume that every exit s is a sink state, that is, P(s, a) is undefined for any $a\in A$. We can naturally lift the definitions of schedulers and weighted reachability probabilities from MDPs to oMDPs: we will be particularly interested in the following instances; 1) the weighted reachability probability from a chosen entrance i to the set $O^{\mathcal {A}}$ of all exits; and 2) the maximum weighted reachability probability from i to $O^{\mathcal {A}}$ weighted by $\textbf{w}$.

We define string diagrams of MDPs [53] syntactically, as syntactic trees whose leaves are oMDPs and non-leaf nodes are algebraic operations. The latter are syntactic operations and they are yet to be interpreted.

Definition 3.4

(string diagram of MDPs). A string diagram $\mathbb {D}$ of MDPs is a term adhering to the grammar , where $\textsf{c}_{\mathcal {A}}$ is a constant designating an oMDP $\mathcal {A}$.

The above syntactic operations are interpreted by the semantic operations below. The following definitions explicate the graphical intuition in Fig. 2.

Definition 3.5

(sequential composition ). Let $\mathcal {A}$, $\mathcal {B}$ be oMDPs, $\textrm{arity}(\mathcal {A}) = (m_{\textbf{r}}, m_{\textbf{l}}) \rightarrow (l_{\textbf{r}}, l_{\textbf{l}})$, and $\textrm{arity}(\mathcal {B}) = (l_{\textbf{r}}, l_{\textbf{l}}) \rightarrow (n_{\textbf{r}}, n_{\textbf{l}})$. Their sequential composition is the oMDP $(M, \textsf{IO}')$ where $\textsf{IO}' = (I_{\textbf{r}}^{\mathcal {A}}, I_{\textbf{l}}^{\mathcal {B}}, O_{\textbf{r}}^{\mathcal {B}}, O_{\textbf{l}}^{\mathcal {A}})$, and P is

$$\begin{aligned} \begin{array}{ll} P(s, a, s') &{}:= {\left\{ \begin{array}{ll} P^{\mathcal {D}}(s, a, s') &{}\text {if } \mathcal {D} \in \{\mathcal {A}, \mathcal {B}\}\text { , } s\in S^{\mathcal {D}} \text { , } a\in A^{\mathcal {D}} \text { , and } s'\in S^{\mathcal {D}}\text {, }\\ P^{\mathcal {A}}(s, a, o_{\textbf{r}, i}^{\mathcal {A}}) &{}\text {if } s\in S^{\mathcal {A}} \text { , } a\in A^{\mathcal {A}} \text { , } s' = i_{\textbf{r}, i}^{\mathcal {B}} \text { for some } 1 \le i \le l_{\textbf{r}}\text { , }\\ P^{\mathcal {B}}(s, a, o_{\textbf{l}, i}^{\mathcal {B}}) &{}\text {if } s\in S^{\mathcal {B}} \text { , } a\in A^{\mathcal {B}} \text { , } s' = i_{\textbf{l}, i}^{\mathcal {A}} \text { for some } 1 \le i \le l_{\textbf{l}},\\ 0 &{}\text {otherwise. } \end{array}\right. } \end{array} \end{aligned}$$

Definition 3.6

(sum $\oplus $). Let $\mathcal {A}, \mathcal {B}$ be oMDPs. Their sum $\mathcal {A}\oplus \mathcal {B}$ is the oMDP $(M, \textsf{IO}')$ where $\textsf{IO}' = (I_{\textbf{r}}^{\mathcal {A}}\uplus I_{\textbf{r}}^{\mathcal {B}}, I_{\textbf{l}}^{\mathcal {A}}\uplus I_{\textbf{l}}^{\mathcal {B}}, O_{\textbf{r}}^{\mathcal {A}}\uplus O_{\textbf{r}}^{\mathcal {B}}, O_{\textbf{l}}^{\mathcal {A}}\uplus O_{\textbf{l}}^{\mathcal {B}})$, $M = (S^{\mathcal {A}} \uplus S^{\mathcal {B}}, A^{\mathcal {A}} \uplus A^{\mathcal {B}}, P)$, and P is given by if $\mathcal {D} \in \{\mathcal {A}, \mathcal {B}\}$, $s\in S^{\mathcal {D}}$, $a\in A^{\mathcal {D}}$, and $s'\in S^{\mathcal {D}}$, and otherwise .

Definition 3.7

(operational semantics $\llbracket \mathbb {D} \rrbracket $). Let $\mathbb {D}$ be a string diagram of MDPs. The operational semantics $\llbracket \mathbb {D} \rrbracket $ is the oMDP which is inductively defined by Defs 3.5 and 3.6, with the base case $\llbracket \textsf{c}_{\mathcal {A}} \rrbracket =\mathcal {A}$. Here we assume that every string diagram $\mathbb {D}$ has matching arities so that compositions are well-defined. We call $I^{\llbracket \mathbb {D} \rrbracket }$ and $O^{\llbracket \mathbb {D} \rrbracket }$ global entrances and global exits of $\mathbb {D}$, respectively.

For describing the occurrence of oMDPs and their duplicates in a string diagram $\mathbb {D}$, we formally define nominal components $\mathop {\textrm{nCP}}(\mathbb {D})$ and components $\mathop {\textrm{CP}}(\mathbb {D})$. The latter for graph-theoretic operations in our compositional VI (CVI) (Algorithm 1), while the former is for Pareto caching (Sect. 5.2). Examples are provided later in Ex. 3.10.

Definition 3.8

($\mathop {\textrm{nCP}}(\mathbb {D})$, $\mathop {\textrm{CP}}(\mathbb {D})$). The set $\mathop {\textrm{nCP}}(\mathbb {D})$ of nominal components is the set of constants occurring in $\mathbb {D}$ (as a term). The set $\mathop {\textrm{CP}}(\mathbb {D})$ of components is inductively defined by the following: , and for ; here we count multiplicities, unlike $\mathop {\textrm{nCP}}(\mathbb {D})$.

We introduce local open ends of string diagrams, in contrast to global open ends defined in Def. 3.7.

Definition 3.9

($I_{\textrm{lc}}(\mathbb {D})$, $O_{\textrm{lc}}(\mathbb {D})$ (local)). The sets $I_{\textrm{lc}}(\mathbb {D})$ and $O_{\textrm{lc}}(\mathbb {D})$ of local entrances and exits of $\mathbb {D}$ are given by , and , respectively. Clearly we have $I^{\llbracket \mathbb {D} \rrbracket }\subseteq I_{\textrm{lc}}(\mathbb {D})$, $O^{\llbracket \mathbb {D} \rrbracket }\subseteq O_{\textrm{lc}}(\mathbb {D})$.

Example 3.10

Let , where $\mathcal {A}$ and $\mathcal {B}$ are from Fig. 1. The oMDP $\llbracket \mathbb {D} \rrbracket $ is shown in Fig. 10. Then $\mathop {\textrm{nCP}}(\mathbb {D})=\{\textsf{c}_{\mathcal {A}},\textsf{c}_{\mathcal {B}} \}$, while $\mathop {\textrm{CP}}(\mathbb {D})=\{\mathcal {A}_{1}, \mathcal {A}_{2}, \mathcal {B}\}$ with subscripts added for distinction. We have $I^{\llbracket \mathbb {D} \rrbracket }=\{i^{\mathcal {A}_{1}}_{1}\}$ and $O^{\llbracket \mathbb {D} \rrbracket }=\{o^{\mathcal {A}_{1}}_{1}, o^{\mathcal {B}}_{2}\}$, and $I_{\textrm{lc}}(\mathbb {D})= \{ i^{\mathcal {A}_{1}}_{1}, i^{\mathcal {A}_{1}}_{2}, i^{\mathcal {A}_{2}}_{1}, i^{\mathcal {A}_{2}}_{2}, i^{\mathcal {B}}_{1} \} $ and $O_{\textrm{lc}}(\mathbb {D})= \{ o^{\mathcal {A}_{1}}_{1}, o^{\mathcal {A}_{1}}_{2}, o^{\mathcal {A}_{2}}_{1}, o^{\mathcal {A}_{2}}_{2}, o^{\mathcal {B}}_{1}, o^{\mathcal {B}}_{2} \} $. Note also that $O_{\textrm{lc}}(\mathbb {D})$ does not suppress exits removed in sequential composition, such as $\{ o^{\mathcal {A}_{1}}_{2}, o^{\mathcal {A}_{2}}_{1}, o^{\mathcal {A}_{2}}_{2}, o^{\mathcal {B}}_{1} \}$.

We remark that as a straightforward extension, we can also extract a scheduler that achieves the under-approximation.

4 VI in a Compositional Setting

We recap value iteration (VI) [3, 44] and its extension to optimistic value iteration (OVI) [30] before presenting our compositional VI (CVI).

4.1 Value Iteration (VI) and Optimistic Value Iteration (OVI)

VI relies on the characterization of maximum reachability probabilities as a least fixed point (lfp), specifically the lfp $\mu \varPhi _{\mathcal {M},T}$ of the Bellman operator $\varPhi _{\mathcal {M},T}$: the Bellman operator $\varPhi _{\mathcal {M},T}$ is an operator on the set $[0, 1]^S$ that intuitively returns the $t+1$-step reachability probabilities given the t-step reachability probabilities. A formal treatment can be found in [55, Appendix B]. Then the Kleene sequence $\bot \le \varPhi _{\mathcal {M},T}(\bot )\le \varPhi _{\mathcal {M},T}^{2}(\bot )\le \cdots $ gives a monotonically increasing sequence that converges to the lfp $\mu \varPhi _{\mathcal {M},T}$, where $\bot $ is the least element. This also applies to weighted reachability probabilities.

While VI gives guaranteed under-approximations, it does not say how close the current approximation is to the solution $\mu \varPhi _{\mathcal {M},T}$^{Footnote 1}. The capability of providing guaranteed over-approximations as well is called soundness in VI, and many techniques come with soundness [24, 30, 43, 46]. Soundness is useful for stopping criteria: one can fix an error bound $\eta \in [0, 1]$; VI can terminate when the distance between under- and over-approximations is at most $\eta $.

Among sound VI techniques, in this paper we focus on optimistic VI (OVI) due to its proven performance record [11, 29]. We use OVI in many places, specifically for 1) stopping criteria for local VIs in Sect. 4.2, 2) caching heuristics in Sect. 5.2, and 3) a stopping criterion for global (compositional) VI in Sect. 6.

The main steps of OVI proceed as follows: 1) a VI iteration produces an under-approximation l for every state; 2) we heuristically pick an over-approximation candidate u, for example by ; and 3) we verify the candidate u by checking if $\varPhi _{\mathcal {M},T}(u)\le u$. If the last holds, then by the Park induction principle [42], u is guaranteed to over-approximate the lfp $\mu \varPhi _{\mathcal {M},T}$. If it does not, then we refine l, u and try again. See [30] for details.

4.2 Going Top-Down in Compositional Value Iteration

We move on to formalize the story of Sect. 2.2. Algorithm 1 is a prototype of our proposed algorithm, where compositional VI is run in a top-down manner. It will be combined with Pareto caching (Sect. 5.2) and the stopping criteria introduced in Sect. 6. A high-level view of Algorithm 1 is the iteration of the following operations: 1) running local VI in each component oMDP, and 2) propagating its result along sequential composition, from an entrance of a succeeding component, to the corresponding exit of a preceding component. See Fig. 11 for illustration. The algorithm maintains two main constructs: functions $g:I_{\textrm{lc}}(\mathbb {D})\rightarrow [0, 1]$ and $h:O_{\textrm{lc}}(\mathbb {D})\rightarrow [0, 1]$ that assign values to local entrances and exits, respectively. They are analogues of the value function $f:S\rightarrow [0, 1]$ in (standard) VI (Sect. 4.1); g and h get iteratively increased as the algorithm proceeds.

Lines 4–12 are the main VI loop, where we combine local VI (over each component $\mathcal {A}$) and propagation along sequential composition. The algorithm $\texttt {LocalVI}$ takes the target oMDP $\mathcal {A}$ and its “local weight” as arguments; the latter is the restriction $h|_{O^{\mathcal {A}}}:O^{\mathcal {A}}\rightarrow [0, 1]$ of the function $h:O_{\textrm{lc}}(\mathbb {D})\rightarrow [0, 1]$. Any VI algorithm will do for $\texttt {LocalVI}$; we use OVI as announced in Sect. 4.1. The result of local VI is a function $g_{\mathcal {A}}:I^{\mathcal {A}}\rightarrow [0, 1]$ for values over entrances of $\mathcal {A}$. These get patched up to form $g:I_{\textrm{lc}}(\mathbb {D})\rightarrow [0, 1]$ in line 11. The function $\coprod _{\mathcal {A}\in \mathop {\textrm{CP}}(\mathbb {D})}g_{\mathcal {A}}$ is defined by obvious case-distinction: it returns $g_{\mathcal {A}}(i)$ for a local entrance $i\in I^{\mathcal {A}}$. Recall from Def. 3.9 that $I_{\textrm{lc}}(\mathbb {D})=\biguplus _{\mathcal {A}\in \mathop {\textrm{CP}}(\mathbb {D})}I^{\mathcal {A}}$. In line 12, the values at entrances are propagated to the connected exits.

On $\texttt {PropagateSeqComp}$ in line 12, its graphical intuition is in Fig. 11c; here are some details. We first note that the set $O_{\textrm{lc}}(\mathbb {D})$ of local exits is partitioned into 1) global exits (i.e. those in $O^{\llbracket \mathbb {D} \rrbracket }$) and 2) those local exits that get removed by sequential composition. Indeed, by examining Defs 3.5 and 3.6, we see that sequential composition is the only operation that removes local exits, and the local exits that are not removed eventually become global exits. It is also obvious (Def. 3.5) that each local exit o removed in sequential composition has a corresponding local entrance $i_o$. Using these, we define the function , of the type $O_{\textrm{lc}}(\mathbb {D})\rightarrow [0, 1]$, as follows: $h(o)=w_{o}$ if o is a global exit (much like line 2); $h(o)=g(i_o)$ otherwise.

Theorem 4.1

Algorithm 1 satisfies the following properties:

1.
(Guaranteed under-approximation) For the output f of Algorithm 1, we have $f(i)\le \textrm{WRPr}^{\llbracket \mathbb {D} \rrbracket }_{\textrm{max}}(\textbf{w}, i)$ for each $i\in I^{\llbracket \mathbb {D} \rrbracket }$.
2.
(Convergence) Assume that $\texttt {GlobalStoppingCriterion}$ is ${\textbf {false}}$. Algorithm 1 converges to the optimal value, that is, f converges to $ \textrm{WRPr}^{\llbracket \mathbb {D} \rrbracket }_{\textrm{max}}(\textbf{w}, I^{\llbracket \mathbb {D} \rrbracket })$. $\square $

The correctness of the under-approximation of Algorithm 1 follows easily from those of (non-compositional, asynchronous) VI. The convergence depends on the fact that line 6 of Algorithm 1 iterates over all components.

5 Pareto Caching in Compositional VI

In our formulation of Algorithm 1, there is no explicit notion of Pareto curves. However, in line 6, we do (implicitly) compute under-approximations on points on the Pareto curves. Here we recap approximate Pareto curves. We then show how we conduct Pareto caching, the key idea sketched in Sect. 2.3.

5.1 Approximating Pareto Curves

We formalize the Pareto curves illustrated in Sect. 2. For details, see [17, 20, 41, 45]. Model checking oMDPs is a multi-objective problem, that determines different trade-offs between reachability probabilities for the individual exits.

Definition 5.1

(Pareto curve for an oMDP [54]). Let $\mathcal {A}$ be an oMDP, and i be a (chosen) entrance. Let $\textbf{p},\textbf{p}' \in [0, 1]^{O^{\mathcal {A}}}$. The relation $\preceq $ between them is defined by $\textbf{p}\preceq \textbf{p}'$ if $\textbf{p}(o)\le \textbf{p}'(o)$ for each $o\in O^{\mathcal {A}}$. When $\textbf{p}\prec \textbf{p}'$ (i.e. $\textbf{p}\preceq \textbf{p}'$ and $\textbf{p}\ne \textbf{p}'$), we say $\textbf{p}'$ dominates $\textbf{p}$. Let $\sigma $ be a scheduler for $\mathcal {A}$. We define the point realized by $\sigma $, denoted by $\textbf{p}^{\sigma }_{i}$, by , the reachability probability from i to o under $\sigma $.

The set $\textsf{Ach}^{\sigma }_{i}$ of points achievable by $\sigma $ is . The set of achievable points is given by . The Pareto curve $\textsf{Pareto}_{i}\subseteq [0, 1]^{O^{\mathcal {A}}}$ is the set of maximal elements in $\textsf{Ach}_{i}$ wrt. $\preceq $. We say a scheduler $\sigma $ is Pareto-optimal if $ \textbf{p}^{\sigma }_{i} \in \textsf{Pareto}_{i} $.

The set is convex, downward closed, and finitely generated by DM schedulers; it follows that, for our target problem, Pareto-optimal DM schedulers suffice. This is illustrated in Fig. 9, where a weight $\textbf{w}$ is the normal vector of blue lines, and the maximum is achieved by a generating point for $\textsf{Ach}_{i}$.

The last observations are formally stated as follows.

Proposition 5.2

([17, 20, 45]). For any entrance $i\in I$, the set $\textsf{Ach}_{i}$ of achievable points is finitely generated by DM schedulers, that is, $ \textsf{Ach}_{i} = \textsf{DwConvCl}(\textsf{Ach}^{\varSigma _\text {d}^{\mathcal {A}}}_{i}) $. Here, $\textsf{DwConvCl}(X)$ denotes the downward and convex closed set generated by $X\subseteq \mathbb R^n$, and $\textsf{Ach}^{\varSigma _\text {d}^{\mathcal {A}}}_{i}$ is given by , where $\varSigma _\text {d}^{\mathcal {A}}$ is the set of DM schedulers.

Proposition 5.3

([17, 20, 45]). Given a weight $\textbf{w}\in [0, 1]^{O^{\mathcal {A}}}$ and an entrance i, there is a scheduler $\sigma $ such that $ \textrm{WRPr}^{\llbracket \mathbb {D} \rrbracket ,\sigma }(\textbf{w}, i) = \textrm{WRPr}^{\llbracket \mathbb {D} \rrbracket }_{\textrm{max}}(\textbf{w}, i)$. Moreover, this $\sigma $ can be chosen to be DM and Pareto-optimal.

We now formulate sound approximations of Pareto curves, which is a foundation of our Pareto caching (and a global stopping criterion in Sect. 6).

Definition 5.4

(sound approximation [54]). Let i be an entrance. An under-approximation $L_i$ of the Pareto curve $\textsf{Pareto}_{i}$ is a downward closed subset $L_i\subseteq \textsf{Ach}_{i}$; an over-approximation is a downward closed superset $U_i\supseteq \textsf{Ach}_{i}$. A pair $(L_i,U_i)$ is called a sound approximation of the Pareto curve $\textsf{Pareto}_{i}$. In this paper, we focus on $L_i$ and $U_i$ that are finitely generated, i.e. the convex and downward closures of some finite generators $L_i^{\textrm{g}}, U_i^{\textrm{g}}\subseteq [0, 1]^{O^{\mathcal {A}}}$, respectively. A sound approximation of an oMDP $\mathcal {A}$ is a pair (L, U), where $L=(L_{i})_{i\in I^{\mathcal {A}}}$, $U=(U_{i})_{i\in I^{\mathcal {A}}}$, and $(L_i, U_i)$ is a sound approximation for each entrance i.

5.2 Pareto Caching

We go on to formalize our second key idea, Pareto caching, outlined in Sect. 2.3.

In Def. 5.5, an index $\textsf{c}_{\mathcal {A}}\in \mathop {\textrm{nCP}}(\mathbb {D})$ is a nominal component that ignores multiplicities, since we want to reuse results for different occurrences of $\mathcal {A}$.

Definition 5.5

(Pareto cache). Let $\mathbb {D}$ be a string diagram of MDPs. A Pareto cache $\textbf{C}$ is an indexed family , where $ (L^{\mathcal {A}}, U^{\mathcal {A}})$ is a sound approximation for each nominal component $\textsf{c}_{\mathcal {A}}$, defined in Def. 5.4.

As announced in Sect. 2.3, a Pareto cache $\textbf{C}$—its component $(L^{\mathcal {A}}, U^{\mathcal {A}})$, to be precise—gets queried on a weight $\textbf{w}\in [0, 1]^{O^{\mathcal {A}}}$. It is not trivial what to return, however, since the specific weight $\textbf{w}$ may not have been used before to construct $\textbf{C}$. The query is answered in the way depicted in Fig. 9, finding an extremal point where $L^{\mathcal {A}}$ intersects with a plane with its normal vector $\textbf{w}$.

Definition 5.6

(cache read). Assume the above setting, and let i be an entrance of interest. The cache read $(L^{\mathcal {A}}_{i}(\textbf{w}), U^{\mathcal {A}}_{i}(\textbf{w}))\in [0, 1]^{2}$ on $\textbf{w}$ at i is defined by and .

Recall from Sect. 5.1 that we can assume $L_{i}$ and $U_{i}$ are finitely generated as convex and downward closures. It follows [20, 45] that each supremum above is realized by some generating point, much like in Prop. 5.3, easing computation.

We complement Algorithm 1 by Algorithm 2 that introduces our Pareto caching. Specifically, for the weight $h|_{O^{\mathcal {A}}}$ in question, we first compute the error $\max _{i\in I^{\mathcal {A}}}U^{\mathcal {A}}_i(h|_{O^{\mathcal {A}}}) - L^{\mathcal {A}}_i(h|_{O^{\mathcal {A}}})$ of the Pareto cache $\textbf{C}= \big ((L^{\mathcal {A}}, U^{\mathcal {A}})\big )_{\textsf{c}_{\mathcal {A}}}$ with respect to this weight. The error can be greater than a prescribed bound $\eta $—we call this cache miss—in which case we run OVI locally for $\mathcal {A}$ (line 9). When the error is no greater than $\eta $—we call this cache hit—we use the cache read (Def. 5.6), sparing OVI on a component $\mathcal {A}\in \mathop {\textrm{CP}}(\mathbb {D})$. In the case of a cache miss, the result $(l,\sigma )$ of local OVI (line 9) is used also to update the Pareto cache $\textbf{C}$ (line 10); see below.

Using a Pareto cache may prevent the execution of local VI on every component, which can be critical for the convergence of Algorithm 1; see Thm. 4.1. A simple solution is to disregard Pareto caches eventually.

Updating the Cache. Pareto caches get incrementally updated using the results for weighted reachabilities with different weights $\textbf{w}$. We build upon data structures in [20, 45]. Notable is the asymmetry between under- and over-approximations $(L_{i},U_{i})$: we obtain 1) a point in $L_{i}$ and 2) a plane that bounds $U_{i}$.

We update the cache after running OVI on a weight $\textbf{w}\in [0, 1]^{O^{\mathcal {A}}}$, which approximately computes the optimal weighted reachability to exits $o\in O^{\mathcal {A}}$. That is, it returns $l,u\in [0, 1]$ such that

$$\begin{aligned} l\;\le \; \textstyle \sup _{\sigma } \bigl (\,\textbf{w}\cdot \bigl (\, \textrm{RPr}^{\sigma }(i\rightarrow o) \,\bigr )_{o\in O^{\mathcal {A}}}\,\bigr ) \;\le \; u. \end{aligned}$$

(4)

Here i is any entrance and $\textrm{RPr}^{\sigma }(i\rightarrow o)$ is the probability $\textrm{RPr}^{\mathcal {A},\sigma ,\{o\}}(i)$ in Sect. 3.1.

What are the “graphical” roles of l, u in the Pareto curve? The role of u is easier: it follows from (4) that any achievable reachability vector $ \bigl ( \textrm{RPr}^{\sigma }(i\rightarrow o) \bigr )_{o} $ resides under the plane $\{\textbf{p}\mid \textbf{w}\cdot \textbf{p}= u\}$. This plane thus bounds an over-approximation $U_{i}$. The use of l takes some computation. By (4), the existence of a good scheduler $\sigma $ is guaranteed; but this alone does not carry any graphical information e.g. in Fig. 9. We have to go constructive, by extracting a near-optimal DM scheduler $\sigma _{0}$ (we can do so in VI) and using this fixed $\sigma _{0}$ to compute $ \bigl ( \textrm{RPr}^{\sigma _{0}}(i\rightarrow o) \bigr )_{o} $. This way we can plot an achievable point—a corner point in Fig. 9—in $L_i$.

6 Global Stopping Criteria (GSC)

We present the last missing piece, namely global stopping criteria (GSC in short, in line 4 of Algorithm 1). It has to ensure that the computed underapproximation f is $\epsilon $ close to the exact reachability probability. We provide two criteria, called optimistic and bottom-up.

Optimistic GSC (Opt-GSC). The challenge in adapting the idea of OVI (see Sect. 4.1) to CVI is to define a suitable Bellman operator for CVI. Once we define such a Bellman operator for CVI, we can immediately apply the idea of OVI. For simplicity, we assume that CVI solves exactly in each local component (line 6 in Algorithm 1) without Pareto caching; this can be done, for example, by policy iteration [29]. Then, CVI (without Pareto caching and a global stopping criterion) on $\mathbb {D}$ is exactly the same as the (non-compositional) VI on a suitable shortcut MDP [54] of $\mathbb {D}$. Intuitively, a shortcut MDP summarizes a Pareto-optimal scheduler by a single action from a local entrance to exit, see [55, Appendix C] for the definition. Thus, we can regard the standard Bellman operator on the shortcut MDP as the Bellman operator for CVI, and define Opt-GSC as the standard OVI based on this characterisation. CVI with Opt-GSC (and Pareto caching) actually uses local under-approximations (not exact solutions) for obtaining a global under-approximation (line 7 in Algorithm 2 and line 9 in Algorithm 2), where the desired soundness property still holds. See [55, Appendix C] for more details.

Bottom-up GSC (BU-GSC). We obtain another global stopping criterion by composing Pareto caches—computed in Algorithm 2 for each component $\mathcal {A}$—in the bottom-up manner in [54] (outlined in Sect. 2.1). Specifically, 1) Algorithm 2 produces an over-approximation $U^{\mathcal {A}}$ for the Pareto curve of each component $\mathcal {A}$; 2) we combine $(U^{\mathcal {A}})_{\mathcal {A}}$ along and $\oplus $ to derive an over-approximation U of the global Pareto curve; and 3) this U is queried on the weight $\textbf{w}$ in question (i.e. the input of CVI), in order to obtain an over-approximation u of the weighted reachability probabilities. BU-GSC checks if this over-approximation u is close enough to the under-approximation l derived from g in Algorithm 1.

Correctness. CVI (Algorithm 1 with Pareto caching under either GSC) is sound. The proof is in [55, Appendix C].

Theorem 6.1

($\epsilon $-soundness of CVI). Given a string diagram $\mathbb {D}$, a weight $\textbf{w}$, and $\epsilon \in [0, 1]$, if CVI terminates, then the output f satisfies

$$\begin{aligned} f(i)\le \textrm{WRPr}^{\llbracket \mathbb {D} \rrbracket }_{\textrm{max}}(\textbf{w}, I^{\llbracket \mathbb {D} \rrbracket }) \le f(i)+\epsilon , \end{aligned}$$

for each $i\in I^{\llbracket \mathbb {D} \rrbracket }$.

Our algorithm currently comes with no termination guarantee; this is future work. Termination of VI (with soundness) is a tricky problem: most known termination proofs exploit the uniqueness of a fixed point of the Bellman operator, which must be algorithmically enforced e.g. by eliminating end components [10, 24]. In the current compositional setting, end components can arise by composing components, so detecting them is much more challenging.

7 Empirical Evaluation

In this section, we compare the scalability of our approaches both among each other and in comparison with some existing baselines. We discuss the setup, give the results, and then give our interpretation of them.

Approaches. We examine our three main algorithms. Opt-GSC with either exact caching ( ) or Pareto-caching ( ), and BU-GSC with Pareto-caching (Symb). BU-GSC needs a Pareto cache, so we cannot run BU-GSC with an exact cache. We compare our approaches against two baselines: a monolithic (Mono) algorithm building the complete MDP $\llbracket \mathbb {D} \rrbracket $ and the bottom-up (BU) as explained in [54]. We use two virtual approaches that use a perfect oracle to select the fastest out of the specified algorithms: baselines is the best-of-the-baselines, while novel is the best of the three new algorithms. All algorithms are built on top of the probabilistic model checker Storm [32], which is primarily used for model building and (O)VI on component MDPs as well as operating on Pareto curves.

Setup. We run all benchmarks on a single core of an AMD Ryzen TRP 5965WX, with a 900 s time-out and a 16GB memory limit. We use all (scalable) benchmark instances from [54]. While these benchmarks are synthetic, they reflect typical structures found in network protocols and high-level planning domains. We require an overall precision of $10^{-4}$, we run the Pareto cache with an acceptance precision of $10^{-5}$, and solve the LPs in the upper-bound queries for the Pareto cache with an exact LP solver and a tolerance of $10^{-4}$. The components are reverse topologically ordered, i.e., we always first analyse component MDPs towards the end of a given MDP $\llbracket \mathbb {D} \rrbracket $. To solve the component MDPs inside the VI, we use OVI for the lower bounds and precise policy iteration for the upper bounds. We use algorithms and data structures already present in Storm for maintaining Pareto curves [45], which use exact rational arithmetic for numerical stability. Although our implementation supports exact arithmetic throughout the code, in practice this leads to a significant performance penalty, performing up to 100 times slower. For algorithms not related to maintaining the Pareto cache, we opted for using 64-bit floating point arithmetic, which is standard in probabilistic model checking [11]. Using floating point arithmetic can produce unsound results [26]; we attempt to prevent unsound results in our benchmark. First, we check with our setup that our results are very close (error $<10^{-5}$) to the exact solutions (when they could be computed). Second, we check that all results, obtained with different methods, are close. We evaluate the stopping criteria after ten iterations. These choices can be adapted using our prototypical implementation, we discuss some of these choices at the end of the discussion below.

Results. We provide pairwise comparisons of the runtimes on all benchmarks using the scatter plots in Fig. 12^{Footnote 2}. Notice the log-log scale. For some of the benchmark instances, we provide detailed information in Tables 1 and 2, respectively. In Table 1, we give the identifier for the string diagram and the component MDPs, as well as the number of states in $\llbracket \mathbb {D} \rrbracket $. Then, for each of the five algorithms, we provide the timings in t, for each algorithm maintaining Pareto points, we give the number of Pareto points stored |P|, and for the three novel VI-based algorithms, we give the amount of time spent in an attempt to prove convergence ($t_s$). In Table 2, we focus on our three novel algorithms and the performance of the caches. We again provide identifiers for the models, and then for each algorithm, the total time spent by the algorithm, the time spent on inserting and retrieving items from the cache, as well as the fraction of cache hits H and the number of total queries Q. Thus, the number of cache hits is given by $H\cdot Q$. The full tables and more figures are given in [55, Appendix A].

Table 1. Performance for different algorithms. See Results for explanations.

Full size table

Table 2. Cache access times for CVI algorithms. See Results for explanations.

Full size table

Discussion. We make some observations. We notice that the CVI algorithms collectively solve more benchmarks within the time out and speed up most benchmarks, see Fig. 12(top-l).^{Footnote 3} We refer to benchmark results in Table 1.

$\textbf{OCVI}^{e}$ Mostly Outperforms Mono, Fig. 12(top-c). The monolithic VI as typical in Storm requires a complete model, which can be prohibitively large. However, even for medium-sized models such as Chains100-RmB, the VI can run into time outs due to slow convergence. CVI with the exact cache (and even with no cache) quickly converges – highlighting that the grouping of states helps VI to converge. On the other hand, a model such as Birooms100-RmS highlights that the harder convergence check can yield a significant overhead.

Symb Mostly Outperforms BU, Fig. 12(top-r). For many models, the top-down approach as motivated in Sect. 4.2 indeed ensures that we avoid the undirected exploration of the Pareto curves. However, if the VI repeatedly asks for weights that are not relevant for the optimal scheduler, the termination checks fail and this yields a significant overhead.

$\textbf{OCVI}^{e}$ and Symb Both Provide Clear Added Value, Fig. 12(bot-l). Both approaches can solve benchmarks within ten seconds that the other approach does not solve within the time-out. Both approaches are able to save significantly upon the number of iterations necessary. Symb suffers from the overhead of the Pareto cache, see below, whereas requires somewhat optimal values in all leaves, regardless of whether these leaves are important for reaching the global target. Therefore, Symb may profit from ideas from asynchronous VI and from adaptive schemes to decide when to run the termination check.

Pareto Cache Has a Significant Overhead, Fig. 12(bot-c/r) and Table 2. We observe that the Pareto cache consistently yields an overhead: In particular, often outperforms . The Pareto cache is essential for Symb. The overhead has three different sources. (1) More iterations: Birooms10-RmB illustrates how requires only 14% of the iterations of . Even with a 66% cache hit rate in , this means an overhead in the number of component MDPs analysed. The main reason is that reusing approximation can delay convergence^{Footnote 4}. (2) Cache retrieval: To obtain an upper bound, we must optimize over Pareto curves that contain tens of halfspaces, which are numerically not very stable. Therefore, Pareto curves in Storm are represented exactly. The linear program that must be solved is often equally slow^{Footnote 5} as actually solving the LP, especially for small MDPs. (3) Cache insertion: Cache insertion of lower bounds requires model checking Markov chains, as many as there are exits in the open MDPs. These times are pure overhead if this lower bound is never retrieved and can be substantial for large open MDPs.

Opportunities for Heuristics and Hyperparameters. We extensively studied variations of the presented algorithms. For example, a much higher tolerance in the Pareto cache can significantly speed up on the cost of not terminating on many benchmark cases and one can investigate a per-query strategy for retrieving and/or inserting cache results.

Interpretation of Results. Mono works well on models that fit into memory and exhibit little sharing of open MDPs. BU works well when the Pareto curves of the open MDPs can be accurately be approximated with few Pareto points, which, in practice, excludes open MDPs with more than 3 exits. CVI without caching and termination criteria resembles a basic kind of topological VI^{Footnote 6} on the monolithic MDP. CVI can thus improve upon topological VI either via the cache or via the alternative stopping criteria. Based on the experiments, we conjecture that

the cache is efficient when the cost of performing a single reachability query is expensive — such as in the Room10 model — while the cache hit rate is high.
the symbiotic termination criterion (Symb) works well when some exits are not relevant for the global target, such as the Chains3500 model, in which going backwards is not productive.
the compositional OVI stopping criterion ( / ) works well when the likelihood of reaching all individual open MDPs is high, such as can be seen in the ChainsLoop500-Dice4 model.

8 Related Work

We group our related work into variations of value iteration, compositional verification of MDPs, and multi-objective verification.

Value Iteration. Value iteration as standard analysis of MDPs [29] is widely studied. In the undiscounted, indefinite horizon case we study, value iteration requires an exponential number of iterations in theory, but in practice converges earlier. This motivates the search for sound termination criteria. Optimistic value iteration [30] is now widely adopted as the default approach [11, 29]. To accelerate VI, various asynchronous variations have been suggested that prevent operating on the complete state space. In particular topological VI [2, 15] and (uni-directional) sequential VI [25, 28, 36] aim to exploit an acyclic structure similar to what exists in uni-directional MDPs.

Sequentially Composed MDPs. The exploitation of a compositional structure in MDPs is widely studied. In particular, the sequential composition in our paper is closely related to hierarchical compositions that capture how tasks are often composed of repetitive subtasks [5, 6, 23, 31, 35, 48, 50, 51]. While we study a fully model-based approach, Jothimurugan et al. [33] provide a compositional reinforcement learning method whose sub-goals are induced by specifications. Neary et al. [40] update the learning goals based on the analysis of the component MDPs, but do not consider the possibility of reaching multiple exits. The widespread option-framework and variations such as abstract VI [34], aggregate policies [14, 49] into additional actions to speed up convergence of value iterations and is often applied in model-free approaches. In the context of OVI, we must converge everywhere and the bottom-up stopping criterion is not easily lifted to a model-free setting.

Further Related Work. As a different type of compositional reasoning, assume-guarantee reasoning [7, 12, 16, 18, 19, 38, 39] is a central topic, and a compositional probabilistic framework [38] with the parallel composition $\parallel $ is also based on Pareto curves: extending string diagrams of MDPs for the parallel composition $\parallel $ is challenging, but an interesting future work. We mention that there are VIs on Pareto curves solving multi-objective simple stochastic games [1, 13]. Due to the multi-objectivity, they maintain a set of points for each state during iterations; CVI solves single-objective oMDPs determined by weights, thus we maintain a single value for each state during iterations.

9 Conclusion

This paper investigates the verification of compositional MDPs, with a particular focus on approximating the behavior of the component MDPs via a Pareto cache and sound stopping criteria for value iteration. The empirical evaluation does not only demonstrate the efficacy of the novel algorithms, but also demonstrates the potential for further improvements, using asynchronous value iteration, efficient Pareto caches manipulations, and powerful compositional stopping criteria.

Notes

1.
The challenge applies to VI in (our) undiscounted setting, where the Bellman operator is not a contraction operator. With discounting, one can easily approximate the gap.
2.
A point (x, y) means that the approach on the x-axis took x seconds and the tool on the y-axis took y seconds. Different shapes refer to different benchmark sets.
3.
We highlight that we use the benchmark suite that accompanied the bottom-up approach.
4.
Towards global convergence, we may eventually deactivate the cache.
5.
We use the Soplex LP solver [21] for exact LP solving, which is significantly faster than using, e.g., Z3. Soplex may return unknown, which we interpret as a cache miss.
6.
Topological VI orders strongly connected components, whereas CVI uses the hierarchical structure. This can also lead to advantages.

References

Ashok, P., Chatterjee, K., Kretínský, J., Weininger, M., Winkler, T.: Approximating values of generalized-reachability stochastic games. In: LICS, pp. 102–115. ACM (2020)
Google Scholar
Azeem, M., Evangelidis, A., Kretínský, J., Slivinskiy, A., Weininger, M.: Optimistic and topological value iteration for simple stochastic games. In: Bouajjani, A., Holík, L., Wu, Z. (eds.) Automated Technology for Verification and Analysis. ATVA 2022. LNCS, vol. 13505. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-19992-9_18
Baier, C., Katoen, J.: Principles of Model Checking. MIT Press, Cambridge (2008)
Google Scholar
Baier, C., Klein, J., Leuschner, L., Parker, D., Wunderlich, S.: Ensuring the reliability of your model checker: interval iteration for markov decision processes. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10426, pp. 160–180. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63387-9_8
Chapter Google Scholar
Barry, J.L., Kaelbling, L.P., Lozano-Pérez, T.: Deth*: Approximate hierarchical solution of large Markov decision processes. In: IJCAI, pp. 1928–1935. IJCAI/AAAI (2011)
Google Scholar
Barto, A.G., Mahadevan, S.: Recent advances in hierarchical reinforcement learning. Discret. Event Dyn. Syst. 13(1–2), 41–77 (2003)
Article MathSciNet Google Scholar
Bloem, R., Chatterjee, K., Jacobs, S., Könighofer, R.: Assume-guarantee synthesis for concurrent reactive programs with partial information. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 517–532. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46681-0_50
Chapter Google Scholar
Bonchi, F., Gadducci, F., Kissinger, A., Sobocinski, P., Zanasi, F.: String diagram rewrite theory I: rewriting with Frobenius structure. J. ACM 69(2), 14:1–14:58 (2022)
Google Scholar
Bonchi, F., Holland, J., Piedeleu, R., Sobocinski, P., Zanasi, F.: Diagrammatic algebra: from linear to concurrent systems. Proc. ACM Program. Lang. 3(POPL), 25:1–25:28 (2019)
Google Scholar
Brázdil, T., et al.: Verification of Markov decision processes using learning algorithms. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 98–114. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11936-6_8
Chapter Google Scholar
Budde, C.E., et al.: On correctness, precision, and performance in quantitative verification. In: Margaria, T., Steffen, B. (eds.) ISoLA 2020. LNCS, vol. 12479, pp. 216–241. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-83723-5_15
Chapter Google Scholar
Chatterjee, K., Henzinger, T.A.: Assume-guarantee synthesis. In: Grumberg, O., Huth, M. (eds.) TACAS 2007. LNCS, vol. 4424, pp. 261–275. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71209-1_21
Chapter Google Scholar
Chen, T., Forejt, V., Kwiatkowska, M., Simaitis, A., Wiltsche, C.: On stochastic games with multiple objectives. In: Chatterjee, K., Sgall, J. (eds.) MFCS 2013. LNCS, vol. 8087, pp. 266–277. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40313-2_25
Chapter Google Scholar
Ciosek, K., Silver, D.: Value iteration with options and state aggregation (2015). CoRR abs/1501.03959
Google Scholar
Dai, P., Mausam, Weld, D.S., Goldsmith, J.J.: Topological value iteration algorithms. Artif. Intell. Res. 42, 181–209 (2011)
Google Scholar
Dewes, R., Dimitrova, R.: Compositional high-quality synthesis. In: André, É., Sun, J. (eds.) Automated Technology for Verification and Analysis. ATVA 2023. LNCS, vol. 14215. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-45329-8_16
Etessami, K., Kwiatkowska, M.Z., Vardi, M.Y., Yannakakis, M.: Multi-objective model checking of Markov decision processes. Log. Methods Comput. Sci. 4(4) (2008)
Google Scholar
Finkbeiner, B., Passing, N.: Compositional synthesis of modular systems. Innov. Syst. Softw. Eng. 18(3), 455–469 (2022)
Article Google Scholar
Forejt, V., Kwiatkowska, M., Norman, G., Parker, D., Qu, H.: Quantitative multi-objective verification for probabilistic systems. In: Abdulla, P.A., Leino, K.R.M. (eds.) TACAS 2011. LNCS, vol. 6605, pp. 112–127. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19835-9_11
Chapter Google Scholar
Forejt, V., Kwiatkowska, M., Parker, D.: Pareto curves for probabilistic model checking. In: Chakraborty, S., Mukund, M. (eds.) ATVA 2012. LNCS, pp. 317–332. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33386-6_25
Chapter Google Scholar
Gamrath, G., et al.: The SCIP optimization suite 7.0. Tech. Rep. 20-10, ZIB, Takustr. 7, 14195 Berlin (2020)
Google Scholar
Ghani, N., Hedges, J., Winschel, V., Zahn, P.: Compositional game theory. In: LICS, pp. 472–481. ACM (2018)
Google Scholar
Gopalan, N., et al.: Planning with abstract Markov decision processes. In: ICAPS, pp. 480–488. AAAI Press (2017)
Google Scholar
Haddad, S., Monmege, B.: Interval iteration algorithm for MDPs and IMDPs. Theor. Comput. Sci. 735, 111–131 (2018)
Article MathSciNet Google Scholar
Hahn, E.M., Hartmanns, A.: A comparison of time- and reward-bounded probabilistic model checking techniques. In: Fränzle, M., Kapur, D., Zhan, N. (eds.) SETTA 2016. LNCS, vol. 9984, pp. 85–100. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47677-3_6
Chapter Google Scholar
Hartmanns, A.: Correct probabilistic model checking with floating-point arithmetic. In: TACAS 2022. LNCS, vol. 13244, pp. 41–59. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99527-0_3
Chapter Google Scholar
Hartmanns, A., Hermanns, H.: The modest toolset: an integrated environment for quantitative modelling and verification. In: Ábrahám, E., Havelund, K. (eds.) TACAS 2014. LNCS, vol. 8413, pp. 593–598. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54862-8_51
Chapter Google Scholar
Hartmanns, A., Junges, S., Katoen, J., Quatmann, T.: Multi-cost bounded tradeoff analysis in MDP. J. Autom. Reason. 64(7), 1483–1522 (2020)
Article MathSciNet Google Scholar
Hartmanns, A., Junges, S., Quatmann, T., Weininger, M.: A practitioner’s guide to MDP model checking algorithms. In: Sankaranarayanan, S., Sharygina, N. (eds.) Tools and Algorithms for the Construction and Analysis of Systems. TACAS 2023. LNCS, vol. 13993. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-30823-9_24
Hartmanns, A., Kaminski, B.L.: Optimistic value iteration. In: Lahiri, S.K., Wang, C. (eds.) CAV 2020. LNCS, vol. 12225, pp. 488–511. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-53291-8_26
Chapter Google Scholar
Hauskrecht, M., Meuleau, N., Kaelbling, L.P., Dean, T.L., Boutilier, C.: Hierarchical solution of Markov decision processes using macro-actions. In: UAI, pp. 220–229. Morgan Kaufmann (1998)
Google Scholar
Hensel, C., Junges, S., Katoen, J., Quatmann, T., Volk, M.: The probabilistic model checker Storm. Int. J. Softw. Tools Technol. Transf. 24(4), 589–610 (2022)
Article Google Scholar
Jothimurugan, K., Bansal, S., Bastani, O., Alur, R.: Compositional reinforcement learning from logical specifications. In: NeurIPS, pp. 10026–10039 (2021)
Google Scholar
Jothimurugan, K., Bastani, O., Alur, R.: Abstract value iteration for hierarchical reinforcement learning. In: AISTATS. Proceedings of Machine Learning Research, vol. 130, pp. 1162–1170. PMLR (2021)
Google Scholar
Junges, S., Spaan, M.T.J.: Abstraction-refinement for hierarchical probabilistic models. In: Shoham, S., Vizel, Y. (eds.) Computer Aided Verification. CAV 2022. LNCS, vol. 13371. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13185-1_6
Klein, J., et al.: Advances in symbolic probabilistic model checking with PRISM. In: Chechik, M., Raskin, J.-F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 349–366. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49674-9_20
Chapter Google Scholar
Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22110-1_47
Chapter Google Scholar
Kwiatkowska, M.Z., Norman, G., Parker, D., Qu, H.: Compositional probabilistic verification through multi-objective model checking. Inf. Comput. 232, 38–65 (2013)
Article MathSciNet Google Scholar
Majumdar, R., Mallik, K., Schmuck, A., Zufferey, D.: Assume-guarantee distributed synthesis. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 39(11), 3215–3226 (2020)
Google Scholar
Neary, C., Verginis, C.K., Cubuktepe, M., Topcu, U.: Verifiable and compositional reinforcement learning systems. In: ICAPS, pp. 615–623. AAAI Press (2022)
Google Scholar
Papadimitriou, C.H., Yannakakis, M.: On the approximability of trade-offs and optimal access of web sources. In: FOCS, pp. 86–92. IEEE Computer Society (2000)
Google Scholar
Park, D.: Fixpoint induction and proofs of program properties. Machine intelligence 5, 59–78 (1969)
Google Scholar
Phalakarn, K., Takisaka, T., Haas, T., Hasuo, I.: Widest paths and global propagation in bounded value iteration for stochastic games. In: Lahiri, S.K., Wang, C. (eds.) CAV 2020. LNCS, vol. 12225, pp. 349–371. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-53291-8_19
Chapter Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics, Wiley, Hoboken (1994)
Google Scholar
Quatmann, T.: Verification of multi-objective Markov models. Phd thesis (2023). https://doi.org/10.18154/RWTH-2023-09669, https://publications.rwth-aachen.de/record/971553
Quatmann, T., Katoen, J.-P.: Sound value iteration. In: Chockler, H., Weissenbacher, G. (eds.) CAV 2018. LNCS, vol. 10981, pp. 643–661. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96145-3_37
Chapter Google Scholar
Quatmann, T., Katoen, J.-P.: Multi-objective optimization of long-run average and total rewards. In: TACAS 2021. LNCS, vol. 12651, pp. 230–249. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72016-2_13
Chapter Google Scholar
Saxe, A.M., Earle, A.C., Rosman, B.: Hierarchy through composition with multitask LMDPs. In: ICML. Proceedings of Machine Learning Research, vol. 70, pp. 3017–3026. PMLR (2017)
Google Scholar
Silver, D., Ciosek, K.: Compositional planning using optimal option models. In: ICML. icml.cc / Omnipress (2012)
Google Scholar
Sutton, R.S., Precup, D., Singh, S.: Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif. Intell. 112(1–2), 181–211 (1999)
Article MathSciNet Google Scholar
Vien, N.A., Toussaint, M.: Hierarchical monte-carlo planning. In: AAAI, pp. 3613–3619. AAAI Press (2015)
Google Scholar
Watanabe, K., Eberhart, C., Asada, K., Hasuo, I.: A compositional approach to parity games. In: MFPS. EPTCS, vol. 351, pp. 278–295 (2021)
Google Scholar
Watanabe, K., Eberhart, C., Asada, K., Hasuo, I.: Compositional probabilistic model checking with string diagrams of MDPs. In: Enea, C., Lal, A. (eds.) Computer Aided Verification. CAV 2023. LNCS, vol. 13966. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-37709-9_3
Watanabe, K., van der Vegt, M., Hasuo, I., Rot, J., Junges, S.: Pareto curves for compositionally model checking string diagrams of MDPs. In: Finkbeiner, B., Kovács, L. (eds.) Tools and Algorithms for the Construction and Analysis of Systems. TACAS 2024. LNCS, vol. 14571. Springer, Cham (2024). https://doi.org/10.1007/978-3-031-57249-4_14
Watanabe, K., van der Vegt, M., Junges, S., Hasuo, I.: Compositional value iteration with Pareto caching (2024). https://arxiv.org/abs/2405.10099, a longer version

Download references

Author information

Authors and Affiliations

National Institute of Informatics, Tokyo, Japan
Kazuki Watanabe & Ichiro Hasuo
The Graduate University for Advanced Studies (SOKENDAI), Kanagawa, Japan
Kazuki Watanabe & Ichiro Hasuo
Radboud University, Nijmegen, The Netherlands
Marck van der Vegt & Sebastian Junges

Authors

Kazuki Watanabe
View author publications
You can also search for this author in PubMed Google Scholar
Marck van der Vegt
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Junges
View author publications
You can also search for this author in PubMed Google Scholar
Ichiro Hasuo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazuki Watanabe .

Editor information

Editors and Affiliations

University of Waterloo, Waterloo, ON, Canada
Arie Gurfinkel
Georgia Institute of Technology, Atlanta, GA, USA
Vijay Ganesh

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Watanabe, K., Vegt, M.v.d., Junges, S., Hasuo, I. (2024). Compositional Value Iteration with Pareto Caching. In: Gurfinkel, A., Ganesh, V. (eds) Computer Aided Verification. CAV 2024. Lecture Notes in Computer Science, vol 14683. Springer, Cham. https://doi.org/10.1007/978-3-031-65633-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-65633-0_21
Published: 26 July 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-65632-3
Online ISBN: 978-3-031-65633-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Compositional Value Iteration with Pareto Caching