figure a
figure b

1 Introduction

The verification of MDPs is crucial for the design and evaluation of cyber-physical systems with sensor noise, biological and chemical processes, network protocols, and many other complex systems. MDPs are the standard model for sequential decision making under uncertainty and thus at the heart of reinforcement learning. Many dependability evaluation and safety assurance approaches rely in some form on the verification of MDPs with respect to temporal logic properties. Probabilistic model checking [4, 5] provides powerful tools to support this task.

The essential MDP model checking queries are for the worst-case probability that something bad happens (reachability) and the expected resource consumption until task completion (expected rewards). These are indefinite (undiscounted) horizon queries: They ask about the probability or expectation of a random variable up until an event—which forms the horizon—but are themselves unbounded. Many more complex properties internally reduce to solving either reachability or expected rewards. For example, if the description of something bad is in linear temporal logic (LTL), then a product construction with a suitable automaton reduces the LTL query to reachability [6]. This paper sets out to determine the practically best algorithms to solve indefinite horizon reachability probabilities and expected rewards; our methodology is an empirical evaluation.

MDP analysis is well studied in many fields and has lead to three main types of algorithms: value iteration (VI), policy iteration (PI), and linear programming (LP) [55]. While indefinite horizon queries are natural in a verification context, they differ from the standard problem of e.g. operations research, planning, and reinforcement learning. In those fields, the primary concern is to compute a policy that (often approximately) optimizes the discounted expected reward over an infinite horizon where rewards accumulated in the future are weighted by a discount factor \(<1\) that exponentially prefers values accumulated earlier.

The lack of discounting in verification has vast implications. The Bellman operation, essentially describing a one-step backward update on expected rewards, is a contraction with discounting, but not a contraction without. This leads to significantly more complex termination criteria for VI-based verification approaches [34]. Indeed, VI runs in polynomial time for every fixed discount factor [49], and similar results are known for PI as well as LP solving with the simplex algorithm [60]. In contrast, VI [9] and PI [20] are known to have exponential worst-case behaviour in the undiscounted case.

So, what is the best algorithm for model checking MDPs? A polynomial-time algorithm exists using an LP formulation and barrier methods for its solution [12]. LP-based approaches (and their extension to MILPs) are also prominent for multi-objective model checking [21], in counterexample generation [23], and for the analysis of parametric Markov chains [16]. However, folklore tells us that iterative methods, in particular VI, are better for solving MDPs. Indeed, variations of VI are the default choice of all model checkers participating in the QComp competition [14]. This uniformity may be misleading. Indeed, for some stochastic game algorithms, using LP to solve the underlying MDPs may be preferential [3, Appendix E.4]. An application in runtime assurance preferred PI for numerical stability [45, Sect. 6]. A toy example from [34] is a famous challenge for VI-based methods. Despite the prominence of LP, the ease of encoding MDPs, and the availability of powerful off-the-shelf LP solvers, many tools did (until very recently) not include MDP model checking via LP solvers.

With this paper, we reconsider the PI and LP algorithms to investigate whether probabilistic model checking focused on the wrong family of algorithms. We report the results of an extensive empirical study with two independent implementations in the model checkers Storm  [42] and mcsta  [37]. We find that, in terms of performance and scalability, optimistic value iteration [40] is a solid choice on the standard benchmark collection (which goes beyond competition benchmarks) but can be beat quite considerably on challenging cases. We also emphasize the question of precision and soundness. Numerical algorithms, in particular ones that converge in the limit, are prone to delivering wrong results. For VI, the recognition of this problem has led to a series of improvements over the last decade [8, 19, 34, 40, 54, 56]. We show that PI faces a similar problem. When using floating-point arithmetic, additional issues may arise [36, 59]. Our use of various LP solvers exhibits concerning results for a variety of benchmarks. We therefore also include results for exact computation using rational arithmetic.

Limitations of this study. A thorough experimental study of algorithms requires a carefully scoped evaluation. We work with flat representations of MDPs that fit completely into memory (i.e. we ignore the state space exploration process and symbolic methods). We selected algorithms that are tailored to converge to the optimal value. We also exclude approaches that incrementally build and solve (partial or abstract) MDPs using simulation or model checking results to guide exploration: they are an orthogonal improvement and would equally profit from faster algorithms to solve the partial MDPs. Moreover, this study is on algorithms, not on their implementations. To reduce the impact of potential implementation flaws, we use two independent tools where possible. Our experiments ran on a single type of machine—we do not study the effect of different hardware.

Contributions. This paper contributes a thorough overview on how to model-check indefinite horizon properties on MDPs, making MDP model checking more accessible, but also pushing the state-of-the-art by clarifying open questions. Our study is built upon a thorough empirical evaluation using two independent code bases, sources benchmarks from the standard benchmark suite and recent publications, compares 10 LP solvers, and studies the influence of various prominent preprocessing techniques. The paper provides new insights and reviews folklore statements: Particular highlights are a new simple but challenging MDP family that leads to wrong results on all floating-point LP solvers (Section 2.3), a negative result regarding the soundness of PI with epsilon-precise policy evaluators (Section 4), and an evaluation on numerically challenging benchmarks that shows the limitations of value iteration in a practical setting (Section 5.3).

2 Background

We recall MDPs with reachability and reward objectives, describe solution algorithms and their guarantees, and address commonly used optimizations.

2.1 Markov Decision Processes

Let \(\textsf{D}_{\!X} {:}{=}\{\,\textsf{d}:X \rightarrow [0,1] \mid \sum _{x\in X} \textsf{d}(x) = 1\,\} \) be the set of distributions over X. A Markov decision process (MDP) [55] is a tuple \(\mathcal {M}= (\textsf{S},\textsf{A},\delta )\) with finite sets of states \(\textsf{S}\) and actions \(\textsf{A}\), and a partially defined transition function \(\delta :\textsf{S}\times \textsf{A}\rightharpoonup \textsf{D}_\textsf{S}\) such that \(\textsf{A}(s) {:}{=}\{\,a \mid (s,a) \in domain (\delta )\,\} \ne \emptyset \) for all \(s \in \textsf{S}\). \(\textsf{A}(s)\) is the set of enabled actions at state s. \(\delta \) maps enabled state-action pairs to distributions over successor states. A Markov chain (MC) is an MDP with \(|\textsf{A}(s)| = 1\) for all s. The semantics of an MDP are defined in the usual way, see, e.g. [6, Chapter 10]. A (memoryless deterministic) policy—a.k.a. strategy or scheduler—is a function \(\pi :\textsf{S}\rightarrow \textsf{A}\) that, intuitively, given the current state s prescribes what action \(a \in \textsf{A}(s)\) to play. Applying a policy \(\pi \) to an MDP induces an MC \(\mathcal {M}^{\pi }\). A path in this MC is an infinite sequence \(\rho = s_1 s_2 \ldots \) with \(\delta (s_i, \pi (s_i))(s_{i+1}) > 0\). \(\textsf{Paths}\) denotes the set of all paths and \(\mathbb {P}^\pi _s\) denotes the unique probability measure of \(\mathcal {M}^\pi \) over infinite paths starting in the state s.

A reachability objective \(\textrm{P}_{{\!}\textsf{opt}}(\textsf{T})\) with set of target states \(\textsf{T}\subseteq \textsf{S}\) and \(\textsf{opt}\in \{\textrm{max},\textrm{min}\}\) induces a random variable \(X:\textsf{Paths}\rightarrow [0,1]\) over paths by assigning 1 to all paths that eventually reach the target and 0 to all others. \(\textrm{E}_{\textsf{opt}}(\textsf{rew})\) denotes an expected reward objective, where \(\textsf{rew}:\textsf{S}\rightarrow \mathbb {Q}_{\ge 0}\) assigns a reward to each state. is the accumulated reward of a path \(\rho = s_1 s_2 \dots \). This yields a random variable \(X:\textsf{Paths}\rightarrow \mathbb {Q}\cup \{\infty \}\) that maps paths to their reward. For a given objective and its random variable X, the value of a state \(s\in \textsf{S}\) is the expectation of X under the probability measure \(\mathbb {P}^\pi _s\) of the the MC induced by an optimal policy \(\pi \) from the set of all policies \(\Pi \), formally .

2.2 Solution Algorithms

Value iteration (VI), e.g. [15], computes a sequence of value vectors converging to the optimum in the limit. In all variants of the algorithm, we start with a function \(x:~\textsf{S}\rightarrow \mathbb {Q}\) that assigns to every state an estimate of the value. The algorithm repeatedly performs an update operation to improve the estimates. After some preprocessing, this operation has a unique fixpoint when \(x = \textsf{V}\). Thus, value iteration converges to the value in the limit. Variants of VI include interval iteration [34], sound VI [56] and optimistic VI [40]. We do not discuss these in detail, but instead refer to the respective papers.

Linear programming (LP), e.g. [6, Chapter 10], encodes the transition structure of the MDP and the objective as a linear optimization problem. For every state, the LP has a variable representing an estimate of its value. Every state-action pair is encoded as a constraint on these variables, as are the target set or rewards. The unique optimum of the LP is attained if and only if for every state its corresponding variable is set to the value of the state. We provide an in-depth discussion of theoretical and practical aspects of LP in Section 3.

Policy iteration (PI), e.g. [11, Section 4], computes a sequence of policies. Starting with an initial policy, we evaluate its induced MC, improve the policy by switching suboptimal choices and repeat the process on the new policy. As every policy improves the previous one and there are only finitely many memoryless deterministic policies (a number exponential in the number of states), eventually we obtain an optimal policy. We further discuss PI in Section 4.

2.3 Guarantees

Given the stakes in many application domains, we require guarantees about the relation between an algorithm’s result \(\bar{v}\) and the true value v. First, implementations are subject to floating-point errors and imprecision [59] unless they use exact (rational) arithmetic or safe rounding [36]. This can result in arbitrary differences between \(\bar{v}\) and v. Second are the algorithm’s inherent properties: VI is an approximating algorithm that converges to the true value only in the limit. In theory, it is possible to obtain the exact result by rounding after exponentially many iterations [15]; in practice, this results in excessive runtime. Instead, for years, implementations used a naive stopping criterion that could return arbitrarily wrong results [33]. This problem’s discovery sparked the development of sound variants of VI [8, 19, 34, 40, 54, 56], including interval iteration, sound value iteration, and optimistic value iteration. A sound VI algorithm guarantees \(\varepsilon \)-precise results, i.e. \(|v - \bar{v}| \le \varepsilon \) or \(|v - \bar{v}| \le v \cdot \varepsilon \). For LP and PI, the guarantees have not yet been thoroughly investigated. Theoretically, both are exact, but implementations are often not. We discuss the problems in Sections 3 and 4.

Fig. 1.
figure 1

A hard MDP for all algorithms

Table 1. Correct results

The handcrafted MC of [33, Figure 2] highlights the lack of guarantees of VI: standard implementations return vastly incorrect results. We extended it with action choices to obtain the MDP \(M_n\) shown in Fig. 1 for \(n \in \mathbb {N}\), \(n \ge 2\). It has \(2n+1\) states; we compute \(\textrm{P}_{{\!}\textrm{min}}(\{\,n\,\})\) and \(\textrm{P}_{{\!}\textrm{max}}(\{\,n\,\})\). The policy that chooses action m wherever possible induces the MC of [33, Figure 2] with \(( \textrm{P}_{{\!}\textrm{min}}(\{\,n\,\}), \textrm{P}_{{\!}\textrm{max}}(\{\,n\,\}) ) = ( \frac{1}{2}, \frac{1}{2} )\). In every state s with \(0< s < n\), we added the choice of action j that jumps to n and . With that, the (optimal) values over all policies are \(( \frac{1}{3}, \frac{2}{3} )\). In VI, starting from value 0 for all states except n, initially taking j everywhere looks like the best policy for \(\textrm{P}_{\textrm{max}}\). As updated values slowly propagate, state-by-state, m becomes the optimal choice in all states except \(-n+1\). We thus layered a “deceptive” decision problem on top of the slow convergence of the original MC. For \(n = 20\), VI with Storm and mcsta deliver the incorrect results \(( 0.247, 0.500 )\). For Storm ’s PI and various LP solvers, we show in Table 1 the largest n for which they return a \(\pm \,0.01\)-correct result. For larger n, PI and all LP solvers claim \(\approx ( \frac{1}{2}, \frac{1}{2} )\) as the correct solution except for Glop and GLPK which only fail for the maximum at the given n; for the minimum, they return the wrong result at \(n \ge 29\) and 52, respectively. Sound VI algorithms and Storm ’s exact-arithmetic engine produce (\(\varepsilon \)-)correct results, though the former at excessive runtime for larger n. We used default settings for all tools and solvers.

2.4 Optimizations

VI, LP, and PI can all benefit from the following optimizations:

Graph-theoretic algorithms can be used for qualitative analysis of the MDP, i.e. finding states with value 0 or (only for reachability objectives) 1. These qualitative approaches are typically a lot faster than the numerical computations for quantitative analysis. Thus, we always apply them first and only run the numerical algorithms on the remaining states with non-trivial values.

Topological methods, e.g. [17], do not consider the whole MDP at once. Instead, they first compute a topological ordering of the strongly connected components (SCCs)Footnote 1 and then analyze each SCC individually. This can improve the runtime, as we decompose the problem into smaller subproblems. The subproblems can be solved with any of the solution methods. Note that when considering acyclic MDPs, the topological approach does not need to call the solution methods, as the resulting values can immediately be backpropagated.

Collapsing of maximal end components (MECs), e.g., [13, 34], transforms the MDP into one with equivalent values but simpler structure. After collapsing MECs, the MDP is contracting, i.e. we almost surely reach a target state or a state with value zero. VI algorithms rely on this property for convergence [34, 40, 56]. For PI and LP, simplifying the graph structure before applying the solution method can speed up the computation.

Warm starts, e.g. [26, 46], may adequately initialize an algorithm, i.e., we may provide it with some prior knowledge so that the computation has a good starting point. We implement warm starts by first running VI for a limited number of iterations and using the resulting estimate to guess bounds on the variables in an LP or a good initial policy for PI. See Sections 3 and 4 for more details.

3 Practically solving MDPs using Linear Programs

This section considers the LP-based approach to solving the optimal policy problem in MDPs. To the best of our knowledge, this is the only polynomial-time approach. We discuss various configurations. These configuration are a combination of the LP formulation, the choice of software, and their parameterization.

3.1 How to encode MDPs as LPs?

For objective \(\textrm{P}_{{\!}\textrm{max}}(\textsf{T})\) we formulate the following LP over variables \(x_s\), \(s \in \textsf{S}\setminus \textsf{T}\):

$$\begin{aligned} \text {minimize}\quad&\sum _{s \in \textsf{S}} x_s \quad \text {s.t. } lb(s) \le x_s \le ub(s) \quad \text {and} \\ \quad&x_s \ge \sum _{s' \in \textsf{S}\setminus \textsf{T}}\delta (s,a)(s') \cdot x_{s'} + \sum _{t \in \textsf{T}} \delta (s,a)(t) \quad \text { for all } s \in \textsf{S}\setminus \textsf{T}, a \in \textsf{A}\\ \end{aligned}$$

We assume bounds \(lb(s) = 0\) and \(ub(s) = 1\) for \(s \in \textsf{S}\setminus \textsf{T}\). The unique solution \(\eta :\{\,x_s \mid s \in \textsf{S}\setminus \textsf{T}\,\} \rightarrow [0,1]\) to this LP coincides with the desired objective values \(\eta (x_s) = V(s)\). Objectives \(\textrm{P}_{{\!}\textrm{min}}(\textsf{T})\) and \(\textrm{E}_{\textsf{opt}}(\textsf{rew})\) have similar encodings: minimizing policies require maximisation in the LP and flipping the constraint relation. Rewards can be added as an additive factor on the right-hand side. For practical purposes, the LP formulation can be tweaked.

The choice of bounds. Any bounds that respect the unique solution will not change the answer. That is, any \(lb\) and \(ub\) with \(0 \le lb(s) \le V(s) \le ub(s)\) yield a sound encoding. While these additional bounds are superfluous, they may significantly prune the search space. We investigate trivial bounds, e.g., knowing that all probabilities are in [0, 1], bounds from a structural analysis as discussed by [8], and bounds induced by a warm start of the solver. For the latter, if we have obtained values \(V' \le V\), e.g., induced by a suboptimal policy, then \(V'(s)\) is a lower bound on the value \(x_s\), which is particularly relevant as the LP minimizes.

Equality for unique actions. Markov chains, i.e., MDPs where \(|\textsf{A}| = 1\), can be solved using linear equation systems. The LP encoding uses one-sided inequalities and the objective function to incorporate nondeterministic choices. We investigate adding constraints for all states with a unique action.

$$\begin{aligned} x_s \le \sum _{s' \in S \setminus T}\delta (s,a)(s') \cdot x_{s'} + \sum _{t \in T} \delta (s,a)(t) \quad \text { for all } s \in S \setminus T \text { with } \textsf{A}(s) = \{a\} \end{aligned}$$

These additional constraints may trigger different optimizations in a solver, e.g., some solvers use Gaussian elimination for variable elimination.

A simpler objective. The standard objective assures the solution \(\eta \) is optimal for every state, whereas most invocations require only optimality in some specific states – typically the initial state \(s_0\) or the entry states of a strongly connected component. In that case, the objective may be simplified to optimize only the value for those states. This potentially allows for multiple optimal solutions: in terms of the MDP, it is no longer necessary to optimize the value for states that are not reached under the optimal policy.

Encoding the dual formulation. Encoding a dual formulation to the LP is interesting for mixed-integer extensions to the LP, relevant for computing, e.g., policies in POMDPs [47], or when computing minimal counterexamples [58]. For LPs, due to the strong duality, the internal representation in the solvers we investigated is (almost) equivalent and all solvers support both solving the primal and the dual representation. We therefore do not further consider constructing them.

3.2 How to solve LPs with existing solvers?

We rely on the performance of state-of-the-art LP solvers. Many solvers have been developed and are still actively advanced, see [2] for a recent comparison on general benchmarks. We list the LP solvers that we consider for this work in Table 2. The columns summarize for each solver the type of license, whether it uses exact or floating-point arithmetic, whether it supports multithreading, and what type of algorithms it implements. We also list whether the solver is available from the two model checkers used in this studyFootnote 2.

Methods. We briefly explain the available methods and refer to [12] for a thorough treatment. Broadly speaking, the LP solvers use one out of two families of methods. Simplex-based methods rely on highly efficient pivot operations to consider vertices of the simplex of feasible solutions. Simplex can be executed either in the primal or dual fashion, which changes the direction of progress made by the algorithm. Our LP formulation has more constraints than variables, which generally means that the dual version is preferable. Interior methods, often the subclass of barrier methods, do not need to follow the set of vertices. These methods may achieve polynomial time worst-case behaviour. It is generally claimed that simplex has superior average-case performance but is highly sensitive to perturbations, while interior-point methods have a more robust performance.

Warm starts. LP-based model checking can be done using two types of warm starts. Either by providing a (feasible) basis point as done in [26] or by presenting bounds. The former, however, comes with various remarks and limitations, such as the requirement to disable preprocessing. We therefore used warm starts only by using bounds as discussed above.

Multithreading. We generally see two types of parallelisation in LP solvers. Some solvers support a portfolio approach that runs different approaches and finishes with the first one that yields a result. Other solvers parallelize the interior-point and/or simplex methods themselves.

Table 2. Available LP solvers (“intr” = interior point)

Guarantees for numerical LP solvers. All LP solvers allow tweaking of various parameters, including tolerances to manage whether a point is considered feasible or optimal, respectively. The experiments in Table 1 already indicate that these guarantees are not absolute. A limited experiment indicated that reducing these tolerances towards zero did remove some incorrect results, but not all.

Exact solving. SoPlex supports exact computations, with a Boost library wrapping GMP rationals [22], after a floating-point arithmetic-based startup phase [27]. While this combination is beneficial for performance in most settings, it leads to crashes for the numerically challenging models. Z3 supports only exact arithmetic (also wrapping GMP numbers with their own interface). We observe that the price of converting large rational numbers may be substantial. SMT solvers like Z3 use a simplex variation [18] tailored towards finding feasible points and in an incremental fashion, optimized for problems with a nontrivial Boolean structure. In contrast, our LP formulation is easily feasible and is a pure conjunction.

4 Sound Policy Iteration

Starting with an initial policy, PI-based algorithms iteratively improve the policy based on the values obtained for the induced MC. The algorithm for solving the induced MC crucially affects the performance and accuracy of the overall approach. This section addresses the solvers available in Storm, possible precision issues, and how to utilize a warm start, while Section5 discusses PI performanceFootnote 3.

Markov chain solvers. To solve the induced MC, Storm can employ all linear equation solvers listed in [42] and all implemented variants of VI. In our experiments, we consider (i) the generalized minimal residual method (GMRES) [57] implemented in GMM++  [25], (ii) VI [15] with a standard (relative) termination criterion, (iii) optimistic VI (OVI) [40], and (iv) the sparse LU decomposition implemented in Eigen  [31] using either floating-point or exact arithmetic (LU\(^\textrm{X}\)). LU and LU\(^\textrm{X}\) provide exact results (modulo floating-point errors in LU) while OVI yields \(\varepsilon \)-precise results. VI and GMRES do not provide any guarantees.

Correctness of PI. The accuracy of PI is affected by the MC solver. Firstly, PI cannot be more precise than its underlying solver: the result of PI has the same precision as the result obtained for the final MC. Secondly, inaccuracies by the solver can hide policy improvements; this may lead to premature convergence with a sub-optimal policy. We show that PI can return arbitrarily wrong results—even if the intermediate results are \(\varepsilon \)-precise:

Fig. 2.
figure 2

Example MDP

Consider the MDP in Fig. 2 with objective \(\textrm{P}_{{\!}\textrm{max}}(\{\,G\,\})\). There is only one nondeterministic choice, namely in state \(s_0\). The optimal policy is to pick \(\textsf{b}\), obtaining a value of 0.5. Picking \(\textsf{a}\) only yields 0.1. However, when starting from the initial policy \(\pi (s_0)=\textsf{a}\), an \(\varepsilon \)-precise MC solver may return \(0.1 + \varepsilon \) for both \(s_0\) and \(s_1\) and for \(s_2\). This solution is indeed \(\varepsilon \)-precise. However, when evaluating which action to pick in \(s_0\), we can choose \(\delta \) such that \(\textsf{a}\) seems to obtain a higher value. Concretely, we require . For every \(\varepsilon > 0\), this can be achieved by setting \(\delta < 2.5 \cdot \varepsilon \). In this case, PI would terminate with the final policy inducing a severely suboptimal value.

If every Markov chain is solved precisely, PI is correct. Indeed, it suffices to be certain that one action is better than all others. This is the essence of modified policy iteration as described in [55, Chapters 6.5 and 7.2.6]. Similarly, [46, Section 4.2] suggests to use interval iteration when solving the system induced by the current policy and stopping when the under-approximation of one action is higher than the over-approximation of all other actions.

Warm starts. PI profits from being provided a good initial policy. If the initial policy is already optimal, PI terminates after a single iteration. We can inform our choice of the initial policy by providing estimates for all states as computed by VI. For every state, we choose the action that is optimal according to the estimate. This is a good way to leverage VI’s ability to quickly deliver good estimates [40], while at the same time providing the exactness guarantees of PI.

5 Experimental Evaluation

To understand the practical performance of the different algorithms, we performed an extensive experimental evaluation. We used three sets of benchmarks: all applicable benchmark instancesFootnote 4 from the Quantitative Verification Benchmark Set (QVBS) [41] (the qvbs set), a subset of hard QVBS instances (the hard set), and numerically challenging models from a runtime monitoring application [45] (the premise set, named for the corresponding prototype). We consider two probabilistic model checkers, Storm  [42] and the Modest Toolset ’s [37] mcsta. We used Intel Xeon Platinum 8160 systems running 64-bit CentOS Linux 7.9, allocating 4 CPU cores and 32 GB RAM to each experiment unless noted otherwise.

We plot algorithm runtimes in seconds in quantile plots as on the left and scatter plots as on the right of Fig. 3. The former compare multiple tools or configurations; for each, we sort the instances by runtime and plot the corresponding monotonically increasing line. Here, a point \(( x, y )\) on the a-line means that the x-th fastest instance solved by a took y seconds. The latter compare two tools or configurations. Each point \(( x, y )\) is for one benchmark instance: the x-axis tool took x while the y-axis tool took y seconds to solve it. The shape of points indicates the model type; the mapping from shapes to types is the same for all scatter plots and is only given explicitly in the first one in Fig. 3. Additional plots to support the claims in this section are provided in the appendix of the full version [39] of this paper.

The depicted runtimes are for the respective algorithm and all necessary and/or stated preprocessing, but do not include the time for constructing the MDP state spaces (which is independent of the algorithms). mcsta reports all time measurements rounded to multiples of 0.1 s. We summarize timeouts, out-of-memory, errors, and incorrect results as “n/a”. Our timeout is 30 minutes for the algorithm and 45 minutes for total runtime including MDP construction. We consider a result \(\bar{v}\) incorrect if \(|v - \bar{v}| > v \cdot 10^{-3}\) (i.e. relative error \(10^{-3}\)) whenever a reference result v is available. We however do not flag a result as incorrect if v and \(\bar{v}\) are both below \(10^{-8}\) (relevant for the premise set). Nevertheless, we configure the (unsound) convergence threshold for VI as \(10^{-6}\) relative; among the sound VI algorithms, we include OVI, with a (sound) stopping criterion of relative \(10^{-6}\) error. To only achieve the \(10^{-3}\) precision we actually test, OVI could thus be even faster than it appears in our plots. We make this difference to account for the fact that many algorithms, including the LP solvers, do not have a sound error criterion. We mark exact algorithms/solvers that use rational arithmetic with a superscript \(^\textrm{X}\). The other configurations use floating-point arithmetic (fp).

5.1 The QVBS Benchmarks

The qvbs set comprises all QVBS benchmark instances with an MDP, Markov automaton (MA), or probabilistic timed automaton (PTA) modelFootnote 5 and a reachability or expected reward/time objective that is quantitative, i.e. not a query that yields a zero or one probability. We only consider instances where both Storm and mcsta can build the explicit representation of the MDP within 15 minutes. This yields 367 instances. We obtain reference results for 344 of them from either the QVBS database or by using one of Storm ’s exact methods. We found all reference results obtained via different methods to be consistent.

Fig. 3.
figure 3

Comparison of LP solver runtime on the qvbs set

For LP, we have various solvers with various parameters each, cf. Section 3. For conciseness, we first compare all available LP solvers on the qvbs set. For the best-performing solver, we then evaluate the benefit of different solver configurations. We do the same for the choice of Markov chain solution method in PI. We then focus on these single, reasonable, setups for LP and PI each in more detail.

LP solver comparison. The left-hand plot of Fig. 3 summarizes the results of our comparison of the different LP solvers. Subscripts \(_\textsf {s} \) and \(_\textsf {m} \) indicate whether the solver is embedded in either Storm or mcsta. We apply no optimizations or

Table 3. LP summary

reductions to the MDPs except for the precomputation of probability-0 states (and in Storm also of probability-1 states), and use the default settings for all solvers, with the trivial variable bounds [0, 1] and \([0, \infty )\) for probabilities and expected rewards, respectively. We include VI as baseline. In Table 3, we summarize the results.

In terms of performance and scalability, Gurobi solves the highest number of benchmarks in any given time budget, closely followed by COPT. CPLEX, HiGHS, and Mosek make up a middle-class group. While the exact solver Z3 is very slow, SoPlex ’s exact mode actually competes with some fp solvers. However, the quantile plots do not tell the whole story. On the right of Fig. 3, we compare COPT and Gurobi directly: each has a large number of instances on which it is (much) better.

In terms of reliability of results, the exact solvers as expected produce no incorrect results; so does the slowest fp solver, lp_solve. COPT, CPLEX, HiGHS, Mosek, and fp-SoPlex perform badly in this metric, producing more errors than VI. Interestingly, these are mostly the faster solvers, the exception being Gurobi.

Overall, Gurobi achieves highest performance at decent reliability; in the remainder of this section, we thus use \(\textsf {Gurobi} _\textsf {s} \) whenever we apply non-exact LP.

Fig. 4.
figure 4

Performance impact of LP problem formulation variants (using \(\textsf {Gurobi} _\textsf {s} \))

LP solver tweaking. Gurobi can be configured to use an “auto” portfolio approach, potentially running multiple algorithms concurrently on multiple threads, a primal or a dual simplex algorithm, or a barrier method algorithm. We compared each option with 4 threads and found no significant performance difference. Similarly, running the auto method with 1, 4, and 16 threads (only here, we allocate 16 threads per experiment) also failed to bring out noticeable performance differences. Using more threads results in a few more out-of-memory errors, though. We thus fix Gurobi on auto with 4 threads.

Fig. 4 shows the performance impact of supplying Gurobi with more precise bounds on the variables for expected reward objectives using methods from [8, 51] (“bounds” instead of “simple”), of optimizing only for initial state (“init”) instead of the sum over all states (“all”), and of using equality (“eq”) instead of less-/greater-than-or-equal (“ineq”) for unique action states. More precise bounds yield a very small improvement at essentially no cost. Optimizing for the initial state only results in a little better overall performance (in the “pocket” in the quantile plot around \(x = 315\) that is also clearly visible in the scatter plot). However, it also results in 2 more incorrect results in the qvbs set. Using equality for unique actions noticeably decreases performance and increases the incorrect result count by 9 instances. For all experiments that follow, we thus use the more precise bounds, but do not enable the other two optimizations.

figure h

PI methods comparison. The main choice in PI is which algorithm to use to solve the induced Markov chains. On the right, we show the performance of the different algorithms available in Storm (cf. Section 4). LU\(^\textrm{X}\) yields a fully exact PI. This interestingly performs better than the fp version, potentially because fp errors induce spurious policy changes. The same effect likely also hinders the use of OVI, whereas VI leads to good performance. Nevertheless, gmres is best overall, and thus our choice for all following experiments with non-exact PI. VI and gmres yield 6 and 4 incorrect results, respectively. OVI and the exact methods are always correct on this benchmark set.

Best MDP algorithms for QVBS. We now compare all MDP model checking algorithms on the qvbs set: with floating-point numbers, LP and PI configured as described above, plus unsound VI, sound OVI, and the warm-start variants of PI and LP denoted “VI2PI” and “VI2LP”, respectively. Exact results are provided by rational search (RS, essentially an exact version of VI) [50], PI with exact LU, and LP with exact solvers (SoPlex and Z3). All are implemented in Storm.

Fig. 5.
figure 5

Comparison of MDP model checking algorithms on the qvbs set

In a first experiment, we evaluated the impact of using the topological approach and of collapsing MECs (cf. Section 2.4). The results, for which we omit plots, are that the topological approach noticeably improves performance and scalability for all algorithms, and we therefore always use it from now on. Collapsing MECs is necessary to guarantee termination of OVI, while for the other algorithms it is a potential optimization; however we found it to overall have a minimal positive performance impact only. Since it is required by OVI and does not reduce performance, we also always use it from now on.

Fig. 5 shows the complete comparison of all the methods on the qvbs set, for fp algorithms on the left and exact solutions on the right. Among the fp algorithms, OVI is clearly the fastest and most scalable. VI is somewhat faster but incurs several incorrect results that diminish its appearance in the quantile plot. OVI is additionally special among these algorithms in that it is sound, i.e. provides guaranteed \(\varepsilon \)-correct results—though up to fp rounding errors, which can be eliminated following the approach of [36]. On the exact side, PI with an inexact-VI warm start works best. The scatter plots in Fig. 6(a) shows the performance impact of computing an exact instead of an approximate solution.

Fig. 6.
figure 6

Additional direct performance comparisons

Fig. 7.
figure 7

Comparison of MDP model checking algorithms on the hard subset

5.2 The Hard QVBS Benchmarks

The QVBS contains many models built for tools that use VI as default algorithm. The other algorithms may actually be important to solve key challenging instances where VI/OVI perform badly. This contribution could be hidden in the sea of instances trivial for VI. We thus zoom in on a selection of QVBS instances that appear “hard” for VI: those where VI takes longer than the prior MDP state space construction phase in both Storm and mcsta, and additionally both phases together take at least 1 s. These are 18 of the previously considered 367 instances.

In Fig. 7, we show the behaviour of all the algorithms on this hard subset. OVI again works better than VI due to the incorrect results that VI returns. We see that the performance and scalability gap between the algorithms has narrowed; although OVI still “wins”, LP in particular is much closer than on the full qvbs set. We also investigated the LP outcomes with solvers other than Gurobi: even on this set, Gurobi and COPT remain the fastest and most scalable solvers. With mcsta, in the basic configuration, they solve 16 and 17 instances, the slowest taking 835 s and 1334 s, respectively; with the topological optimization, the numbers become 17 and 15 instances with the slowest at 1373 s and 1590 s seconds. We show the detailed comparison of OVI and LP in Fig. 6(c), noting that there are a few instances where LP is much faster, and repeat the comparison between the best fp and exact algorithms (Fig. 6(b)).

Fig. 8.
figure 8

Comparison of MDP model checking algorithms on the premise set

5.3 The Runtime Monitoring Benchmarks

While the QVBS is intentionally diverse, our third set of benchmarks is intentionally focused: We study 200 MDPs from a runtime monitoring study [45]. The original problem is to compute the normalized risk of continuing to operate the system being monitored subject to stochastic noise, unobservable and uncontrollable nondeterminism, and partial state observations. This is a query for a conditional probability. It is answered via probabilistic model checking by unrolling an MDP model along an observed history trace of length \(n \in \{\,50, \ldots , 1000\,\}\) following the approach of Baier et al. [7]. The MDPs contain many transitions back to the initial state, ultimately resulting in numerically challenging instances (containing structures similar to the one of \(M_n\) in Section 2.3). We were able to compute a reference result for all instances.

Fig. 8 compares the different MDP model checking algorithms on this set. In line with the observations in [45], we see very different behaviour compared to the QVBS. Among the fp solutions on the left, LP with Gurobi terminates very quickly (under 1 s), and either produces a correct (155 instances) or a completely incorrect result (mostly 0, on 45 instances). VI behaves similarly, but is slower. OVI, in contrast, delivers no incorrect result, but instead fails to terminate on all but 116 instances. In the exact setting, warm starts using VI inherit its relative slowness and consequently do not pay off. Exact PI outperforms both exact LP solvers. In the case of exact SoPlex, out of the 112 instances it does not manage to solve, 98 are crashes likely related to a confirmed bug in its current version.

The premise set highlights that the best MDP model checking algorithm depends on the application. Here, in the fp case, LP appears best but produces unreliable (incorrect) results; the seemingly much worse OVI at least does not do so. Given the numeric challenge, an exact method should be chosen, and we show that these actually perform well here.

6 Conclusion

We thoroughly investigated the state of the art in MDP model checking, showing that there is no single best algorithm for this task. For benchmarks which are not numerically challenging, OVI is a sensible default, closely followed by PI and LP with a warm start—although using the latter two means losing soundness as confirmed by a number of incorrect results in our experiments. For numerically hard benchmarks, PI and LP as well as computing exact solutions are more attractive, and clearly preferable in combination. Overall, although LP has the superior (polynomial) theoretical complexity, in our practical evaluation, it almost always performs worse than the other (exponential) approaches. This is even though we use modern commercial solvers and tune both the LP encoding of the problem as well as the solvers’ parameters. While we observed the behaviour of the different algorithms and have some intuition into what makes the premise set hard, an entire research question of its own is to identify and quantify the structural properties that make a model hard.

Our evaluation also raises the question of how prevalent MDPs that challenge VI are in practice. Aside from the premise benchmarks, we were unable to find further sets of MDPs that are hard for VI. Notably, several stochastic games (SGs) difficult for VI were found in [46]; the authors noted that using PI for the SGs was better than applying VI to the SGs. However, when we extracted the induced MDPs, we found them all easy for VI. Similarly, [3] used a random generation of SGs of at most 10,000 states, many of which were challenging for the SG algorithms. Yet the same random generation modified to produce MDPs delivered only MDPs easily solved in seconds, even with drastically increased numbers of states. In contrast, Alagöz et al. [1] report that their random generation returned models where LP beat PI. However, their setting is discounted, and their description of the random generation was too superficial for us to be able to replicate it. We note that, in several of our scatter plots, the MA instances from the QVBS (where we check the embedded MDP) appeared more challenging overall than the MDPs. We thus conclude this paper with a call for challenging MDP benchmarks—as separate benchmark sets of unique characteristics like premise, or for inclusion in the QVBS.