Optimistic Value Iteration

Markov decision processes are widely used for planning and verification in settings that combine controllable or adversarial choices with probabilistic behaviour. The standard analysis algorithm, value iteration, only provides lower bounds on infinite-horizon probabilities and rewards. Two “sound” variations, which also deliver an upper bound, have recently appeared. In this paper, we present a new sound approach that leverages value iteration’s ability to usually deliver good lower bounds: we obtain a lower bound via standard value iteration, use the result to “guess” an upper bound, and prove the latter’s correctness. We present this optimistic value iteration approach for computing reachability probabilities as well as expected rewards. It is easy to implement and performs well, as we show via an extensive experimental evaluation using our implementation within the mcsta model checker of the Modest Toolset.


Introduction
Markov decision processes (MDP, [30]) are a widely-used formalism to represent discrete-state and -time systems in which probabilistic effects meet controllable nondeterministic decisions. The former may arise from an environment or agent whose behaviour is only known statistically (e.g. message loss in wireless communication or statistical user profiles), or it may be intentional as part of a randomised algorithm (such as exponential backoff in Ethernet). The latter may be under the control of the system-then we are in a planning setting and typically look for a scheduler (or strategy, policy) that minimises the probability of unsafe behaviour or maximises a reward-or it may be considered adversarial, which is the standard assumption in verification: we want to establish that the maximum probability of unsafe behaviour is below, or that the minimum reward is above, a specified threshold. Extensions of MDP cover continuous time [11,26], and the analysis of complex formalisms such as stochastic hybrid automata [13] can be reduced to the analysis of MDP abstractions.
The standard algorithm to compute optimal (maximum or minimum) probabilities or reward values on MDP is value iteration (VI). It implicitly computes the corresponding optimal scheduler, too. It keeps track of a value for every state of the MDP, locally improves the values iteratively until a "convergence" criterion is met, and then reports the final value for the initial state as the overall result. The initial values are chosen to be an underapproximation of the true values (e.g. 0 for all states in case of probabilities or non-negative rewards). The final values are then an improved underapproximation of the true values. For unbounded (infinite-horizon) properties, there is unfortunately no (known and practical) convergence criterion that could guarantee a predefined error on the final result. Still, probabilistic model checkers such as Prism [24] report the final result obtained via simple relative or absolute global error criteria as the definitive probability. This is because, on most case studies considered so far, value iteration in fact converges fast enough that the (relative or absolute) difference between the reported and the true value approximately meets the error specified for the convergence criterion. Only relatively recently has this problem of soundness come to the attention of the probabilistic verification and planning communities [7,14,28]. First highlighted on hand-crafted counterexamples, it has by now been found to affect benchmarks and real-life case studies, too [3].
The first proposal to compute sound reachability probabilities was to use interval iteration (II [15], first presented in [14]). The idea is to perform two iterations concurrently, one starting from 0 as before, and one starting from 1. The latter improves an overapproximation of the true values, and the process can be stopped once the (relative or absolute) difference between the two values for the initial state is below the specified , or at any earlier time with a correspondingly larger but known error. Baier et al. extended interval iteration to expected accumulated reward values [3]; here, the complication is to find initial values that are guaranteed to be an overapproximation. The proposed graph-based (i.e. not numerical) algorithm in practice tends to compute conservative initial values from which many iterations are needed until convergence. More recently, sound value iteration (SVI) [31] improved upon interval iteration by computing upper bounds on-the-fly and performing larger value improvements per iteration, for both probabilities and expected rewards. However, we found SVI tricky to implement correctly; some edge cases not considered by the algorithm as presented in [31] initially caused our implementation to deliver incorrect results or diverge on very few benchmarks. Both II and SVI fundamentally depend on the MDP being contracting; this must be ensured by appropriate structural transformations, e.g. by collapsing end components, a priori. These transformations additionally complicate implementations, and increase memory requirements.
Our Contribution. We present (in Sect. 4) a new algorithm to compute sound reachability probabilities and expected rewards that is both simple and practically efficient. We first (1) perform standard value iteration until "convergence", resulting in a lower bound on the value for every state. To this we (2) apply specific heuristics to "guess", for every state, a candidate upper bound value.
Further value iterations (3) then confirm (if all values decrease) or disprove (if all values increase, or lower and upper bounds cross) the soundness of the upper bounds. In the latter case, we perform more lower bound iterations with reduced before retrying from step 2. We combine classic results from domain theory with specific properties of value iteration to show that our algorithm terminates. In problematic cases, many retries may be needed before termination, and performance may be worse than interval or sound value iteration. However, on many existing case studies, value iteration already worked well, and our approach attaches a soundness proof to its result with moderate overhead. We thus refer to it as optimistic value iteration (OVI). In contrast to II and SVI, it also works well for non-contracting MDP, albeit without a general termination guarantee. Our experimental evaluation in Sect. 5 uses all applicable models from the Quantitative Verification Benchmark Set [21] to confirm that OVI indeed performs as expected. It uses our publicly available implementations of II, SVI, and now OVI in the mcsta model checker of the Modest Toolset [20].
Related Work. In parallel to [15], the core idea behind II was also presented in [7] (later improved in [2]), embedded in a learning-based framework that manages to alleviate the state space explosion problem in models with a particular structure. In this approach, end components are statistically detected and collapsed on-thefly. II has recently been extended to stochastic games in [23], offering deflating as a new alternative to collapsing end components in MDP. Deflating does not require a structural transformation, but rather extra computation steps in each iteration applied to the states of all (a priori identified) end components.
The only known convergence criterion for pure VI was presented in [9, Sect. 3.5]: if we run VI until the absolute error between two iterations is less than a certain value α, then the computed values at that point are within α of the true values, and can in fact be rounded to the exact true values (as implemented in the rational search approach [5]). However, α cannot be freely chosen; it is a fixed number that depends on the size of the MDP and the largest denominator of the (rational) transition probabilities. The number of iterations needed is exponential in the size and the denominators. While not very useful in practice, this establishes an exponential upper bound on the number of iterations needed in unbounded-horizon VI. Additionally, Balaji et al. [4] recently showed the computations in finite-horizon value iteration to be EXPTIME-complete.
As an alternative to the iterative numeric road, guaranteed correct results (modulo implementation errors) can be obtained by using precise rational arithmetic. It does not combine too well with iterative methods like II or SVI due to the increasingly small differences between the values and the actual solution. The probabilistic model checker Storm [10] thus combines topological decomposition, policy iteration, and exact solvers for linear equation systems based on Gaussian elimination when asked to use rational arithmetic [22,Section 7.4.8].
The disadvantage is the significant runtime cost for performing the unlimitedprecision calculations, limiting such methods to relatively smaller MDP.
The only experimental evaluations using large sets of benchmarks that we are aware of compared VI with II to study the overhead needed to obtain sound    [3], and II with SVI to show the performance improvements of SVI [31]. The learning-based method with deflation of [2] does not compete against II and SVI; its aim is rather in dealing with state space explosion (i.e. memory usage). Its performance was evaluated on 16 selected small (<400 k states) benchmark instances in [2], showing absolute errors on the order of 10 −4 on many benchmarks with a 30-min timeout. SVI thus appears the most competitive technique in runtime and precision so far. Consequently, in our evaluation in Sect. 5, we compare OVI with SVI, and II for reference, using the default relative error of 10 −6 , including large and excluding clearly acyclic benchmarks (since they are trivial even for VI), with a 10-min timeout which is rarely hit.

Preliminaries
R + 0 is the set of all non-negative real numbers. We write { x 1 → y 1 , . . . } to denote the function that maps all x i to y i , and if necessary in the respective context, implicitly maps to 0 all x for which no explicit mapping is specified. Given a set S, its powerset is 2 S . A (discrete) probability distribution over S is a function μ ∈ S → [0, 1] with countable support spt(μ) def = { s ∈ S | μ(s) > 0 } and s∈spt(μ) μ(s) = 1. Dist(S) is the set of all probability distributions over S.
Markov Decision Processes (MDP) combine nondeterministic choices as in labelled transition systems with discrete probabilistic decisions as in discrete-time Markov chains (DTMC). We define them formally and describe their semantics. For s ∈ S, an element of T (s) is a transition, and a pair r, s ∈ spt(T (s)) is a branch to successor state s with reward r and probability T (s)( r, s ). Let M (s I ) be M but with initial state s I , and M 0 be M with all rewards set to zero.
Example 1. Figure 1 shows our example MDP M e . We draw transitions as lines to an intermediate node from which branches labelled with probability and reward (if not zero) lead to successor states. We omit the intermediate node and probability 1 for transitions with a single branch, and label some transitions to refer to them in the text. M e has 5 states, 7 transitions, and 10 branches. In practice, higher-level modelling languages like Modest [17] are used to specify MDP. The semantics of an MDP is captured by its paths. A path represents a concrete resolution of all nondeterministic and probabilistic choices. Formally: Π fin is the set of all finite paths starting in s I . A path is an analogous infinite sequence π, and Π is the set of all paths starting in s I . We write s ∈ π if ∃ i : s = s i , and π →G for the shortest prefix of π that contains a state in G ⊆ S, or ⊥ if π contains no such state. Let rew(⊥) def = ∞.
A scheduler (or adversary, policy or strategy) only resolves the nondeterministic choices of M . For this paper, memoryless deterministic schedulers suffice [6].
, respectively. The definition extends to sets G of goal states. Let R M G : Π → R + 0 be the random variable defined by R M G (π) = rew(π →G ) and let E M s (G) be the expected value of R M G under P M s . Then the maximum and minimum expected reward to reach G is defined as , respectively. We omit the superscripts for M when they are clear from the context. From now on, whenever we have an MDP with a set of goal states G, we assume that they have been made absorbing, i.e. for all g ∈ G we only have a self-loop:

Definition 4. An end component of M as above is a (sub-)MDP S , T , s
r, s ∈ spt(μ) then r = 0, and the directed graph with vertex set S and edge set { s, s | ∃ μ ∈ T (s): 0, s ∈ spt(μ) } is strongly connected.

Value Iteration
The standard algorithm to compute reachability probabilities and expected rewards is value iteration (VI) [30]. In this section, we recall its theoretical foundations and its limitations regarding convergence.

Theoretical Foundations
forms a complete lattice, i.e. every subset V ⊆ V has a supremum (and an infimum) in V with respect to . We Minimum and maximum reachability probabilities and expected rewards can be expressed as the least fixed point of the Bellman operator Φ : V → V given by where opt ∈ { max, min } and the choice of both S ? ⊆ S and d depends on whether we wish to compute reachability probabilities or expected rewards. In any case, the Bellman operator Φ can be shown to be Scott-continuous [1], i.e. in our case: for any subset The Kleene fixed point theorem for Scott-continuous self-maps on complete lattices [1,27] guarantees that lfpΦ, the least fixed point of Φ, indeed exists. Note that Φ can still have more than one fixed point. In addition to mere existence of lfp Φ, the Kleene fixed point theorem states that lfp Φ can be expressed by where0 ∈ V is the zero vector and Φ n (v) denotes n-fold application of Φ to v. Equation 1 is the basis of VI: the algorithm iteratively constructs a sequence of which converges to the sought-after least fixed point. This convergence is monotonic: for every n ∈ N, we have Φ n (0) Φ n+1 (0) and hence Φ n (0) lfp Φ. In particular, Φ n (0)(s I ) is an under approximation of the sought-after quantity for every n. Note that iterating Φ on any underapproximation v lfp Φ (instead of0) will still converge to lfp Φ and Φ n (v) lfp Φ will hold for any n.
Gauss-Seidel Value Iteration. Algorithm 1 shows the pseudocode of a VI implementation that uses the so-called Gauss-Seidel optimisation: Whereas standard VI needs to store two vectors v i and v i+1 , Gauss-Seidel VI stores only a single vector v and performs updates in place. This does not affect the correctness of VI, but may speed up convergence depending on the order in which the loop in line 4 considers the states in S ? . The error metric diff is used to check for convergence.
VI for Probabilities. For determining reachability probabilities, we operate on M 0 and set S ? = S \ G and d = 1. Then the corresponding Bellman operator satisfies , and VI will iteratively approximate this quantity from below. The corresponding call to Algorithm 1 is

Uniqueness of Fixed Points
lfp Φ may not be unique for two reasons: states that cannot reach G under the optimal scheduler may take any value (causing fixed points greater than lfp Φ for P min and P max ), and states in end components may take values higher than lfp Φ. The latter affects P max (higher fixed points) and E min (lower fixed points).

Example 2.
In M e of Fig. 1, s 1 and s 2 and the two transitions in-between form an end component. VI works for P min , P max , and E max with multiple fixed points: we anyway seek lfp Φ and start from a (trivial) underapproximation. For E min , (zero-reward) end components need to be collapsed: we determine the maximal end components using algorithms similar to [15,Alg. 1], then replace each of them by a single state, keeping all transitions leading out of the end component. We refer to this as the ECC transformation. However, such end components rarely occur in case studies for E min since they indicate Zeno behaviour w.r.t. to the reward. As rewards are often associated to time progress, such behaviour would be unrealistic.
To make the fixed points unique, for E max and E min we fix the values of all states in G to 0. For P min , we precompute the set S 0 min of states that reach G with minimum probability 0 using Alg. 1 of [12], then fix their values to 0. For P max , we analogously use S 0 max via Alg. 3 of [12]. For P max and E min , we additionally need to remove end components via ECC. In contrast to the precomputations, ECC changes the structure of the MDP and is thus more memory-intensive.

Convergence
VI and GSVI will not reach a fixed point in general, except for special cases such as acyclic MDP. It is thus standard to use a convergence criterion based on the difference between two consecutive iterations (lines 6 and 8) to make GSVI terminate: we either check the absolute error, i.e.
By default, probabilistic model checkers like Prism and Storm use diff rel and α = 10 −6 . Upon termination of GSVI, v is then closer to the least fixed point, but remains an underapproximation. In particular, α has, in general, no relation to the final difference between v(s I ) and P opt ( G) or E opt (G), respectively.  Table 1 show the values for v after the i-th iteration of the outer loop of a call to GSVI After the fourth iteration, GSVI terminates since the error is less than α = 0.05; at this point, we have P max ( s + ) − v(s 0 ) = 0.08 > α.
To obtain a value within a prescribed error of the true value, we can compute an upper bound in addition to the lower bound provided by VI. Interval iteration (II) [3,15] does so by performing, in parallel, a second value iteration on a second vector u that starts from a known overapproximation. For probabilities, the vector1 = { s → 1 } is a trivial overapproximation; for rewards, more involved graph-based algorithms need to be used to precompute (a very conservative) one [3]. II terminates when diff (v(s I ), u(s I )) ≤ 2 and analogously for expected rewards. However, to ensure termination, II requires a unique fixed point: u converges from above to the greatest fixed point gfp Φ, thus for every MDP where diff ((lfp Φ)(s I ), (gfp Φ)(s I )) > 2 , II diverges. For P max , we have gfp Φ(s ec ) = 1 for all s ec in end components, thus II tends to diverge when there is an end component. Sound value iteration (SVI) [31] is similar, but uses a different approach to derive upper bounds that makes it perform better overall, and that eliminates the need to precompute an initial overapproximation for expected rewards. However, SVI still requires unique fixed points.
We summarise the preprocessing requirements of VI, II, and SVI in Table 2. With unique fixed points, we can transform P min into P max by making S 0 min states absorbing and setting G to S 0 min , and P max into E max by a similar transformation adding reward 1 to entering G. Most of the literature on VI variants works in such a setting and describes the P max or E max case only. Since OVI also works with multiple fixed points, we have to consider all four cases individually.

Optimistic Value Iteration
We now present a new, practical solution to the convergence problem for unbounded reachability and expected rewards. It exploits the empirical observation that on many case studies VI delivers results which are roughly α-close to the true value-it only lacks the ability to prove it. Our approach, optimistic value iteration (OVI), extends standard VI with the ability to deliver such a proof.
The key idea is to exploit a property of the Bellman operator Φ and its Gauss-Seidel variant as in Algorithm 1 to determine whether a candidate vector is a lower bound, an upper bound, or neither. The foundation is basic domain theory: by Scott-continuity of Φ it follows that Φ is monotonic, meaning v w implies Φ(v) Φ(w). A principle called Park induction [29] for monotonic self-maps on complete lattices yields the following induction rules: For any u ∈ V, Thus, if we can construct a candidate vector u s.t. Φ(u) u, then u is in fact an upper bound on the sought-after lfpΦ. We call such a u an inductive upper bound.
Optimistic value iteration uses this insight and can be summarised as follows: 1. Perform value iteration on v until "convergence" w.r.t. α.

Heuristically determine a candidate upper bound u.
3. If Φ(u) u, then v lfp Φ u.
The resulting procedure in more detail is shown as Algorithm 2. Starting from the same initial vectors v as for VI, we first perform standard Gauss-Seidel value iteration (in line 2). We refer to this as the iteration phase of OVI. After that, vector v is an improved underapproximation of the actual probabilities or reward values. If, in some iteration, no state was blocked (line 15), then we had Φ(u) u before the start of the iteration. We thus know by Eq. 2 that the current u is an inductive upper bound for the values of all states, and the true value must be in the interval [v(s I ), u(s I )]. By property H2, our use of Φ min for u, and the monotonicity of Φ as used on v, we also know that diff (v(s I ), u(s I )) ≤ 2 , so we immediately terminate and return the interval's centre If, in some iteration, no state was unblocked (line 16), then again by Park induction we know that u gfp Φ. If we are in a situation of unique fixed points, this also means u lfp Φ, thus the current u is no upper bound: we cancel verification and go back to the iteration phase to further improve v before trying again. We do the same if v crosses u: then u(s) < v(s) ≤ (lfp Φ)(s) for some s, so this u was just another bad guess, too.
Otherwise, we do not yet know the relationship between u and lfp Φ, so we remain in the verification phase until we encounter one of the cases above, or until we exceed the verification budget of 1 α iterations (as checked by the loop condition in line 4). This budget is a technical measure to ensure termination.
Optimisation. In case the fixed point of Φ is unique, by Park induction (via Eq. 3) we know that u Φ(u) implies that u is a lower bound on lfp Φ. In such situations of single fixed points, we can-as an optimisation-additionally replace v by u before the goto in line 16.
Heuristics. OVI relies on heuristics to gain an advantage over alternative methods such as II or SVI; it cannot be better on all MDP. Concretely, we can choose 1. a stopping criterion for the iteration phase, 2. how to guess candidate upper values from the result of the iteration phase, and 3. how much to reduce α when going back from verification to iteration. Algorithm 2 shows the choices made by our implementation. We employ the standard stopping criteria used by probabilistic model checkers for VI, and the "weakest" guessing heuristics that satisfies properties H1, H2, and H3 (i.e. guessing any higher values would violate one of these properties). The only arbitrary choice is how to reduce α, which we at least halve on every retry. We experimentally found this to be a good compromise on benchmarks that we consider in Sect. 5, where (a) reducing α further causes more and potentially unnecessary iterations in GSVI (continuing to iterate when switching to the verification phase would already result in upper values sufficient for termination), and (b) reducing α less results in more verification phases (whose iterations are computationally more expensive than those of GSVI) being started before the values in v are high enough such that we manage to guess a u with lfp Φ u.
Example 4. We now use the version of Φ to compute P max and call Table 1 shows the values in v and u during this run, assuming that we use non-Gauss-Seidel iterations. The first iteration phase lasts from i = 0 to 4. At this point, u is initialised with the values shown in italics. The first verification phase needs only one iteration to realise that u is actually a lower bound (to a fixed point which is not the least fixed point, due to the uncollapsed end component). Blocked states are marked with a bar; unblocked states have a lower u-value than in the previous iteration. We resume GSVI from i = 6. The error in GSVI is again below α, which had been reduced to 0.008, during iteration i = 9. We thus start another verification phase, which immediately (in one iteration) finds the newly guessed vector u to be an upper bound, with diff (v(s 0 ), u(s 0 )) < 2 .

Termination of OVI
We showed above that OVI returns an -correct result when it terminates. We now show that it terminates in all cases except for P max with multiple fixed points. Note that this is a stronger result than what II and SVI can achieve. Let us first consider the situations where lfp Φ is the unique fixed point of Φ. First, GSVI terminates by Eq. 1. Let us now write v i and u i for the vectors u and v as they are at the beginning of verification phase iteration i. We know that v 0 u 0 . We distinguish three cases relating the initial guess u 0 to lfp Φ.
there is a state s with u 0 (s) < (lfp Φ)(s). Since we use Φ min on the upper values, it follows u i (s) ≤ u 0 (s) < (lfp Φ)(s) for all i. By Eq. 1, there must thus be a j such that v j (s) > u j (s), triggering a retry with reduced α in line 14. Such a retry could also be triggered earlier in line 16.
Due to the reduction of α and Eq. 1, every call to GSVI will further increase some values in v or reach v = lfp Φ (in special cases), and for some subsequent guess u we must have u 0 (s) < u(s). Consequently, after some repetitions of this case 1, we must eventually guess a u with lfp Φ u.  Thus, for all states s, there must be an i such that u i (s) < u i−1 (s); in consequence, there is also an iteration j where no state is blocked any more. Then the condition in line 15 will be true and OVI terminates. 3. lfp Φ u 0 but not lfp Φ ≺ u 0 , i.e. there is a state s with u 0 (s) = (lfp Φ)(s).
If there is an i where no state, including s, is blocked, then OVI terminates as above. For P min and P max , if u 0 (s) = 1, s cannot be blocked, so we can w.l.o.g. exclude such s. For other s not to be blocked in iteration i, we must have u i (s ) = (lfp Φ)(s ) for all states s reachable from s under the optimal scheduler, i.e. all of those states must reach the fixed point. This cannot be guaranteed on general MDP. Since this case is a very particular situation unlikely to be encountered in practice with our heuristics, OVI adopts a pragmatic solution: it bounds the number of iterations in every verification phase (cf. line 4). Due to property H3 of our heuristics, u 0 (s) = (lfp Φ)(s) requires v 0 (s) < (lfp Φ)(s), thus some subsequent guess u will have u(s) > u 0 (s), and eventually we must get a u with lfp Φ ≺ u, which is case 2. Since we strictly increase the iteration bound on every retry, we will eventually encounter case 2 with a sufficiently high bound for termination.
Three of the four situations with multiple fixed points reduce to the corresponding unique fixed point situation due to property H1 of our guessing heuristics: 1. For P min , recall from Sect. 3.2 that the fixed point is unique if we fix the values of all S 0 min states to 0. In OVI without preprocessing, such states are in S ? , thus they initially have value 0. Φ will not increase their values, neither will guessing due to H1, and neither will Φ min . Thus OVI here operates on a sublattice of V, where the fixed point of Φ is unique. 2. For E min , after the preprocessing steps of Table 2, we only need to fix the values of all goal states to 0. Then the argument is the same as for P min . 3. For E max , we reduce to a unique fixed point sublattice in the same way, too.
The only case where OVI may not terminate is for P max without ECC. Here, end components may cause states to be permanently blocked. However, we did not encounter this on any benchmark used in Sect. 5, so in contrast to e.g. II, OVI is still practically useful in this case despite the lack of a termination guarantee.
Example 5. We turn M e of Fig. 1 into M e by replacing the c-labelled transition from s 2 by transition { 0, we can now go from s 2 back to s 2 with probability 1 2 and to each of s + , s − with probability 1 4 . The probability-1 transition from s 2 to s 1 remains. Then Table 3 shows a run of OVI for P max with diff abs and α = 0.1. s 0 is forever blocked from iteration 6 on.

Variants of OVI
While the core idea of OVI rests on classic results from domain theory, Algorithm 2 includes several particular choices that work together to achieve good performance and ensure termination. We sketch two variants to motivate these choices.
Observe how the value of s 0 increases iff s 1 decreases and vice-versa. Thus we never encounter an inductive upper or lower bound. In Algorithm 2, we use Gauss-Seidel VI, which would not show the same effect on this model; however, if we insert another state between s 0 and s 1 that is updated last, Algorithm 2 would behave in the same alternating way. This particular u is contrived, but we could have guessed one with a similar relationship of the values leading to similar behaviour. An alternative that allows us to use Φ instead of Φ min is to change the conditions that lead to retrying and termination: We separately store the initial guess of a verification phase as u 0 , and then compare each newly calculated u with u 0 . If u u 0 , then we know that there is an i such that u = Φ i (u) u 0 .
Φ i retains all properties of Φ needed for Park induction, so this would also be a proof of lfp Φ u. The other conditions and the termination proofs can be adapted analogously. However, this variant needs ≈50 % more memory (to store an additional vector of values), and we found it to be significantly slower than Algorthm 2 and the first variant on almost all benchmark instances of Sect. 5.

Experimental Evaluation
We have implemented interval iteration (II) (using the "variant 2" approach of [3] to compute initial overapproximations for expected rewards), sound value iteration (SVI), and now optimistic value iteration (OVI) precisely as described in the previous section, in the mcsta model checker of the Modest Toolset [20], which is publicly available at modestchecker.net. It is cross-platform, implemented in C#, and built around the Modest [17] high-level modelling language. Via support for the Jani format [8], mcsta can exchange models with other tools like Epmc [18] and Storm [10]. Its performance is competitive with Storm and Prism [16]. We tried to spend equal effort performance-tuning our VI, II, SVI, and OVI implementations to avoid unfairly comparing highly-optimised OVI code with naïve implementations of the competing algorithms.
In the following, we report on our experimental evaluation of OVI using mcsta on all applicable models of the Quantitative Verification Benchmark Set (QVBS) [21]. All models in the QVBS are available in Jani and can thus be used by mcsta. Most are parameterised, and come with multiple properties of different types. Aside from MDP models, the QVBS also includes DTMCs (which are a special case of MDP), continuous-time Markov chains (CTMC, for which the analysis of unbounded properties reduces to checking the embedded DTMC), Markov automata (MA [11], on which the embedded MDP suffices for unbounded properties), and probabilistic timed automata (PTA [26], some of which can be converted into MDP via the digital clocks semantics [25]). We use all of these model types. The QVBS thus gives rise to a large number of benchmark instances: combinations of a model, a parameter valuation, and a property to check. For every model, we chose one instance per probabilistic reachability and expected-reward property such that state space exploration did not run out of memory and VI took at least 10 s where possible. We only excluded -2 models with multiple initial states (which mcsta does not yet support), -4 PTA with open clock constraints (they cannot be converted to MDP), -29 probabilistic reachability properties for which the result is 0 or 1 (they are easily solved by the graph-based precomputations and do not challenge VI), -16 instances for which VI very quickly reaches the fixed point, which indicates that (the relevant part of) the MDP is acyclic and thus trivial to solve, -3 models for which no parameter valuation allowed state space exploration to complete without running out of memory or taking more than 600 s, -7 instances where, on the largest state space we could explore, no iterative algorithm took more than 1 s (which does not allow reliable comparisons), and -the oscillators model due to its very large model files, As a result, we considered 38 instances with probabilistic reachability and 41 instances with expected-reward properties, many comprising several million states.
We ran all experiments on an Intel Core i7-4790 workstation (3.6-4.0 GHz) with 8 GB of memory and 64-bit Ubuntu Linux 18.04. By default, we request a relative half-width of = 10 −6 for the result probability or reward value, and configure OVI to use the relative-error criterion with α = 10 −6 in the iteration phase. We use a 600 s timeout ("TO"). Due to the number of instances, we show most results as scatter plots like in Fig. 3. Each such plot compares two methods in terms of runtime or number of iterations. Every point x, y corresponds to an instance and indicates that the method noted on the x-axis took x seconds or iterations to solve this instance while the method noted on the y-axis took y seconds or iterations. Thus points above the solid diagonal line correspond to instances where the x-axis method was faster (or needed fewer iterations); points above (below) the upper (lower) dotted diagonal line are where the x-axis method took less than half (more than twice) as long or as many iterations.

Comparison with VI
All methods except VI delivered correct results up to . VI offers low runtime at the cost of occasional incorrect results, and in general the absence of any guarantee about the result. We thus compare with VI separately to judge the overhead caused by performing additional verification, and possibly iteration, phases. This is similar to the comparison done for II in [3]. Figures 3 and 4 show the results. The unfilled shapes indicate instances where VI produced an incorrect result. In terms of runtime, we see that OVI does not often take more than twice as long as VI, and frequently requires less than 50% extra time. On several instances where OVI incurs most overhead, VI produces an incorrect result, indicating that they are "hard" instances for value iteration. The unfilled CTMCs where OVI takes much longer to compute probabilities are all instances of the embedded model; the DTMC on the x-axis is haddad-monmege, an adversarial model built to highlight the convergence problem of VI in [14]. The problematic cases for expected rewards include most MA instances, the two expected-reward instances of the embedded CTMC, and again haddad-monmege. In terms of iterations, the overhead of OVI is even less than in runtime.

Comparison with II and SVI
We compare the runtime of OVI with the runtime of II and that of SVI separately for reachability probabilities (shown in Fig. 5) and expected rewards (shown in Fig. 6). As shown in Table 2, OVI has almost the same requirements on precomputations as VI, while II and SVI require extra precomputations and ECC for reachability probabilities. The precomputations and ECC need extra runtime (which turned out to be negligible in some cases but significant enough to cause a timeout in others) prior to the numeric iterations. However, doing the precomputations can reduce the size of the set S ? , and ECC can reduce the size of the MDP itself. Both can thus reduce the runtime needed for the numeric iterations. For the overall runtime, we found that none of these effects dominates the other over all models. Thus sometimes it may be better to perform only the required precomputations and transformations, while on other models performing all applicable ones may lead to lower total runtime. For reachability probabilities, we thus compare OVI, II, and SVI in two scenarios: once in the default ("std") setting of mcsta that uses only required preprocessing steps (without ECC for OVI; we report the total runtime for preprocessing and iterations), and once with all of them enabled ("pre", where we report only the runtime for numeric iterations, plus the computation of initial upper bounds in case of II).
For probabilistic reachability, we see in Fig. 5 that there is no clear winner among the three methods in the "std" setting (top plots). In some cases, the extra precomputations take long enough to give an advantage to OVI, while in others they speed up II and SVI significantly, compensating for their overhead. The "pre" setting (bottom), in which all three algorithms operate on exactly the same input w.r.t. to MDP M and set S ? , however, shows a clearer picture: now OVI is faster, sometimes significantly so, than II and SVI on most instances. Expected-reward properties were more challenging for all three methods (as well as for VI, which produced more errors here than for probabilities). The plots in Fig. 6 paint a very clear picture of OVI being significantly faster for expected rewards than II (which suffers from the need to precompute initial upper bounds that then turn out to be rather conservative), and faster (though by a lesser margin and with few exceptions) than SVI.
In Fig. 7, we give a summary view combining the data from Figs. 3 to 6. For each algorithm, we plot the instances sorted by runtime, i.e. a point x, y on the line for algorithm z means that some instance took y seconds to solve via z, and there are x instances that z solves in less time. Note in particular that the times are not cumulative. The right-hand plot zooms into the left-hand one. We clearly see the speedup offered by OVI over SVI and especially II. Where the scatter plots merely show that OVI often does not obtain more than a 2× speedup compared to SVI, these plots provide an explanation: the VI line is a rough Fig. 7. Summary comparison to VI, II, and SVI, instances ordered by runtime bound on the performance that any extension of VI can deliver. Comparing the SVI and VI lines, over much of the plot's range, OVI thus cannot take less than half the runtime of SVI without outperforming VI itself.

On the Effect of and α
We also compared the four algorithms for different values of and, where applicable, α. We show a selection of the results in Fig. 8. The axis labels are of the form "algorithm, /α". On the left, we see that the runtime of OVI changes if we set α to values different from , however there is no clear trend: some instances are checked faster, some slower. We obtained similar plots for other combinations of α values, with only a slight tendency towards longer runtimes as α > . mcsta thus uses α = as a default that can be changed by the user.
In the middle, we study the impact of reducing the desired precision by setting to 10 −3 . This allows OVI to speed up by factors mostly between 1 and 2; the same comparison for SVI and II resulted in similar plots, however VI was able to more consistently achieve higher speedups. When we compare the right plot with the right-hand plot of Fig. 6, we consequently see that the overall result of our comparison between OVI and SVI does not change significantly with the lower precision, although OVI does gain slightly more than SVI.

Comparing Relative and Absolute Error
In Fig. 9, we show comparison plots for the runtime when using diff abs instead of diff rel . Requiring absolute-error-correct results may make instances with low result values much easier and instances with high results much harder. We chose = 10 −2 as a compromise, and the leftmost plot confirms that we indeed chose an that keeps the expected-reward benchmarks on average roughly as hard as with 10 −6 relative error. In the middle and right plots, we again see OVI compared with II and SVI. Compared to Fig. 6, both II and SVI gain a little, but there are no significant differences overall. Our experiments thus confirm that the relative performance of OVI is stable under varying precision requirements.

Verification Phases
On the right, we show histograms of the number of verification phases started (top, from 1 phase on the left to 20 on the right) and the percentage of iterations that are done in verification phases (bottom) over all benchmark instances (probabilities and rewards). We see that, in the vast majority of cases, we need few verification attempts, with many succeeding in the first attempt, and most iterations are performed in the iteration phases.

Conclusion
We have presented optimistic value iteration (OVI), a new approach to making non-exact probabilistic model checking via iterative numeric algorithms sound in the sense of delivering results within a prescribed interval around the true value (modulo floating-point and implementation errors). Compared to interval (II) and sound value iteration (SVI), OVI has slightly stronger termination guarantees in presence of multiple fixed points, and works in practice for max. probabilities without collapsing end components despite the lack of a guarantee. Like II, it can be combined with alternative methods for dealing with end components such as the new deflating technique of [23]. OVI is a simple algorithm that is easy to add to any tool that already implements value iteration, and it is fast, further closing the performance gap between VI and sound methods. Data Availability. A dataset to replicate our experimental evaluation is archived and available at DOI 10.4121/uuid:3df859e6-edc6-4e2d-92f3-93e478bbe8dc [19].