PAC Statistical Model Checking for Markov Decision Processes and Stochastic Games

Statistical model checking (SMC) is a technique for analysis of probabilistic systems that may be (partially) unknown. We present an SMC algorithm for (unbounded) reachability yielding probably approximately correct (PAC) guarantees on the results. We consider both the setting (i) with no knowledge of the transition function (with the only quantity required a bound on the minimum transition probability) and (ii) with knowledge of the topology of the underlying graph. On the one hand, it is the first algorithm for stochastic games. On the other hand, it is the first practical algorithm even for Markov decision processes. Compared to previous approaches where PAC guarantees require running times longer than the age of universe even for systems with a handful of states, our algorithm often yields reasonably precise results within minutes, not requiring the knowledge of mixing time or the topology of the whole model.


Introduction
Statistical model checking (SMC) [YS02a] is an analysis technique for probabilistic systems based on 1. simulating finitely many finitely long runs of the system, 2. statistical analysis of the obtained results, 3. yielding a confidence interval/probably approximately correct (PAC) result on the probability of satisfying a given property, i.e., there is a non-zero probability that the bounds are incorrect, but they are correct with probability that can be set arbitrarily close to 1. One of the advantages is that it can avoid the state-space explosion problem, albeit at the cost of weaker guarantees. Even more importantly, this technique is applicable even when the model is not known (black-box setting) or only qualitatively known (grey-box setting), where the exact transition probabilities are unknown such as in many cyber-physical systems.
Firstly, for infinite time-horizon properties we need a stopping criterion such that the infinite-horizon property can be reliably evaluated based on a finite prefix of the run yielded by simulation. This can rely on the the complete knowledge of the system (white-box setting) [YCZ10, LP08], the topology of the system (grey box) [YCZ10,HJB + 10], or a lower bound p min on the minimum transition probability in the system (black box) [DHKP16,BCC + 14].
Secondly, for Markov decision processes (MDP) [Put14] with (non-trivial) nondeterminism, [HMZ + 12] and [LP12] employ reinforcement learning [SB98] in the setting of bounded properties or discounted (and for the purposes of approximation thus also bounded) properties, respectively. The latter also yields PAC guarantees.
Finally, for MDP with unbounded properties, [BFHH11] deals with MDP with spurious non-determinism, where the way it is resolved does not affect the desired property. The general non-deterministic case is treated in [FT14,BCC + 14], yielding PAC guarantees. However, the former requires the knowledge of mixing time, which is at least as hard to compute; the algorithm in the latter is purely theoretical since before a single value is updated in the learning process, one has to simulate longer than the age of universe even for a system as simple as a Markov chain with 12 states having at least 4 successors for some state. Our contribution is an SMC algorithm with PAC guarantees for (i) MDP and unbounded properties, which runs for realistic benchmarks [HKP + 19] and confidence intervals in orders of minutes, and (ii) is the first algorithm for stochastic games (SG). It relies on different techniques from literature.

The increased practical performance rests on two pillars:
extending early detection of bottom strongly connected components in Markov chains by [DHKP16] to end components for MDP and simple end components for SG; improving the underlying PAC Q-learning technique of [SLW + 06]: (a) learning is now model-based with better information reuse instead of model-free, but in realistic settings with the same memory requirements, (b) better guidance of learning due to interleaving with precise computation, which yields more precise value estimates. (c) splitting confidence over all relevant transitions, allowing for variable width of confidence intervals on the learnt transition probabilities. 2. The transition from algorithms for MDP to SG is possible via extending the over-approximating value iteration from MDP [BCC + 14] to SG by [KKKW18].
To summarize, we give an anytime PAC SMC algorithm for (unbounded) reachability. It is the first such algorithm for SG and the first practical one for MDP.

Related work
Most of the previous efforts in SMC have focused on the analysis of properties with bounded horizon [YS02b,SVA04,YKNP06,JCL + 09,JLS12,BDL + 12]. SMC of unbounded properties was first considered in [HLMP04] and the first approach was proposed in [SVA05], but observed incorrect in [HJB + 10]. Notably, in [YCZ10] two approaches are described. The first approach proposes to terminate sampled paths at every step with some probability p term and re-weight the result accordingly. In order to guarantee the asymptotic convergence of this method, the second eigenvalue λ of the chain and its mixing time must be computed, which is as hard as the verification problem itself and requires the complete knowledge of the system (white box setting). The correctness of [LP08] relies on the knowledge of the second eigenvalue λ, too. The second approach of [YCZ10] requires the knowledge of the chain's topology (grey box), which is used to transform the chain so that all potentially infinite paths are eliminated. In [HJB + 10], a similar transformation is performed, again requiring knowledge of the topology. In [DHKP16], only (a lower bound on) the minimum transition probability p min is assumed and PAC guarantees are derived. While unbounded properties cannot be analyzed without any information on the system, knowledge of p min is a relatively light assumption in many realistic scenarios [DHKP16]. For instance, bounds on the rates for reaction kinetics in chemical reaction systems are typically known; for models in the PRISM language [KNP11], the bounds can be easily inferred without constructing the respective state space. In this paper, we thus adopt this assumption.
In the case with general non-determinism, one approach is to give the nondeterminism a probabilistic semantics, e.g., using a uniform distribution instead, as for timed automata in [DLL + 11a,DLL + 11b,Lar13]. Others [LP12,HMZ + 12,BCC + 14] aim to quantify over all strategies and produce an -optimal strategy. In [HMZ + 12], candidates for optimal strategies are generated and gradually improved, but "at any given point we cannot quantify how close to optimal the candidate scheduler is" (cited from [HMZ + 12]) and the algorithm "does not in general converge to the true optimum" (cited from [LST14] Several approaches provide SMC for MDPs and unbounded properties with PAC guarantees. Firstly, similarly to [LP08,YCZ10], [FT14] requires (1) the mixing time T of the MDP. The algorithm then yields PAC bounds in time polynomial in T (which in turn can of course be exponential in the size of the MDP). Moreover, the algorithm requires (2) the ability to restart simulations also in non-initial states, (3) it only returns the strategy once all states have been visited (sufficiently many times), and thus (4) requires the size of the state space |S|. Secondly, [BCC + 14], based on delayed Q-learning (DQL) [SLW + 06], lifts the assumptions (2) and (3) and instead of (1) mixing time requires only (a bound on) the minimum transition probability p min . Our approach additionally lifts the assumption (4) and allows for running times faster than those given by T , even without the knowledge of T .
Reinforcement learning (without PAC bounds) for stochastic games has been considered already in [LN81,Lit94,BT99]. [WT16] combines the special case of almost-sure satisfaction of a specification with optimizing quantitative objectives. We use techniques of [KKKW18], which however assumes access to the transition probabilities.

Stochastic games
A probability distribution on a finite set X is a mapping δ : X → [0, 1], such that x∈X δ(x) = 1. The set of all probability distributions on X is denoted by D(X). Now we define turn-based two-player stochastic games. As opposed to the notation of e.g. [Con92], we do not have special stochastic nodes, but rather a probabilistic transition function.
where S is a finite set of states partitioned 1 into the sets S and S of states of the player Maximizer and Minimizer 2 , respectively s 0 ∈ S is the initial state, A is a finite set of actions, Av : S → 2 A assigns to every state a set of available actions, and T : S × A → D(S) is a transition function that given a state s and an action a ∈ Av(s) yields a probability distribution over successor states. Note that for ease of notation we write T(s, a, t) instead of T(s, a)(t).
A Markov decision process (MDP) is a special case of SG where S = ∅. A Markov chain (MC) can be seen as a special case of an MDP, where for all s ∈ S : |Av(s)| = 1. We assume that SG are non-blocking, so for all states s we have Av(s) = ∅.
For a state s and an available action a ∈ Av(s), we denote the set of successors by Post(s, a) := {t | T(s, a, t) > 0}. We say a state-action pair (s, a) is an exit of a set of states T , written (s, a) exits T , if ∃t ∈ Post(s, a) : t / ∈ T , i.e., if with some probability a successor outside of T could be chosen.
We consider algorithms that have a limited information about the SG.
Definition 2 (Black box and grey box). An algorithm inputs an SG as black box if it cannot access the whole tuple, but it knows the initial state, for a given state, an oracle returns its player and available action, given a state s and action a, it can sample a successor t according to T(s, a), 3 it knows p min ≤ min s∈S,a∈Av(s) t∈Post(s,a) T(s, a, t), an under-approximation of the minimum transition probability.
When input as grey box it additionally knows the number |Post(s, a)| of successors for each state s and action a. 4 1 I.e., S ⊆ S, S ⊆ S, S ∪ S = S, and S ∩ S = ∅. 2 The names are chosen, because Maximizer maximizes the probability of reaching a given target state, and Minimizer minimizes it. 3 Up to this point, this definition conforms to black box systems in the sense of [SVA04] with sampling from the initial state, being slightly stricter than [YS02b] or [RP09], where simulations can be run from any desired state. Further, we assume that we can choose actions for the adversarial player or that she plays fairly. Otherwise the adversary could avoid playing her best strategy during the SMC, not giving SMC enough information about her possible behaviours. 4 This requirement is slightly weaker than the knowledge of the whole topology, i.e.
Post(s, a) for each s and a.
The semantics of SG is given in the usual way by means of strategies and the induced Markov chain [BK08] and its respective probability space, as follows. An infinite path ρ is an infinite sequence ρ = s 0 a 0 s 1 a 1 · · · ∈ (S × A) ω , such that for every i ∈ N, a i ∈ Av(s i ) and s i+1 ∈ Post(s i , a i ).
A strategy of Maximizer or Minimizer is a function σ : S → D(A) or S → D(A), respectively, such that σ(s) ∈ D(Av(s)) for all s. Note that we restrict to memoryless/positional strategies, as they suffice for reachability in SGs [CH12]. A pair (σ, τ ) of strategies of Maximizer and Minimizer induces a Markov chain G σ,τ with states S, s 0 being initial, and the transition function T(s)(t) = a∈Av(s) σ(s)(a) · T(s, a, t) for states of Maximizer and analogously for states of Minimizer, with σ replaced by τ . The Markov chain induces a unique probability distribution P σ,τ over measurable sets of infinite paths [BK08, Ch. 10].

Reachability objective
For a goal set Goal ⊆ S, we write ♦Goal := {s 0 a 0 s 1 a 1 · · · | ∃i ∈ N : s i ∈ Goal} to denote the (measurable) set of all infinite paths which eventually reach Goal. For each s ∈ S, we define the value in s as where the equality follows from [Mar75]. We are interested in V(s 0 ), its ε-approximation and the corresponding (ε-)optimal strategies for both players. Let Zero be the set of states, from which there is no finite path to any state in Goal. The value function V satisfies the following system of equations, which is referred to as the Bellman equations:

Bounded and asynchronous value iteration
The well known technique of value iteration, e.g. [Put14,RF91], works by starting from an under-approximation of value function and then applying the Bellman equations. This converges towards the least fixpoint of the Bellman equations, i.e. the value function. Since it is difficult to give a convergence criterion, the approach of bounded value iteration (BVI, also called interval iteration) was developed for MDP [BCC + 14,HM17] and SG [KKKW18]. Beside the under-approximation, it also updates an over-approximation according to the Bellman equations. The most conservative over-approximation is to use an upper bound of 1 for every state. For the under-approximation, we can set the lower bound of target states to 1; all other states have a lower bound of 0. We use the function INITIALIZE BOUNDS in our algorithms to denote that the lower and upper bounds are set as just described; see Algorithm 8 in Appendix A.1 for the pseudocode. Additionally, BVI ensures that the over-approximation converges to the least fixpoint by taking special care of end components, which are the reason for not converging to the true value from above. Note that, to stay in an EC in an SG, the two players would have to cooperate, since it depends on the pair of strategies. To take into account the adversarial behaviour of the players, it is also relevant to look at a subclass of ECs, the so called simple end components, introduced in [KKKW18].
Definition 4 (Simple end component (SEC) [KKKW18]). Intuitively, SECs are ECs where Minimizer does not want to use any of her exits, as all of them have a greater value than the best exit of Maximizer. Assigning any value between those of the best exits of Maximizer and Minimizer to all states in the EC is a solution to the Bellman equations, because both players prefer remaining and getting that value to using their exits [KKKW18, Lemma 1]. However, this is suboptimal for Maximizer, as the goal is not reached if the game remains in the EC forever. Hence we "deflate" the upper bounds of SECs, i.e. reduce them to depend on the best exit of Maximizer. T is called maximal simple end component (MSEC), if there is no SEC T such that T T . Note that in MDPs, treating all MSECs amounts to treating all MECs.
Algorithm 1 rephrases that of [KKKW18] and describes the general structure of all bounded value iteration algorithms that are relevant for this paper. We discuss it here since all our improvements refer to functions (in capitalized font) in it. In the next section, we design new functions, pinpointing the difference to the other papers. The pseudocode of the functions adapted from the other papers can be found, for the reader's convenience, in Appendix A. Note that to improve readability, we omit the parameters G, Goal, L and U of the functions in the algorithm.
Algorithm 1 Bounded value iteration algorithm for SG (and MDP) 1: procedure BVI(SG G, target set Goal, precision > 0) 2: INITIALIZE BOUNDS 3: repeat 4: X ← SIMULATE until LOOPING or state in Goal is hit 5: UPDATE(X) Bellman updates or their modification 6: for T ∈ FIND MSECs(X) do 7: Instead of considering the whole state space, a path through the SG is sampled by picking in every state one of the actions that look optimal according to the current over-/under-approximation and then sampling a successor of that action. This is repeated until either a target is found, or until the simulation is looping in an EC; the latter case occurs if the heuristic that picks the actions generates a pair of strategies under which both players only pick staying actions in an EC. After the simulation, only the bounds of the states on the path are updated and deflated. Since we pick actions which look optimal in the simulation, we almost surely find an -optimal strategy and the algorithm terminates [BCC + 14, Theorem 3].

Model-based
Given only limited information, updating cannot be done using T, since the true probabilities are not known. The approach of [BCC + 14] is to sample for a high number of steps and accumulate the observed lower and upper bounds on the true value function for each state-action pair. When the number of samples is large enough, the average of the accumulator is used as the new estimate for the state-action pair, and thus the approximations can be improved and the results back-propagated, while giving statistical guarantees that each update was correct. However, this approach has several drawbacks, the biggest of which is that the number of steps before an update can occur is infeasibly large, often larger than the age of the universe, see Table 1 in Section 4. Our improvements to make the algorithm practically usable are linked to constructing a partial model of the given system. That way, we have more information available on which we can base our estimates, and we can be less conservative when giving bounds on the possible errors. The shift from model-free to model-based learning asymptotically increases the memory requirements from O(|S| · |A|) (as in [SLW + 06,BCC + 14]) to O(|S| 2 · |A|). However, for systems where each action has a small constant bound on the number of successors, which is typical for many practical systems, e.g. classical PRISM benchmarks, it is still O(|S| · |A|) with a negligible constant difference.
We thus track the number of times some successor t has been observed when playing action a from state s in a variable #(s, a, t). This implicitly induces the number of times each state-action pair (s, a) has been played #(s, a) = t∈S #(s, a, t). Given these numbers we can then calculate probability estimates for every transition as described in the next subsection. They also induce the set of all states visited so far, allowing us to construct a partial model of the game. See [?, Appendix A.2] for the pseudo-code of how to count the occurrences during the simulations.

Safe updates with confidence intervals using distributed error probability
We use the counters to compute a lower estimate of the transition probability for some error tolerance δ T as follows: We view sampling t from state-action pair (s, a) as a Bernoulli sequence, with success probability T(s, a, t), the number of trials #(s, a) and the number of successes #(s, a, t). The tightest lower estimate we can give using the Hoeffding bound (see Appendix D.1) is where the confidence width c := ln(δ T ) −2#(s,a) . Since c could be greater than 1, we limit the lower estimate to be at least 0. Now we can give modified update equations: The idea is the same for both upper and lower bound: In contrast to the usual Bellman equation (see Section 2.2) we use T instead of T. But since the sum of all the lower estimates does not add up to one, there is some remaining probability for which we need to under-/over-approximate the value it can achieve. We use the safe approximations 0 and 1 for the lower and upper bound respectively; this is why in L there is no second term and in U the whole remaining probability is added. Algorithm 2 shows the modified update that uses the lower estimates; the proof of its correctness is in Appendix D.2..

Lemma 1 (UPDATE is correct).
Given correct under-and over-approximations L, U of the value function V, and correct lower probability estimates T, the underand over-approximations after an application of UPDATE are also correct.
Algorithm 2 New update procedure using the probability estimates for s ∈ X \ Goal do For all non-target states in the given set 4: Example 1. We illustrate how the calculation works and its huge advantage over the approach from [BCC + 14] on the SG from Figure 1. For this example, ignore the dashed part and let p 1 = p 2 = 0.5, i.e. we have no self loop, and an even chance to go to the target 1 or a sink 0. Observe that hence V(s 0 ) = V(s 1 ) = 0.5. Given an error tolerance of δ = 0.1, the algorithm of [BCC + 14] would have to sample for more than 10 9 steps before it could attempt a single update. In contrast, assume we have seen 5 samples of action b 2 , where 1 of them went to 1 and 4 of them to 0. Note that, in a sense, we were unlucky here, as the observed averages are very different from the actual distribution. The confidence width for δ T = 0.1 and 5 samples is ln(0.1)/ − 2 · 5 ≈ 0.48. So given that data, we get Note that both probabilities are in fact lower estimates for their true counterpart.
Assume we already found out that 0 is a sink with value 0; how we gain this knowledge is explained in the following subsections. Then, after getting only these 5 samples, UPDATE already decreases the upper bound of (s 1 , b 2 ) to 0.68, as we know that at least 0.32 of T(s 1 , b 2 ) goes to the sink.
Given 500 samples of action b 2 , the confidence width of the probability estimates already has decreased below 0.05. Then, since we have this confidence width for both the upper and the lower bound, we can decrease the total precision for (s 1 , b 2 ) to 0.1, i.e. return an interval in the order of [0.45; 0.55].
Summing up: with the model-based approach we can already start updating after very few steps and get a reasonable level of confidence with a realistic number of samples. In contrast, the state-of-the-art approach of [BCC + 14] needs a very large number of samples even for this toy example.
Since for UPDATE we need an error tolerance for every transition, we need to distribute the given total error tolerance δ over all transitions in the current partial model. For all states in the explored partial model S we know the number of available actions and can over-approximate the number of successors as 1 pmin . Thus the error tolerance for each transition can be set to This is illustrated in Example 4 in Appendix B.
Note that the fact that the error tolerance δ T for every transition is the same does not imply that the confidence width for every transition is the same, as the latter becomes smaller with increasing number of samples #(s, a).

Improved EC detection
As mentioned in the description of Algorithm 1, we must detect when the simulation is stuck in a bottom EC and looping forever. However, we may also stop simulations that are looping in some EC but still have a possibility to leave it; for a discussion of different heuristics from [BCC + 14, KKKW18] We choose to define LOOPING as follows: Given a candidate for a bottom EC, we continue sampling until we are δ T -sure (i.e. the error probability is smaller than δ T ) that we cannot leave it. Then we can safely deflate the EC, i.e. decrease all upper bounds to zero.
To detect that something is a δ T -sure EC, we do not sample for the astronomical number of steps as in [BCC + 14], but rather extend the approach to detect bottom strongly connected components from [DHKP16]. If in the EC-candidate T there was some state-action pair (s, a) that actually has a probability to exit the T , that probability is at least p min . So after sampling (s, a) for n times, the probability to overlook such a leaving transition is (1 − p min ) n and it should be smaller than δ T . Solving the inequation for the required number of samples n yields n ≥ ln(δ T ) ln(1−pmin) . Algorithm 3 checks that we have seen all staying state-action pairs n times, and hence that we are δ T -sure that T is an EC. Note that we restrict to staying state-action pairs, since the requirement for an EC is only that there exist staying actions, not that all actions stay. We further speed up the EC-detection, because we do not wait for n samples in every simulation, but we use the aggregated counters that are kept over all simulations.
We stop a simulation, if LOOPING returns true, i.e. under the following three conditions: (i) We have seen the current state before in this simulation (s ∈ X), i.e. there is a cycle. (ii) This cycle is explainable by an EC T in our current partial model. (iii) We are δ T -sure that T is an EC.
Example 2. For this example, we again use the SG from Figure 1 without the dashed part, but this time with p 1 = p 2 = p 3 = 1 3 . Assume the path we simulated is Algorithm 3 Check whether we are δ T -sure that T is an EC if s / ∈ X then 3: return false Easy improvement to avoid overhead 4: we sampled the self-loop of action b 2 . Then {s 1 } is a candidate for an EC, because given our current observation it seems possible that we will continue looping there forever. However, we do not stop the simulation here, because we are not yet δ T -sure about this. Given δ T = 0.1, the required samples for that are 6, since ln(0.1) ln(1− 1 3 ) = 5.6. With high probability (greater than (1 − δ T ) = 0.9), within these 6 steps we will sample one of the other successors of (s 1 , b 2 ) and thus realise that we should not stop the simulation in s 1 . If, on the other hand, we are in state 0 or if in state s 1 the guiding heuristic only picks b 1 , then we are in fact looping for more than 6 steps, and hence we stop the simulation.

Adapting to games: Deflating MSECs
To extend the algorithm of [BCC + 14] to SGs, instead of collapsing problematic ECs we deflate them as in [KKKW18], i.e. given an MSEC, we reduce the upper bound of all states in it to the upper bound of the bestExit of Maximizer. In contrast to [KKKW18], we cannot use the upper bound of the bestExit based on the true probability, but only based on our estimates. Algorithm 5 shows how to deflate an MSEC and highlights the difference, namely that we use U instead of U.
Algorithm 5 Black box algorithm to deflate a set of states 1: procedure DEFLATE(State set X) 2: for s ∈ X do 3: The remaining question is how to find MSECs. The approach of [KKKW18] is to find MSECs by removing the suboptimal actions of Minimizer according to the current lower bound. Since it converges to the true value function, all MSECs are eventually found [KKKW18, Lemma 2]. Since Algorithm 6 can only access the SG as a black box, there are two differences: We can only compare our estimates of the lower bound L(s, a) to find out which actions are suboptimal. Additionally there is the problem that we might overlook an exit from an EC, and hence deflate to some value that is too small; thus we need to check that any state set FIND MSECs returns is a δ T -sure EC. This is illustrated in Example 3. For a bigger example of how all the functions we have defined work together, see Example 5 in Appendix B.
Algorithm 6 Finding MSECs in the game restricted to X for black box setting 1: procedure FIND MSECs(State set X)

2:
suboptAct ← {(s, {a ∈ Av(s) | L (s, a) > L(s)} | s ∈ S ∩ X} 3: Av ← Av without suboptAct 4: G ← G restricted to states X and available actions Av 5: Example 3. For this example, we use the full SG from Figure 1, including the dashed part, with p 1 , p 2 > 0. Let (s 0 , a 1 , s 1 , b 2 , s 2 , b 1 , s 1 , a 2 , s 2 , c, 1) be the path generated by our simulation. Then in our partial view of the model, it seems as if T = {s 0 , s 1 } is an MSEC, since using a 2 is suboptimal for the minimizing state s 0 6 and according to our current knowledge a 1 , b 1 and b 2 all stay inside T . If we deflated T now, all states would get an upper bound of 0, which would be incorrect.
Thus in Algorithm 6 we need to require that T is an EC δ T -surely. This was not satisfied in the example, as the state-action pairs have not been observed the required number of times. Thus we do not deflate T , and our upper bounds stay correct. Having seen (s 1 , b 2 ) the required number of times, we probably know that it is exiting T and hence will not make the mistake.

Guidance and statistical guarantee
It is difficult to give statistical guarantees for the algorithm we have developed so far (i.e. Algorithm 1 calling the new functions from Section 3.2 to 3.4). Although we can bound the error of each function, applying them repeatedly can add up the error. Algorithm 7 shows our approach to get statistical guarantees: It interleaves a guided simulation phase (Lines 7-10) with a guaranteed standard bounded value iteration (called BVI phase) that uses our new functions (Lines 11-16).
The simulation phase builds the partial model by exploring states and remembering the counters. In the first iteration of the main loop, it chooses actions randomly. In all further iterations, it is guided by the bounds that the last BVI phase computed. After N k simulations (see below for a discussion of how to choose N k ), all the gathered information is used to compute one version of the partial model with probability estimates T for a certain error tolerance δ k . We can continue with the assumption, that these probability estimates are correct, since it is only violated with a probability smaller than our error tolerance (see below for an explanation of the choice of δ k ). So in our correct partial model, we re-initialize the lower and upper bound (Line 12), and execute a guaranteed standard BVI. If the simulation phase already gathered enough data, i.e. explored the relevant states and sampled the relevant transitions often enough, this BVI achieves a precision smaller than ε in the initial state, and the algorithm terminates. Otherwise we start another simulation phase that is guided by the improved bounds.
Algorithm 7 Full algorithm for black box setting 1: procedure BlackVI(SG G, target set Goal, precision ε > 0, error tolerance δ > 0) 2: INITIALIZE BOUNDS 3: k = 1 guaranteed BVI counter 4: S ← ∅ current partial model 5: repeat 6: k ← 2 · k 7: δ k ← δ k // Guided simulation phase 8: for N k times do 9: X ← SIMULATE 10: for T ∈ FIND MSECs( S) do 16: Choice of δ k : For each of the full BVI phases, we construct a partial model that is correct with probability (1 − δ k ). To ensure that the sum of these errors is not larger than the specified error tolerance δ, we use the variable k, which is initialised to 1 and doubled in every iteration of the main loop. Hence for the i-th BVI, k = 2 i . By setting δ k = δ k , we get that δ 2 i = δ, and hence the error of all BVI phases does not exceed the specified error tolerance.
When to stop each BVI-phase: The BVI phase might not converge if the probability estimates are not good enough. We increase the number of iterations for each BVI depending on k, because that way we ensure that it eventually is allowed to run long enough to converge. On the other hand, since we always run for finitely many iterations, we also ensure that, if we do not have enough information yet, BVI is eventually stopped. Other stopping criteria could return arbitrarily imprecise results [HM17]. We also multiply with S to improve the chances of the early BVIs to converge, as that number of iterations ensures that every value has been propagated through the whole model at least once.
Discussion of the choice of N k : The number of simulations between the guaranteed BVI phases can be chosen freely; it can be a constant number every time, or any sequence of natural numbers, possibly parameterised by e.g. k, S , ε or any of the parameters of G. The design of particularly efficient choices or learning mechanisms that adjust them on the fly is an interesting task left for future work.
We conjecture the answer depends on the given SG and "task" that the user has for the algorithm: E.g. if one just needs a quick general estimate of the behaviour of the model, a smaller choice of N k is sensible; if on the other hand a definite precision ε certainly needs to be achieved, a larger choice of N k is required.
Theorem 1. For any choice of sequence for N k , Algorithm 7 is an anytime algorithm with the following property: When it is stopped, it returns an interval for V(s 0 ) that is PAC 7 for the given error tolerance δ and some ε , with 0 ≤ ε ≤ 1.
Theorem 1 is the foundation of the practical usability of our algorithm. Given some time frame and some N k , it calculates an approximation for V(s 0 ) that is probably correct. Note that the precision ε is independent of the input parameter ε, and could in the worst case be always 1. However, practically it often is good (i.e. close to 0) as seen in the results in Section 4. Moreover, in our modified algorithm, we can also give a convergence guarantee as in [BCC + 14]. Although mostly out of theoretical interest, in Appendix D.4 we design such a sequence N k , too. Since this a-priori sequence has to work in the worst case, it depends on an infeasibly large number of simulations.
Theorem 2. There exists a choice of N k , such that Algorithm 7 is PAC for any input parameters ε, δ, i.e. it terminates almost surely and returns an interval for V(s 0 ) of width smaller than ε that is correct with probability at least 1 − δ.

Utilizing the additional information of grey box input
In this section, we consider the grey box setting, i.e. for every state-action pair (s, a) we additionally know the exact number of successors |Post(s, a)|. Then we can sample every state-action pair until we have seen all successors, and hence this information amounts to having qualitative information about the transitions, i.e. knowing where the transitions go, but not with which probability.
In that setting, we can improve the EC-detection and the estimated bounds in UPDATE. For EC-detection, note that the whole point of δ T -sure EC is to check whether there are further transitions available; in grey box, we know this and need not depend on statistics. For the bounds, note that the equations for L and U both have two parts: The usual Bellman part and the remaining probability multiplied with the most conservative guess of the bound, i.e. 0 and 1. If we know all successors of a state-action pair, we do not have to be as conservative; then we can use min t∈Post(s,a) L(t) respectively max t∈Post(s,a) U(t). Both these improvements have huge impact, as demonstrated in Section 4. However, of course, they also assume more knowledge about the model.

Experimental evaluation
We implemented the approach as an extension of PRISM-Games [CFK + 13a]. 11 MDPs with reachability properties were selected from the Quantitative Verification Benchmark Set [HKP + 19]. Further, 4 stochastic games benchmarks from Table 1. Achieved precision ε given by our algorithm in both grey and black box settings after running for a period of 30 minutes (See the paragraph below Theorem 1 for why we use ε and not ε). The first set of the models are MDPs and the second set are SGs. '-' indicates that the algorithm did not finish the first simulation phase and hence partial BVI was not called. m is the number of steps required by the DQL algorithm of [BCC + 14] before the first update. As this number is very large, we report only log10(m). For comparison, note that the age of the universe is approximately 10 26 nanoseconds. [CKJ12,SS12,CFK + 13b,CKPS11] were also selected. We ran the experiments on a 40 core Intel Xeon server running at 2.20GHz per core and having 252 GB of RAM. The tool however utilised only a single core and 1 GB of memory for the model checking. Each benchmark was ran 10 times with a timeout of 30 minutes. We ran two versions of Algorithm 7, one with the SG as a black box, the other as a grey box (see Definition 2). We chose N k = 10, 000 for all iterations. The tool stopped either when a precision of 10 −8 was obtained or after 30 minutes. In total, 16 different model-property combinations were tried out. The results of the experiment are reported in Table 1.
In the black box setting, we obtained ε < 0.1 on 6 of the benchmarks. 5 benchmarks were 'hard' and the algorithm did not improve the precision below 1. For 4 of them, it did not even finish the first simulation phase. If we decrease N k , the BVI phase is entered, but still no progress is made.
In the grey box setting, on 14 of 16 benchmarks, it took only 6 minutes to achieve ε < 0.1. For 8 these, the exact value was found within that time. Less than 50% of the state space was explored in the case of pacman, pneuli-zuck-3, rabin-3, zeroconf and cloud 5. A precision of ε < 0.01 was achieved on 15/16 benchmarks over a period of 30 minutes. Figure 2 shows the evolution of the lower and upper bounds in both the greyand the black box settings for 4 different models. Graphs for the other models as well as more details on the results are in Appendix C.

Conclusion
We presented a PAC SMC algorithm for SG (and MDP) with the reachability objective. It is the first one for SG and the first practically applicable one. Nevertheless, there are several possible directions for further improvements. For instance, one can consider different sequences for lengths of the simulation phases, possibly also dependent on the behaviour observed so far. Further, the error tolerance could be distributed in a non-uniform way, allowing for fewer visits in rarely visited parts of end components. Since many systems are strongly connected, but at the same time feature some infrequent behaviour, this is the next bottleneck to be attacked. [

A.2 SIMULATE
In the full information setting, SIMULATE works as follows: The set of best actions for a state s, given the the current U and L, is defined as: So SIMULATE can be viewed as first fixing the strategies σ, τ , such that for every state s they randomize uniformly over all actions in best U,L (s), and then sampling in the induced Markov chain G σ,τ .
Algorithm 9 Standard simulation algorithm to sample a path 1: procedure SIMULATE(lower bound function L, upper bound function U) 2: X ← ∅ Set of states visited during this simulation 3: s ← s 0 4: repeat 5: X ← X ∪ s 6: a ← sampled uniformly from best U,L (s) 7: s ← sampled according to T(s, a) 8: until s ∈ Goal or LOOPING(s, X) 9: return X ∪ {s} Add last state before returning path.
In the limited information setting, we use a map S × A × S → N to store the number of times a triple (s, a, t) has been observed. Initially, all counters are set to 0. Then the only difference between Algorithm 9 and 10 is the highlighted line, where we increment the counter of the observed triple, and the fact that we explicitly save the successor state in a variable to be able to do so.
Algorithm 10 New simulation algorithm counting occurrences 1: procedure SIMULATEcounting 2: X ← ∅ Set of states visited during this simulation 3: s ← s 0 4: repeat 5: X ← X ∪ s 6: a ← sampled uniformly from best U,L (s) 7: t ← sampled according to T(s, a)  (2) if there is positive chance to reach a state in the Markov chain G σ,τ (see Section A.2), there also is a positive chance that the simulation reaches it. This would be violated if the simulation depended on a constant number of steps that is smaller than the length of the longest path in the SG.
Correct heuristics for LOOPING are e.g. stopping a simulation after a certain length of the path is exceeded, where this length either grows with the number of simulations or is larger than the size of the partial model; or one can stop a simulation as soon as a state appears the second time in the path. This is the heuristic that we use in Algorithm 11. One could also be rigorous and check that in fact a bottom EC has been reached and there is no way to exit it. However, this check is computationally costly, and using the heuristics proved to be faster.
Algorithm 11 Heuristic to check whether we are probably looping and hence should stop the simulation 1: procedure LOOPING(State set X, state s) 2: return s / ∈ X

A.4 UPDATE
Algorithm 12 is the standard update procedure for the full information setting as used in e.g. [BCC + 14,HM17] (without the case distinction on players) and [KKKW18]. Note that we can only update states in X \ Goal, since targets are set correctly already, but updating them might be wrong (if they do not self-loop, but go somewhere else). This problem was not addressed in the other papers because of their preprocessing.

A.5 DEFLATE and FIND MSECs
This section contains the algorithms for finding MSECs and deflating them from [KKKW18]. Algorithm 13 finds all MSECs in X in the full information setting. It works by removing the suboptimal actions of Minimizer and computing MECs in the thus restricted game. Any MEC that uses only optimal actions of Minimizer is an MSEC according to L [KKKW18, Lemma 2]. Note that for convergence of the full algorithm we use that L converges towards V, so eventually all decisions of Minimizer are set correctly and we find the true MSECs.
Algorithm 13 Algorithm to find all MSECs in the game restricted to X 1: procedure FIND MSECs(State set X) 2: suboptAct ← {(s, {a ∈ Av(s) | L(s, a) > L(s)} | s ∈ S ∩ X} 3: G ← G with Av replaced by Av without suboptAct 4: return MEC(G ) Algorithm 14 shows how to adjust the upper bounds in an MSEC to ensure convergence, i.e. avoid the over-approximation being stuck at a greater fixpoint than the least. The resulting upper bound actually is sound for any set of states X given as input [KKKW18,Lemma 3].

B Additional examples
Example 4. We use the same SG as in Example 1 and a total error tolerance of δ = 0.1. We count the number of state-action pairs in our partial model as 6, but still have to over-estimate that every one of them has 1 pmin = 1 0.5 = 2 successors, so then δ T = 0.1 12 = 0.008. Note that in fact the only state-action pair in this example, where we really needed this error tolerance, is (s 1 , b 2 ). Any state-action pair that does not have 1 pmin successors reduces the true number of transitions that would need a part of the error tolerance. Hence, for realistic models we often still distribute very conservatively.
Example 5. In this example we illustrate how all our functions work together. We use the full SG from Figure 1, including the dashed part, and set p 1 = p 2 = 0.5.
Our first simulation reached 0. We stayed there until LOOPING returned true, since we looped the required number of times. {0} is a MEC in the game restricted to optimal actions of Minimizer and the states explored in the first simulation, and we already ensured it is an EC δ T -surely. Hence it is in the set returned by FIND MSECs, and U(0) is set to 0. Now we simulate enough times, such that we just realized that the upper bound of (s 1 , b 2 ) is something smaller than 1, since we know there is some probability to go to the sink 0 (similar to Example 1). Additionally, we also already know that (s 0 , a 2 ) has a positive probability of reaching the target, a higher one than (s 1 , b 2 )(by simulating paths as in Example 3).
Then T = {s 0 , s 1 } forms an MSEC, since it is Minimizer's best choice to remain in T and Maximizer is under the illusion that going to s 0 promises a value of 1, since the U(s 0 ) still is 1; hence the heuristic picks b 1 , as it already knows that b 2 yields a value smaller than 1. Thus, the next simulation loops in T until we know that it is an EC δ T -surely and thus we stop simulating. Then deflate decreases the upper bound of all states in T , i.e. of s 0 and s 1 , to U(s 1 , b 2 ), which is the best thing that Maximizer can achieve in T , and the least thing that Minimizer must allow to happen.
Then, in the next simulation, s 1 randomizes uniformly between b 1 and b 2 , and hence there is a positive chance of sampling the state-action pair (s 1 , b 2 ) and improving our probability estimate T (s 1 , b 2 ). Thus, we eventually improve our bounds for (s 1 , b 2 ).
Then the process is repeated, i.e. we are stuck in T again, which we realize immediately this time, since we can access the information from the previous simulations; then we deflate the upper bounds in T to U(s 1 , b 2 ) and after that again eventually improve the bounds for (s 1 , b 2 ). This way, we eventually are able to achieve any precision , since we know all relevant states (actually all states, in this example) and the estimate T(s 1 , b 2 ) becomes more and more precise given more and more samples.
However, as described in Section 3.5 doing it like this might add up the error, which is why we introduce the two phase approach.  We view sampling t from state-action pair (s, a) as a Bernoulli sequence, with success probability T(s, a, t) and the number of trials #(s, a). Given the number of successes #(s, a, t), we want to find a lower estimate T, such that P( T ≥ T(s, a, t)) ≤ δ T .

D.2 Proof of Lemma 1
We slightly reformulate the lemma.
Lemma 1 (UPDATE is correct). Let L and U be correct under-respectively over-approximations of the value function V, i.e. for all states s it holds that L(s) ≤ V(s) ≤ U(s); and let L and U be the new estimates after an application of UPDATE.
Assuming that T(s, a, t) ≤ T(s, a, t), UPDATE is correct, i.e. for all states s it holds that L (s) ≤ V(s) ≤ U (s) Proof. We prove, that actually for every action a ∈ Av(s) the claim holds. Then trivially it also holds for the whole state, independent of player, since if the estimates for every action are correct, also the maximum/minimum estimate is correct.
We exemplify the proof for the upper bound; the lower bound is analogous. Reasoning ( * ): If for a state t it holds that #(s, a, t) > 0, then t ∈ Post(s, a), because otherwise the state cannot have been sampled; so no summands are added. If there are states in Post(s, a) that have not been sampled yet, then T(s, a, t) = 0 as #(s, a, t) = 0; thus these states have no impact on the sum. So for every action V(s, a) ≤ U (s, a), so the new estimate is correct.

D.3 Proof of Theorem 1
Theorem 1. For any choice of sequence for N k , Algorithm 7 is an anytime algorithm with the following property: When it is stopped, it returns an interval for V(s 0 ) that is PAC for error tolerance δ and some ε , with 0 ≤ ε ≤ 1.
Proof. We proceed in two steps: First we prove, that we can continue with the assumption that our partial model is correct, as this assumption is only violated with a probability greater than the given error tolerance δ (intuitively, this is the 'probably' in the PAC-guarantee). Then we prove that under this assumption all the computations of our algorithms are correct and return sensible bounds.
Assumption 1: Every time the standard BVI (Lines 11-16) is executed, it holds that during the whole computation, for all s ∈ S, a ∈ Av(s), t ∈ Post(s, a) we have that T(s, a, t) ≤ T(s, a, t).
We now prove that Assumption 1 is only violated with probability smaller than δ, as sketched in Section 3.5.
-Every standard BVI depends on a fixed set of simulations, and hence on fixed values of the counters, so during the computation the probability estimates do not change. -Every standard BVI uses its own error tolerance δ k . As argued in Section 3.5, the sum of all these error tolerances is δ. -This error tolerance δ k is distributed over all transitions in the model as described in Section 3.2, where the number of all transitions is over-approximated. Hence the sum of the error probability in the whole partial model is bounded as follows: s∈ S,a∈Av(s),t∈Post(s,a) (using over-approximation of the number of transitions) = δ k (by definition of δ T in Line 11.) So in total, the sum of the error probability in each partial model is bounded by δ k , and the sum of all δ k is bounded by δ, so the overall error is bounded by δ. Thus, we continue with Assumption 1. It remains to prove that every time our algorithms modify the bounds, these modifications are correct. We do this by an induction as follows: For every standard BVI phase, the bounds are initialized conservatively in Line 12, so they are certainly correct in the beginning, i.e. L(s) ≤ V(s) ≤ U(s).
Assuming that L and U are correct, we now prove for each of our algorithms that modify L and U, that after the modification they still are correct.
UPDATE: Given the assumption that L and U are correct and Assumption 1, Lemma 1 proves our goal. DEFLATE: Note that the case of a target being inside the deflated state set T is handled correctly, since in that case all upper bounds remain at 1 and hence are correct over-approximations. We continue with the assumption that Goal ∩ T = ∅. We now make a case distinction on whether we know about the bestExit(T ) of Maximizer in our partial model. We know bestExit and it exists: This means there exists some state-action pair (s, a) with s ∈ S such that (s, a) exits T , and T(s, a, t) > 0 for some t / ∈ T . Note that by induction hypothesis U(t) is correct. Then the value of all states in T is set to U(s, a). By a similar argument as in the proof of Lemma 1 we get that U(s, a) ≥ U(s, a). By [KKKW18, Lemma 2] we know that U(s, a) ≥ V(u) for all u ∈ T . Thus, after deflating the upper bound still is correct.
There is no bestExit: In this case, Maximizer cannot leave T . Note that there is no target in T . Hence the value of all states is 0, and deflate sets the upper bound to 0 correctly. There is a bestExit, but we do not know it in our partial model: This case violates Assumption 1. Before deflate is called on a state set T , FIND MSECs ensures that T is an EC δ T -surely. So if there exists an exiting state-action pair (s, a), we have sampled it the number of steps required in Algorithm 3. We arrive at a contradiction by showing that the sum of all the estimated probabilities of the staying actions is larger than the actual sum of all staying probabilities, which violates Assumption 1. This is done by the following chain of equations:

D.4 Proof of Theorem 2
Theorem 2. There exists a choice of N k , such that Algorithm 7 is PAC for any input parameters ε, δ, i.e. it terminates almost surely and returns an interval for V(s 0 ) of width smaller than ε that is correct with probability at least 1 − δ. Proof.
-We use Assumption 1 as in the proof of Theorem 1. We first want to prove that the error in each state can be bounded, depending on the confidence width.
• Assume that the partial BVI has converged. For a Maximizer state s it holds that L(s) := max ≤ 1 p min confWidth since number of successors of one action can be approximated by 1 pmin and where confWidth is the greatest confidence width of the probability of any transition in the whole game. Dually, the same can be done for a state of Minimizer.
-For every state, if we have the error in our estimate to be ≤ 1 pmin confWidth, the error at the initial state can be bounded by |S| pmin confWidth.
-Dually, the error in the upper bound estimate can also be bounded by the same term. Hence, we want to ensure -From Section 3.2, we know that confWidth is given by log(δ T ) −2#(s,a) .
-Therefore, denoting #(s, a) by k, we have -In other words if every relevant transition 8 in the partial model is chosen at least k times, then we would achieve the desired confWidth leading to a solution with a precision of ε. -Let p = p |S| min be the minimum probability with which some transition can be played. We want to compute the number of samples N required so that the least likely transition is triggered at least k times.
-Let X be a random variable counting the number of times the least likely transition has been triggered. X is distributed according to a Binomial distribution with parameters N and p. The cumulative distribution function F (X ≤ k) = k i=0 N i p i (1 − p) n−i gives the probability that the least likely transition is triggered at least k times when N samples are taken.
-If we choose an N such that F (X ≥ k) ≥ 1/2, then it means that every relevant transition will be triggered at least k times with a probability of 1 2 , which means that the partial BVI will succeed with a probability of 1 2 . If the partial BVI fails, then we repeatedly try it until it succeeds. Note that hence we succeed almost surely, since the probability of succeeding in any BVI phase is 8 Note that the guiding heuristic picks the correct actions to sample the relevant parts of the state space, since by Theorem 1 the bounds are correct and by [BCC + 14, Theorem 3] simulation based asynchronous value iteration converges.