# Probabilistic black-box reachability checking (extended version)

- 451 Downloads
- 1 Citations

## Abstract

Model checking has a long-standing tradition in software verification. Given a system design it checks whether desired properties are satisfied. Unlike testing, it cannot be applied in a black-box setting. To overcome this limitation Peled et al. introduced black-box checking, a combination of testing, model inference and model checking. The technique requires systems to be fully deterministic. For stochastic systems, statistical techniques are available. However, they cannot be applied to systems with non-deterministic choices. We present a black-box checking technique for stochastic systems that allows both, non-deterministic and probabilistic behaviour. It involves model inference, testing and probabilistic model-checking. Here, we consider reachability checking, i.e., we infer near-optimal input-selection strategies for bounded reachability.

## Keywords

Model inference Statistical model-checking Reachability analysis Black-box checking Testing Verification## 1 Introduction

Model checking has a long-standing tradition in software verification. Given a system design, model-checking techniques determine whether requirements stated as formal properties are satisfied. These techniques and other forms of model-based verification fall short if no design is available. Model learning provides a solution to this issue. It establishes the basis for model-based verification by automatically learning automata models of black-box systems from observed data. Data used as basis for learning is usually given in the form of system traces, that is, sequences of system events, which may be partitioned into input and output events. Note that model learning is also often referred to as model inference, thus we use both terms interchangeably.

There are two main forms of model learning: *passive* learning and *active* learning. Passive learning learns from preexisting data such as system logs, while active learning actively queries the system that is examined to gain relevant information. This can for instance be done via testing. Noteworthy early examples of passive learning techniques are RPNI for regular languages [27, 41] and Alergia [11] for stochastic regular languages, which learn deterministic finite automata (DFAs) and their stochastic counterparts, respectively. Both, RPNI and Alergia, apply a principle called state-merging. More recent work based on this principle extends the applicability of passive model learning to timed systems [52], Moore machines [24] and to stochastic systems involving non-deterministic choices [35, 36], which we use in this article. All these approaches have in common that the models they learn depend on given sampled training data.

In contrast to this, active learning approaches rely on the possibility to query relevant information. Angluin formalised this by introducing the minimally adequate teacher framework in her seminal work on the \(L^*\) algorithm [4]. This framework assumes the existence of a teacher that is able to answer two types of queries: membership queries and equivalence queries. When a model of a software system is learned, membership queries basically check whether a given traces can be observed and equivalence queries check whether a hypothesised system model is equivalent to the system under investigation. In practice, both queries are usually implemented via testing. Since the introduction of \(L^*\), it has been adapted and extended to various types of systems like Mealy machines [37, 46], timed systems [25] and non-deterministic systems [28, 53]. There are also \(L^*\)-based learning approaches applicable for probabilistic system models [9, 19], but they place strong assumptions on the information that can be queried. These approaches are therefore unsuitable considering a testing scenario, which allows interaction with a black-box system only via testing.

In this paper, we consider such a testing scenario, in which we know the interface of a black-box system and we can gain information by testing the system. Furthermore, we assume that inputs to the system can be freely chosen and that reactions are stochastic. This makes Markov decision processes (MDPs) a well-suited choice of model type. MDPs allow for non-deterministic choices of inputs, while state transitions are stochastic, whereby outputs are produced depending on the entered state. Given such a system, we aim at generating testing strategies that produce desired outputs with high probability. For learning, we rely on an adaptation of Alergia [11] called IOAlergia [35, 36], which learns MDPs. While this learning technique is passive in general, our technique is active, as we generate new data for learning by testing. In an iterative approach, we steer the data generation based on learned models towards desired outputs to explore relevant parts of the system more thoroughly. That way, we aim at iteratively improving the accuracy of learning with respect to these outputs. This is in contrast to the application of IOAlergia in an active setting by Chen and Nielsen [13], as they aimed at actively improving the overall accuracy of learned models.

Model learning enables various forms of verification for black-box systems, such as differential equivalence testing on model-level [5, 48, 49], model-checking [20, 21] and model-based testing with learned models [1]. A particularly interesting technique combining model learning, model checking and testing is black-box checking introduced by Peled et al. [42]. This technique learns models of black-box systems in the form of DFAs on-the-fly and iteratively via \(L^*\). Whenever a hypothesis automaton model is created, the hypothesis is model checked which may reveal a fault in the system or show that learning was incomplete and needs to be continued. If model checking does not reveal a fault, equivalence between the hypothesis and the black-box system is checked via testing. In case, non-equivalence is detected the learned hypothesis is extended and learning continues.

The approach we follow is shown in Fig. 1. First, we sample system traces randomly. Then, we infer an MDP from these traces via the state-merging-based method described by Mao et al. [35, 36], which as noted above is called IOAlergia. Once we inferred a hypothesis model \({\mathcal {M}_\mathrm {h}}_1\), we use the Prism model checker [29] for a reachability analysis to find the maximal probability of reaching a state satisfying a property \(\psi \). Prism computes a probability *p* and a strategy \(s_1\) (also called adversary or scheduler) to reach \(\psi \) with *p*. Since IOAlergia infers models from system traces, the quality of the model \({\mathcal {M}_\mathrm {h}}_1\) depends on these traces. If \(\psi \) is not adequately covered, \(s_1\) inferred from \({\mathcal {M}_\mathrm {h}}_1\) may perform poorly and rarely reach \(\psi \). To account for that, we follow an incremental process. After initial random sampling, we iteratively infer models \({\mathcal {M}_\mathrm {h}}_i\) from which we infer strategies \(s_i\). To sample new traces for \({\mathcal {M}_\mathrm {h}}_{i+1}\) we select inputs randomly and based on \(s_i\), that is, we use the strategy \(s_i\) for directed testing. Selecting inputs with \(s_i\) ensures that paths relevant to \(\psi \) will be explored more thoroughly. This process is repeated until either a maximal number of rounds *n* has been executed, or a heuristic detects that the search has converged to a scheduler.

We mainly use Prism to generate strategies, but ignore the probabilities computed in the reachability analysis. Since the computations are based on possibly inaccurate learned models, the probabilities may significantly differ from the true probabilities. Strategies, however, may serve as testing strategies regardless of the accuracy of the learned models. In fact, we evaluate the final strategy generated in the process described above via directed testing of the the system under test (SUT). Since the behaviour under a strategy is purely probabilistic, this is a form of Monte Carlo simulation, which is commonly used in statistical model-checking (SMC) [31]. The evaluation provides an estimation of the probability of reaching \(\psi \) with the actual SUT under strategy \(s_l\), where *l* is the last round that has been executed. By directly interacting with the SUT during evaluation, the computed estimation is an approximate lower bound for the optimal probability. In contrast to this, the reachability probabilities computed by Prism based on the learned model do not enjoy this property.

- Learning
We rely on IOAlergia for learning MDPs. This algorithm has been developed with verification in mind and evaluated in a model-checking context [35, 36].

- Probabilistic model-checking
We use Prism [29], a state-of-the-art probabilistic model checker, to generate strategies for bounded reachability based on learned models.

- Testing
Directed sampling guided by a strategy is a form of model-based testing with learned models. The sampling algorithm was developed for the presented technique.

- Statistical model-checking
We evaluate the final strategy on the SUT. As the SUT is a black-box, we cannot apply probabilistic model-checking and instead perform a Monte Carlo simulation to estimate reachability probabilities, like in SMC [31].

*International Conference on Runtime Verification*[3]. Additional content presented in the current paper covers the heuristic check for detecting convergence, a more thorough evaluation including two new case studies and several further improvements throughout the paper.

The rest of this paper is structured as follows. In Sect. 2, we will discuss related work. Section 3 introduces preliminaries used in Sect. 4 which discusses the proposed approach. We present evaluation results in Sect. 5. Finally, we provide an outlook on future work and conclude in Sect. 6.

## 2 Related work

As discussed before, black-box checking [42] is closely related. In contrast to our technique, it considers non-stochastic systems, but more general properties. Various follow-up work demonstrates the potential of learning-based verification. Extensions, e.g., take existing models into account [26], focus on the composition of black-box and white-box components [17], or check security properties [47].

Mao et al. [34, 35, 36] also inferred probabilistic models with the purpose of model checking. In fact, we apply the model-inference technique for MDPs described by them. Wang et al. [54] apply a variant of Alergia as well and take properties into account during model inference with the goal of probabilistic model-checking. They apply automated property-specific abstraction/refinement to decrease the model-checking runtime. Nouri et al. [39] also combine stochastic learning and abstraction with respect to some property. Their goal is to improve the runtime of SMC. Notably, their approach could also be applied for black-box systems, but does not consider controllability via inputs. Further work on SMC of black-box systems can be found in [45, 55].

Although we did not adapt IOAlergia, a passive model-inference technique, we apply it in an active setting. Chen and Nielsen [13] describe active learning of MDPs based on IOAlergia. However, they do not aim at model checking, but try to reduce the required number of samples by directing sampling towards uncertainties.

We try to find optimal schedulers for MDPs. This problem has been solved in other simulation-based verification approaches as well, like in SMC. A lightweight approach for finding schedulers in SMC is described in [14, 33]. By representing schedulers efficiently, they are able to consider history-dependent schedulers and through “smart sampling” they accomplish finding near-optimal schedulers with low simulation budget. Brázdil et al. [10] presented an approach to unbounded reachability analysis via SMC. The technique is based on delayed Q-learning, a form of reinforcement learning, requiring only limited knowledge of the system (but more than our technique). Another approach using reinforcement learning for strategy inference for reachability objectives has been presented by David et al. [15]. They minimise expected cost while respecting worst-case time bounds.

Learning-based synthesis of control strategies for MDPs has also been studied by Fu and Topcu [23]. They obtain control strategies which are approximately optimal with respect to linear temporal logic (LTL) specifications. They consider transition probabilities to be initially unknown, but in contrast to our setting they assume the MDP structure to be known.

## 3 Preliminaries

We introduce background material following [22, 36], but consider only finite traces, finite paths, and *bounded* reachability, as we use a simulation-based approach. The restriction to bounded properties is also commonly found in SMC [31], which is also simulation-based and from which we apply concepts. Moreover, SMC of unbounded properties is especially challenging in a black-box setting [32].

*Basics. * Let \(\varSigma ^\mathrm {in}\) and \(\varSigma ^\mathrm {out}\) be sets of input and output symbols. An input/output string *s* is an alternating sequence of inputs and outputs, starting with an output, i.e. \(s\in \varSigma ^\mathrm {out} \times (\varSigma ^\mathrm {in} \times \varSigma ^\mathrm {out})^*\). We denote by |*s*| the number of input symbols in *s* and refer to it also as string/trace length. Given a set *S*, we denote by \( Dist (S)\) the set of probability distributions over *S*, i.e. for all \(\mu \) in \( Dist (S)\) we have \(\mu : S \rightarrow [0,1]\) such that \(\sum _{s\in S} \mu (s) = 1\). We denote the indicator function by \(\mathbf {1}_A\) which returns 1 for \(e \in A\) and 0 otherwise.

In Sect. 4, we apply two pseudo-random functions \( coinFlip \) and \( randSel \). These require an initialisation operation which takes a *seed*-value for a pseudo-random number generator and which returns implementations of both functions. The function \( coinFlip \) implements a biased coin flip and is defined for \(p \in [0,1]\) by \(\mathbb {P}( coinFlip (p) = \top ) = p\) and \(\mathbb {P}( coinFlip (p) = \bot ) = 1-p\). The function \( randSel \) takes a set as input and returns a single element of the set, whereby the element is chosen according to a uniform distribution, i.e. \(\forall e \in S: \mathbb {P}( randSel (S) = e) = \frac{1}{|S|}\).

### 3.1 Markov decision processes

MDPs allow modelling reactive systems with probabilistic responses. An MDP starts in an initial state. During execution, the environment may choose and execute inputs non-deterministically upon which the system reacts according to its current state and its probabilistic transition function. For that, the system changes its state and produces an output.

### Definition 1

*Markov decision process*(

*MDP*)) A Markov decision process (MDP) is a tuple \(\mathcal {M} = \langle Q,\varSigma ^\mathrm {in}, \varSigma ^\mathrm {out},q_0, \delta , L\rangle \) where

*Q*is a finite set of states,\(\varSigma ^\mathrm {in}\) and \(\varSigma ^\mathrm {out}\) are finite sets of input and output symbols respectively,

\(q_0 \in Q\) is the initial state,

\(\delta : Q \times \varSigma ^\mathrm {in} \rightarrow Dist (Q)\) is the probabilistic transition function, and

\(L: Q \rightarrow \varSigma ^\mathrm {out}\) is the labelling function.

The above definition requires MDPs to be input-enabled, that is, they must not block or reject inputs. Since we assume SUTs to be MDPs, this allows us to execute any input at any point in time.

We generally set \(\varSigma ^\mathrm {out} = \mathcal {P}( AP )\) where \( AP \) is a set of relevant propositions and *L*(*q*) gives the propositions that hold in state *q*. A finite path \(\rho \) through an MDP is an alternating sequence of states and inputs, i.e. \(\rho = q_0 i_1 q_1 \cdots i_{n-1} q_{n-1} i_n q_n \in Q \times (\varSigma ^\mathrm {in} \times Q)^*\). The set of all paths of an MDP \(\mathcal {M}\) is denoted by \(Path_\mathcal {M}\). A path \(\rho \) corresponds to a trace \(L(\rho ) = t\), i.e. an input/output string, with \(t = o_0 i_1 o_1 \cdots i_{n-1} o_{n-1} i_n o_n\) and \(L(q_i) = o_i\). To reason about probabilities of traces, we need a way to resolve non-determinism. To accomplish this, we introduce schedulers which are often also referred to as adversaries or strategies [36]. Schedulers basically choose the next input action (probabilistically) given a history of visited states, i.e. a path.

### Definition 2

(*Scheduler*) Given an MDP \(\mathcal {M} = \langle Q,\varSigma ^\mathrm {in}, \varSigma ^\mathrm {out},q_0, \delta , L\rangle \), a scheduler for \(\mathcal {M}\) is a function \(s: Path_\mathcal {M} \rightarrow Dist(\varSigma ^\mathrm {in})\).

*s*induce a probability distribution \(\mathbb {P}_{\mathcal {M},s}^l\) on the set of paths \(Path_\mathcal {M}\), defined by:

Since we target reachability, we do not need general schedulers, but may restrict ourselves to *memoryless**deterministic* schedulers [30]. A scheduler is memoryless if its choice of inputs depends only on the current state, i.e. it is a function from *Q* to \( Dist (\varSigma ^\mathrm {in})\). It is deterministic if for all \(\rho \in Path_\mathcal {M}\), there is exactly one \(i \in \varSigma ^\mathrm {in}\) such that \(s(\rho )(i) = 1\). Otherwise, it is called randomised. Example 1 describes an MDP and a scheduler for a faulty coffee machine.

Note that bounded reachability actually requires finite-memory schedulers. However, bounded reachability can be encoded as unbounded reachability by transforming the MDP model [10], at the expense of increased state space.

### Example 1

Figure 2a shows an MDP modelling a faulty coffee machine. Edge labels denote input symbols and corresponding transition probabilities, whereas output labels are placed above states. After insertion of a coin and pressing a button, the coffee machine is supposed to provide coffee. However, with a probability of 0.1 it may reset itself instead. A deterministic memoryless scheduler *s* may provide inputs \(\texttt {coin}\) and \(\texttt {but}\) in alternation, i.e. \(s(q_0) = 1_{\{\texttt {coin}\}}\), \(s(q_1) = 1_{\{\texttt {but}\}}\), and \(s(q_2) = 1_{\{\texttt {coin}\}}\). By setting \(p_l = 1_{\{2\}}\), all strings must have length 2, such that, e.g., \(\mathbb {P}^l_{\mathcal {M},s}(\rho ) = 0.9\) for \(\rho = q_0 \cdot \texttt {coin} \cdot q_1 \cdot \texttt {but} \cdot q_2\).

### 3.2 Model inference

We infer MDPs via an adaptation of Alergia, called IOAlergia [11, 35, 36]. The technique takes input-output strings as input and constructs an input output frequency prefix tree acceptor (IOFPTA) representing the strings. An IOFPTA is a tree with edges labelled by inputs and nodes labelled by outputs. Additionally, edges are annotated with frequencies denoting how often a corresponding string was present in the sample. An IOFPTA with normalised frequencies represents a tree-shaped MDP whereby tree nodes correspond to MDP states.

In a second step, the IOFPTA is transformed through iterated state-merging, which potentially introduces cycles. This step compares nodes in the tree and merges them if they show similar output behaviour such that it is likely that they correspond to the same state in the MDP, generating the data. IOAlergia basically views the IOFPTA as an MDP with non-normalised transition probabilities. During the operation of the algorithm, the states of the MDP are partitioned into three sets: *red* states which have been checked, *blue* states which are neighbours of red states, and uncoloured states. Initially, the only red states is the root of the IOFPTA. After initialisation, pairs of blue and red states are checked for compatibility and merged if compatible. Otherwise, the blue one is coloured red. This is repeated until all states are coloured. After normalisation of transition probabilities, IOAlergia returns an MDP.

Two states are compatible if they have the same label, their outgoing transitions are compatible and their successors are recursively compatible. Outgoing transitions are compatible, if their empirical probabilities, estimated from the data, are sufficiently close to each other. In other words, we check for all inputs if the estimated probability distribution over outputs conditioned on inputs are sufficiently similar. If they are, we check recursive compatibility of successors reached by all input-output pairs. A parameter \(\epsilon _\mathrm {\textsc {Alergia}{}} \in (0,2]\) controls the significance level of a statistical test, which determines whether two empirical probabilities are sufficiently close. We represent calls to IOAlergia by \( \textsc {IOAlergia}{} (\mathcal {S},\epsilon _\mathrm {\textsc {Alergia}{}}) = \mathcal {M}\) where \(\mathcal {S}\) is a multiset of input-output strings and \(\mathcal {M}\) is a resulting MDP.

Figure 1 shows an IOFPTA for the coffee machine from Example 1, but sampled with a (uniformly) randomised scheduler and a different \(p_l\). Edge labels denote inputs and associated frequencies, while outputs are placed next to nodes. At first, \(s_1\) might be merged with \(s_0\) as their successors are similar. Redirecting the \(\texttt {but}\) edge from \(s_1\) to \(s_0\) would create the self loop in the initial state.

### 3.3 Statistical model-checking

We consider step-bounded reachability. The syntax of formulas \(\phi \) is given by: \(\phi = F^{<k} \psi \text { with } \psi = \lnot \psi | \psi \wedge \psi | \psi \vee \psi | AP\), *AP* denoting an atomic proposition, and \(k \in \mathbb {N}\).

The formula \(\phi = F^{<k} \psi \) denotes that \(\psi \) should be satisfied in a state reached in less than *k* steps. We define the satisfaction of \(\phi = F^{<k} \psi \) via: a trace \(t = o_0 i_1 o_1 \cdots i_{n-1} o_{n-1} i_n o_n\) satisfies \(\phi \) denoted by \(t \models \phi \) if there is an \(i < k\) such that \(o_i \models \psi \). The evaluation of a trace *t* with respect to a formula \(\phi = F^{<k} \psi \) places restrictions on the length of *t*. In particular, we can only conclude that \(t \not \models \phi \) if *t* does not contain an \(o_i\) with \(o_i \models \psi \) and contains at least \(k-1\) steps. In other words, *t* must be long enough to determine that it does not satisfy a reachability property. To ascertain that all traces can be evaluated, we set the length probability \(p_l\) accordingly. We set for all traces: \(p_l(j) = 0\) for \(j < k - 1\).

The composition of a scheduler *s* and an MDP \(\mathcal {M}\) behaves entirely probabilistically. In fact, it induces a discrete time Markov chain (DTMC) [22]. Hence, we can apply techniques from SMC without considering non-determinism. Furthermore, we can define the probability of satisfying a property \(\phi \) with an MDP \(\mathcal {M}\), and a scheduler *s* by \(\mathbb {P}_{\mathcal {M},s}(\phi ) = \mathbb {P}_{\mathcal {M},s}^l(\{\rho \in Path_\mathcal {M} | L(\rho ) \models \phi \})\) for an appropriate \(p_l\). Note that the value \(\mathbb {P}_{\mathcal {M},s}(\phi )\) does not depend on the actual \(p_l\) as long as \(p_l\) ensures that traces are long enough to allow reasoning about satisfaction of \(\phi \).

*p*with simulations of the SUT. A realisation \(b_i\) is 1 if the corresponding sampled trace satisfies \(\phi \) and 0 otherwise. To estimate \(p = \mathbb {P}_{\mathcal {M},s}(\phi )\) we apply Monte Carlo simulation. Given

*n*individual simulations, the estimate \(\hat{p}\) is the observed relative success frequency, i.e. \(\hat{p} = \sum _{i=1}^n \frac{b_i}{n}\). In order to bound the error of the estimation with a certain degree of confidence, we compute the number of required simulations based on a Chernoff bound [31, 40]. This bound guarantees that if

*p*is the true probability, then the distance between \(\hat{p}\) and

*p*is greater than or equal to some \(\epsilon \) with a probability of at most \(\delta \), i.e. \(\mathbb {P}(|\hat{p} - p| \ge \epsilon ) \le \delta \). The required number of simulations

*n*and the parameters \(\epsilon \) and \(\delta \) are related by \(\delta =2e^{-2n\epsilon ^2}\) [40], i.e. we compute

*n*by

## 4 Probabilistic black-box reachability checking

- 1.
*Create initial samples* The step collects a multiset of system traces through interaction with \(\mathcal {M}\) by uniformly choosing and executing inputs from \(\varSigma ^\mathrm {in}\).

*For at most*\( maxRounds \)

*rounds do*

- 2.1.
*Infer model* Given the system traces sampled so far, we use IOAlergia to infer an MDP \({\mathcal {M}_\mathrm {h}}_i = \langle Q_\mathrm {h},\varSigma ^{\mathrm {in}}, {\varSigma ^\mathrm {out}}_\mathrm {h},{q_0}_\mathrm {h}, \delta _\mathrm {h}, L_\mathrm {h}\rangle \), where \(\mathrm {h}\) stands for hypothesis and \(i \in [1.\,.\, maxRounds ]\) denotes the current round.

- 2.2.
*Reachability analysis* Reachability analysis on \({\mathcal {M}_\mathrm {h}}_i\) with Prism [29]: i.e. we compute the maximum probability \(P_{{\mathcal {M}_\mathrm {h}}_i,s_i}(\phi )\) of satisfying \(\phi \) and generate the corresponding scheduler \(s_i\).

- 2.3.
*Sample with scheduler* We extend the multiset of system traces through property-directed sampling. For that, we choose some inputs with scheduler \(s_i\) and some randomly. With increasing

*i*, we decrease the portion of random choices.- 2.4.
*Check early stop* We may stop before executing \( maxRounds \) rounds if a stopping criterion is satisfied. This criterion is realised with a heuristic check for convergence. In this check, we basically determine whether several consecutive schedulers behave similarly.

- 3.
*Evaluate* In a last step, we evaluate the most recent scheduler we have generated. For this evaluation, we sample system traces again, but avoid choosing inputs randomly. The relative frequency of satisfying \(\phi \) now gives us an estimate for the success probability of executing \(\mathcal {M}\), the black-box SUT, controlled by scheduler \(s_l\), where

*l*is the last round we executed. A Chernoff bound [31, 40], which is commonly used in SMC, specifies the required number of samples.

*Create initial samples*In the first step, we sample system traces by choosing input actions randomly according to a uniform distribution. Hence, we sample with a scheduler \(s_\mathrm {unif}\) defined as follows: \(\forall q \in Q, s_\mathrm {unif}: q \mapsto \mu _\mathrm {unif}(\varSigma ^\mathrm {in})\) where \(\forall i \in \varSigma ^\mathrm {in}: \mu _\mathrm {unif}: i \mapsto \frac{1}{|\varSigma ^\mathrm {in}|}\). Sampling is further controlled by the length probability \(p_l\) and by the batch size \(n_\mathrm {batch}\), i.e. the number of traces collected at once. These parameters also affect subsequent sampling. Additionally, we set a

*seed*-value for the initialisation of pseudo-random functions.

As discussed in Sect. 3, we set \(p_l(j) = 0\) for \(j < k - 1\) if *k* is the step bound of the property we test for. This would not be necessary for learning but we generally apply this constraint. The length of suffixes, i.e. the trace extensions beyond *k*, follows a geometric distribution parameterised by \(p_\mathrm {quit} \in [0,1]\). Before each step, we stop with probability \(p_\mathrm {quit}\). Hence, the number of input steps |*t*| in a trace *t* is distributed according to \(p_l(|t|) = (1-p_\mathrm {quit})^{|t| - k + 1} p_\mathrm {quit}\) for \(|t| \ge k - 1\) and \(p_l(|t|) = 0\) otherwise. Both \(p_\mathrm {quit}\) and \(n_\mathrm {batch}\) must be supplied by the user. In the following, \(\mathcal {S}_i\) denotes the multiset of traces created by the \(i\mathrm{th}\) sampling step, and \(\mathcal {S}_\mathrm {all}\) refers to the multiset of all traces. Hence, \(\mathcal {S}_\mathrm {all}\) is initially set to \(\mathcal {S}_1\), containing \(n_\mathrm {batch}\) traces distributed according to \(\mathbb {P}_{\mathcal {M},s_\mathrm {unif}}^l\), collected by random testing.

*Infer model* In this step, we use IOAlergia [35, 36] to infer an MDP \({\mathcal {M}_\mathrm {h}}_i = \langle Q_\mathrm {h},\varSigma ^{\mathrm {in}}, {\varSigma ^\mathrm {out}}_\mathrm {h},{q_0}_\mathrm {h}, \delta _\mathrm {h}, L_\mathrm {h}\rangle \), from \(\mathcal {S}_\mathrm {all} = \bigcup _{j \le i} \mathcal {S}_j\), i.e. an approximate system model. Strictly speaking, we infer an MDP with a partial transition function, which we make input-complete with a function \( complete \). The transition function of an inferred MDP may be undefined for some state-input pair if there is no corresponding execution in \(\mathcal {S}_\mathrm {all}\). For this reason, we add transitions to a special state labelled with \( dontKnow \) for undefined state-input pairs. Once we enter that state, we cannot leave it. The label \( dontKnow \) is more generally a special output label, which is not part of the original output alphabet.

Formally, \(\mathcal {M}_\mathrm {h}' = \langle Q_\mathrm {h}',\varSigma ^{\mathrm {in}}, {\varSigma ^\mathrm {out}}_\mathrm {h}',{q_0}_\mathrm {h}', \delta _\mathrm {h}', L_\mathrm {h}'\rangle = \textsc {IOAlergia}{} (\mathcal {S}_\mathrm {all}, \epsilon _\mathrm {\textsc {Alergia}{}})\) and \({\mathcal {M}_\mathrm {h}}_i = complete (\mathcal {M}_\mathrm {h}')\) where \(Q_\mathrm {h} = Q_\mathrm {h}' \cup \{q_\mathrm {undef}\}\), \({\varSigma ^\mathrm {out}}_\mathrm {h} = {\varSigma ^\mathrm {out}}_\mathrm {h}' \cup \{ dontKnow \}\), with \( dontKnow \notin {\varSigma ^\mathrm {out}}_\mathrm {h}'\), \({q_0}_\mathrm {h}' = {q_0}_\mathrm {h}\), \(\delta _\mathrm {h} = \delta _\mathrm {h}' \cup \{(q_\mathrm {undef},i) \mapsto \mathbf {1}_{\{q_\mathrm {undef}\}} | i \in \varSigma ^\mathrm {in}\} \cup \{(q,i) \mapsto \mathbf {1}_{\{q_\mathrm {undef}\}} | q \in Q_\mathrm {h}', i \in \varSigma ^{\mathrm {in}}, \not \exists d: (q,i) \mapsto d \in \delta _\mathrm {h}' \}\) and \(L_\mathrm {h} = L_\mathrm {h}' \cup \{q_\mathrm {undef} \mapsto dontKnow \}\).

Following the terminology of active automata learning [4], we refer to \({\mathcal {M}_\mathrm {h}}_i\) as the current hypothesis. Input completion via \( complete \) is required by Definition 1, but does not affect the reachability analysis. We aim at maximising the probability of desired events, therefore generated schedulers will not choose to execute inputs leading to the state \(q_\mathrm {undef}\) labelled \( dontKnow \). This is due to the fact that once we reached \(q_\mathrm {undef}\), we have a probability of zero to observe anything other than \( dontKnow \) according to our hypothesis.

*Reachability analysis*Given the current hypothesis inferred in the last step, our implementation of the approach uses the Prism model checker [29] to derive a scheduler for satisfying the property \(\phi \). This is achieved by performing the following steps in a fully automated manner:

- 1.Translate \({\mathcal {M}_\mathrm {h}}_i\) into the Prism modelling language, whereby we encode
- 1.1
states using integers,

- 1.2
inputs using commands labelled with actions, and

- 1.3
outputs using labels.

- 1.1
- 2Since Prism only supports scheduler generation for unbounded reachability properties, we preprocess the translated \({\mathcal {M}_\mathrm {h}}_i\) further [10] and encode \(\phi \) as unbounded property:
- 2.1
We add a step-counter variable \( steps \) ranging between 0 and

*k*, where*k*is the step bound of the examined property. - 2.2
The variable \( steps \) is incremented with every execution of an input until the maximal value

*k*is reached. Once \( steps = k\), \( steps \) is left unchanged. - 2.3
We change \(\phi \) to \(\phi ' = F(\psi \wedge steps < k)\), i.e. we move the bound from the temporal operator to the property that should be reached.

- 2.1
- 3
Finally, we use the

*sparse engine*of Prism to compute the maximum probability \(\max _s \mathbb {P}_{{\mathcal {M}_\mathrm {h}}_i,s}(\phi ')\) for satisfying \(\phi '\) and export the corresponding scheduler \({s_\mathrm {h}}_i\), i.e. we verify the property \( \texttt {Pmax=?[F(psi}~\texttt { \& }~\texttt {steps < k)]}\).

*q*,

*st*) of each state

*q*of \({\mathcal {M}_\mathrm {h}}_i\), one for each value

*st*the variable \( steps \) can take. Note that not all \(k+1\) copies of a state are reachable. If \(q'\) is reachable from

*q*in \({\mathcal {M}_\mathrm {h}}_i\), then \((q',st+1)\) is reachable from (

*q*,

*st*) if \(st < k\). If \(st=k\), then \((q',st)\) is reachable from (

*q*,

*st*). The target states in \(\mathcal {M}_ steps \) for the unbounded reachability property \(\phi ' = F(\psi \wedge steps < k)\) are all (

*q*,

*st*) with \(L(q) \models \psi \) and \(st < k\). Furthermore, all (

*q*,

*st*) with \(st = k\) are non-target states from which we cannot reach target states, as required by the original bounded reachability property \(\phi \). Given \(\mathcal {M}_ steps \) and the unbounded reachability property \(\phi '\), Prism exports memoryless deterministic schedulers. These schedulers, however, do not define input choices for all states, but only for states reachable by the composition of scheduler and corresponding model. To account for cases with undefined scheduler behaviour, we use the notation \({s_\mathrm {h}}_i(q) = \bot \). It denotes that scheduler \({s_\mathrm {h}}_i\) does not define a choice for

*q*.

*Sample with scheduler.*

Property-directed sampling with inferred schedulers aims at exploring parts of the system more thoroughly that have been identified as being relevant to the property. To avoid getting trapped in local minima, we also explore new paths by choosing actions randomly with probability \({p_\mathrm {rand}}_i\), where *i* corresponds to the current round. This probability is decreased in each round to explore more broadly in the beginning and focus on relevant parts in later rounds. Two parameters control \({p_\mathrm {rand}}_i\): \(p_\mathrm {start} \in [0,1]\) for the initial probability and \(c_\mathrm {change} \in [0,1]\) specifying an exponential decrease, i.e. \({p_\mathrm {rand}}_1 = p_\mathrm {start}\) and \({p_\mathrm {rand}}_{i+1} = c_\mathrm {change} \cdot {p_\mathrm {rand}}_i \) for \(i \ge 1\).

- 1.
The SUT may show outputs not foreseen by \({\mathcal {M}_\mathrm {h}}_i\), i.e. not only probabilities differ. In such cases, we cannot determine the correct state transition in \({\mathcal {M}_\mathrm {h}}_i\).

- 2.
By performing random inputs we may follow a path that is not optimal with respect to \({\mathcal {M}_\mathrm {h}}_i\) and \(\phi \). Thus, we may enter a state where \({s_\mathrm {h}}_i\) is undefined.

^{1}\({\mathcal {M}_\mathrm {h}}_i\) and the generated scheduler \({s_\mathrm {h}}_i\), sampling requires two auxiliary operations:

\( reset \): resets the SUT to the initial state and returns the unique initial output symbol

\( exec \): executes a single input changing the state of the SUT and returning the corresponding output

*Evaluate.* As a result of the reachability analysis, Prism calculates a probability of reaching \(\phi \). This probability, however, is derived from a learned model which is possibly inaccurate. Therefore, it may greatly differ from the actual probability of reachability with the SUT. To account for that, we evaluate the scheduler \(s_\mathrm {h} = {s_\mathrm {h}}_l\), where *l* is the last round we executed. We accomplish this by sampling a multiset of traces \(\mathcal {S}_\mathrm {eval}\), while generally selecting inputs with \(s_\mathrm {h}\), i.e. we execute Algorithm 1 with \({p_\mathrm {rand}}_i = 0\). Thereby, we implicitly sample traces from the DTMC induced by the composition of \(\mathcal {M}\) and \( randomised (s_\mathrm {h})\). Since this DTMC behaves entirely probabilistic, we can apply SMC. Hence, we estimate \(\mathbb {P}_{\mathcal {M}, randomised (s_\mathrm {h})}(\phi )\) by \(\hat{p}_{\mathcal {M},s_\mathrm {h}} = \frac{\left| \{s \in \mathcal {S}_\mathrm {eval} | s \models \phi \}\right| }{\left| \mathcal {S}_\mathrm {eval}\right| }\). To achieve a given error bound \(\epsilon _\mathrm {eval}\) with a given confidence \(1-\delta _\mathrm {eval}\), we compute the required number of samples \(\left| \mathcal {S}_\mathrm {eval}\right| = n_\mathrm {batch}\) based on a Chernoff bound [40], i.e. we apply (2). The estimation provides an approximate lower bound of the maximal reachability probability with the SUT. We consider \(\hat{p}_{\mathcal {M},s_\mathrm {h}}\) an approximate lower bound, because we know with confidence \(1{-}\delta _\mathrm {eval}\) that \(\max _s \mathbb {P}_{\mathcal {M},s}(\phi )\) is at least as large as \(\hat{p}_{\mathcal {M},s_\mathrm {h}}-\epsilon _\mathrm {eval}\).

*Check early stop* We have observed that the performance of schedulers usually increases with the total amount of available data. Probability estimations derived with intermediate schedulers showed that schedulers generated in later rounds tend to perform better than those generated in earlier. However, we have also seen fluctuations in these estimations over time, i.e. some schedulers may perform worse than schedulers generated in previous rounds. With increasing number of rounds these fluctuations generally diminish and the estimations converge. Intuitively, this can be explained by the influence of \({p_\mathrm {rand}}_{i}\) in Algorithm 1, which controls the probability of selecting random inputs and decreases over time. As this probabilities \({p_\mathrm {rand}}_{i}\) approaches zero, we will almost always select inputs with generated schedulers. This will generally only increase the confidence in parts of the system we have already explored, but will not explore new parts and therefore new schedulers are likely to show similar behaviour to previous ones.

Based on these observations, we developed a heuristic check for convergence. If it detects convergence, we stop the iteration early before executing \( maxRounds \) rounds. Two simpler checks actually form the basis of the heuristic. The first, called \(\textsc {similarSched}\), basically compares the scheduler generated in the current round to the scheduler from the previous round and returns \( true \) if both behave similarly. The second check, called \(\textsc {conv}\) builds upon the first and reports convergence if we detect statistically similar behaviour via \(\textsc {similarSched}\) in multiple consecutive rounds. The rationale behind this is that schedulers should behave alike after convergence. We check for similarity rather than for equivalence because there may be several optimal inputs in a state and slight variations in transition probabilities in the inferred models may lead to the different choices of inputs. Furthermore, we can compare schedulers during sampling by comparing whether they would choose the same inputs. This gives us a large number of events as basis for our decision and does not require additional sampling.

The convergence check has three parameters: \(\alpha _\mathrm {conv}\) controlling the confidence level, an error bound \(\epsilon _\mathrm {conv}\), and a bound on the number of rounds \(r_\mathrm {conv}\). The first two parameters control a statistical test which checks whether two schedulers behave similarly. For this test, we consider Bernoulli random variables \(E_i\) for \(i \in [2.\,.\, maxRounds ]\). \(E_i\) is equal to one if two consecutive schedulers \({s_\mathrm {h}}_i\) and \({s_\mathrm {h}}_{i-1}\) behave equivalently, i.e. choose the same input in some state, and zero otherwise. Let \(p_{E_i}\) be the success probability, the probability of \(E_i\) being equal to one. We observe samples of \(E_i\) in Line 13 of Algorithm 1. Each time we choose an input *i* with \({s_\mathrm {h}}_i\), we also determine which input \(i'\) the previous scheduler \({s_\mathrm {h}}_{i-1}\) would have chosen. We record a positive outcome if \(i = i'\) and a negative outcome otherwise.

Let \(\hat{p}_{E_i}\) be the relative number of positive outcomes, which is an estimate of \(p_{E_i}\). If \(p_{E_i}\) is equal to one, then both schedulers behave equivalently, they always choose the same input. Consequently, we test whether \(\hat{p}_{E_i}\) is close to one. We test the null hypothesis \(H_0: p \le 1-\epsilon _\mathrm {conv}\) against \(H_1: p > 1-\epsilon _\mathrm {conv}\) with a confidence level of \(1-\alpha _\mathrm {conv}\). The hypothesis \(H_1\) denotes that the compared schedulers choose the same inputs in most of the cases. Let \(\textsc {similarSched}(\alpha _\mathrm {conv},\epsilon _\mathrm {conv}, i)\) be the result of this test in round *i*, which is \( true \) if \(H_0\) is rejected and \( false \) otherwise.

Finally, we can formulate the complete convergence check \(\textsc {conv}(\alpha _\mathrm {conv},\epsilon _\mathrm {conv}, i)\). It returns \( true \) in round *i* if \(r_\mathrm {conv}\) consecutive calls of \(\textsc {similarSched}\) returned \( true \), i.e. \(\textsc {conv}(\alpha _\mathrm {conv},\epsilon _\mathrm {conv}, i) = \bigwedge ^i_{j=i-r_\mathrm {conf}+1} \textsc {similarSched}(\alpha _\mathrm {conv},\epsilon _\mathrm {conv}, j)\).

Note that \({p_\mathrm {rand}}_{i}\) implicitly affects the convergence check. We collect samples of \(E_i\) in Line 13 of Algorithm 1, thus large \({p_\mathrm {rand}}_{i}\), cause Line 13 to be executed infrequently. As a result, sample sizes of \(E_i\) are small. This influence on the convergence check is beneficial because schedulers are more likely to improve if \({p_\mathrm {rand}}_{i}\) is large, as new parts of the system may be explored via frequent random steps.

While the check introduces further parameters, it may simplify the application of the approach in scenarios where we have little knowledge about the system at hand. In such cases, it may be difficult to find a reasonable choice for the number of rounds \( maxRounds \). With this heuristic, it is possible to choose \( maxRounds \) conservatively, but stop early once convergence is detected. However, it may also impair results, if convergence is detected too early.

*Convergence to the true model*

Generally, Mao et al. [36] showed convergence in the large sample limit for IOAlergia. However, the sampling mechanism needs to ensure that sufficiently many executions of all inputs in all states are observed. This is also discussed in [35], which states that IOAlergia requires a *fair* schedulers, one that chooses each input infinitely often. The uniformly randomised \(s_\mathrm {unif}\) satisfies this requirement. As a result, we have convergence in the limit, if we perform only a single round of inference, in which we sample with \(s_\mathrm {unif}\).

Property-directed sampling favours certain inputs with increasing number of rounds, but it also selects random inputs with probability \({p_\mathrm {rand}}_i\) in round *i*. If we ensure that \({p_\mathrm {rand}}_i\) is always non-zero, we will select all inputs infinitely often in an infinite number of rounds. Therefore, the inferred models will converge to the true model (up to bisimulation equivalence) and the inferred schedulers will converge to the optimal scheduler.

Another way to approach convergence is to follow a hybrid approach, by collecting traces via property-directed sampling and via uniform sampling in parallel. Uniform sampling ensures that all inputs are executed sufficiently often, which entails convergence. Property-directed sampling explores parts of the system, identified to be relevant, which increases the confidence in the correct inference of those parts. As a result, intermediate schedulers are more likely to perform well.

Hence, we have convergence in the limit under certain assumptions. In practice, i.e. when learning from limited data, uniform schedulers are likely to be insufficient, if events occur only after long interaction scenarios. If events occur rarely in the sampled system traces, then it is unlikely that the part modelling those events is accurately learned. Active learning, as described by Chen and Nielsen [13], addressed this issue by guiding sampling so as to reduce the uncertainty in the learned model. Our approach similarly guides sampling, but with the aim at reducing uncertainty along traces, which are likely to satisfy a reachability property.

As noted above, we have seen that the inferred schedulers usually converge to a scheduler, which may not be globally optimal, though. We also performed experiments with the outlined hybrid approach to avoid getting trapped in local maxima, by collecting half the system traces through uniform sampling. While it showed favourable performance in a few cases, the incremental approach generally produced better results with the same number of samples. Therefore, we will not discuss experiments with the hybrid approach.

Apart from convergence, it may not always be necessary to find a (near-) optimal scheduler. A requirement may state that the probability of reaching an erroneous state must be smaller than some *p*. Once we found and evaluated a scheduler \(s_\mathrm {h}\) such that the estimation \(\hat{p}_{\mathcal {M},s_\mathrm {h}} \ge p\), we basically show with some confidence that the requirement is violated. Such a requirement could be the basis of another stopping criterion. If in round *i*, a sufficiently large number of the sampled traces \(\mathcal {S}_i\) reaches an erroneous state, we may decide to evaluate the corresponding scheduler \({s_\mathrm {h}}_{i-1}\). We could then stop if \(\hat{p}_{\mathcal {M},{s_\mathrm {h}}_{i-1}} \ge p\) and continue otherwise.

*Application and choice of parameters*We will now briefly discuss the choice of parameters taking our findings into account. A summary of all parameters along with a concise description is given by Table 1.

All parameters with short descriptions

Parameter | Description |
---|---|

\(n_\mathrm {batch}\) | Number of traces sampled in one round |

\( maxRounds \) | Maximum number of rounds |

\(p_\mathrm {start}\) | Initial probability of random input selection |

\(c_\mathrm {change}\) | Factor changing the probability of random input selection |

\(p_\mathrm {quit}\) | Parameter of geometric distribution of sampled trace length |

\(\epsilon _\textsc {Alergia}{}\) | Controls significance level of statistical compatibility check of IOAlergia |

\(1-\alpha _\mathrm {conv}\) | Confidence level of convergence check |

\(\epsilon _\mathrm {conv}\) | Error bound of convergence check |

\(1-\delta _\mathrm {eval}\) | Confidence level of scheduler evaluation (Chernoff bound) |

\(\epsilon _\mathrm {eval}\) | Error bound of scheduler evaluation (Chernoff bound) |

\(r_\mathrm {conv}\) | Number of rounds considered in convergence check |

The product \(n_\mathrm {s} = n_\mathrm {batch} \cdot maxRounds \) defines the overall maximum number of samples for inference, thus it could be chosen as large as the testing/simulation budget permits. Increasing \( maxRounds \) while fixing \(n_\mathrm {s}\) increases the time required for learning and model checking. Intuitively, it improves accuracy as well, as sampling is more frequently adjusted towards the considered property. For the systems examined in Sect. 5, values in the range between 50 and 200 led to reasonable accuracy while incurring acceptable runtime overhead. Runtime overhead is the time spent learning and model checking, as opposed to the time spent doing actual testing, i.e. (property-directed) sampling. The convergence check takes three parameters as input for which we identified well-suited default parameters. To ensure high confidence for the statistical test, we set \(\alpha _\mathrm {conv} = 0.01\). Since schedulers should choose the same input in most cases, \(\epsilon _\mathrm {conv}\) should be small, but greater than zero to allow for some variation. In our experiments, we set it to \(\epsilon _\mathrm {conv} = 0.01\) and we set \(r_\mathrm {conv} = 6\). More conservative choices would be possible at the expense of performing additional rounds.

The value of \(p_\mathrm {start}\) should generally be larger than 0.5, while \(c_\mathrm {change}\) should be close to 1. This ensures broad exploration in the beginning and more directed exploration afterwards. Finally, the choice of \(p_\mathrm {quit}\) depends on the simulation budget and the number of inputs. If there is a large number of inputs, it may be highly improbable to reach certain states within a small number test steps via random testing. Consequently, we should allow for the execution of long tests, in order to reach states requiring complex combinations of inputs. Domain knowledge may also aid in choosing this parameter. If we, e.g., expect a long initialisation phase, \(p_\mathrm {quit}\) should be low to ascertain that we reach states following the initialisation.

## 5 Experiments

We evaluated our approach on five case studies from the area of automata learning, control policy synthesis, and probabilistic model-checking. For the first case study, we created our own model of the slot machine described by Mao et al. [36] in the context of learning MDPs. Two case studies consider models of network protocols enhanced with stochastic failures. For that, we transformed deterministic Mealy-machine models as detailed below. The model used in the fourth case study is inspired by the gridworld example, for which Fu and Topcu synthesised control strategies [23]. Finally, we generate schedulers for a consensus protocol [6] which serves as a benchmark in probabilistic model-checking. We discussed the experiments involving the slot machine and the network protocol models before [3]. New additions in this extended version are experiments with the gridworld example, the consensus protocol, and experiments with the convergence check. Note that due to changes of the implementation, measurement result may differ from those in [3].

*Adding stochastic failures*Deterministic Mealy-machines serve as the basis for two case studies. These Mealy machines are results from previous learning experiments [20, 49] and model communication protocols. Basically, we simulate stochastic failures by adding outputs represented by the label \( crash \). These occur with a predefined probability instead of the correct output. Upon such a failure, the system is reset. We implemented this by transforming the Mealy machines as follows:

- 1.
Translate Mealy machine into Moore machine: this effectively creates an MDP \(\mathcal {M} = \langle Q,\varSigma ^\mathrm {in}, \varSigma ^\mathrm {out},q_0, \delta , L\rangle \) with a non-probabilistic \(\delta \).

- 2.
Extend \(\varSigma ^\mathrm {out}\) with a new symbol \( crash \) and add \(q_\mathrm {cr}\) to

*Q*with \(L(q_\mathrm {cr}) = crash \). - 3.For a predefined probability \(p_\mathrm {cr}\) and for all
*o*in a predefined set \( Crashes \):- 3.1.
Find all \(q,q'\in Q\), \(i \in \varSigma ^\mathrm {in}\) such that \(\delta (q,i)(q') = 1\) and \(L(q') = o\)

- 3.2.
Set \(\delta (q,i)(q') = 1 -p_\mathrm {cr}\) and \(\delta (q,i)(q_\mathrm {cr}) = p_\mathrm {cr}\)

- 3.3.
For all \(i\in \varSigma ^\mathrm {in}\) set \(\delta (q_\mathrm {cr},i)(q_\mathrm {cr}) = p_\mathrm {cr}\) and \(\delta (q_\mathrm {cr},i)(q_0) = 1 - p_\mathrm {cr}\)

- 3.1.

*Measurement setup and criteria* We have complete information about all models. This allows us to compare our results to optimal values. Nevertheless, for the evaluation we treat the systems as black boxes. The state spaces of the models without step-counter variables for bounded reachability are of sizes 471 (slot machine), 63 (MQTT), 157 (TCP), 35 (gridworld), and 272 (consensus protocol), respectively. For each of these systems, we identified an output relevant to the application domain and applied the presented technique to reach states emitting this output with varying numbers of steps. The slot machine grants prizes and we generated strategies to observe the rarest prize. Using the steps discussed above, we seeded stochastic failures into the MQTT and TCP models, which we tried to reach. The gridworld we used in the evaluation contains a dedicated *goal* location that served as a as reachability objective. In case of the consensus protocol, we generated strategies to finish the protocol, i.e. reach consensus, with high probability.

*s*for \(\mathbb {P}_{\mathcal {M},s}(\phi )\):

- Incremental Scheduler Inference
We apply the incremental approach discussed in Sect. 4 with a fixed number of rounds. Inferred schedulers are denoted by \(s_\mathrm {inc}\).

- Incremental with Convergence Check
We apply the incremental approach, but stop if we either detect convergence with \(\textsc {conv}\) or if \( maxRounds \) rounds have been executed. Inferred schedulers are denoted by \(s_\mathrm {conv}\).

- Monolithic Scheduler Inference
To check if the incremental refinement of inferred models pays off, we use the same approach but set \( maxRounds = 1\). In other words, we sample traces by solely choosing inputs randomly. Based on this, we perform a single round, inferring a model and a scheduler which we evaluate. To balance the simulation budget, we collect \( maxRounds \cdot n_\mathrm {batch}\) traces, where \( maxRounds \) and \(n_\mathrm {batch}\) are the parameter settings for inferring \(s_\mathrm {inc}\). We denote monolithically inferred schedulers by \(s_\mathrm {mono}\).

- Uniform Schedulers
As a baseline for comparison we compare to the randomised scheduler \(s_\mathrm {unif}\) which chooses inputs according to a uniform distribution. This resembles random testing without additional knowledge.

*s*to be

*near optimal*, if the estimate \(\hat{p}_{\mathcal {M},s}\) of \(\mathbb {P}_{\mathcal {M},s}(\phi )\) derived via SMC is approximately equal to \(\mathbb {P}_{\mathcal {M},s_\mathrm {opt}}(\phi )\), i.e. \(|\hat{p}_{\mathcal {M},s} - \mathbb {P}_{\mathcal {M},s_\mathrm {opt}}(\phi )| \le \epsilon \), for an \(\epsilon > 0\). In the following, we use \(\epsilon = \epsilon _\mathrm {eval}\) for deciding near optimality, where \(\epsilon _\mathrm {eval}\) is the error bound of the applied Chernoff bound (2).

We balance the number of test steps for the incremental and the monolithic approach by executing the same number of tests. As a result, the simulation costs for executing tests is approximately the same. Since the incremental approach requires model learning and model checking in each round, it will also require more computation time than the monolithic approach. While our main focus is on evaluating with respect to the achieved probability estimation, we will briefly discuss computation cost at the end of the section.

We also briefly discuss estimations based on model checking of inferred models \(\mathcal {M}_\mathrm {h}\), i.e. \(\max _{s} \mathbb {P}_{\mathcal {M}_\mathrm {h},s}(\phi )\) calculated by Prism [29]. These estimations have also been discussed by Mao et al. [36]. They noted that estimations may differ significantly from optimal values in some cases, but generally represent good approximations.

*Implementation and settings*We base the evaluation on our Java implementation of the described technique which can be found at [43]. All experiments were performed with a Lenovo Thinkpad T450 with 16 GB RAM and an Intel Core i7-5600U CPU operating at 2.6 GHz and running Xubuntu Linux 18.04. The systems were modelled with Prism [29]. Prism served three purposes:

We exported the state, transition, and label information from models. We simulated the models in a black-box fashion with this information.

The maximal probabilities were computed via Prism.

Prism’s scheduler generation was used for scheduler inference.

General parameter settings for experiments

Parameter | Value |
---|---|

\(p_\mathrm {start}\) | 0.75 |

\(c_\mathrm {change}\) | 0.95 |

\(\epsilon _\textsc {Alergia}{}\) | 0.5 |

\(\alpha _\mathrm {conv}\) | 0.01 |

\(\epsilon _\mathrm {conv}\) | 0.01 |

\(\delta _\mathrm {eval}\) | 0.01 |

\(\epsilon _\mathrm {eval}\) | 0.01 |

\(r_\mathrm {conv}\) | 6 |

Simulation, as well as sampling, is controlled by probabilistic choices. To ensure reproducibility, we used fixed seeds for pseudo-random number generators controlling the choices. All experiments were run with 20 different seeds and we discuss statistics derived from 20 such runs. For the evaluation of schedulers, we applied a Chernoff bound with \(\epsilon _\mathrm {eval}=0.01\) and \(\delta _\mathrm {eval}=0.01\). In contrast to the conference version of this paper [3], we used a fixed significance level for the compatibility check of IOAlergia, by setting \(\epsilon _\mathrm {\textsc {Alergia}{}} = 0.5\), a value also used by Mao et al. [36]. They noted that IOAlergia is generally robust with respect to the choice of this value, but we found that our approach benefits from a larger \(\epsilon _\mathrm {\textsc {Alergia}{}}\), which causes fewer state merges and consequently larger models. Put differently, our approach benefits from more conservative state merging. As noted in Sect. 4, we aim at ensuring broad exploration in the beginning and property-directed exploration in later rounds. Therefore, we set \(p_\mathrm {start}=0.75\) and \(c_\mathrm {change} = 0.95\) unless otherwise noted. We set the convergence-check parameters in all experiments as suggested in Sect. 4: \(\alpha _\mathrm {conv} = 0.01\), \(\epsilon _\mathrm {conv} = 0.01\), and \(r_\mathrm {conv} = 6\). Table 2 summarises parameter settings that apply in general.

### 5.1 Slot machine

*apple*or

*bar*after spinning (one input per reel). With increasing number of spins the probability of bar decreases. A player is given a number of spins

*m*, after which one of three prizes is awarded depending on the reel configuration. A fourth input leads with equal probability either to two extra spins (with a maximum of

*m*), or to stopping the game prematurely including issuance of prizes. For the evaluation, we reimplemented the model, therefore probabilities and state space differ from [36]. As property, we investigated reaching the output \( Pr10 \) if \(m=5\), representing a prize that is awarded after stopping the game, if all reels show bar. The parameter settings for the learning experiments are given by Table 3, that is, \(p_\mathrm {quit} = 0.05\), \( maxRounds = 100\), and \(n_\mathrm {batch} = 1000\).

Parameter settings for the slot-machine case study

Parameter | Value |
---|---|

\(p_\mathrm {quit}\) | 0.05 |

\( maxRounds \) | 100 |

\(n_\mathrm {batch}\) | 1000 |

Figure 3 shows evaluation results comparing the different approaches. Box plots summarising the probability estimations for reaching \( Pr10 \) in less than 8 steps are shown in Fig. 3a and b shows results for a limit of 14 steps. From left to right, the blue boxes correspond to \(s_\mathrm {mono}\), the black boxes correspond to \(s_\mathrm {inc}\), and the red boxes correspond to \(s_\mathrm {conv}\), i.e. the incremental approach with convergence check. Dashed lines mark optimal probabilities. Note that estimations may be slightly larger than the optimal value in rare cases because they are based on simulations. This can be observed for \(s_\mathrm {inc}\) in Fig. 3a and also in some of the following experiments. The applied Chernoff bound gives a confidence value for staying within error bound \(\epsilon _\mathrm {eval}\), in case we actually found an optimal scheduler.

Estimations with the baseline \(s_\mathrm {unif}\) are fairly constant, at approximately 0.012 for 8 steps and at 0.019 for 14 steps. As estimations with \(s_\mathrm {mono}\), \(s_\mathrm {inc}\), and \(s_\mathrm {conv}\) are significantly higher, this shows that our approach positively influences the probability of reaching a desired event. We further see that the incremental approach performs better than the monolithic, whereby the gap increases with step size. Unlike the monolithic approach, the incremental approach finds near-optimal schedulers in both cases. However, the relative number of near-optimal schedulers decreases with increasing step bound.

Alternatively to simulation-based estimation, estimations may be based on model checking an inferred model [36]. For that, a model \(\mathcal {M}_\mathrm {h}\) is inferred, either incrementally or in a single step, and then a probabilistic model-checker computes \(\max _s P_{\mathcal {M}_\mathrm {h},s}(\phi )\). In other words, SMC of the actual SUT controlled by an inferred scheduler is replaced by probabilistic model-checking of a learned model. In the first scenario, estimations are generally bounded above by the optimal probability while estimations in the second scenario may also overestimate the true optimal probability. An advantage of the second scenario is that it reduces the simulation cost since SMC requires additional sampling of the SUT.

Figure 4a and c show model-checking-based estimations of reaching \( Pr10 \) in less than 8 and 14 steps respectively. Here, \(s_\mathrm {mono}\) denotes that the models \(\mathcal {M}_\mathrm {h}\) were inferred in one step, while \(s_\mathrm {inc}\) denotes incremental model-inference. Incremental model-inference with early stopping is labelled \(s_\mathrm {conv}\). The figures demonstrate that these estimations differ from estimations obtained via SMC (see Fig. 3). The monolithic approach significantly overestimates in both cases. The incremental approach leads to more accurate results. None of the measurement results exceeds the optimal value by more than \(\epsilon _\mathrm {eval}\). Note that early stopping did not significantly affect these estimations. Still, the SMC-based estimations are more reliable in the sense that they establish an approximate lower bound for the true optimal probability.

### 5.2 MQTT with stochastic failures

*s*maximising \(\mathbb {P}_{\mathcal {M},s}(F^{<k} crash )\) for \(k \in \{5,8,11,14,17\}\). For the sampling, we set \(p_\mathrm {quit} = 0.025\). While this leads to samples longer than necessary for evaluation, e.g., for \(k=5\) the expected length of traces is 43, this increases the chance of seeing \( crash \) in a sample which is reflected in inferred models. The simulation budget is limited by \( maxRounds = 60\), and \(n_\mathrm {batch} = 100\) for the incremental approach without early stopping. Since the experiments required more than 60 rounds for convergence to be detected, we set \( maxRounds \) to 240 for the incremental approach with convergence check. The parameter are also summarised in Table 4.

Parameter settings for the MQTT case study

Parameter | Value |
---|---|

\(p_\mathrm {quit}\) | 0.025 |

\( maxRounds \) | 60 (240 for \(s_\mathrm {conv}\)) |

\(n_\mathrm {batch}\) | 100 |

Figure 5 shows box plots for the learning-based approaches. At each *k*, the box plots from left to right summarise measurements for \(s_\mathrm {mono}\) (blue), \(s_\mathrm {inc}\) (black), and \(s_\mathrm {conv}\) (red). The dashed line is the optimal probability achieved with \(s_\mathrm {opt}\), and the solid line represents the average probability of reaching \( crash \) with a uniformly randomised scheduler. The box plots demonstrate that larger probabilities are achievable with learning-based approaches than with random testing. All runs including outliers reach \( crash \) with a higher probability than random testing. The monolithic approach, however, only performs marginally better in some cases. Both incremental approaches achieve near-optimal results more reliably. All learning-based approaches find at least one near-optimal scheduler out of 20, but incremental inference finds near-optimal schedulers more reliably.

The convergence check causes a reliability gain for \(k=8\) and \(k=17\) in this case study, as it basically detected that executing 60 rounds is not enough. It generally required more than 60 rounds to detect convergence, except in a few cases. Experiments for larger values of *k* required slightly more rounds to be executed, such that on average 79.6 rounds were executed for \(k=17\). In contrast to this, we executed on average only 72.15 for \(k=5\). We also see that most estimations of \(s_\mathrm {inc}\) and \(s_\mathrm {conv}\) are in a small range near to the optimal values. However, a few outliers are significantly lower, e.g. at 0.46 for \(k=8\). Therefore, it makes sense to infer multiple schedulers and discard those performing poorly.

Model-checking-based estimations of reaching \( crash \) with the incremental approach led to overestimations in some cases. For instance, the maximal estimation for \(k=11\) is 0.724 while 0.651 is the true optimal value, and also for \(k=5\) one run leads to a model-checking-based estimation of 0.373 although 0.344 is the true optimal value. This is in contrast to the slot machine example (see Fig. 4), where the incremental approach produced results close to or lower than the optimal value.

### 5.3 TCP with stochastic failures

Parameter settings for the TCP case study

Parameter | Value |
---|---|

\(p_\mathrm {quit}\) | 0.025 |

\( maxRounds \) | 120 (240 for \(s_\mathrm {conv}\)) |

\(n_\mathrm {batch}\) | 250 |

Figure 6 shows box plots summarising the collected probability estimations. As before, there are groups of three box plots at each *k*, which from left to right represent \(s_\mathrm {mono}\), \(s_\mathrm {inc}\), and \(s_\mathrm {conv}\). The figure does not include plots for random testing with \(s_\mathrm {unif}\), because it reaches the crash with very low probability. Estimations produced by \(s_\mathrm {unif}\) are lower than 0.01 for all *k*. This demonstrates that random testing is insufficient in this case to reliably reach crashes of the system.

We further see that all learning-based approaches achieve to generate near-optimal schedulers for all *k*. As before, both configurations of the incremental approach are more reliable than the monolithic approach. For this more complex system, the reliability gain from incremental scheduler generation is actually much larger than for the MQTT experiments. Early stopping affects probability estimations only marginally. This is also in line with previous observations.

Like for MQTT, we needed to set \( maxRounds \) to a value larger than initially planned, for convergence to be detected. There is a large spread in the number of executed rounds, e.g., we executed between 42 and 240 rounds for \(k=14\). In this case, convergence was detected after 133.5 rounds on average. The average number of executed rounds is lower than 135 rounds for all *k*.

### 5.4 Gridworld

The following case study is inspired by a motion-planning scenario discussed by Fu and Topcu [23], also in the context of learning control strategies. In the experiments, we generate schedulers for a robot navigating in a gridworld environment. These schedulers shall with high probability reach a fixed goal location after starting from a fixed initial location.

A gridworld consists of tiles of different terrains and is surrounded by walls. To model obstacles, interior tiles may be walls as well. The robot starts at a predefined location and may move into one of four directions, i.e. we select from four inputs. It can observe changes in the type of terrain, whether it bumped into a wall, and whether it is located at the goal location. If the robot bumps into a wall, it will not change location. Whenever the robot moves, it may not reach its target, but rather reach a neighbouring tile with some probability, unless the neighbouring tile is a wall. That is, if the robot moves north, it may reach the tile situated north west or north east to its original position. The probability of such an error depends on the terrain of the target tile. We distinguish the terrains (with error probabilities in parentheses): *Mud* (0.4), *Sand* (0.25), *Concrete* (0), and *Grass* (0.2). As indicated above, *Wall* is actually also a terrain that cannot be entered.

Figure 7a shows the gridworld, we used for evaluation. Black tiles represent walls, while the other terrains are represented by different shades of grey and their initial letters. A circle marks the initial location and a double circle marks the goal location. Although its state space, containing 35^{2} different states, is relatively small, navigating in this gridworld is challenging without prior knowledge. Initially, three moves to the right are necessary, as walls block direct moves towards the goal. This mimics the requirement of an initialisation routine.

Parameter settings for the gridworld case study

Parameter | Value |
---|---|

\(p_\mathrm {quit}\) | 0.5 |

\( maxRounds \) | 150 |

\(n_\mathrm {batch}\) | 500 |

\(c_\mathrm {change}\) | 0.975 |

To infer schedulers, we applied the configuration given by Table 6, that is, \( maxRounds = 150\), and \(n_\mathrm {batch} = 500\), and \(p_\mathrm {quit} = 0.5\). Due to the larger value of \( maxRounds \), we increased \(c_\mathrm {change}\) as well to 0.975. This causes more random choices and thereby broad exploration in a larger number of rounds. As this case study differs significantly from the others, we chose \( maxRounds \) conservatively, performing a larger number of rounds.

Figure 7b shows measured estimations of \(\mathbb {P}_{\mathcal {M},s}(F^{<10} goal )\) for \(s_\mathrm {inc}\), \(s_\mathrm {conv}\), \(s_\mathrm {mono}\), and random testing with \(s_\mathrm {unif}\). The dashed line denotes the optimal probability.

Random testing obviously fails to reach the goal in less than ten steps. This is caused by the fact that it is unlikely to navigate past the walls via random exploration. The performance of the monolithic approach is also affected by this issue, because it learns solely from uniformly randomised sample traces. Random exploration covers only the initial part of the state space thoroughly. Therefore, the monolithically generated schedulers tend to perform worse than incrementally generated. By directing exploration towards the goal, the incremental approach achieves to generate near-optimal schedulers.

We also see that the impact of the convergence check is not severe. Both settings, with and without convergence check, produced similar results. The convergence check was able to reduce simulation costs for all but three runs of the experiment, in which convergence was not detected in less than 150 rounds. The incremental scheduler generation required at least 94 rounds and on average 131.9 rounds were executed before convergence was detected.

### 5.5 Shared coin consensus

The last case study examines scheduler generation for a randomised consensus protocol by Aspnes and Herlihy [6]. In particular, we used a model of the protocol distributed with the PRISM model checker [29] as a basis for this case study.^{3} Note that we did not change the functionality of the protocol, but only performed minor adaptions such as adding action labels for inputs.

*c*with a range of \([0.\,.\, 2\cdot (K+1)\cdot N]\) where

*N*is the number of processes and

*K*is an integer constant. Initially,

*c*is set to \((K+1)\cdot N\). All involved processes perform the following steps to locally determine a preferred value

*v*:

- 1.
Flip a fair coin (local to the process)

- 2.Check coin
- 2.1
If the coin shows tails, decrement shared counter

*c* - 2.2
Otherwise increment

*c*

- 2.1
- 3Check value of
*c*- 3.1
If \(c \le N\), then the preferred value is \(v = 1\)

- 3.2
If \(c \ge 2\cdot (K+1)\cdot N - N\), then \(v = 2\)

- 3.3
Otherwise

**goto**1.

- 3.1

*c*represents one step in the protocol. Since the processes execute asynchronously, their actions may be arbitrarily interleaved, whereby the interleavings are controlled by schedulers. A schedulers may choose from

*N*inputs \( go _i\), one for each process

*i*. Performing \( go _i\) basically instructs process

*i*to perform the next step in the protocol. If process

*i*already picked a preferred value in Step 3.1. or in Step 3.2., \( go _i\) is simply ignored.

The visible outputs of the system are sets of propositions that hold in the current step. Firstly, the propositions expose the current value of the shared counter, i.e. they include \((c=k)\) for a \(k\in [0.\,.\, 2\cdot (K+1)\cdot N]\). Secondly, they expose values of the local coins, i.e. the outputs include one \(( coin _i = x)\) for each process *i*, where \(x\in \{ heads , tails \}\). Additionally, the outputs may include a proposition \( finished \), signalling that processes decided on a preferred value. As generating schedulers for this protocol in a learning-based fashion represents a demanding task, we only consider the case of \(K=2\) and \(N=2\), i.e. two asynchronously executing processes. Setting either of these constants to larger values significantly increases the number of steps to reach consensus.

*c*. After performing \( go _2\), we have \( coin _2 = heads \) with 0.5 probability and cannot satisfy \(\phi \) anymore. All other traces would satisfy \(\phi \). Without knowledge about the state of local coins, we would not be able to make sensible choices of inputs. The randomised state machines controlling the processes remain a black box to us, though. Models of their composition are inferred via learning.

Parameter settings for the consensus-protocol case study

Parameter | Value |
---|---|

\(p_\mathrm {quit}\) | 0.025 |

\( maxRounds \) | 100 |

\(n_\mathrm {batch}\) | 250 |

Figure 8 shows evaluation results for the incremental and the monolithic approach in comparison to random testing. The box plots corresponding to each of these are labelled \(s_\mathrm {mono}\), \(s_\mathrm {inc}\) and \(s_\mathrm {unif}\), respectively. The dashed line represents the optimal probability as before. In contrast to previous experiments, we see that the monolithic approach may perform worse than random testing. For \(k=14\), there are three measurements, which are exactly zero, but more than a quarter of the measurement results are near-optimal. For \(k=20\), the number of experiments achieving lower estimations than random testing decreases to two, but none of the generated schedulers is near-optimal. This can be explained by considering the minimum number of steps necessary to reach \( finished \). We need to execute at least 12 steps to observe \( finished \). As a result, it may happen that relevant parts of the system, states reached only after 12 steps, are inaccurately learned. This exemplifies that incremental scheduler generation pays off, because it is able to generate near-optimal schedulers for both values of *k*. For \(k=14\), three quarter of the incrementally generated schedulers are near-optimal and for \(k=20\), more than one quarter of the schedulers are near-optimal.

This case study actually highlights a weakness of our convergence check. It assumes that the search will converge to some unique behaviour. The protocol is completely symmetric for both processes, so it does not matter which process performs the first step. Hence, there are at least two optimal schedulers which differ in their initial action. This action is present in each of the 250 traces collected in one round, which presumably include further ambiguous choices. This causes \(\textsc {similarSched}\) to return \( false \) in most of the cases. Consequently, we do not discuss results obtained with the convergence check, as it rarely led to early stopping. A possible approach to counter this problem would be, assuming there is a lexicographic ordering on inputs, to always select the lexicographically minimal input, in case the choice is ambiguous.

### 5.6 Convergence check

Figure 9 contains graphs showing statistics summarising the collected estimations. The experiment summarised in Fig. 9a optimises reaching \( Pr10 \) in less than 14 steps with the slot machine. Figure 9b shows statistics for reaching \( goal \) in the gridworld in less than 10 steps. The graphs read as follows: the horizontal axis displays the rounds, and the vertical axis displays the value of the probability estimations. The lines from top to bottom represent the maximum, the third quartile, the median, the first quartile, and the minimum computed from the estimations collected in each round. Like before, these values were computed from 20 runs.

In both cases, we see that fluctuations decrease over time. The interquartile range decreases as well until it becomes relatively stable. Stable estimations are reached at around the \(70\mathrm{th}\) round in Fig. 9a, which is the area where convergence was detected – we stopped on average after 71.05 rounds. We see larger fluctuations of the minimal value in Fig. 9b, but they decline as well. Fluctuations of the minimal value can also be observed after 150 rounds. As a result, we may stop too early in rare cases.

Figure 9b also reveals unexpected behaviour. Testing of the gridworld actually required relatively few rounds of learning to achieve good results. In particular, the estimations after the first rounds were larger than expected, because the basis for the first round of learning is formed by only \(n_\mathrm {batch}\) random tests.

### 5.7 Runtime

Average runtime of learning and scheduler generation for various properties (all values in seconds)

Case study and property | Operation | \(s_\mathrm {mono}\) | \(s_\mathrm {inc}\) | \(s_\mathrm {conv}\) |
---|---|---|---|---|

Slot machine: \(F^{<14} Pr10 \) | Learning | 3.4 | 114.4 | 64.8 |

Scheduler generation | 3.8 | 364.6 | 226.6 | |

MQTT: \(F^{<17} crash \) | Learning | 1.7 | 50.5 | 75.4 |

Scheduler generation | 2.6 | 166.4 | 174.0 | |

TCP: \(F^{<17} crash \) | Learning | 21.3 | 471.7 | 564.4 |

Scheduler generation | 3.4 | 256.9 | 248.3 | |

Gridworld: \(F^{<10} goal \) | Learning | 2.1 | 97.5 | 96.7 |

Scheduler generation | 2.0 | 324.6 | 317.9 | |

Shared coin: \(F^{<20} finished \) | Learning | 7.7 | 300.9 | – |

Scheduler generation | 3.9 | 316.8 | – |

It can be seen that the incremental approaches, denoted by \(s_\mathrm {inc}\) and \(s_\mathrm {conv}\), require considerably more time to complete. Incremental scheduler generation without convergence detection for instance takes on average 728.6 s for the TCP property \(F^{<17} crash \), while the monolithic approach requires only 24.7 seconds. Thus, the better performance with respect to maximising probability estimations comes at the cost of increased runtime for learning and scheduler generation. In a testing scenario with real-world implementations, however, this time overhead may be negligible. If network communication is necessary, e.g., for protocol testing, the simulation time required for interacting with the SUT can be assumed to dominate the overall runtime. To contrast simulation runtime to the runtime of learning and scheduler generation, consider the hypothetical, but realistic scenario in which each simulation step takes about 10 milliseconds. Both approaches, the monolithic and the incremental, require approximately the same number of simulation steps, about \(1.7 \cdot 10^6\) for \(F^{<17} crash \). In this scenario, the simulation duration would amount to about 4.7 h, such that the runtime overhead of 703.9 s caused by the incremental approach would be low in comparison. Similar observations can be made for other case studies. Since the time spent simulating the SUT can be expected to dominate the computation time, we conclude that the incremental approach is preferable to the monolithic approach in this context.

In Table 8, we also see that the convergence detection provides a performance gain for the slot machine, but causes slightly worse runtime for MQTT and TCP. The decreased performance is caused by the fact that convergence often could not be detected within the \( maxRounds \) used for \(s_\mathrm {inc}\). Therefore, we increased \( maxRounds \) of \(s_\mathrm {conv}\) for MQTT and TCP, as discussed above. However, the goal of convergence detection is not a reduction of runtime. With the convergence detection heuristic, we want to provide a stopping criterion that does not solely rely on an arbitrarily chosen \( maxRounds \) parameter.

Finally, we want to discuss the runtime complexity of learning and scheduler generation. The worst-case complexity of IOAlergia is cubic in the size of the merged tree-shaped representation of the sampled traces, but the typical behaviour observed in practice is approximately linear [36]. Hence, it is unlikely that learning runtime could be improved, but the scheduler generation runtime can potentially be improved. Our implementation communicates with Prism via files, standard input and standard output. As a result, there is substantial communication overhead that could be removed via a tighter integration of scheduler generation. Prism’s default technique for scheduler generation, which is called *value iteration*, could also benefit algorithmically from such a tight integration. Since we check reachability with respect to a bound *k*, we could also bound the number of iterations performed by value iteration by *k* [7]. This leads to a worst-case runtime complexity of \(O(k\cdot n^2 \cdot m)\), where *n* is the number of states of a learned model and *m* is the number of inputs. The number of inputs *m* is generally a small constant and we have observed that \(n^2\) is generally smaller than the number of sampling steps required for learning. As a result, we expect the simulation time to generally dominate in non-simulated scenarios.

### 5.8 Discussion

We applied our approach for various types of models in several configurations, which we compared among each other, to the true optimal testing strategy and also to random testing as a baseline. The results of the performed experiments show (1) that learning-based approaches outperform the baseline and (2) that the incremental approach is able to generate near-optimal schedulers in all cases. In most experiments, the median probability estimation derived with the incremental approach was near-optimal, thus it generated near-optimal schedulers reliably. We have also seen that the convergence detection heuristic did not have a negative impact on the accuracy of the incremental approach. However, we are not able to give concrete bounds on the required number of samples to achieve a desired success rate. This is due to the fact that we rely on IOAlergia, for which convergence in the limit has been shown [36], but stronger guarantees are currently not available.

We generally targeted systems with small state space. An application in practice therefore requires abstraction to ensure that the state space is not prohibitively large. This is generally required in learning-based verification and several applications have shown that learning combined with abstraction provides effective means to enable model-based analysis [20, 21, 44, 49].

In addition to that, we have also seen limitations that cannot be solved by abstraction. The differences between estimations and optimal values tend to increase with the step bound *k*. This is potentially caused by the exponential growth of different traces. This growth also affects the application of the approach for large gridworld examples. Increasing the width of the gridworld also increases the steps required to reach the goal and causes the performance to drop. A possible mitigation would be to identify disabled inputs, i.e. inputs rejected by the SUT, if we have such information. This might prevent certain traces from being executed beyond a disabled input. In the original version of IOAlergia [35], such knowledge facilitates learning, because disabled inputs are assumed to leave the current state unchanged. In the gridworld example, we may consider inputs to be disabled, if they cause the robot to move into a wall.

A related issue also affects the case study on the consensus protocol. Changing the number of processes from two to four, increases the minimum number of steps to reach \( finished \) to 24. Here, composition actually causes the state space to grow. We could tackle such problems via decomposition. Instead of learning a large monolithic model, several small models could be learned, which would then be composed for the reachability analysis.

## 6 Conclusion

We presented an approach to infer near-optimal schedulers for reachability objectives of MDPs. To our knowledge, it is the first such approach to be applicable in a purely black-box setting, where only the input interface is known. This is accomplished by incrementally refining the knowledge about the system, via model inference and based on that property-directed exploration of the system. Section 5 presents promising results, showing that near-optimality can be achieved.

Therefore, we plan to investigate this approach further and extend it. As a first step, we are currently evaluating the method on more case studies. In order to be able to examine more complex systems, we are planning to work on compositional verification. That is, we are investigating how to benefit from decomposition, as opposed to treating composed systems as large monolithic systems. We are also studying the applicability in different testing scenarios. In a testing context, e.g., Nachmanson et al. [38] discussed strategies for bounded reachability games, but with a given model.

Furthermore, non-functional properties like execution time could be considered. For that, we need to devise a model-inference technique which considers both non-deterministic choice of inputs and (continuous) time. Current approaches for probabilistic timed systems do not account for non-determinism of this form [36, 52]. If we had such a technique, we could, e.g., use Prism with the digital clocks engine [29] or Uppaal Stratego [15, 16] to infer schedulers. Another possible extension would be to consider more general properties than reachability. This would require replacing Prism’s scheduler generation in our approach. In conclusion, we believe that our results are encouraging and that there are many promising directions for future research.

## Footnotes

- 1.
If we reach the state labelled \( dontKnow \) during sampling, the outputs of the hypothesis and the SUT are guaranteed to differ. As a result, we will continue sampling with random inputs after reaching that state.

- 2.
The number of states does not equal the number of reachable locations because locations adjacent to walls require two states in the MDP – one outputting the terrain and one with the output \( wall \). States outputting \( wall \) are reached after the robot bumps into a wall.

- 3.
A thorough discussion of the model and related experiments can be found at http://www.prismmodelchecker.org/casestudies/consensus_prism.php. Accessed: 2018-12-03

## Notes

### Acknowledgements

Open access funding provided by Graz University of Technology. This work was supported by the TU Graz LEAD project “Dependable Internet of Things in Adverse Environments”. The authors would like to thank the LEAD project members Roderick Bloem, Masoud Ebrahimi, Franz Pernkopf, Franz Röck, and Tobias Schrank for fruitful discussions.

## References

- 1.Aichernig BK, Mostowski W, Mousavi MR, Tappler M, Taromirad M (2018) Model learning and model-based testing. In: Bennaceur A, Hähnle R, Meinke K (eds) Machine learning for dynamic software analysis: potentials and limits–international Dagstuhl seminar 16172, Dagstuhl Castle, Germany, April 24–27, 2016. Revised papers, Lecture notes in computer science, vol 11026, pp 74–100. Springer. https://doi.org/10.1007/978-3-319-96562-8_3 Google Scholar
- 2.Aichernig BK, Tappler M (2017) Learning from faults: mutation testing in active automata learning. In: Barrett C, Davies M, Kahsai T (eds) NASA formal methods–9th international symposium, NFM 2017, Moffett Field, CA, USA, May 16–18, 2017. Proceedings, Lecture notes in computer science, vol 10227, pp 19–34. https://doi.org/10.1007/978-3-319-57288-8_2 Google Scholar
- 3.Aichernig BK, Tappler M (2017) Probabilistic black-box reachability checking. In: Lahiri SK, Reger G (eds) Runtime verification–17th international conference, RV 2017, Seattle, WA, USA, September 13–16, 2017. Proceedings, Lecture notes in computer science, vol 10548, pp 50–67. Springer. https://doi.org/10.1007/978-3-319-67531-2_4 CrossRefGoogle Scholar
- 4.Angluin D (1987) Learning regular sets from queries and counterexamples. Inf. Comput. 75(2):87–106. https://doi.org/10.1016/0890-5401(87)90052-6 MathSciNetCrossRefzbMATHGoogle Scholar
- 5.Argyros G, Stais I, Jana S, Keromytis AD, Kiayias A (2016) SFADiff: automated evasion attacks and fingerprinting using black-box differential automata learning. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp 1690–1701. ACM. https://doi.org/10.1145/2976749.2978383
- 6.Aspnes J, Herlihy M (1990) Fast randomized consensus using shared memory. J Algorithms 11(3):441–461. https://doi.org/10.1016/0196-6774(90)90021-6 MathSciNetCrossRefzbMATHGoogle Scholar
- 7.Baier C, Katoen J (2008) Principles of model checking. MIT Press, CambridgezbMATHGoogle Scholar
- 8.Banks A. Gupta, R (ed.) (2014) MQTT version 3.1.1. OASIS standard. http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/os/mqtt-v3.1.1-os.html. http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/os/mqtt-v3.1.1-os.html
- 9.Beimel A, Bergadano F, Bshouty NH, Kushilevitz E, Varricchio S (2000) Learning functions represented as multiplicity automata. J ACM 47(3):506–530. https://doi.org/10.1145/337244.337257 MathSciNetCrossRefzbMATHGoogle Scholar
- 10.Brázdil T, Chatterjee K, Chmelik M, Forejt V, Kretínský J, Kwiatkowska MZ, Parker D, Ujma M Verification of Markov decision processes using learning algorithms. In: Cassez and Raskin [12], pp 98–114. https://doi.org/10.1007/978-3-319-11936-6_8 CrossRefGoogle Scholar
- 11.Carrasco RC, Oncina J(1994) Learning stochastic regular grammars by means of a state merging method. In: Carrasco RC, Oncina J (eds) Grammatical inference and applications, second international colloquium, ICGI-94, Alicante, Spain, September 21–23, 1994. Proceedings, Lecture notes in computer science, vol 862, pp 139–152. Springer. https://doi.org/10.1007/3-540-58473-0_144 CrossRefGoogle Scholar
- 12.Cassez F, Raskin J (eds) (2014) Automated technology for verification and analysis–12th international symposium, ATVA 2014, Sydney, NSW, Australia, November 3–7, 2014. Proceedings, Lecture notes in computer science, vol 8837. Springer. https://doi.org/10.1007/978-3-319-11936-6 zbMATHGoogle Scholar
- 13.Chen Y, Nielsen TD (2012) Active learning of Markov decision processes for system verification. In: 11th international conference on machine learning and applications, ICMLA, Boca Raton, FL, USA, December 12–15, 2012, vol 2, pp 289–294. IEEE. https://doi.org/10.1109/ICMLA.2012.158
- 14.D’Argenio P, Legay A, Sedwards S, Traonouez L (2015) Smart sampling for lightweight verification of Markov decision processes. STTT 17(4):469–484. https://doi.org/10.1007/s10009-015-0383-0 CrossRefGoogle Scholar
- 15.David A, Jensen PG, Larsen KG, Legay A, Lime D, Sørensen MG, Taankvist JH. On time with minimal expected cost! In: Cassez and Raskin [12], pp 129–145. https://doi.org/10.1007/978-3-319-11936-6_10 CrossRefGoogle Scholar
- 16.David A, Jensen PG, Larsen KG, Mikucionis M, Taankvist JH (2015) Uppaal stratego. In: Baier C, Tinelli C (eds) Tools and algorithms for the construction and analysis of systems–21st international conference, TACAS 2015, held as part of the European joint conferences on theory and practice of software, ETAPS 2015, London, April 11–18, 2015. Proceedings, Lecture notes in computer science, vol 9035, pp 206–211. Springer. https://doi.org/10.1007/978-3-662-46681-0_16 CrossRefGoogle Scholar
- 17.Elkind E, Genest B, Peled DA, Qu H (2006) Grey-box checking. In: Najm E, Pradat-Peyre J, Donzeau-Gouge V (eds) Formal techniques for networked and distributed systems–FORTE 2006, 26th IFIP WG 6.1 international conference, Paris, France, September 26–29, 2006. Lecture notes in computer science, vol 4229, pp 420–435. Springer. https://doi.org/10.1007/11888116_30 Google Scholar
- 18.EMQ. http://emqtt.io/. Accessed 3 Dec 2018
- 19.Feng L, Han T, Kwiatkowska MZ, Parker D (2011) Learning-based compositional verification for synchronous probabilistic systems. In: Bultan T, Hsiung P (eds) Automated technology for verification and analysis, 9th international symposium, ATVA 2011, Taipei, Taiwan, October 11–14, 2011. Proceedings, Lecture notes in computer science, vol 6996, pp 511–521. Springer. https://doi.org/10.1007/978-3-642-24372-1_40 CrossRefGoogle Scholar
- 20.Fiterau-Brostean P, Janssen R, Vaandrager FW (2016) Combining model learning and model checking to analyze TCP implementations. In: Chaudhuri S, Farzan A (eds) Computer aided verification–28th international conference, CAV 2016, Toronto, ON, Canada, July 17–23, 2016. Proceedings, Part II, Lecture notes in computer science, vol 9780, pp 454–471. Springer. https://doi.org/10.1007/978-3-319-41540-6_25 CrossRefGoogle Scholar
- 21.Fiterau-Brostean P, Lenaerts T, Poll E, de Ruiter J, Vaandrager FW, Verleg P (2017) Model learning and model checking of SSH implementations. In: Erdogmus H, Havelund K (eds) Proceedings of the 24th ACM SIGSOFT international SPIN symposium on model checking of software, Santa Barbara, CA, July 10–14, 2017, pp 142–151. ACM. https://doi.org/10.1145/3092282.3092289. http://doi.acm.org/10.1145/3092282.3092289
- 22.Forejt V, Kwiatkowska MZ, Norman G, Parker D (2011) Automated verification techniques for probabilistic systems. In: Bernardo M, Issarny V (eds) Formal methods for eternal networked software systems–11th international school on formal methods for the design of computer, communication and software systems, SFM 2011, Bertinoro, Italy, June 13–18, 2011. Advanced lectures, Lecture notes in computer science, vol 6659, pp 53–113. Springer. https://doi.org/10.1007/978-3-642-21455-4_3 CrossRefGoogle Scholar
- 23.Fu J, Topcu U (2014) Probably approximately correct MDP learning and control with temporal logic constraints. In: Fox D, Kavraki LE, Kurniawati H (eds) Robotics: science and systems X, University of California, Berkeley, July 12–16, 2014. http://www.roboticsproceedings.org/rss10/p39.html
- 24.Giantamidis G, Tripakis S (2016) Learning Moore machines from input-output traces. In: Fitzgerald JS, Heitmeyer CL, Gnesi S, Philippou A (eds) FM 2016: formal methods–21st international symposium, Limassol, Cyprus, November 9–11, 2016. Proceedings, Lecture notes in computer science, vol 9995, pp 291–309. https://doi.org/10.1007/978-3-319-48989-6_18 CrossRefGoogle Scholar
- 25.Grinchtein O, Jonsson B, Leucker M (2004) Learning of event-recording automata. In: Lakhnech Y, Yovine S (eds) Formal techniques, modelling and analysis of timed and fault-tolerant systems, joint international conferences on formal modelling and analysis of timed systems, FORMATS 2004 and formal techniques in real-time and fault-tolerant systems, FTRTFT 2004, Grenoble, France, September 22–24, 2004. Proceedings, Lecture notes in computer science, vol 3253, pp 379–396. Springer. https://doi.org/10.1007/978-3-540-30206-3_26 CrossRefGoogle Scholar
- 26.Groce A, Peled DA, Yannakakis M (2002) Adaptive model checking. In: Katoen J, Stevens P (eds) Tools and algorithms for the construction and analysis of systems. In: 8th international conference, TACAS 2002, held as part of the joint European conference on theory and practice of software, ETAPS 2002, Grenoble, France, April 8–12, 2002. Proceedings, Lecture notes in computer scienceCrossRefGoogle Scholar
- 27.de la Higuera C (2010) Grammatical inference: learning automata and grammars. Cambridge University Press, New York, NYCrossRefGoogle Scholar
- 28.Khalili A, Tacchella A (2014) Learning nondeterministic Mealy machines. In: Clark A, Kanazawa M, Yoshinaka R (eds) Proceedings of the 12th international conference on grammatical inference, ICGI 2014, Kyoto, Japan, September 17–19, 2014. JMLR workshop and conference proceedings, vol 34, pp 109–123. http://jmlr.org/proceedings/papers/v34/khalili14a.html
- 29.Kwiatkowska MZ, Norman G, Parker D (2011) PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan G, Qadeer S (eds) Computer aided verification–23rd international conference, CAV 2011, Snowbird, UT, July 14–20, 2011. Proceedings, Lecture notes in computer science, vol 6806, pp 585–591. Springer. https://doi.org/10.1007/978-3-642-22110-1_47 CrossRefGoogle Scholar
- 30.Kwiatkowska MZ, Parker D (2013) Automated verification and strategy synthesis for probabilistic systems. In: Hung DV, Ogawa M (eds) Automated technology for verification and analysis–11th international symposium, ATVA 2013, Hanoi, Vietnam, October 15–18, 2013. Proceedings, Lecture notes in computer science, vol 8172, pp 5–22. Springer. https://doi.org/10.1007/978-3-319-02444-8_2 CrossRefGoogle Scholar
- 31.Larsen KG, Legay A (2016) Statistical model checking: past, present, and future. In: Margaria T, Steffen B (eds) Leveraging applications of formal methods, verification and validation: foundational techniques–7th international symposium, ISoLA 2016, Imperial, Corfu, Greece, October 10–14, 2016. Proceedings, Part I, Lecture notes in computer science, vol 9952, pp 3–15. https://doi.org/10.1007/978-3-319-47166-2_1 CrossRefGoogle Scholar
- 32.Legay A, Delahaye B, Bensalem S (2010) Statistical model checking: an overview. In: Barringer H, Falcone Y, Finkbeiner B, Havelund K, Lee I, Pace GJ, Rosu G, Sokolsky O, Tillmann N (eds) Runtime verification–first international conference, RV 2010, St. Julians, Malta, November 1–4, 2010. Proceedings, Lecture notes in computer science, vol 6418, pp 122–135. Springer. https://doi.org/10.1007/978-3-642-16612-9_11 CrossRefGoogle Scholar
- 33.Legay A, Sedwards S, Traonouez L (2014) Scalable verification of Markov decision processes. In: Canal C, Idani A (eds) Software engineering and formal methods–SEFM 2014 collocated workshops: HOFM, SAFOME, OpenCert, MoKMaSD, WS-FMDS, Grenoble, France, September 1–2, 2014. Revised selected papers, Lecture notes in computer science, vol 8938, pp 350–362. Springer. https://doi.org/10.1007/978-3-319-15201-1_23 CrossRefGoogle Scholar
- 34.Mao H, Chen Y, Jaeger M, Nielsen TD, Larsen KG, Nielsen B (2011) Learning probabilistic automata for model checking. In: Eighth international conference on quantitative evaluation of systems, QEST 2011, Aachen, 5–8 September, 2011, pp 111–120. IEEE Computer Society. https://doi.org/10.1109/QEST.2011.21
- 35.Mao H, Chen Y, Jaeger M, Nielsen TD, Larsen KG, Nielsen B (2012) Learning Markov decision processes for model checking. In: Fahrenberg U, Legay A, Thrane CR (eds) Proceedings quantities in formal methods, QFM 2012, Paris, France, 28 August 2012. EPTCS, vol 103, pp 49–63. https://doi.org/10.4204/EPTCS.103.6 CrossRefGoogle Scholar
- 36.Mao H, Chen Y, Jaeger M, Nielsen TD, Larsen KG, Nielsen B (2016) Learning deterministic probabilistic automata from a model checking perspective. Mach Learn 105(2):255–299. https://doi.org/10.1007/s10994-016-5565-9 MathSciNetCrossRefzbMATHGoogle Scholar
- 37.Margaria T, Niese O, Raffelt H, Steffen B (2004) Efficient test-based model generation for legacy reactive systems. In: Ninth IEEE international high-level design validation and test workshop 2004, Sonoma Valley, CA, USA, November 10–12, 2004, pp. 95–100. IEEE Computer Society. https://doi.org/10.1109/HLDVT.2004.1431246
- 38.Nachmanson L, Veanes M, Schulte W, Tillmann N, Grieskamp W (2004) Optimal strategies for testing nondeterministic systems. In: Avrunin GS, Rothermel G (eds) Proceedings of the ACM/SIGSOFT international symposium on software testing and analysis, ISSTA 2004, Boston, MA, USA, July 11–14, 2004, pp 55–64. ACM. https://doi.org/10.1145/1007512.1007520
- 39.Nouri A, Raman B, Bozga M, Legay A, Bensalem S (2014) Faster statistical model checking by means of abstraction and learning. In: Bonakdarpour B, Smolka SA (eds) Runtime verification–5th international conference, RV 2014, Toronto, ON, Canada, September 22–25, 2014. Proceedings, Lecture notes in computer science, vol 8734, pp 340–355. Springer. https://doi.org/10.1007/978-3-319-11164-3_28 CrossRefGoogle Scholar
- 40.Okamoto M (1959) Some inequalities relating to the partial sum of binomial probabilities. Ann Inst Stat Math 10(1):29–35. https://doi.org/10.1007/BF02883985 MathSciNetCrossRefzbMATHGoogle Scholar
- 41.Oncina J, Garcia P (1992) Identifying regular languages in polynomial time. In: Advances in structural and syntactic pattern recognition. Volume 5 of series in Machine perception and artificial intelligence, pp 99–108. World ScientificGoogle Scholar
- 42.Peled DA, Vardi MY, Yannakakis M (1999) Black box checking. In: Wu J, Chanson ST, Gao Q (eds) Formal methods for protocol engineering and distributed systems, FORTE XII/PSTV XIX’99, IFIP TC6 WG6.1 joint international conference on formal description techniques for distributed systems and communication protocols (FORTE XII) and protocol specification, testing and verification (PSTV XIX), October 5–8, 1999, Beijing, China. IFIP conference proceedings, vol 156, pp 225–240. KluwerGoogle Scholar
- 43.prob-black-reach—Java implementation of probabilistic black-box reachability checking. https://github.com/mtappler/prob-black-reach. Accessed 3 Dec 2018
- 44.de Ruiter J, Poll E (2015) Protocol state fuzzing of TLS implementations. In: Jung J, Holz T(eds) 24th USENIX security symposium, USENIX Security 15, Washington, D.C., USA, August 12–14, 2015, pp 193–206. USENIX Association. https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/de-ruiter
- 45.Sen K, Viswanathan M, Agha G (2004) Statistical model checking of black-box probabilistic systems. In: Alur R, Peled DA (eds) Computer aided verification, 16th international conference, CAV 2004, Boston, MA, USA, July 13–17, 2004. Proceedings, Lecture notes in computer science, vol 3114, pp 202–215. Springer. https://doi.org/10.1007/978-3-540-27813-9_16 CrossRefGoogle Scholar
- 46.Shahbaz M, Groz R (2009) Inferring Mealy machines. In: Cavalcanti A, Dams D (eds) FM 2009: formal methods, second world congress, Eindhoven, The Netherlands, November 2–6, 2009. Proceedings, Lecture notes in computer science, vol 5850, pp 207–222. Springer. https://doi.org/10.1007/978-3-642-05089-3_14 CrossRefGoogle Scholar
- 47.Shu G, Lee D (2007) Testing security properties of protocol implementations–a machine learning based approach. In: 27th IEEE international conference on distributed computing systems (ICDCS 2007), June 25–29, 2007, Toronto, Ontario, Canada, p 25. IEEE Computer Society. https://doi.org/10.1109/ICDCS.2007.147
- 48.Sivakorn S, Argyros G, Pei K, Keromytis AD, Jana S (2017) HVLearn: automated black-box analysis of hostname verification in SSL/TLS implementations. In: SP 2017, pp 521–538. IEEE Computer Society. https://doi.org/10.1109/SP.2017.46
- 49.Tappler M, Aichernig BK, Bloem R (2017) Model-based testing IoT communication via active automata learning. In: 2017 IEEE international conference on software testing, verification and validation, ICST 2017, Tokyo, Japan, March 13–17, 2017, pp 276–287. IEEE Computer Society. https://doi.org/10.1109/ICST.2017.32
- 50.TCP models. https://gitlab.science.ru.nl/pfiteraubrostean/tcp-learner/tree/cav-aec/models. Accessed 3 Dec 2018
- 51.Utting M, Pretschner A, Legeard B (2012) A taxonomy of model-based testing approaches. Softw Test Verif Reliab 22(5):297–312. https://doi.org/10.1002/stvr.456 CrossRefGoogle Scholar
- 52.Verwer S, de Weerdt M, Witteveen C (2010) A likelihood-ratio test for identifying probabilistic deterministic real-time automata from positive data. In: Sempere JM, García P (eds) Grammatical inference: theoretical results and applications, 10th international colloquium, ICGI 2010, Valencia, Spain, September 13–16, 2010. Proceedings, Lecture notes in computer science, vol 6339, pp 203–216. Springer. https://doi.org/10.1007/978-3-642-15488-1_17 CrossRefGoogle Scholar
- 53.Volpato M, Tretmans J (2015) Approximate active learning of nondeterministic input output transition systems. In: ECEASST, vol 72. http://journal.ub.tu-berlin.de/eceasst/article/view/1008
- 54.Wang J, Sun J, Qin S Verifying complex systems probabilistically through learning, abstraction and refinement. In: CoRR. arXiv:1610.06371 (2016)
- 55.Younes HLS (2005) Probabilistic verification for “black-box” systems. In: Etessami K, Rajamani SK (eds) Computer aided verification, 17th international conference, CAV 2005, Edinburgh, Scotland, July 6–10, 2005. Proceedings, Lecture notes in computer science, vol 3576, pp 253–265. Springer. https://doi.org/10.1007/11513988_25 CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.