Model-based testing of probabilistic systems

This work presents an executable model-based testing framework for probabilistic systems with non-determinism. We provide algorithms to automatically generate, execute and evaluate test cases from a probabilistic requirements specification. The framework connects input/output conformance-theory with hypothesis testing: our algorithms handle functional correctness, while statistical methods assess, if the frequencies observed during the test process correspond to the probabilities specified in the requirements. At the core of our work lies the conformance relation for probabilistic input/output conformance, enabling us to pin down exactly when an implementation should pass a test case. We establish the correctness of our framework alongside this relation as soundness and completeness; Soundness states that a correct implementation indeed passes a test suite, while completeness states that the framework is powerful enough to discover each deviation from a specification up to arbitrary precision for a sufficiently large sample size. The underlying models are probabilistic automata that allow invisible internal progress. We incorporate divergent systems into our framework by phrasing four rules that each well-formed system needs to adhere to. This enables us to treat divergence as the absence of output, or quiescence, which is a well-studied formalism in model-based testing. Lastly, we illustrate the application of our framework on three case studies.


Introduction
Probability.Probability plays an important role in many computer applications.A vast number of randomized algorithms, protocols and computation methods use randomization to achieve their goals.Routing in sensor networks, for instance, can be done via random walks [1]; speech recognition is based on hidden Markov models [32]; population genetics use Bayesian computation [2], security protocols use random bits in their encryption methods [10], control policies in robotics, leading to the emerging field of probabilistic robotics, concerned with perception and control in the face of uncertainty networking algorithms assign bandwidth in a random fashion.Such applications can be implemented in one of the many probabilistic programming languages, such as Probabilistic-C [26] or Figaro [28].At a higher level, service level agreements are formulated in a stochastic fashion, stating that the average uptime should be at least 99 %, or that the punctuality of train services should be 95 %.
Key question is whether such probabilistic systems are correct: is bandwidth distributed fairly among all parties?Is the up-time, packet delay and jitter according to specification?Do the trains on a certain day run punctual modelled as pIOTS conforms to a specification pIOTs.We prove several properties of the pioco-relation, in particular it being a conservative extension of ioco.Lastly, we illustrate our approach with two case studies: the exponential binary back off protocol, and the IEEE 1394 root contention protocol.
While test efficiency is important, this paper focusses on the methodological set up and correctness.Important future work is to optimize the statistical verdicts we derive and to provide a fully fledged implementation of our methods.Related Work.Probabilistic testing preorders and equivalences are well studied [11,13,34], defining when two probabilistic transition systems are equivalent, or one subsumes the other.In particular, early and influential work by [21] introduces the fundamental concept of probabilistic bisimulation via hypothesis testing.Also, [9] shows how to observe trace probabilities via hypothesis testing.Executable test frameworks for probabilistic systems have been defined for probabilistic finite state machines [17,24], dealing with mutations and stochastic timing, Petri nets [6], and CSL [35,36].The important research line of statistical testing [4,42,43] is concerned with choosing the inputs for the SUT in a probabilistic way in order to optimize a certain test metric, such as (weighted) coverage.The question on when to stop statistical testing is tackled in [29].
An approach similar in the spirit of ours is by Hierons et al. [16].However, our model can be considered as an extension of [16] reconciling probabilistic and non-deterministic choices in a fully fledged way.Being more restrictive enables [16] to focus on individual traces, whereas we use trace distributions.
Furthermore, the current paper extends a workshop paper by [14] that introduced the pioco-relation and roughly sketched the test case process.Novel contributions of our current paper are 1. a more generic model pIOTS model that includes internal transitions, 2. the soundness and completeness results, 3. solid definitions of test cases, test execution, and verdicts, 4. the treatment of quiescence, i.e., absence of outputs, 5. the handling of probabilistic test cases.
Overview of the Paper.Section 2 sets the mathematical framework and introduces pIOTSs, adversaries and trace distributions.Section 3 shows how we generate and execute probabilistic tests and evaluate them functionally and statistically.Section 4 introduces the pioco relation and shows the soundness and completeness of our testing method.Two case studies can be found in Sect. 5. Lastly Sect.6 ends the paper with future work and conclusions.

Probabilistic Input/Output Systems
We start by introducing some standard notions from probability theory.A discrete probability distribution over a set X is a function μ : X −→ [0, 1] such that x∈X μ (x) = 1.The set of all distributions over X is denoted by Distr (X).The probability distribution that assigns 1 to a certain element x ∈ X is called the Dirac distribution over x and is denoted Dirac (x).A probability space is a triple (Ω, F, P), such that Ω is a set, F is a σfield of Ω, and P : F → [0, 1] a probability measure such that P (Ω) = 1 and P ( We use "?" to suffix input and "!" to suffx output.We write s μ,a − − → s if (s, μ) ∈ Δ and μ (a, s ) > 0; and s → a if there are μ ∈ Distr (L × S) and s ∈ S such that s μ,a − − → s (s →a if not).We write s μ,a − − → A s , etc. to clarify ambiguities if needed.Lastly, A is input-enabled if for all s ∈ S we have s → a? for all a ∈ L I .
Following [15], pIOTSs are input-reactive and output-generative.Upon receiving an input, the pIOTS decides probabilistically which next state to move to.On producing an output, the pIOTS chooses both the output action and the state probabilistically.As required in clause 4 of Definition 1, this means that each transition can either involve a single input action, or several outputs, quiescence or internal actions.Note that a state can enable input and output transitions albeit not in the same distribution.Furthermore, in testing, a verdict must also be given if the system-under-test is quiescent, i.e., produces no output at all.Hence, the requirements model must explicitly indicate when quiescence is allowed, which is expressed by a special output label δ, for details see [39,41].
Example 2. Figure 1 shows three models of a simple shuffle mp3 player with two songs.The pIOTS in (Fig. 1a) models the requirements: pressing the shuffle button enables the two songs with probability 0.5 each, repeatedly until stop is pressed.
Implementation (Fig. 1b) is subject to a small probabilistic deviation.In implementation (Fig. 1c) the same song cannot be played twice in a row without intervention of the shuffle button.States without enabled output transition allow quiescence, denoted by δ transitions.The model-based testing framework established in the paper is capable of detecting all of the above flaws.
Parallel composition is defined in a standard fashion.Two pIOTSs in composition synchronize on shared actions, and evolve independently on others.Since the transitions in the component pIOTSs are stochastically independent, we multiply the probabilities when taking shared actions, denoted by μ × ν.To avoid name clashes, we only compose compatible pIOTSs.Note that parallel composition of two input-enabled pIOTSs yields a pIOTS.

Paths and Traces
We define the usual language concepts for LTSs.Let A = (S, s 0 , L I , L O , L H , Δ) be a pIOTS.A path π of A is a (possibly) infinite sequence of the following form where s i ∈ S, a i ∈ L and μ i ∈ Distr (L × S), such that each finite path ends in a state and s i μi+1,ai+1 − −−−−−→ s i+1 for each non-final i.We use last (π) to denote the last state of a finite path (last (π) = ∞ for infinite paths).The set of all finite paths of A is denoted by Path * (A) and all infinite paths by Path (A).
The associated trace of a path π is obtained by omitting states, distributions and internal actions, i.e. trace (π) = a 1 a 2 a 3 . ... Conversely, trace −1 (σ) gives the set of all paths, which have trace σ.The length of a path is the number of occurring actions on its associated trace.All finite traces of A are summarized in traces (A).The set of complete traces, ctraces (A), contains every trace based on paths ending in states that do not enable any more actions.We write out A (σ) for the set of all output actions enabled with positive probability after trace σ.

Adversaries and Trace Distributions
Very much like traces of LTSs are obtained by first selecting a path and by then removing all states and internal actions, we do the same in the probabilistic case.First, we resolve all non-deterministic choices in the pIOTS via an adversary and then we remove all states to get the trace distribution.
The resolution of the non-determinism via an adversary leads to a purely probabilistic system, in which we can assign a probability to each finite path.A classical result in measure theory [12] shows that it is impossible to assign a probability to all sets of traces, hence we use σ-fields F consisting of cones.
Adversaries.Following the standard theory for probabilistic automata [38], we define the behaviour of a pIOTS via adversaries (a.k.a.policies or schedulers) to resolve the non-deterministic choices in pIOTSs; in each state of the pIOTS, the adversary may choose which transition to take or it may also halt the execution.
Given any finite history leading to a state, an adversary returns a discrete probability distribution over the set of next transitions.In order to model termination, we define schedulers such that they can continue paths with a halting extension, after which only quiescence is observed.

Definition 4. An adversary E of a pIOTS
We say that E is deterministic, if E (π) assigns the Dirac distribution to every distribution after all π ∈ Path * (A).The value E (π) (⊥) is considered as interruption/halting.An adversary E halts on a path π, if E (π) (⊥) = 1.We say that an adversary halts after k ∈ N steps, if it halts for every path of length greater or equal to k.We denote all such finite adversaries by adv (A, k).
Intuitively an adversary tosses a multi-faced and biased die at every step of the computation, thus resulting in a purely probabilistic computation tree.The probability assigned to a path π is obtained by the probability of its cone , s).This function enables us to assign a unique probability space (Ω E , F E , P E ) associated to an adversary E. Thus, the probability of Trace Distributions.A trace distribution is obtained from (the probability space of) an adversary by removing all states.Thus, the probability assigned to a set of traces X is the probability of all paths whose trace is an element of X.

Definition 5. The trace distribution H of an adversary
We write trd (A) for the set of all trace distributions of A and trd (A, k) for those halting after k ∈ N. Lastly we write The fact that (Ω E , F E , P E ) and (Ω H , F H , P H ) define probability spaces, follows from standard measure theory arguments (see for example [12]).Example 6.Consider (c) in Fig. 1 and an adversary E starting from the beginning state s 0 scheduling probability 1 to shuf?, 1 to the distribution consisting of song1! and song2! and 1  2 to both shuffle?transitions in s 2 .Then choose the paths π = s 0 μ 1 shuf?s 1 μ 2 song1!s 2 μ 3 shuf?s 2 and π = s 0 μ 1 shuf?s 1 μ 2 song1!s 2 μ 4 shuf?s 1 .

Test Generation
Model-based testing entails the automatic test case generation, execution and evaluation based on a requirements model.We provide two algorithms for test case generation: an offline or batch algorithm that generates test cases before their execution; and an online or on-the-fly algorithm generating test cases during execution.First, we formalize the notion of a (offline) test case over an action signature (L I , L O ).In each state of a test, the tester can either provide some stimulus a? ∈ L I , or wait for a response of the system or stop the testing process. 2ach of these possibilities can be chosen with a certain probability, leading to probabilistic test cases.We model this as a probabilistic choice between the internal actions τ obs , τ stop and τ stim .Note that, even in the non-probabilistic case, the test cases are often generated probabilistically in practice, but this is not supported in theory.Thus, our definition fills a small gap here.
Furthermore, note that, when waiting for a system response, we have to take into account all potential outputs in L O , including the situation that the system provides no response at all, modelled by δ.Since the continuation of a test depends on the history, offline test cases are formalized as trees.− − → t s with a an output action of the specification, then we have μ = Dirac.Test t 2 shows how we apply stimuli, observe or stop with probabilities 1  3 each.If we stimulate, we apply stop! and shuf! with probability 1  2 each.Algorithms.The procedure batch in Algorithm 1 generates test cases from a specification, given a specification pIOTSs A S and a history σ, which is initially .Each step a probabilistic choice is made to return an empty test, to observe or to stimulate, denoted with probabilities p σ,1 , p σ,2 or p σ,3 respectively.The latter two call the procedure batch again.If erroneous output is detected, we stop immediately.We require that p σ,1 + p σ,2 + p σ,3 = 1.Algorithm 2 shows a sound way to derive tests on-the-fly.The inputs are a specification A S , a concrete implementation A I and a test length n ∈ N. The algorithm returns a verdict of whether or not the implementation is ioco correct in the first n steps.If erroneous output was detected, the verdict will be fail and pass otherwise.With probability p σ,1 we observe and with probability p σ,2 we stimulate.The algorithm stops after n steps.Thus, p σ,1 + p σ,2 = 1.

Test Evaluation
In our framework, we assess functional behaviour by the test verdict a AS ,t and probabilistic behaviour via statistics, as elaborated below.
Statistical Verdict.Given a (black box) implementation, the idea is to run an offline or online test case multiple times, in order to collect a sample.Then, we check if the frequencies of the traces contained in this sample match the probabilities in the specification via statistical hypothesis testing.However, since the specification contains non-determism, we cannot apply statistical means directly.Rather, we check if the observed trace frequencies can be explained, if we resolve occurring non-determinism in the specification according to some scheduler.We formulate a hypothesised scheduler that makes the occurrence of the sample most likely.This gives rise to a purely probabilistic computation tree and probabilities and expected values for each trace can be calculated.Based on a predefined level of significance α ∈ (0, 1) we use null hypothesis testing to determine whether to accept or reject the hypothesised scheduler.If it is accepted, we have no reason to assume that the implementation differs probabilistically from the specification and give the pass label.If it is rejected, we assign the fail verdict, because there is no scheduler to explain the observed frequencies.
Sampling.To collect a sample, we define the length k ∈ N and width m ∈ N of an experiment first, i.e. how long shall we observe the machine and how many times do we want to run it before stopping.Thus, we collect σ 1 , . . ., σ m ∈ traces (A I ) with |σ i | = k for i = 1, . . ., m.We call O = (σ 1 , . . ., σ m ) ∈ L k m a sample.We assume the system is governed by a trace distribution D i in every run, thus running the machine m times, means that a sample is generated by a sequence of m (possibly) different trace distributions D = (D 1 , D 2 , . . ., D m ) ∈ trd (A I , k) m .Each run the implementation makes two choices.(1) It chooses a trace distribution D i and (2) D i chooses a trace σ i .Once a trace distribution D i is chosen, it is solely responsible for the trace σ i , meaning that for i = j the choice of σ i by D i is independent of the choice σ j by D j .
Frequencies.The frequency function is defined as freq : . Assume that k, m ∈ N, D and σ ∈ L k are fixed.Then a sample O can be treated as a Bernoulli experiment of length m, where success occurs in position i ∈ {1, . . ., m} if σ = σ i .Thus, the success probability in the i-th step is given by P Di (σ).So assume X i are Bernoulli distributed random variables for i = 1, . . ., m.We define a new random variable as Z = 1 m m i=1 X i , which represents the frequency of success in m steps governed by D. Thus the expected frequency is given as It is σ E D σ = 1, which means E D is the distribution expected under D.
Acceptable Outcomes.We will accept a sample O if freq (O) lies within some distance r of the expected distribution E D .Recall the definition of a ball centred at x ∈ X with radius r as B r (x) = {y ∈ X | dist (x, y) ≤ r}.All distributions deviating at most by r from the expected distribution are contained within the ball B r E D , where dist (u, v) := sup σ∈L k | u (σ) − v (σ) | and u and v are distributions.In order to minimize the error of falsely accepting a sample, we choose the smallest radius, such that the error of falsely rejecting a sample is not greater than a predefined level of significance α ∈ (0, 1) by r : Definition 11.For k, m ∈ N and a pIOTS A the acceptable outcomes under D ∈ trd (A, k) m of significance level α ∈ (0, 1) are given by the set of observations Obs The defined set of observations of a pIOTS A therefore has two properties, reflecting the error of false rejection and false acceptance respectively.

For D ∈ trd (A) of length k, we have P
where α is the predefined level of significance and β m is unknown but minimal by construction.Note that β m → 0 as m → ∞, thus the error of falsely accepting an observation decreases with increasing sample width.
Application.This framework has two problems for practical applications: (1) the parameter r may be hard to find and (2) for a given sample, it is no trivial task to find the trace distribution, that gives it maximal likelihood, i.e.

P k,m A (O) := max D∈(trd(A,k)\trd(A,k−1)) m P D (O).
The parameter r gives the best fit, but finding it is no trivial task.It is of interest for the soundness and completeness proofs, but in practice we will use χ 2 hypothesis testing.The empirical value where n (σ) is the amount σ occurred in the sample, is compared to critical values of given degrees of freedom and levels of significance.These values can be calculated or looked up in a χ 2 table.
Since expectations in our construction depend on a scheduler/trace distribution to explain a possible sample, it is of interest to find the best fit.Hence, we are trying to solve the minimisation min ( By construction, we want to optimize the probabilities p i used by a scheduler to resolve non-determinism.This turns (1) into a minimisation of a rational function f (p) /g (p) with inequality constraints on the vector p.As shown in [25], minimizing rational functions is NP-hard.This approach optimizes one possible trace distribution to fit the sample data instead of finding m different ones.This topic could be handled in future research, with the assumption of one distribution which lets the implementation choose different trace distributions.
Verdict Function.With this framework, the following decision process summarizes if an implementation fails for functional and/or statistical behaviour.

Soundness and Completeness
Talking about soundness and completeness when referring to probabilistic systems is not a trivial topic, since one of the main inherent difficulties of statistical analysis is the possibility of false rejection or false acceptance.
The former is of interest when we refer to soundness (i.e.what is the probability that we erroneously assign fail to a correct implementation), and the latter is important when we talk about completeness (i.e.what is the probability that we assign pass to an erroneous implementation).Thus, a test suite can only fulfil these properties with a guaranteed (high) probability (c.f.Definition 12).Definition 16.Let A S be a specification over an action signature (L I , L O ), α ∈ (0, 1) the level of significance and T an annotated test suite for A S .Then -T is sound for A S with respect to pioco , if for all input-enabled implementations A i ∈ pIOTS and sufficiently large m ∈ N it holds that -T is complete for A S with respect to pioco , if for all input-enabled implementations A I ∈ pIOTS and sufficiently large m ∈ N it holds that Soundness for a given α ∈ (0, 1) expresses that we have a 1 − α chance that a correct system will pass the annotated suite for sufficiently large sample width m.This relates to false rejection of a correct hypothesis or correct implementation respectively.
Completeness of a test suite is inherently a theoretic result.Since we allow loops, we require a test suite of infinite size.Moreover, there is still the chance of falsely accepting an erroneous implementation.However, this is bound from above by construction, and will decrease for bigger sample sizes (c.f.Definition 11).
Theorem 18 (Completeness).The set of all annotated test cases for a specification A S is complete for every level of significance α ∈ (0, 1) wrt pioco.

Experimental Validation
To apply our framework, we implemented two well-known randomized communication protocols in Java, and tested these with the MBT tool JTorX [3].The statistical verdicts were calculated in MatLab with a level of significance α = 0.1.

Binary Exponential Backoff
The Binary Exponential Backoff protocol is a data transmission protocol between N hosts, trying to send information via one bus [19].If two hosts send simultaneously, then their messages collide and they pick a new waiting time before trying again: after i collisions, they randomly choose a slot in {0, . . . 2 i − 1} until the message gets through.
A sample of the protocol is shown in Table 1.Note that our specification of this protocol contains no non-determinism.Thus, calculations in this example are not subject to optimization to find the best trace distribution.n in Table 1 shows how many times each trace occurred and E σ gives the expected value.The interval [l 0.1 , r 0.1 ] represents the 90 % confidence interval under the assumption of a normal distribution.It gives a rough idea how much values will deviate for the given level of confidence.However, we are interested in the multinomial deviation (i.e. less deviation of one trace allows higher deviation for another trace).For that purpose we use the χ 2 score, given by the sum of the entries of the last column.Calculation shows χ 2 = 14.84 < 17.28 = χ 2 0.1 , which is the critical value for 11 degrees of freedom and α = 0.1.Consequently, we accept the hypothesis of the probabilities being implemented correctly.

IEEE 1394 FireWire Root Contention Protocol
The IEEE 1394 FireWire Root Contention Protocol [37] elects a leader between two nodes via coin flips: If head comes up, node i picks a waiting time fast i ∈ [0.24 μs, 0.26 μs], if tail comes up, it waits slow i ∈ [0.57μs, 0.60 μs].After the waiting time has elapsed, the node checks whether a message has arrived: if so, the node declares itself leader.If not, the node will send out a message itself, asking the other node to be the leader.Thus, the four outcomes of the coin flips are: {fast 1 , fast 2 }, {slow 1 , slow 2 }, {fast 1 , slow 2 } and {slow 1 , fast 2 }.The protocol contains inherent non-determinism [37]; If different times were picked, the protocol always terminates.However, if equal times were picked, it may either elect a leader, or retry depending on the resolution of the non-determinism.Table 2 shows the recorded traces, where c1? and c2? denote coin1 and coin2 respectively.We have tested five implementations: Implementation Correct implements fair coins, while the mutants M 1 , M 2 , M 3 and M 4 were subjects to probabilistic deviations giving advantage to the second node, i.e.P (fast 1 ) = P (slow 2 ) = 0.1, P (fast 1 ) = P (slow 2 ) = 0.4, P (fast 1 ) = P (slow 2 ) = 0.45 and P (fast 1 ) = P (slow 2 ) = 0.49 for mutants 1, 2, 3 and 4 respectively.The expected value E D σ depends on resolving one non-determinism by varying p (which coin was flipped first).Note that other non-determinism was not subject to optimization, but immediately clear by trace frequencies.The calculated χ 2 scores are based on an optimized value for p for each sample and compared to the critical value χ 2 0.1 = 17.28 resulting in the verdicts shown.

Conclusions and Future Work
We defined a and complete framework to test probabilistic systems, defined a conformance relation in the ioco tradition called pioco and showed how to derive probabilistic tests of a requirements model.Verdicts that handle the functional and statistical behaviour are assigned after a test is applied.We showed that the correct verdict can be assigned up to arbitrary precision by setting a level of significance and sufficiently large sample size.Future work should focus on the practical aspects of our theory: tool support, larger case studies and more powerful statistical methods to increase efficiency.

Fig. 1 .
Fig. 1.Specification and two implementations of a shuffle music player.Actions separated by commas indicate that two transitions are enabled from the state.
2, . . .pairwise disjoint.Definition 1.A probabilistic input/output transition system is a sixtuple A = (S, s, L I , L O , L H , Δ), where -S is a finite set of states, -s 0 is the unique starting state, -L I , L O and L H are disjoint sets of input, output and internal labels respectively, containing a special quiescence label δ ∈ L O .We write L = L I ∪L δ O ∪L H for the set of all labels.-Δ ⊆ S×Distr (L × S) a finite transition relation such that for all input actions a?, μ (a?, s ) > 0 implies μ (b, s ) = 0 for all b = a?.

Fig. 2 .
Fig. 2. Two tests derived from the specification in Fig. 1

Table 2 .
A sample O of length k = 5 and depth m = 10 5 of the FireWire root contention protocol.Calculations of χ 2 are done after optimization in p.

Definition 7 .
A test or test case over an action signature (L I , L O ) is a pIOTS of the form t = (S, s 0 , L O \ {δ} , L I ∪ {δ} , {τ obs , τ stim , τ stop } , Δ) such that where after (s) is the set of actions in state s.A test suite T is a set of test cases.A test case (suite) for a pIOTS A S = (S, s 0 , L I , L O , L H , Δ), is a test case (suite) over (L I , L O ).Note that the action signature of tests has switched input and output label sets.Definition 8.For a given test t a test annotation is a function a : ctraces (t) −→ {pass, fail }.A pair t = (t, a) consisting of a test and a test annotation is called an annotated test.The set of all such t, denoted by T = (t i , a i ) i∈I for some index set I, is called an annotated test suite.If t is a test case for a specification A S we define the test annotation a AS ,t : ctraces (t) −→ {pass, fail } by obs , τ stim , τ stop }, or -after (s) = L I ∪ {δ}, or -after (s) = L out , such that L out ⊆ L O \ {δ}, S ) ; pass otherwise.Example 9. Figure 2 shows two derived tests for the specification in Fig. 1.Note that the action signature is mirrored.Therefore if s μ,a

Definition 12 .
Given a specification A S , an annotated test t for A S , k, m ∈ N where k given by the trace length of t and a level of significance α ∈ (0, 1), we define the functional verdict as the function v t : pIOTS −→ {pass, fail }, withThe implementation is always assumed to be input-enabled.If the specification is input-enabled too, then pioco coincides with trace distribution inclusion.Moreover, our results show that pioco is transitive, just like ioco.Let A, B and C be pIOTSs and let A and B be input-enabled, then -A pioco B if and only if A T D B.-A pioco B and B pioco C then A pioco C.