ModelBased Testing of Probabilistic Systems
 6 Citations
 840 Downloads
Abstract
This paper presents a modelbased testing framework for probabilistic systems. We provide algorithms to generate, execute and evaluate test cases from a probabilistic requirements model. In doing so, we connect iocotheory for modelbased testing and statistical hypothesis testing: our iocostyle algorithms handle the functional aspects, while statistical methods, using \(\chi ^2\) tests and fitting functions, assess if the frequencies observed during test execution correspond to the probabilities specified in the requirements.
Key results of our paper are the classical soundness and completeness properties, establishing the mathematical correctness of our framework; Soundness states that each test case is assigned the right verdict. Completeness states that the framework is powerful enough to discover each probabilistic deviation from the specification, with arbitrary precision.
We illustrate the use of our framework via two case studies.
Keywords
Statistical Hypothesis Testing Trace Distribution False Rejection Probabilistic Automaton Finite Path1 Introduction
Probability. Probability plays an important role in many computer applications. A vast number of randomized algorithms, protocols and computation methods use randomization to achieve their goals. Routing in sensor networks, for instance, can be done via random walks [1]; speech recognition is based on hidden Markov models [32]; population genetics use Bayesian computation [2], security protocols use random bits in their encryption methods [10], control policies in robotics, leading to the emerging field of probabilistic robotics, concerned with perception and control in the face of uncertainty networking algorithms assign bandwidth in a random fashion. Such applications can be implemented in one of the many probabilistic programming languages, such as ProbabilisticC [26] or Figaro [28]. At a higher level, service level agreements are formulated in a stochastic fashion, stating that the average uptime should be at least 99 %, or that the punctuality of train services should be 95 %.
Key question is whether such probabilistic systems are correct: is bandwidth distributed fairly among all parties? Is the uptime, packet delay and jitter according to specification? Do the trains on a certain day run punctual enough? To investigate such questions, probabilistic verification has become a mature research field, putting forward models like probabilistic automata (PAs) [33, 38], Markov decision processes [30], (generalized) stochastic Petri nets [23], with verification techniques like stochastic model checking [31], and tools like Prism [20].
Testing. In practice, the most common validation technique is testing, where we subject the system under test to many welldesigned test cases, and compare the outcome to the specification. Surprisingly, only few papers are concerned with the testing of probabilistic systems^{1}, with notable exceptions being [16, 18].
This paper presents a modelbased testing framework for probabilistic systems. Modelbased testing (MBT) is an innovative method to automatically generate, execute and evaluate test cases from a requirements model. By providing faster and more thorough testing at lower cost, MBT has gained rapid popularity in industry. A wide variety of MBT frameworks exist, capable of handling different system aspects, such as functional properties [40], realtime [5, 8, 22], quantitative aspects [7], and continuous behaviour [27]. As stated, MBT approaches dealing with probability are underdeveloped.
Our Approach. Our specification is given as probabilistic input/output transition system pIOTS, a mild generalization of the PA model. As usual, pIOTSs contain two type of choices, nondeterministic choices model choices that are not under the control of the system. As argued in [33], these are needed to model phenomena like implementation freedom, scheduler choices, intervals of probability and interleaving. Probabilistic choices model random choices made by the system (e.g., coin tosses) or nature (e.g., failure probabilities, degradation rates).
Important contribution are our algorithms to automatically generate, execute and evaluate test cases from a specification pIOTS. These test cases are probabilistic and check if both the functional and the probabilistic behaviour conform to the specification. Probability is observed through frequencies, hence we execute each test multiple times. We use statistical hypothesis testing, in particular the \(\chi ^2\) test, to assess whether a test case should pass or fail. Technical complication here is the nondeterminism in pIOTSs, which prevents us from directly using the \(\chi ^2\) test. Rather, we first need to find the best resolution of the nondeterminism that could have led to these observations. To do so, we set up a nonlinear optimization problem that finds the best fit for the \(\chi ^2\) test.
Key result of our paper is the soundness and completeness of our framework. Soundness states that each test case we derive contains the correct verdict: a pass if the behaviour observed during testing conforms to the requirements; a fail if it does not. Completeness states that the framework is powerful enough to discover each deviation of nonconforming implementations. Formulating the soundness and completeness results requires a formal notion of conformance. Here, we propose the piocorelation, which pins down when an implementation modelled as pIOTS conforms to a specification pIOTs. We prove several properties of the piocorelation, in particular it being a conservative extension of ioco. Lastly, we illustrate our approach with two case studies: the exponential binary back off protocol, and the IEEE 1394 root contention protocol.
While test efficiency is important, this paper focusses on the methodological set up and correctness. Important future work is to optimize the statistical verdicts we derive and to provide a fully fledged implementation of our methods.
Related Work. Probabilistic testing preorders and equivalences are well studied [11, 13, 34], defining when two probabilistic transition systems are equivalent, or one subsumes the other. In particular, early and influential work by [21] introduces the fundamental concept of probabilistic bisimulation via hypothesis testing. Also, [9] shows how to observe trace probabilities via hypothesis testing. Executable test frameworks for probabilistic systems have been defined for probabilistic finite state machines [17, 24], dealing with mutations and stochastic timing, Petri nets [6], and CSL [35, 36]. The important research line of statistical testing [4, 42, 43] is concerned with choosing the inputs for the SUT in a probabilistic way in order to optimize a certain test metric, such as (weighted) coverage. The question on when to stop statistical testing is tackled in [29].
An approach similar in the spirit of ours is by Hierons et al. [16]. However, our model can be considered as an extension of [16] reconciling probabilistic and nondeterministic choices in a fully fledged way. Being more restrictive enables [16] to focus on individual traces, whereas we use trace distributions.
Furthermore, the current paper extends a workshop paper by [14] that introduced the piocorelation and roughly sketched the test case process. Novel contributions of our current paper are 1. a more generic model pIOTS model that includes internal transitions, 2. the soundness and completeness results, 3. solid definitions of test cases, test execution, and verdicts, 4. the treatment of quiescence, i.e., absence of outputs, 5. the handling of probabilistic test cases.
Overview of the Paper. Section 2 sets the mathematical framework and introduces pIOTSs, adversaries and trace distributions. Section 3 shows how we generate and execute probabilistic tests and evaluate them functionally and statistically. Section 4 introduces the pioco relation and shows the soundness and completeness of our testing method. Two case studies can be found in Sect. 5. Lastly Sect. 6 ends the paper with future work and conclusions.
2 Preliminaries
2.1 Probabilistic Input/Output Systems
We start by introducing some standard notions from probability theory. A discrete probability distribution over a set X is a function \(\mu :X\longrightarrow \left[ 0,1\right] \) such that \(\sum _{x\in X}\mu \left( x\right) =1\). The set of all distributions over X is denoted by \( Distr \left( X\right) \). The probability distribution that assigns 1 to a certain element \(x\in X\) is called the Dirac distribution over x and is denoted \( Dirac \left( x\right) \).
A probability space is a triple \(\left( \varOmega ,\mathcal {F},\mathbb {P}\right) \), such that \(\varOmega \) is a set, \(\mathcal {F}\) is a \(\sigma \)field of \(\varOmega \), and \(\mathbb {P}:\mathcal {F}\rightarrow \left[ 0,1\right] \) a probability measure such that \(\mathbb {P}\left( \varOmega \right) =1\) and \(\mathbb {P}\left( \bigcup _{i=0}^{\infty }A_{i}\right) =\sum _{i=0}^{\infty }\mathbb {P}\left( A_{i}\right) \) for \(A_{i}\in \mathcal {F}\), \(i=1,2,\ldots \) pairwise disjoint.
Definition 1

S is a finite set of states,

\(s_0\) is the unique starting state,

\(L_I\), \(L_O\) and \(L_H\) are disjoint sets of input, output and internal labels respectively, containing a special quiescence label \(\delta \in L_O\). We write \(L=L_I\cup L_O^\delta \cup L_H\) for the set of all labels.

\(\varDelta \subseteq S\times Distr \left( L\times S\right) \) a finite transition relation such that for all input actions a?, \(\mu \left( a?,s'\right) >0\) implies \(\mu \left( b,s''\right) =0\) for all \(b\ne a?\).
We use “?” to suffix input and “!” to suffx output. We write \(s\xrightarrow {\mu ,a}s'\) if \(\left( s,\mu \right) \in \varDelta \) and \(\mu \left( a,s'\right) >0\); and \(s\rightarrow a\) if there are \(\mu \in Distr \left( L\times S\right) \) and \(s'\in S\) such that \(s\xrightarrow {\mu ,a}s'\) ( Open image in new window if not). We write \(s\xrightarrow {\mu ,a}_\mathcal {A}s'\), etc. to clarify ambiguities if needed. Lastly, \(\mathcal {A}\) is inputenabled if for all \(s\in S\) we have \(s\rightarrow a?\) for all \(a\in L_I\).
Example 2
Figure 1 shows three models of a simple shuffle mp3 player with two songs. The pIOTS in (Fig. 1a) models the requirements: pressing the shuffle button enables the two songs with probability 0.5 each, repeatedly until stop is pressed.
Implementation (Fig. 1b) is subject to a small probabilistic deviation. In implementation (Fig. 1c) the same song cannot be played twice in a row without intervention of the shuffle button. States without enabled output transition allow quiescence, denoted by \(\delta \) transitions. The modelbased testing framework established in the paper is capable of detecting all of the above flaws.
Parallel composition is defined in a standard fashion. Two pIOTSs in composition synchronize on shared actions, and evolve independently on others. Since the transitions in the component pIOTSs are stochastically independent, we multiply the probabilities when taking shared actions, denoted by \(\mu \times \nu \). To avoid name clashes, we only compose compatible pIOTSs. Note that parallel composition of two inputenabled pIOTSs yields a pIOTS.
Definition 3
2.2 Paths and Traces
The associated trace of a path \(\pi \) is obtained by omitting states, distributions and internal actions, i.e. \( trace \left( \pi \right) =a_{1}a_{2}a_{3}\ldots \). Conversely, \( trace ^{1}\left( \sigma \right) \) gives the set of all paths, which have trace \(\sigma \). The length of a path is the number of occurring actions on its associated trace. All finite traces of \(\mathcal {A}\) are summarized in \( traces \left( \mathcal {A}\right) \). The set of complete traces, \( ctraces \left( \mathcal {A}\right) \), contains every trace based on paths ending in states that do not enable any more actions. We write \( out {}_{\mathcal {A}}\left( \sigma \right) \) for the set of all output actions enabled with positive probability after trace \(\sigma \).
2.3 Adversaries and Trace Distributions
Very much like traces of LTSs are obtained by first selecting a path and by then removing all states and internal actions, we do the same in the probabilistic case. First, we resolve all nondeterministic choices in the pIOTS via an adversary and then we remove all states to get the trace distribution.
The resolution of the nondeterminism via an adversary leads to a purely probabilistic system, in which we can assign a probability to each finite path. A classical result in measure theory [12] shows that it is impossible to assign a probability to all sets of traces, hence we use \(\sigma \)fields \(\mathcal {F}\) consisting of cones. Adversaries. Following the standard theory for probabilistic automata [38], we define the behaviour of a pIOTS via adversaries (a.k.a. policies or schedulers) to resolve the nondeterministic choices in pIOTSs; in each state of the pIOTS, the adversary may choose which transition to take or it may also halt the execution.
Given any finite history leading to a state, an adversary returns a discrete probability distribution over the set of next transitions. In order to model termination, we define schedulers such that they can continue paths with a halting extension, after which only quiescence is observed.
Definition 4
Intuitively an adversary tosses a multifaced and biased die at every step of the computation, thus resulting in a purely probabilistic computation tree. The probability assigned to a path \(\pi \) is obtained by the probability of its cone \(C_\pi = \left\{ \pi '\in Path \left( \mathcal {A}\right) \mid \pi \sqsubseteq \pi '\right\} \). We use the inductively defined path probability function \(Q^E\), i.e. \(Q^E\left( s_0\right) =1\) and \(Q^E\left( \pi \mu a s\right) =Q^E\left( \pi \right) E\left( \pi \right) \left( \mu \right) \mu \left( a,s\right) \). This function enables us to assign a unique probability space \(\left( \varOmega _E,\mathcal {F}_E,P_E\right) \) associated to an adversary E. Thus, the probability of \(\pi \) is \(P_E\left( \pi \right) =P_E\left( C_\pi \right) =Q^E\left( \pi \right) \).
Trace Distributions. A trace distribution is obtained from (the probability space of) an adversary by removing all states. Thus, the probability assigned to a set of traces X is the probability of all paths whose trace is an element of X.
Definition 5
 1.
\(\varOmega _{H}=L^\omega \),
 2.
\(\mathcal {F}_{H}\) is the smallest \(\sigma \)field containing the set \(\left\{ C_{\beta }\subseteq \varOmega _H\mid \beta \in L^{\omega }\right\} \),
 3.
\(P_{H}\) is the unique prob. measure on \(\mathcal {F}_{H}\) such that \(P_{H}\left( X\right) =P_{E}\left( trace {}^{1}\left( X\right) \right) \) for \(X\in \mathcal {F}_{H}\).
We write \( trd \left( \mathcal {A}\right) \) for the set of all trace distributions of \(\mathcal {A}\) and \( trd \left( \mathcal {A},k\right) \) for those halting after \(k\in \mathbb {N}\). Lastly we write \(\mathcal {A}=_{ TD }\mathcal {B}\) if \( trd \left( \mathcal {A}\right) = trd \left( \mathcal {B}\right) \), \(\mathcal {A}\sqsubseteq _{ TD }\mathcal {B}\) if \( trd \left( \mathcal {A}\right) \subseteq trd \left( \mathcal {B}\right) \) and \(\mathcal {A}\sqsubseteq _{ TD }^{k}\mathcal {B}\) if \( trd \left( \mathcal {A},k\right) \subseteq trd \left( \mathcal {B},k\right) \) for \(k\in \mathbb {N}\).
The fact that \(\left( \varOmega _{E},\mathcal {F}_{E},P_{E}\right) \) and \(\left( \varOmega _{H},\mathcal {F}_{H},P_{H}\right) \) define probability spaces, follows from standard measure theory arguments (see for example [12]).
Example 6
Consider (c) in Fig. 1 and an adversary E starting from the beginning state \(s_0\) scheduling probability 1 to shuf?, 1 to the distribution consisting of song1! and song2! and \(\frac{1}{2}\) to both shuffle? transitions in \(s_2\). Then choose the paths \(\pi =s_0\mu _1\text{ shuf? }s_1 \mu _2 \text{ song1! } s_2 \mu _3 \text{ shuf? } s_2\) and \(\pi '=s_0\mu _1\text{ shuf? }s_1 \mu _2 \text{ song1! } s_2 \mu _4 \text{ shuf? } s_1\).
We see that \(\sigma = trace \left( \pi \right) = trace \left( \pi '\right) \) and \(P_E\left( \pi \right) =Q^E\left( \pi \right) =\frac{1}{4}\) and \(P_E\left( \pi '\right) =Q^E\left( \pi '\right) =\frac{1}{4}\), but \(P_{ trd \left( E\right) }\left( \sigma \right) =P_E\left( trace ^{1}\left( \sigma \right) \right) =P_E\left( \left\{ \pi ,\pi '\right\} \right) =\frac{1}{2}\).
3 Testing with pIOTS
3.1 Test Generation
Modelbased testing entails the automatic test case generation, execution and evaluation based on a requirements model. We provide two algorithms for test case generation: an offline or batch algorithm that generates test cases before their execution; and an online or onthefly algorithm generating test cases during execution.
First, we formalize the notion of a (offline) test case over an action signature \(\left( L_I,L_O\right) \). In each state of a test, the tester can either provide some stimulus \(a?\in L_I\), or wait for a response of the system or stop the testing process.^{2} Each of these possibilities can be chosen with a certain probability, leading to probabilistic test cases. We model this as a probabilistic choice between the internal actions \(\tau _{\textit{obs}}\), \(\tau _{\textit{stop}}\) and \(\tau _{\textit{stim}}\). Note that, even in the nonprobabilistic case, the test cases are often generated probabilistically in practice, but this is not supported in theory. Thus, our definition fills a small gap here.
Furthermore, note that, when waiting for a system response, we have to take into account all potential outputs in \(L_O\), including the situation that the system provides no response at all, modelled by \(\delta \). Since the continuation of a test depends on the history, offline test cases are formalized as trees.
Definition 7

\(t\) is internally deterministic and does not contain an infinite path;

\(t\) is acyclic and connected;
 For every state \(s\in S\), we either have
 
\( after \left( s\right) =\emptyset \), or
 
\( after \left( s\right) =\left\{ \tau _{\textit{obs}},\tau _{\textit{stim}},\tau _{\textit{stop}}\right\} \), or
 
\( after \left( s\right) =L_{I}\cup \left\{ \delta \right\} \), or
 
\( after \left( s\right) =L_{ out }\), such that \(L_{ out }\subseteq L_{O}\backslash \left\{ \delta \right\} \),
 
Note that the action signature of tests has switched input and output label sets.
Definition 8
Example 9
Figure 2 shows two derived tests for the specification in Fig. 1. Note that the action signature is mirrored. Therefore if \(s\xrightarrow {\mu ,a}_{\widehat{t}}s'\) with a an output action of the specification, then we have \(\mu = Dirac \). Test \(\widehat{t}_2\) shows how we apply stimuli, observe or stop with probabilities \(\frac{1}{3}\) each. If we stimulate, we apply stop! and shuf! with probability \(\frac{1}{2}\) each.
Algorithms. The procedure batch in Algorithm 1 generates test cases from a specification, given a specification pIOTSs \(\mathcal {A}_\mathcal {S}\) and a history \(\sigma \), which is initially \(\epsilon \). Each step a probabilistic choice is made to return an empty test, to observe or to stimulate, denoted with probabilities \(p_{\sigma ,1}, p_{\sigma ,2}\) or \(p_{\sigma ,3}\) respectively. The latter two call the procedure batch again. If erroneous output is detected, we stop immediately. We require that \(p_{\sigma ,1}+p_{\sigma ,2}+p_{\sigma ,3}=1\).
Algorithm 2 shows a sound way to derive tests onthefly. The inputs are a specification \(\mathcal {A}_\mathcal {S}\), a concrete implementation \(\mathcal {A}_\mathcal {I}\) and a test length \(n\in \mathbb {N}\). The algorithm returns a verdict of whether or not the implementation is ioco correct in the first n steps. If erroneous output was detected, the verdict will be \( fail \) and \( pass \) otherwise. With probability \(p_{\sigma ,1}\) we observe and with probability \(p_{\sigma ,2}\) we stimulate. The algorithm stops after n steps. Thus, Open image in new window .
3.2 Test Evaluation
In our framework, we assess functional behaviour by the test verdict \(a_{\mathcal {A}_\mathcal {S},t}\) and probabilistic behaviour via statistics, as elaborated below.
Statistical Verdict. Given a (black box) implementation, the idea is to run an offline or online test case multiple times, in order to collect a sample. Then, we check if the frequencies of the traces contained in this sample match the probabilities in the specification via statistical hypothesis testing. However, since the specification contains nondetermism, we cannot apply statistical means directly. Rather, we check if the observed trace frequencies can be explained, if we resolve occurring nondeterminism in the specification according to some scheduler.
We formulate a hypothesised scheduler that makes the occurrence of the sample most likely. This gives rise to a purely probabilistic computation tree and probabilities and expected values for each trace can be calculated. Based on a predefined level of significance \(\alpha \in \left( 0,1\right) \) we use null hypothesis testing to determine whether to accept or reject the hypothesised scheduler. If it is accepted, we have no reason to assume that the implementation differs probabilistically from the specification and give the pass label. If it is rejected, we assign the fail verdict, because there is no scheduler to explain the observed frequencies.
Sampling. To collect a sample, we define the length \(k\in \mathbb {N}\) and width \(m\in \mathbb {N}\) of an experiment first, i.e. how long shall we observe the machine and how many times do we want to run it before stopping. Thus, we collect \(\sigma _1,\ldots ,\sigma _m\in traces \left( \mathcal {A}_\mathcal {I}\right) \) with \(\left \sigma _i\right =k\) for \(i=1,\ldots ,m\). We call \(O=\left( \sigma _1,\ldots , \sigma _m\right) \in \left( L^k\right) ^m\) a sample. We assume the system is governed by a trace distribution \(D_i\) in every run, thus running the machine m times, means that a sample is generated by a sequence of m (possibly) different trace distributions \(\varvec{D}=\left( D_1,D_2,\ldots ,D_m\right) \in trd \left( \mathcal {A}_\mathcal {I},k\right) ^m\).
Each run the implementation makes two choices. (1) It chooses a trace distribution \(D_i\) and (2) \(D_i\) chooses a trace \(\sigma _i\). Once a trace distribution \(D_i\) is chosen, it is solely responsible for the trace \(\sigma _i\), meaning that for \(i\ne j\) the choice of \(\sigma _i\) by \(D_i\) is independent of the choice \(\sigma _j\) by \(D_j\).
Acceptable Outcomes. We will accept a sample O if \( freq \left( O\right) \) lies within some distance r of the expected distribution \(\mathbb {E}^{\varvec{D}}\). Recall the definition of a ball centred at \(x\in X\) with radius r as \(B_r\left( x\right) =\left\{ y\in X\mid dist \left( x,y\right) \le r\right\} \). All distributions deviating at most by r from the expected distribution are contained within the ball \(B_r\left( \mathbb {E}^{\varvec{D}}\right) \), where \( dist \left( u,v\right) :=\sup _{\sigma \in L^k}\mid u\left( \sigma \right) v\left( \sigma \right) \mid \) and u and v are distributions. In order to minimize the error of falsely accepting a sample, we choose the smallest radius, such that the error of falsely rejecting a sample is not greater than a predefined level of significance \(\alpha \in \left( 0,1\right) \) by \(\bar{r}:=\inf \left\{ r\mid P_{\varvec{D}}\left( freq ^{1}\left( B_r\left( \mathbb {E}^{\varvec{D}}\right) \right) \right) >1\alpha \right\} .\)
Definition 11
The defined set of observations of a pIOTS \(\mathcal {A}\) therefore has two properties, reflecting the error of false rejection and false acceptance respectively.
 1.
For \(\varvec{D}\in trd \left( \mathcal {A}\right) \) of length k, we have \(P_{\varvec{D}}\left( Obs \left( \mathcal {A},\alpha ,k,m\right) \right) \ge 1\alpha \),
 2.
For \(\varvec{D}'\notin trd \left( \mathcal {A}\right) \) of length k, we have \(P_{\varvec{D}'}\left( Obs \left( \mathcal {A},\alpha ,k,m\right) \right) \le \beta _m\),
Verdict Function. With this framework, the following decision process summarizes if an implementation fails for functional and/or statistical behaviour.
Definition 12
4 Conformance, Soundness and Completeness
A key result of our paper is the correctness of our framework, formalized as soundness and completeness. Soundness states that each test case is assigned the correct verdict. Completeness states that the framework is powerful enough to discover each deviation from the specification. Formulating these properties requires a formal notion of conformance that we formalize as the piocorelation.
4.1 Probabilistic Input/Output Conformance \(\sqsubseteq _{ pioco }\)
 1.
the prefix relation for trace distributions \(H \sqsubseteq _{k} H'\) is the analogue of trace prefixes, i.e. \(H\sqsubseteq _{k} H'\) iff \(\forall \sigma \in L^k: P_H\left( \sigma \right) =P_{H'}\left( \sigma \right) \)
 2.
for a pIOTSs \(\mathcal {A}\) and a trace distribution H of length k, the output continuation of H in \(\mathcal {A}\) contains all trace distributions, which are equal up to length k and assign every trace of length \(k+1\) ending in input probability 0. We set
Definition 13
The pioco relation conservatively extends the ioco relation, i.e. both relations coincide for IOTSs.
Theorem 14
The implementation is always assumed to be inputenabled. If the specification is inputenabled too, then pioco coincides with trace distribution inclusion. Moreover, our results show that pioco is transitive, just like ioco.
Theorem 15

\(\mathcal {A}\sqsubseteq _{ pioco }\mathcal {B}\) if and only if \(\mathcal {A}\sqsubseteq _{TD}\mathcal {B}\).

\( \mathcal {A}\sqsubseteq _{ pioco }\mathcal {B}\) and \(\mathcal {B}\sqsubseteq _{ pioco }\mathcal {C}\) then \(\mathcal {A}\sqsubseteq _{ pioco }\mathcal {C}\).
4.2 Soundness and Completeness
Talking about soundness and completeness when referring to probabilistic systems is not a trivial topic, since one of the main inherent difficulties of statistical analysis is the possibility of false rejection or false acceptance.
The former is of interest when we refer to soundness (i.e. what is the probability that we erroneously assign \( fail \) to a correct implementation), and the latter is important when we talk about completeness (i.e. what is the probability that we assign \( pass \) to an erroneous implementation). Thus, a test suite can only fulfil these properties with a guaranteed (high) probability (c.f. Definition 12).
Definition 16
 \(\widehat{T}\) is sound for \(\mathcal {A}_\mathcal {S}\) with respect to \(\sqsubseteq _{ pioco }\), if for all inputenabled implementations \(\mathcal {A}_i\in pIOTS \) and sufficiently large \(m\in \mathbb {N}\) it holds that$$\begin{aligned} \mathcal {A}_\mathcal {I}\sqsubseteq _{ pioco }\mathcal {A}_\mathcal {S}\Longrightarrow V_{\widehat{T}}^{\alpha ,m}\left( \mathcal {A}_\mathcal {I}\right) = pass . \end{aligned}$$
 \(\widehat{T}\) is complete for \(\mathcal {A}_\mathcal {S}\) with respect to \(\sqsubseteq _{ pioco }\), if for all inputenabled implementations \(\mathcal {A}_\mathcal {I}\in pIOTS \) and sufficiently large \(m\in \mathbb {N}\) it holds that
Soundness for a given \(\alpha \in \left( 0,1\right) \) expresses that we have a \(1\alpha \) chance that a correct system will pass the annotated suite for sufficiently large sample width m. This relates to false rejection of a correct hypothesis or correct implementation respectively.
Theorem 17
(Soundness). Each annotated test for a pIOTS \(\mathcal {A}_\mathcal {S}\) is sound for every level of significance \(\alpha \in \left( 0,1\right) \) wrt pioco.
Completeness of a test suite is inherently a theoretic result. Since we allow loops, we require a test suite of infinite size. Moreover, there is still the chance of falsely accepting an erroneous implementation. However, this is bound from above by construction, and will decrease for bigger sample sizes (c.f. Definition 11).
Theorem 18
(Completeness). The set of all annotated test cases for a specification \(\mathcal {A}_\mathcal {S}\) is complete for every level of significance \(\alpha \in \left( 0,1\right) \) wrt pioco.
5 Experimental Validation
To apply our framework, we implemented two wellknown randomized communication protocols in Java, and tested these with the MBT tool JTorX [3]. The statistical verdicts were calculated in MatLab with a level of significance \(\alpha =0.1\).
5.1 Binary Exponential Backoff
The Binary Exponential Backoff protocol is a data transmission protocol between N hosts, trying to send information via one bus [19]. If two hosts send simultaneously, then their messages collide and they pick a new waiting time before trying again: after i collisions, they randomly choose a slot in \(\{0,\ldots 2^i1\}\) until the message gets through.
A sample O of trace length \(k=5\) and depth (number of test runs) \(m=10^5\). Calculations yield \(\chi ^2=14.84<17.28=\chi ^2_{0.1}\), hence we accept the implementation.
n in Table 1 shows how many times each trace occurred and \(E_\sigma \) gives the expected value. The interval \(\left[ l_{0.1},r_{0.1}\right] \) represents the 90 % confidence interval under the assumption of a normal distribution. It gives a rough idea how much values will deviate for the given level of confidence. However, we are interested in the multinomial deviation (i.e. less deviation of one trace allows higher deviation for another trace). For that purpose we use the \(\chi ^2\) score, given by the sum of the entries of the last column. Calculation shows \(\chi ^2=14.84<17.28 =\chi ^{2}_{0.1}\), which is the critical value for 11 degrees of freedom and \(\alpha =0.1\). Consequently, we accept the hypothesis of the probabilities being implemented correctly.
5.2 IEEE 1394 FireWire Root Contention Protocol
A sample O of length \(k=5\) and depth \(m=10^5\) of the FireWire root contention protocol. Calculations of \(\chi ^2\) are done after optimization in p.
Table 2 shows the recorded traces, where c1? and c2? denote coin1 and coin2 respectively. We have tested five implementations: Implementation Correct implements fair coins, while the mutants \(M_1\), \(M_2\), \(M_3\) and \(M_4\) were subjects to probabilistic deviations giving advantage to the second node, i.e. \(P\left( fast_1\right) =P\left( slow_2\right) =0.1\), \(P\left( fast_1\right) =P\left( slow_2\right) =0.4\), \(P\left( fast_1\right) =P\left( slow_2\right) =0.45\) and \(P\left( fast_1\right) =P\left( slow_2\right) =0.49\) for mutants 1, 2, 3 and 4 respectively. The expected value \(E^{\varvec{D}}_\sigma \) depends on resolving one nondeterminism by varying p (which coin was flipped first). Note that other nondeterminism was not subject to optimization, but immediately clear by trace frequencies. The calculated \(\chi ^2\) scores are based on an optimized value for p for each sample and compared to the critical value \(\chi ^2_{0.1}=17.28\) resulting in the verdicts shown.
6 Conclusions and Future Work
We defined a sound and complete framework to test probabilistic systems, defined a conformance relation in the ioco tradition called pioco and showed how to derive probabilistic tests of a requirements model. Verdicts that handle the functional and statistical behaviour are assigned after a test is applied. We showed that the correct verdict can be assigned up to arbitrary precision by setting a level of significance and sufficiently large sample size.
Future work should focus on the practical aspects of our theory: tool support, larger case studies and more powerful statistical methods to increase efficiency.
Footnotes
 1.
Note that the popular topic of statistical testing is concerned with choosing the test inputs probabilistically; it does not check for the correctness of the random choices made by a system itself.
 2.
Note that in more recent version of ioco theory [41], test cases are inputenabled. This can easily be incorporated into our framework.
References
 1.AlKaraki, J.N., Kamal, A.E.: Routing techniques in wireless sensor networks: a survey. IEEE Wireless Commun. 11(6), 6–28 (2004)CrossRefGoogle Scholar
 2.Beaumont, M.A., Zhang, W., Balding, D.J.: Approximate bayesian computation in population genetics. Genetics 162(4), 2025–2035 (2002)Google Scholar
 3.Belinfante, A.: JTorX: a tool for online modeldriven test derivation and execution. In: Esparza, J., Majumdar, R. (eds.) TACAS 2010. LNCS, vol. 6015, pp. 266–270. Springer, Heidelberg (2010)CrossRefGoogle Scholar
 4.Beyer, M., Dulz, W.: Scenariobased statistical testing of quality of service requirements. In: Leue, S., Systä, T.J. (eds.) Scenarios: Models, Transformations and Tools. LNCS, vol. 3466, pp. 152–173. Springer, Heidelberg (2005)CrossRefGoogle Scholar
 5.Bohnenkamp, H.C., Belinfante, A.: Timed testing with TorX. In: Fitzgerald, J.S., Hayes, I.J., Tarlecki, A. (eds.) FM 2005. LNCS, vol. 3582, pp. 173–188. Springer, Heidelberg (2005)CrossRefGoogle Scholar
 6.Böhr, F.: Model based statistical testing of embedded systems. In: IEEE 4th International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 18–25 (2011)Google Scholar
 7.Bozga, M., David, A., Hartmanns, A., Hermanns, H., Larsen, K.G., Legay, A., Tretmans, J.: Stateoftheart tools and techniques for quantitative modeling and analysis of embedded systems. In: DATE, pp. 370–375 (2012)Google Scholar
 8.Briones, Laura Brandán, Brinksma, Ed: A test generation framework for quiescent realtime systems. In: Grabowski, Jens, Nielsen, Brian (eds.) FATES 2004. LNCS, vol. 3395, pp. 64–78. Springer, Heidelberg (2005)CrossRefGoogle Scholar
 9.Cheung, L., Stoelinga, M., Vaandrager, F.: A testing scenario for probabilistic automata. J. ACM 54(6), 45 (2007). Article No. 29MathSciNetCrossRefzbMATHGoogle Scholar
 10.Choi, S.G., DachmanSoled, D., Malkin, T., Wee, H.: Improved noncommitting encryption with applications to adaptively secure protocols. In: Matsui, M. (ed.) ASIACRYPT 2009. LNCS, vol. 5912, pp. 287–302. Springer, Heidelberg (2009)CrossRefGoogle Scholar
 11.Cleaveland, R., Dayar, Z., Smolka, S.A., Yuen, S.: Testing preorders for probabilistic processes. Inform. Comput. 154(2), 93–148 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
 12.Cohn, D.L.: Measure Theory. Birkhäuser, Boston (1980)CrossRefzbMATHGoogle Scholar
 13.Deng, Y., Hennessy, M., van Glabbeek, R.J., Morgan, C.: Characterising Testing Preorders for Finite Probabilistic Processes. CoRR (2008)Google Scholar
 14.Gerhold, M., Stoelinga, M.: Ioco Theory for Probabilistic Automata. In: Proceedings of the Tenth Workshop on MBT, pp. 23–40 (2015)Google Scholar
 15.van Glabbeek, R.J., Smolka, S.A., Steffen, B., Tofts, C.: Reactive, Generative, and Stratified Models of Probabilistic Processes, pp. 130–141. IEEE Computer Society Press (1990)Google Scholar
 16.Hierons, R.M., Núñez, M.: Testing probabilistic distributed systems. In: Hatcliff, J., Zucca, E. (eds.) FMOODS 2010, Part II. LNCS, vol. 6117, pp. 63–77. Springer, Heidelberg (2010)CrossRefGoogle Scholar
 17.Hierons, R.M., Merayo, M.G.: Mutation testing from probabilistic and stochastic finite state machines. J. Syst. Softw. 82, 1804–1818 (2009)CrossRefGoogle Scholar
 18.Hwang, I., Cavalli, A.R.: Testing a probabilistic FSM using interval estimation. Comput. Netw. 54, 1108–1125 (2010)CrossRefzbMATHGoogle Scholar
 19.Jeannet, B., D’Argenio, P.R., Larsen, K.G.: Rapture: a tool for verifying markov decision processes. In: Tools Day (2002)Google Scholar
 20.Kwiatkowska, M., Norman, G., Parker, D.: PRISM: probabilistic symbolic model checker. In: Field, T., Harrison, P.G., Bradley, J., Harder, U. (eds.) TOOLS 2002. LNCS, vol. 2324, pp. 200–204. Springer, Heidelberg (2002)Google Scholar
 21.Larsen, K.G., Skou, A.: Bisimulation Through Probabilistic Testing, pp. 344–352. ACM Press (1989)Google Scholar
 22.Larsen, K.G., Mikucionis, M., Nielsen, B.: Online testing of realtime systems using Uppaal. In: Grabowski, J., Nielsen, B. (eds.) FATES 2004. LNCS, vol. 3395, pp. 79–94. Springer, Heidelberg (2005)CrossRefGoogle Scholar
 23.Marsan, M.A., Balbo, G., Conte, G., Donatelli, S., Franceschinis, G.: Modelling with Generalized Stochastic Petri Nets. Wiley, New York (1994)zbMATHGoogle Scholar
 24.Merayo, M.G., Hwang, I., Núñez, M., Cavalli, A.: A statistical approach to test stochastic and probabilistic systems. In: Breitman, K., Cavalcanti, A. (eds.) ICFEM 2009. LNCS, vol. 5885, pp. 186–205. Springer, Heidelberg (2009)CrossRefGoogle Scholar
 25.Nie, J., Demmel, J., Gu, M.: Global minimization of rational functions and the nearest GCDs. J. Global Optim. 40(4), 697–718 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
 26.Paige, B., Wood, F.: A Compilation Target for Probabilistic Programming Languages. CoRR arXiv:1403.0504 (2014)
 27.Peters, H., Knieke, C., Brox, O., JaunsSeyfried, S., Krämer, M., Schulze, A.: A testdriven approach for modelbased development of powertrain functions. In: Cantone, G., Marchesi, M. (eds.) XP 2014. LNBIP, vol. 179, pp. 294–301. Springer, Heidelberg (2014)Google Scholar
 28.Pfeffer, A.: Practical probabilistic programming. In: Frasconi, P., Lisi, F.A. (eds.) ILP 2010. LNCS, vol. 6489, pp. 2–3. Springer, Heidelberg (2011)CrossRefGoogle Scholar
 29.Prowell, S.J.: Computations for Markov Chain Usage Models. Technical Report (2003)Google Scholar
 30.Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (2014)zbMATHGoogle Scholar
 31.Remke, A., Stoelinga, M. (eds.): Stochastic Model Checking. LNCS, vol. 8453. Springer, Heidelberg (2014)zbMATHGoogle Scholar
 32.Russell, N., Moore, R.: Explicit modelling of state occupancy in hidden markov models for automatic speech recognition. In: Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP 1985, vol. 10, pp. 5–8 (1985)Google Scholar
 33.Segala, R.: Modeling and verification of randomized distributed realtime systems. Ph.D. thesis, Cambridge, MA, USA (1995)Google Scholar
 34.Segala, R.: Testing Probabilistic Automata. In: Sassone, V., Montanari, U. (eds.) CONCUR 1996. LNCS, vol. 1119, pp. 299–314. Springer, Heidelberg (1996)CrossRefGoogle Scholar
 35.Sen, K., Viswanathan, M., Agha, G.: Statistical model checking of blackbox probabilistic systems. In: Alur, R., Peled, D.A. (eds.) CAV 2004. LNCS, vol. 3114, pp. 202–215. Springer, Heidelberg (2004)CrossRefGoogle Scholar
 36.Sen, K., Viswanathan, M., Agha, G.: On statistical model checking of stochastic systems. In: Etessami, K., Rajamani, S.K. (eds.) CAV 2005. LNCS, vol. 3576, pp. 266–280. Springer, Heidelberg (2005)CrossRefGoogle Scholar
 37.Stoelinga, M., Vaandrager, F.W.: Root contention in IEEE 1394. In: Katoen, J.P. (ed.) AMASTARTS 1999, ARTS 1999, and AMASTWS 1999. LNCS, vol. 1601, pp. 53–74. Springer, Heidelberg (1999)CrossRefGoogle Scholar
 38.Stoelinga, M.: Alea jacta est: verification of probabilistic, realtime and parametric systems. Ph.D. thesis, Radboud University of Nijmegen (2002)Google Scholar
 39.Stokkink, W.G.J., Timmer, M., Stoelinga, M.I.A.: Divergent quiescent transition systems. In: Veanes, M., Viganò, L. (eds.) TAP 2013. LNCS, vol. 7942, pp. 214–231. Springer, Heidelberg (2013)CrossRefGoogle Scholar
 40.Tretmans, J.: Test generation with inputs, outputs and repetitive quiescence. Softw. Concepts Tools 17(3), 103–120 (1996)zbMATHGoogle Scholar
 41.Tretmans, J.: Model based testing with labelled transition systems. In: Hierons, R.M., Bowen, J.P., Harman, M. (eds.) FORTEST. LNCS, vol. 4949, pp. 1–38. Springer, Heidelberg (2008)CrossRefGoogle Scholar
 42.Walton, G.H., Poore, J.H., Trammell, C.J.: Statistical Testing of Software Based on a Usage Model. Softw. Pract. Exper. 25(1), 97–108 (1995)CrossRefGoogle Scholar
 43.Whittaker, J.A., Thomason, M.G.: A markov chain model for statistical software testing. IEEE Trans. Softw. Eng. 20(10), 812–824 (1994)CrossRefGoogle Scholar