Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Probability. Probability plays an important role in many computer applications. A vast number of randomized algorithms, protocols and computation methods use randomization to achieve their goals. Routing in sensor networks, for instance, can be done via random walks [1]; speech recognition is based on hidden Markov models [32]; population genetics use Bayesian computation [2], security protocols use random bits in their encryption methods [10], control policies in robotics, leading to the emerging field of probabilistic robotics, concerned with perception and control in the face of uncertainty networking algorithms assign bandwidth in a random fashion. Such applications can be implemented in one of the many probabilistic programming languages, such as Probabilistic-C [26] or Figaro [28]. At a higher level, service level agreements are formulated in a stochastic fashion, stating that the average uptime should be at least 99Ā %, or that the punctuality of train services should be 95Ā %.

Key question is whether such probabilistic systems are correct: is bandwidth distributed fairly among all parties? Is the up-time, packet delay and jitter according to specification? Do the trains on a certain day run punctual enough? To investigate such questions, probabilistic verification has become a mature research field, putting forward models like probabilistic automata (PAs) [33, 38], Markov decision processes [30], (generalized) stochastic Petri nets [23], with verification techniques like stochastic model checking [31], and tools like Prism [20].

Testing. In practice, the most common validation technique is testing, where we subject the system under test to many well-designed test cases, and compare the outcome to the specification. Surprisingly, only few papers are concerned with the testing of probabilistic systemsFootnote 1, with notable exceptions being [16, 18].

This paper presents a model-based testing framework for probabilistic systems. Model-based testing (MBT) is an innovative method to automatically generate, execute and evaluate test cases from a requirements model. By providing faster and more thorough testing at lower cost, MBT has gained rapid popularity in industry. A wide variety of MBT frameworks exist, capable of handling different system aspects, such as functional properties [40], real-time [5, 8, 22], quantitative aspects [7], and continuous behaviour [27]. As stated, MBT approaches dealing with probability are underdeveloped.

Our Approach. Our specification is given as probabilistic input/output transition system pIOTS, a mild generalization of the PA model. As usual, pIOTSs contain two type of choices, non-deterministic choices model choices that are not under the control of the system. As argued in [33], these are needed to model phenomena like implementation freedom, scheduler choices, intervals of probability and interleaving. Probabilistic choices model random choices made by the system (e.g., coinĀ tosses) or nature (e.g., failure probabilities, degradation rates).

Important contribution are our algorithms to automatically generate, execute and evaluate test cases from a specification pIOTS. These test cases are probabilistic and check if both the functional and the probabilistic behaviour conform to the specification. Probability is observed through frequencies, hence we execute each test multiple times. We use statistical hypothesis testing, in particular the \(\chi ^2\) test, to assess whether a test case should pass or fail. Technical complication here is the non-determinism in pIOTSs, which prevents us from directly using the \(\chi ^2\) test. Rather, we first need to find the best resolution of the non-determinism that could have led to these observations. To do so, we set up a non-linear optimization problem that finds the best fit for the \(\chi ^2\) test.

Key result of our paper is the soundness and completeness of our framework. Soundness states that each test case we derive contains the correct verdict: a pass if the behaviour observed during testing conforms to the requirements; a fail if it does not. Completeness states that the framework is powerful enough to discover each deviation of non-conforming implementations. Formulating the soundness and completeness results requires a formal notion of conformance. Here, we propose the pioco-relation, which pins down when an implementation modelled as pIOTS conforms to a specification pIOTs. We prove several properties of the pioco-relation, in particular it being a conservative extension of ioco. Lastly, we illustrate our approach with two case studies: the exponential binary back off protocol, and the IEEE 1394 root contention protocol.

While test efficiency is important, this paper focusses on the methodological set up and correctness. Important future work is to optimize the statistical verdicts we derive and to provide a fully fledged implementation of our methods.

Related Work. Probabilistic testing preorders and equivalences are well studied [11, 13, 34], defining when two probabilistic transition systems are equivalent, or one subsumes the other. In particular, early and influential work by [21] introduces the fundamental concept of probabilistic bisimulation via hypothesis testing. Also, [9] shows how to observe trace probabilities via hypothesis testing. Executable test frameworks for probabilistic systems have been defined for probabilistic finite state machines [17, 24], dealing with mutations and stochastic timing, Petri nets [6], and CSL [35, 36]. The important research line of statistical testing [4, 42, 43] is concerned with choosing the inputs for the SUT in a probabilistic way in order to optimize a certain test metric, such as (weighted) coverage. The question on when to stop statistical testing is tackled in [29].

An approach similar in the spirit of ours is by Hierons et al. [16]. However, our model can be considered as an extension of [16] reconciling probabilistic and non-deterministic choices in a fully fledged way. Being more restrictive enables [16] to focus on individual traces, whereas we use trace distributions.

Furthermore, the current paper extends a workshop paper by [14] that introduced the pioco-relation and roughly sketched the test case process. Novel contributions of our current paper are 1. a more generic model pIOTS model that includes internal transitions, 2. the soundness and completeness results, 3. solid definitions of test cases, test execution, and verdicts, 4. the treatment of quiescence, i.e., absence of outputs, 5. the handling of probabilistic test cases.

Overview of the Paper. SectionĀ 2 sets the mathematical framework and introduces pIOTSs, adversaries and trace distributions. SectionĀ 3 shows how we generate and execute probabilistic tests and evaluate them functionally and statistically. SectionĀ 4 introduces the pioco relation and shows the soundness and completeness of our testing method. Two case studies can be found in Sect.Ā 5. Lastly Sect.Ā 6 ends the paper with future work and conclusions.

2 Preliminaries

2.1 Probabilistic Input/Output Systems

We start by introducing some standard notions from probability theory. A discrete probability distribution over a set X is a function \(\mu :X\longrightarrow \left[ 0,1\right] \) such that \(\sum _{x\in X}\mu \left( x\right) =1\). The set of all distributions over X is denoted by \( Distr \left( X\right) \). The probability distribution that assigns 1 to a certain element \(x\in X\) is called the Dirac distribution over x and is denoted \( Dirac \left( x\right) \).

A probability space is a triple \(\left( \varOmega ,\mathcal {F},\mathbb {P}\right) \), such that \(\varOmega \) is a set, \(\mathcal {F}\) is a \(\sigma \)-field of \(\varOmega \), and \(\mathbb {P}:\mathcal {F}\rightarrow \left[ 0,1\right] \) a probability measure such that \(\mathbb {P}\left( \varOmega \right) =1\) and \(\mathbb {P}\left( \bigcup _{i=0}^{\infty }A_{i}\right) =\sum _{i=0}^{\infty }\mathbb {P}\left( A_{i}\right) \) for \(A_{i}\in \mathcal {F}\), \(i=1,2,\ldots \) pairwise disjoint.

Definition 1

A probabilistic input/output transition system is a sixtuple \(\mathcal {A}=\left( S,s,L_I,L_O,L_H,\varDelta \right) \), where

  • S is a finite set of states,

  • \(s_0\) is the unique starting state,

  • \(L_I\), \(L_O\) and \(L_H\) are disjoint sets of input, output and internal labels respectively, containing a special quiescence label \(\delta \in L_O\). We write \(L=L_I\cup L_O^\delta \cup L_H\) for the set of all labels.

  • \(\varDelta \subseteq S\times Distr \left( L\times S\right) \) a finite transition relation such that for all input actions a?, \(\mu \left( a?,s'\right) >0\) implies \(\mu \left( b,s''\right) =0\) for all \(b\ne a?\).

We use ā€œ?ā€ to suffix input and ā€œ!ā€ to suffx output. We write \(s\xrightarrow {\mu ,a}s'\) if \(\left( s,\mu \right) \in \varDelta \) and \(\mu \left( a,s'\right) >0\); and \(s\rightarrow a\) if there are \(\mu \in Distr \left( L\times S\right) \) and \(s'\in S\) such that \(s\xrightarrow {\mu ,a}s'\) ( if not). We write \(s\xrightarrow {\mu ,a}_\mathcal {A}s'\), etc. to clarify ambiguities if needed. Lastly, \(\mathcal {A}\) is input-enabled if for all \(s\in S\) we have \(s\rightarrow a?\) for all \(a\in L_I\).

Following [15], pIOTSs are input-reactive and output-generative. Upon receiving an input, the pIOTS decides probabilistically which next state to move to. On producing an output, the pIOTS chooses both the output action and the state probabilistically. As required in clause 4 of DefinitionĀ 1, this means that each transition can either involve a single input action, or several outputs, quiescence or internal actions. Note that a state can enable input and output transitions albeit not in the same distribution. Furthermore, in testing, a verdict must also be given if the system-under-test is quiescent, i.e., produces no output at all. Hence, the requirements model must explicitly indicate when quiescence is allowed, which is expressed by a special output label \(\delta \), for details see [39, 41].

Fig. 1.
figure 1figure 1

Specification and two implementations of a shuffle music player. Actions separated by commas indicate that two transitions are enabled from the state.

Example 2

FigureĀ 1 shows three models of a simple shuffle mp3 player with two songs. The pIOTS in (Fig.Ā 1a) models the requirements: pressing the shuffle button enables the two songs with probability 0.5 each, repeatedly until stop is pressed.

Implementation (Fig.Ā 1b) is subject to a small probabilistic deviation. In implementation (Fig.Ā 1c) the same song cannot be played twice in a row without intervention of the shuffle button. States without enabled output transition allow quiescence, denoted by \(\delta \) transitions. The model-based testing framework established in the paper is capable of detecting all of the above flaws.

Parallel composition is defined in a standard fashion. Two pIOTSs in composition synchronize on shared actions, and evolve independently on others. Since the transitions in the component pIOTSs are stochastically independent, we multiply the probabilities when taking shared actions, denoted by \(\mu \times \nu \). To avoid name clashes, we only compose compatible pIOTSs. Note that parallel composition of two input-enabled pIOTSs yields a pIOTS.

Definition 3

Two pIOTSs \(\mathcal {A}=\left( S,s_0,L_I,L_O,L_H,\varDelta \right) \) and \(\mathcal {A}'=(S',s_0',L_I',L_O^{\prime },L_H',\varDelta ')\), are compatible if \(L_O\cap L_O^{\prime }=\left\{ \delta \right\} \), \(L_H\cap L'=\emptyset \) and \(L\cap L_H'=\emptyset \). Their parallel composition is the tuple

$$\begin{aligned} \mathcal {A}\mid \mid \mathcal {A}'=\left( S'',\left( s_0,s_0'\right) ,L_{I}'',L_{O}^{\prime \prime }, L_H'',\varDelta ''\right) ,\,{where} \end{aligned}$$

\(S''=S\times S'\), \(L_I''=\left( L_I\cup L_I'\right) \backslash \left( L_O\cup L_O'\right) \), \(L_O^{\prime \prime }=L_O\cup L_O^{\prime }\), \(L_H''=L_H\cup L_H'\), and finally \(\varDelta ''=\{\left( \left( s,t\right) ,\mu \right) \in S''\times Distr \left( L''\times S''\right) \mid \)

where \(\left( s,\nu _1\right) \in \varDelta \),\(\left( t,\nu _2\right) \in \varDelta '\) respectively, and \(\nu _1\times \mathbb {1}\left( \left( s,t\right) ,a\right) =\nu _1\left( s,a\right) \cdot 1\) and \(\mathbb {1}\times \nu _2\left( \left( s,t\right) ,a\right) =1\cdot \nu _2\left( t,a\right) \).

2.2 Paths and Traces

We define the usual language concepts for LTSs. Let \(\mathcal {A}=\left( S,s_{0},L_{I},L_{O},L_H,\varDelta \right) \) be a pIOTS. A path \(\pi \) of \(\mathcal {A}\) is a (possibly) infinite sequence of the following form

$$\begin{aligned} \pi =s_{1}\mu _{1}a_{1}s_{2}\mu _{2}a_{2}s_{3}\mu _{3}a_{3}s_{4}\ldots , \end{aligned}$$

where \(s_{i}\in S\), \(a_{i}\in L\) and \(\mu _i\in Distr \left( L\times S\right) \), such that each finite path ends in a state and \(s_{i}\xrightarrow {\mu _{i+1},a_{i+1}}s_{i+1}\) for each non-final i. We use \( last \left( \pi \right) \) to denote the last state of a finite path (\( last \left( \pi \right) =\infty \) for infinite paths). The set of all finite paths of \(\mathcal {A}\) is denoted by \( Path ^{*}\left( \mathcal {A}\right) \) and all infinite paths by \( Path \left( \mathcal {A}\right) \).

The associated trace of a path \(\pi \) is obtained by omitting states, distributions and internal actions, i.e. \( trace \left( \pi \right) =a_{1}a_{2}a_{3}\ldots \). Conversely, \( trace ^{-1}\left( \sigma \right) \) gives the set of all paths, which have trace \(\sigma \). The length of a path is the number of occurring actions on its associated trace. All finite traces of \(\mathcal {A}\) are summarized in \( traces \left( \mathcal {A}\right) \). The set of complete traces, \( ctraces \left( \mathcal {A}\right) \), contains every trace based on paths ending in states that do not enable any more actions. We write \( out {}_{\mathcal {A}}\left( \sigma \right) \) for the set of all output actions enabled with positive probability after trace \(\sigma \).

2.3 Adversaries and Trace Distributions

Very much like traces of LTSs are obtained by first selecting a path and by then removing all states and internal actions, we do the same in the probabilistic case. First, we resolve all non-deterministic choices in the pIOTS via an adversary and then we remove all states to get the trace distribution.

The resolution of the non-determinism via an adversary leads to a purely probabilistic system, in which we can assign a probability to each finite path. A classical result in measure theory [12] shows that it is impossible to assign a probability to all sets of traces, hence we use \(\sigma \)-fields \(\mathcal {F}\) consisting of cones. Adversaries. Following the standard theory for probabilistic automata [38], we define the behaviour of a pIOTS via adversaries (a.k.a. policies or schedulers) to resolve the non-deterministic choices in pIOTSs; in each state of the pIOTS, the adversary may choose which transition to take or it may also halt the execution.

Given any finite history leading to a state, an adversary returns a discrete probability distribution over the set of next transitions. In order to model termination, we define schedulers such that they can continue paths with a halting extension, after which only quiescence is observed.

Definition 4

An adversary E of a pIOTS \(\mathcal {A}=\left( S,s_{0},L_{I},L_{O},L_H,\varDelta \right) \) is a function

$$ E: Path {}^{*}\left( \mathcal {A}\right) \longrightarrow Distr \left( Distr \left( L\times S\right) \cup \left\{ \perp \right\} \right) , $$

such that for each finite path \(\pi \), if \(E\left( \pi \right) \left( \mu \right) >0\), then \(\left( last\left( \pi \right) ,\mu \right) \in \varDelta \) or \(\mu \equiv \perp \). We say that E is deterministic, if \(E\left( \pi \right) \) assigns the Dirac distribution to every distribution after all \(\pi \in Path ^{*}\left( \mathcal {A}\right) \). The value \(E\left( \pi \right) \left( \perp \right) \) is considered as interruption/halting. An adversary E halts on a path \(\pi \), if \(E\left( \pi \right) \left( \perp \right) =1\). We say that an adversary halts after \(k\in \mathbb {N}\) steps, if it halts for every path of length greater or equal to k. We denote all such finite adversaries by \( adv \left( \mathcal {A},k\right) \).

Intuitively an adversary tosses a multi-faced and biased die at every step of the computation, thus resulting in a purely probabilistic computation tree. The probability assigned to a path \(\pi \) is obtained by the probability of its cone \(C_\pi = \left\{ \pi '\in Path \left( \mathcal {A}\right) \mid \pi \sqsubseteq \pi '\right\} \). We use the inductively defined path probability function \(Q^E\), i.e. \(Q^E\left( s_0\right) =1\) and \(Q^E\left( \pi \mu a s\right) =Q^E\left( \pi \right) E\left( \pi \right) \left( \mu \right) \mu \left( a,s\right) \). This function enables us to assign a unique probability space \(\left( \varOmega _E,\mathcal {F}_E,P_E\right) \) associated to an adversary E. Thus, the probability of \(\pi \) is \(P_E\left( \pi \right) =P_E\left( C_\pi \right) =Q^E\left( \pi \right) \).

Trace Distributions. A trace distribution is obtained from (the probability space of) an adversary by removing all states. Thus, the probability assigned to a set of traces X is the probability of all paths whose trace is an element of X.

Definition 5

The trace distribution H of an adversary E, denoted \(H= trd \left( E\right) \) is the probability space \(\left( \varOmega _{H},\mathcal {F}_{H},P_{H}\right) \) given by

  1. 1.

    \(\varOmega _{H}=L^\omega \),

  2. 2.

    \(\mathcal {F}_{H}\) is the smallest \(\sigma \)-field containing the set \(\left\{ C_{\beta }\subseteq \varOmega _H\mid \beta \in L^{\omega }\right\} \),

  3. 3.

    \(P_{H}\) is the unique prob. measure on \(\mathcal {F}_{H}\) such that \(P_{H}\left( X\right) =P_{E}\left( trace {}^{-1}\left( X\right) \right) \) for \(X\in \mathcal {F}_{H}\).

We write \( trd \left( \mathcal {A}\right) \) for the set of all trace distributions of \(\mathcal {A}\) and \( trd \left( \mathcal {A},k\right) \) for those halting after \(k\in \mathbb {N}\). Lastly we write \(\mathcal {A}=_{ TD }\mathcal {B}\) if \( trd \left( \mathcal {A}\right) = trd \left( \mathcal {B}\right) \), \(\mathcal {A}\sqsubseteq _{ TD }\mathcal {B}\) if \( trd \left( \mathcal {A}\right) \subseteq trd \left( \mathcal {B}\right) \) and \(\mathcal {A}\sqsubseteq _{ TD }^{k}\mathcal {B}\) if \( trd \left( \mathcal {A},k\right) \subseteq trd \left( \mathcal {B},k\right) \) for \(k\in \mathbb {N}\).

The fact that \(\left( \varOmega _{E},\mathcal {F}_{E},P_{E}\right) \) and \(\left( \varOmega _{H},\mathcal {F}_{H},P_{H}\right) \) define probability spaces, follows from standard measure theory arguments (see for example [12]).

Example 6

Consider (c) in Fig.Ā 1 and an adversary E starting from the beginning state \(s_0\) scheduling probability 1 to shuf?, 1 to the distribution consisting of song1! and song2! and \(\frac{1}{2}\) to both shuffle? transitions in \(s_2\). Then choose the paths \(\pi =s_0\mu _1\text{ shuf? }s_1 \mu _2 \text{ song1! } s_2 \mu _3 \text{ shuf? } s_2\) and \(\pi '=s_0\mu _1\text{ shuf? }s_1 \mu _2 \text{ song1! } s_2 \mu _4 \text{ shuf? } s_1\).

We see that \(\sigma = trace \left( \pi \right) = trace \left( \pi '\right) \) and \(P_E\left( \pi \right) =Q^E\left( \pi \right) =\frac{1}{4}\) and \(P_E\left( \pi '\right) =Q^E\left( \pi '\right) =\frac{1}{4}\), but \(P_{ trd \left( E\right) }\left( \sigma \right) =P_E\left( trace ^{-1}\left( \sigma \right) \right) =P_E\left( \left\{ \pi ,\pi '\right\} \right) =\frac{1}{2}\).

3 Testing with pIOTS

3.1 Test Generation

Model-based testing entails the automatic test case generation, execution and evaluation based on a requirements model. We provide two algorithms for test case generation: an offline or batch algorithm that generates test cases before their execution; and an online or on-the-fly algorithm generating test cases during execution.

First, we formalize the notion of a (offline) test case over an action signature \(\left( L_I,L_O\right) \). In each state of a test, the tester can either provide some stimulus \(a?\in L_I\), or wait for a response of the system or stop the testing process.Footnote 2 Each of these possibilities can be chosen with a certain probability, leading to probabilistic test cases. We model this as a probabilistic choice between the internal actions \(\tau _{\textit{obs}}\), \(\tau _{\textit{stop}}\) and \(\tau _{\textit{stim}}\). Note that, even in the non-probabilistic case, the test cases are often generated probabilistically in practice, but this is not supported in theory. Thus, our definition fills a small gap here.

Furthermore, note that, when waiting for a system response, we have to take into account all potential outputs in \(L_O\), including the situation that the system provides no response at all, modelled by \(\delta \). Since the continuation of a test depends on the history, offline test cases are formalized as trees.

Definition 7

A test or test case over an action signature \(\left( L_{I},L_{O}\right) \) is a pIOTS of the form \(t=\left( S,s_{0},L_{O}\backslash \left\{ \delta \right\} ,L_{I}\cup \left\{ \delta \right\} ,\left\{ \tau _{\textit{obs}},\tau _{\textit{stim}},\tau _{\textit{stop}}\right\} ,\varDelta \right) \) such that

  • \(t\) is internally deterministic and does not contain an infinite path;

  • \(t\) is acyclic and connected;

  • For every state \(s\in S\), we either have

    1. -

      \( after \left( s\right) =\emptyset \), or

    2. -

      \( after \left( s\right) =\left\{ \tau _{\textit{obs}},\tau _{\textit{stim}},\tau _{\textit{stop}}\right\} \), or

    3. -

      \( after \left( s\right) =L_{I}\cup \left\{ \delta \right\} \), or

    4. -

      \( after \left( s\right) =L_{ out }\), such that \(L_{ out }\subseteq L_{O}\backslash \left\{ \delta \right\} \),

where \( after \left( s\right) \) is the set of actions in state s. A test suite \(T\) is a set of test cases. A test case (suite) for a pIOTS \(\mathcal {A}_\mathcal {S}=\left( S,s_{0},L_{I},L_{O},L_H,\varDelta \right) \), is a test case (suite) over \(\left( L_{I},L_{O}\right) \).

Note that the action signature of tests has switched input and output label sets.

Definition 8

For a given test \(t\) a test annotation is a function

$$ a: ctraces \left( t\right) \longrightarrow \left\{ pass , fail \right\} \!. $$

A pair \(\widehat{t}=\left( t,a\right) \) consisting of a test and a test annotation is called an annotated test. The set of all such \(\widehat{t}\), denoted by \(\widehat{T}=\left\{ \left( t_{i},a_{i}\right) _{i\in \mathcal {I}}\right\} \) for some index set \(\mathcal {I}\), is called an annotated test suite. If \(t\) is a test case for a specification \(\mathcal {A}_\mathcal {S}\) we define the test annotation \(a_{\mathcal {A}_\mathcal {S},t}: ctraces \left( t\right) \longrightarrow \left\{ pass , fail \right\} \) by

$$ a_{\mathcal {A}_\mathcal {S},t}\left( \sigma \right) ={\left\{ \begin{array}{ll} fail &{} {if}\,\exists \varrho \in traces \left( \mathcal {A}_\mathcal {S}\right) , a!\in L_{O}^\delta : \varrho a!\sqsubseteq \sigma \wedge \varrho a!\notin traces \left( \mathcal {A}_\mathcal {S}\right) ;\\ pass &{} {otherwise.} \end{array}\right. } $$

Example 9

FigureĀ 2 shows two derived tests for the specification in Fig.Ā 1. Note that the action signature is mirrored. Therefore if \(s\xrightarrow {\mu ,a}_{\widehat{t}}s'\) with a an output action of the specification, then we have \(\mu = Dirac \). Test \(\widehat{t}_2\) shows how we apply stimuli, observe or stop with probabilities \(\frac{1}{3}\) each. If we stimulate, we apply stop! and shuf! with probability \(\frac{1}{2}\) each.

Fig. 2.
figure 2figure 2

Two tests derived from the specification in Fig.Ā 1

Algorithms. The procedure batch in Algorithm 1 generates test cases from a specification, given a specification pIOTSs \(\mathcal {A}_\mathcal {S}\) and a history \(\sigma \), which is initially \(\epsilon \). Each step a probabilistic choice is made to return an empty test, to observe or to stimulate, denoted with probabilities \(p_{\sigma ,1}, p_{\sigma ,2}\) or \(p_{\sigma ,3}\) respectively. The latter two call the procedure batch again. If erroneous output is detected, we stop immediately. We require that \(p_{\sigma ,1}+p_{\sigma ,2}+p_{\sigma ,3}=1\).

Algorithm 2 shows a sound way to derive tests on-the-fly. The inputs are a specification \(\mathcal {A}_\mathcal {S}\), a concrete implementation \(\mathcal {A}_\mathcal {I}\) and a test length \(n\in \mathbb {N}\). The algorithm returns a verdict of whether or not the implementation is ioco correct in the first n steps. If erroneous output was detected, the verdict will be \( fail \) and \( pass \) otherwise. With probability \(p_{\sigma ,1}\) we observe and with probability \(p_{\sigma ,2}\) we stimulate. The algorithm stops after n steps. Thus, .

Theorem 10

All test cases generated by Algorithm 1 are test cases according to Definition 7. All test cases generated by Algorithm 2 assign the correct verdict according to Definition 8.

figure afigure a

3.2 Test Evaluation

In our framework, we assess functional behaviour by the test verdict \(a_{\mathcal {A}_\mathcal {S},t}\) and probabilistic behaviour via statistics, as elaborated below.

Statistical Verdict. Given a (black box) implementation, the idea is to run an offline or online test case multiple times, in order to collect a sample. Then, we check if the frequencies of the traces contained in this sample match the probabilities in the specification via statistical hypothesis testing. However, since the specification contains non-determism, we cannot apply statistical means directly. Rather, we check if the observed trace frequencies can be explained, if we resolve occurring non-determinism in the specification according to some scheduler.

We formulate a hypothesised scheduler that makes the occurrence of the sample most likely. This gives rise to a purely probabilistic computation tree and probabilities and expected values for each trace can be calculated. Based on a predefined level of significance \(\alpha \in \left( 0,1\right) \) we use null hypothesis testing to determine whether to accept or reject the hypothesised scheduler. If it is accepted, we have no reason to assume that the implementation differs probabilistically from the specification and give the pass label. If it is rejected, we assign the fail verdict, because there is no scheduler to explain the observed frequencies.

Sampling. To collect a sample, we define the length \(k\in \mathbb {N}\) and width \(m\in \mathbb {N}\) of an experiment first, i.e. how long shall we observe the machine and how many times do we want to run it before stopping. Thus, we collect \(\sigma _1,\ldots ,\sigma _m\in traces \left( \mathcal {A}_\mathcal {I}\right) \) with \(\left| \sigma _i\right| =k\) for \(i=1,\ldots ,m\). We call \(O=\left( \sigma _1,\ldots , \sigma _m\right) \in \left( L^k\right) ^m\) a sample. We assume the system is governed by a trace distribution \(D_i\) in every run, thus running the machine m times, means that a sample is generated by a sequence of m (possibly) different trace distributions \(\varvec{D}=\left( D_1,D_2,\ldots ,D_m\right) \in trd \left( \mathcal {A}_\mathcal {I},k\right) ^m\).

Each run the implementation makes two choices. (1) It chooses a trace distribution \(D_i\) and (2) \(D_i\) chooses a trace \(\sigma _i\). Once a trace distribution \(D_i\) is chosen, it is solely responsible for the trace \(\sigma _i\), meaning that for \(i\ne j\) the choice of \(\sigma _i\) by \(D_i\) is independent of the choice \(\sigma _j\) by \(D_j\).

Frequencies. The frequency function is defined as \( freq :\left( L^k\right) ^m\rightarrow Distr \left( L^k\right) \), such that \( freq \left( O\right) \left( \sigma \right) =\frac{\mid \left\{ i=1,\ldots ,m\wedge \sigma =\sigma _i\right\} \mid }{m}\). Assume that \(k,m\in \mathbb {N}\), \(\varvec{D}\) and \(\sigma \in L^k\) are fixed. Then a sample O can be treated as a Bernoulli experiment of length m, where success occurs in position \(i\in \left\{ 1,\ldots ,m\right\} \) if \(\sigma =\sigma _i\). Thus, the success probability in the i-th step is given by \(P_{D_i}\left( \sigma \right) \). So assume \(X_i\) are Bernoulli distributed random variables for \(i=1,\ldots , m\). We define a new random variable as \(Z=\frac{1}{m}\sum _{i=1}^{m}X_i\), which represents the frequency of success in m steps governed by \(\varvec{D}\). Thus the expected frequency is given as

$$ \mathbb {E}^{\varvec{D}}_{\sigma }:=\mathbb {E}\left( Z\right) =\frac{1}{m}\sum _{i=1}^m\mathbb {E}\left( X_i\right) =\frac{1}{m}\sum _{i=1}^m P_{D_i}\left( \sigma \right) \!. $$

It is \(\sum _\sigma \mathbb {E}^{\varvec{D}}_\sigma =1\), which means \(\mathbb {E}^{\varvec{D}}\) is the distribution expected under \(\varvec{D}\).

Acceptable Outcomes. We will accept a sample O if \( freq \left( O\right) \) lies within some distance r of the expected distribution \(\mathbb {E}^{\varvec{D}}\). Recall the definition of a ball centred at \(x\in X\) with radius r as \(B_r\left( x\right) =\left\{ y\in X\mid dist \left( x,y\right) \le r\right\} \). All distributions deviating at most by r from the expected distribution are contained within the ball \(B_r\left( \mathbb {E}^{\varvec{D}}\right) \), where \( dist \left( u,v\right) :=\sup _{\sigma \in L^k}\mid u\left( \sigma \right) -v\left( \sigma \right) \mid \) and u and v are distributions. In order to minimize the error of falsely accepting a sample, we choose the smallest radius, such that the error of falsely rejecting a sample is not greater than a predefined level of significance \(\alpha \in \left( 0,1\right) \) by \(\bar{r}:=\inf \left\{ r\mid P_{\varvec{D}}\left( freq ^{-1}\left( B_r\left( \mathbb {E}^{\varvec{D}}\right) \right) \right) >1-\alpha \right\} .\)

Definition 11

For \(k,m\in \mathbb {N}\) and a pIOTS \(\mathcal {A}\) the acceptable outcomes under \(\varvec{D}\in trd \left( \mathcal {A},k\right) ^m\) of significance level \(\alpha \in \left( 0,1\right) \) are given by the set of observations \( Obs \left( \varvec{D},\alpha ,k,m\right) =\left\{ O\in \left( L^k\right) ^m \mid dist \left( freq \left( O\right) ,\mathbb {E}^{\varvec{D}}\right) \le \bar{r}\right\} \). The set of observations of \(\mathcal {A}\) of significance level \(\alpha \in \left( 0,1\right) \) is given by

$$ Obs \left( \mathcal {A},\alpha ,k,m\right) =\bigcup _{\varvec{D}\in trd \left( \mathcal {A},k\right) ^m} Obs \left( \varvec{D},\alpha ,k,m\right) \!. $$

The defined set of observations of a pIOTS \(\mathcal {A}\) therefore has two properties, reflecting the error of false rejection and false acceptance respectively.

  1. 1.

    For \(\varvec{D}\in trd \left( \mathcal {A}\right) \) of length k, we have \(P_{\varvec{D}}\left( Obs \left( \mathcal {A},\alpha ,k,m\right) \right) \ge 1-\alpha \),

  2. 2.

    For \(\varvec{D}'\notin trd \left( \mathcal {A}\right) \) of length k, we have \(P_{\varvec{D}'}\left( Obs \left( \mathcal {A},\alpha ,k,m\right) \right) \le \beta _m\),

where \(\alpha \) is the predefined level of significance and \(\beta _m\) is unknown but minimal by construction. Note that \(\beta _m\rightarrow 0\) as \(m\rightarrow \infty \), thus the error of falsely accepting an observation decreases with increasing sample width.

Application. This framework has two problems for practical applications: (1) the parameter \(\bar{r}\) may be hard to find and (2) for a given sample, it is no trivial task to find the trace distribution, that gives it maximal likelihood, i.e.

$$ \mathbb {P}_\mathcal {A}^{k,m}\left( O\right) :=\max _{\varvec{D}\in \left( trd \left( \mathcal {A},k\right) \backslash trd \left( \mathcal {A},k-1\right) \right) ^m} P_{\varvec{D}}\left( O\right) \!. $$

The parameter \(\bar{r}\) gives the best fit, but finding it is no trivial task. It is of interest for the soundness and completeness proofs, but in practice we will use \(\chi ^2\) hypothesis testing. The empirical value \(\chi ^2=\sum _{i=1}^m (n\left( \sigma _i\right) -mE^{\varvec{D}}_{\sigma _i})^2/mE^{\varvec{D}}_{\sigma _i}\), where \(n\left( \sigma \right) \) is the amount \(\sigma \) occurred in the sample, is compared to critical values of given degrees of freedom and levels of significance. These values can be calculated or looked up in a \(\chi ^2\) table.

Since expectations in our construction depend on a scheduler/trace distribution to explain a possible sample, it is of interest to find the best fit. Hence, we are trying to solve the minimisation

$$\begin{aligned} \min _{\varvec{D}} \sum _{i=1}^m \frac{\left( n\left( \sigma _i\right) -mE^{\varvec{D}}_{\sigma _i}\right) ^2}{mE^{\varvec{D}}_{\sigma _i}}. \end{aligned}$$
(1)

By construction, we want to optimize the probabilities \(p_i\) used by a scheduler to resolve non-determinism. This turns (1) into a minimisation of a rational function \(f\left( p\right) /g\left( p\right) \) with inequality constraints on the vector p. As shown in [25], minimizing rational functions is NP-hard. This approach optimizes one possible trace distribution to fit the sample data instead of finding m different ones. This topic could be handled in future research, with the assumption of one distribution which lets the implementation choose different trace distributions.

Verdict Function. With this framework, the following decision process summarizes if an implementation fails for functional and/or statistical behaviour.

Definition 12

Given a specification \(\mathcal {A}_\mathcal {S}\), an annotated test \(\widehat{t}\) for \(\mathcal {A}_\mathcal {S}\), \(k,m\in \mathbb {N}\) where k given by the trace length of \(\widehat{t}\) and a level of significance \(\alpha \in \left( 0,1\right) \), we define the functional verdict as the function \(v_{\widehat{t}}: pIOTS \longrightarrow \left\{ pass , fail \right\} \), with

$$ v_{\widehat{t}}\left( \mathcal {A}_\mathcal {I}\right) ={\left\{ \begin{array}{ll} pass &{} {if}\,\forall \sigma \in ctraces \left( \mathcal {A}_\mathcal {I}\left| \right| t\right) \cap ctraces \left( t\right) : a\left( \sigma \right) = pass \\ fail &{} {otherwise,}\end{array}\right. } $$

the statistical verdict as the function \(v_t^{\alpha ,m}: pIOTS \longrightarrow \left\{ pass , fail \right\} \), with

$$ v_{t}^{\alpha ,m}\left( \mathcal {A}_\mathcal {I}\right) ={\left\{ \begin{array}{ll} pass &{} {if}\,\mathbb {P}^{k,m}_{\mathcal {A}_\mathcal {S}}\left( Obs \left( \mathcal {A}_\mathcal {I}\left| \right| t,\alpha ,k,m\right) \right) \ge 1-\alpha \\ fail &{} {otherwise,}\end{array}\right. } $$

and finally the overall verdict as the function \(V_{\widehat{t}}^{\alpha ,m}: pIOTS \rightarrow \left\{ pass , fail \right\} \), with \(V^{\alpha ,m}_{\widehat{t}}\left( \mathcal {A}_\mathcal {I}\right) = pass \) if \(v_{\widehat{t}}\left( \mathcal {A}_\mathcal {I}\right) =v_t^{\alpha ,m}\left( \mathcal {A}_\mathcal {I}\right) = pass \) and \(V^{\alpha ,m}_{\widehat{t}}\left( \mathcal {A}_\mathcal {I}\right) = fail \) otherwise. For an annotated test suite \(\widehat{T}\) for \(\mathcal {A}_\mathcal {S}\) we lift this to \(V_{\widehat{T}}^{\alpha ,m}\left( \mathcal {A}_\mathcal {I}\right) = pass \) if \(V_{\widehat{t}}^{\alpha ,m}\left( \mathcal {A}_\mathcal {I}\right) = pass \) for each \(\widehat{t}\in \widehat{T}\) and \(V_{\widehat{T}}^{\alpha ,m}\left( \mathcal {A}_\mathcal {I}\right) = fail \) otherwise.

4 Conformance, Soundness and Completeness

A key result of our paper is the correctness of our framework, formalized as soundness and completeness. Soundness states that each test case is assigned the correct verdict. Completeness states that the framework is powerful enough to discover each deviation from the specification. Formulating these properties requires a formal notion of conformance that we formalize as the pioco-relation.

4.1 Probabilistic Input/Output Conformance \(\sqsubseteq _{ pioco }\)

The classical ioco relation [40] states that an implementation conforms to a specification, if it never provides any unspecified output or quiescence, i.e. for two IOTSs \(\mathcal {A}_\mathcal {I}\) and \(\mathcal {A}_\mathcal {S}\), with \(\mathcal {A}_\mathcal {I}\) input-enabled, we say \(\mathcal {A}_\mathcal {I}\sqsubseteq _{ioco}\mathcal {A}_\mathcal {S}\), iff

$$ \forall \sigma \in traces \left( \mathcal {A}_\mathcal {S}\right) : out _{\mathcal {A}_\mathcal {I}}\left( \sigma \right) \subseteq out _{\mathcal {A}_\mathcal {S}}\left( \sigma \right) . $$

To generalize ioco to pIOTSs, we introduce two auxiliary concepts:

  1. 1.

    the prefix relation for trace distributions \(H \sqsubseteq _{k} H'\) is the analogue of trace prefixes, i.e. \(H\sqsubseteq _{k} H'\) iff \(\forall \sigma \in L^k: P_H\left( \sigma \right) =P_{H'}\left( \sigma \right) \)

  2. 2.

    for a pIOTSs \(\mathcal {A}\) and a trace distribution H of length k, the output continuation of H in \(\mathcal {A}\) contains all trace distributions, which are equal upĀ to length k and assign every trace of length \(k+1\) ending in input probability 0. We set

$$ outcont \left( H,\mathcal {A}\right) : = \left\{ H'\in trd \left( \mathcal {A},k+1\right) \mid H\sqsubseteq _{k}H'\wedge \forall \sigma \in L^{k}L_{I}:P_{H'}\left( \sigma \right) =0\right\} . $$

Intuitively an implementation should conform to a specification, if the probability of every trace in \(\mathcal {A}_\mathcal {I}\) specified in \(\mathcal {A}_\mathcal {S}\), can be matched. Just like in ioco, we neglect unspecified traces ending in input actions. However, if there is unspecified output in the implementation, there is at least one adversary that schedules positive probability to this continuation.

Definition 13

Let \(\mathcal {A}_\mathcal {I}\) and \(\mathcal {A}_\mathcal {S}\) be two pIOTSs. Furthermore let \(\mathcal {A}_\mathcal {I}\) be input-enabled, then we say \(\mathcal {A}_\mathcal {I}\sqsubseteq _{ pioco }\mathcal {A}_\mathcal {S}\) iff

$$ \forall k\, \in \mathbb {N}\forall H\in trd \left( \mathcal {A}_\mathcal {S},k\right) : outcont \left( H,\mathcal {A}_\mathcal {I}\right) \subseteq outcont \left( H,\mathcal {A}_\mathcal {S}\right) . $$

The pioco relation conservatively extends the ioco relation, i.e. both relations coincide for IOTSs.

Theorem 14

Let \(\mathcal {A}\) and \(\mathcal {B}\) be two IOTSs and \(\mathcal {A}\) be input-enabled, then

$$ \mathcal {A}\sqsubseteq _{ ioco }\mathcal {B} \Longleftrightarrow \mathcal {A} \sqsubseteq _{ pioco }\mathcal {B}. $$

The implementation is always assumed to be input-enabled. If the specification is input-enabled too, then pioco coincides with trace distribution inclusion. Moreover, our results show that pioco is transitive, just like ioco.

Theorem 15

Let \(\mathcal {A}\), \(\mathcal {B}\) and \(\mathcal {C}\) be pIOTSs and let \(\mathcal {A}\) and \(\mathcal {B}\) be input-enabled, then

  • \(\mathcal {A}\sqsubseteq _{ pioco }\mathcal {B}\) if and only if \(\mathcal {A}\sqsubseteq _{TD}\mathcal {B}\).

  • \( \mathcal {A}\sqsubseteq _{ pioco }\mathcal {B}\) and \(\mathcal {B}\sqsubseteq _{ pioco }\mathcal {C}\) then \(\mathcal {A}\sqsubseteq _{ pioco }\mathcal {C}\).

4.2 Soundness and Completeness

Talking about soundness and completeness when referring to probabilistic systems is not a trivial topic, since one of the main inherent difficulties of statistical analysis is the possibility of false rejection or false acceptance.

The former is of interest when we refer to soundness (i.e. what is the probability that we erroneously assign \( fail \) to a correct implementation), and the latter is important when we talk about completeness (i.e. what is the probability that we assign \( pass \) to an erroneous implementation). Thus, a test suite can only fulfil these properties with a guaranteed (high) probability (c.f. Definition 12).

Definition 16

Let \(\mathcal {A}_\mathcal {S}\) be a specification over an action signature \(\left( L_I,L_O\right) \), \(\alpha \in \left( 0,1\right) \) the level of significance and \(\widehat{T}\) an annotated test suite for \(\mathcal {A}_\mathcal {S}\). Then

  • \(\widehat{T}\) is sound for \(\mathcal {A}_\mathcal {S}\) with respect to \(\sqsubseteq _{ pioco }\), if for all input-enabled implementations \(\mathcal {A}_i\in pIOTS \) and sufficiently large \(m\in \mathbb {N}\) it holds that

    $$\begin{aligned} \mathcal {A}_\mathcal {I}\sqsubseteq _{ pioco }\mathcal {A}_\mathcal {S}\Longrightarrow V_{\widehat{T}}^{\alpha ,m}\left( \mathcal {A}_\mathcal {I}\right) = pass . \end{aligned}$$
  • \(\widehat{T}\) is complete for \(\mathcal {A}_\mathcal {S}\) with respect to \(\sqsubseteq _{ pioco }\), if for all input-enabled implementations \(\mathcal {A}_\mathcal {I}\in pIOTS \) and sufficiently large \(m\in \mathbb {N}\) it holds that

Soundness for a given \(\alpha \in \left( 0,1\right) \) expresses that we have a \(1-\alpha \) chance that a correct system will pass the annotated suite for sufficiently large sample width m. This relates to false rejection of a correct hypothesis or correct implementation respectively.

Theorem 17

(Soundness). Each annotated test for a pIOTS \(\mathcal {A}_\mathcal {S}\) is sound for every level of significance \(\alpha \in \left( 0,1\right) \) wrt pioco.

Completeness of a test suite is inherently a theoretic result. Since we allow loops, we require a test suite of infinite size. Moreover, there is still the chance of falsely accepting an erroneous implementation. However, this is bound from above by construction, and will decrease for bigger sample sizes (c.f. DefinitionĀ 11).

Theorem 18

(Completeness). The set of all annotated test cases for a specification \(\mathcal {A}_\mathcal {S}\) is complete for every level of significance \(\alpha \in \left( 0,1\right) \) wrt pioco.

5 Experimental Validation

To apply our framework, we implemented two well-known randomized communication protocols in Java, and tested these with the MBT tool JTorX [3]. The statistical verdicts were calculated in MatLab with a level of significance \(\alpha =0.1\).

5.1 Binary Exponential Backoff

The Binary Exponential Backoff protocol is a data transmission protocol between N hosts, trying to send information via one bus [19]. If two hosts send simultaneously, then their messages collide and they pick a new waiting time before trying again: after i collisions, they randomly choose a slot in \(\{0,\ldots 2^i-1\}\) until the message gets through.

A sample of the protocol is shown in TableĀ 1. Note that our specification of this protocol contains no non-determinism. Thus, calculations in this example are not subject to optimization to find the best trace distribution.

Table 1. A sample O of trace length \(k=5\) and depth (number of test runs) \(m=10^5\). Calculations yield \(\chi ^2=14.84<17.28=\chi ^2_{0.1}\), hence we accept the implementation.

n in TableĀ 1 shows how many times each trace occurred and \(E_\sigma \) gives the expected value. The interval \(\left[ l_{0.1},r_{0.1}\right] \) represents the 90Ā % confidence interval under the assumption of a normal distribution. It gives a rough idea how much values will deviate for the given level of confidence. However, we are interested in the multinomial deviation (i.e. less deviation of one trace allows higher deviation for another trace). For that purpose we use the \(\chi ^2\) score, given by the sum of the entries of the last column. Calculation shows \(\chi ^2=14.84<17.28 =\chi ^{2}_{0.1}\), which is the critical value for 11 degrees of freedom and \(\alpha =0.1\). Consequently, we accept the hypothesis of the probabilities being implemented correctly.

5.2 IEEE 1394 FireWire Root Contention Protocol

The IEEE 1394 FireWire Root Contention Protocol [37] elects a leader between two nodes via coin flips: If head comes up, node i picks a waiting time \( fast _i\in \left[ 0.24\,\mu s, 0.26\,\mu s\right] \), if tail comes up, it waits \( slow _i\in \left[ 0.57\,\mu s, 0.60\,\mu s\right] \). After the waiting time has elapsed, the node checks whether a message has arrived: if so, the node declares itself leader. If not, the node will send out a message itself, asking the other node to be the leader. Thus, the four outcomes of the coin flips are: \(\left\{ fast _1, fast _2\right\} ,\left\{ slow _1, slow _2\right\} ,\left\{ fast _1, slow _2\right\} \) and \(\left\{ slow _1, fast _2\right\} \). The protocol contains inherent non-determinism [37]; If different times were picked, the protocol always terminates. However, if equal times were picked, it may either elect a leader, or retry depending on the resolution of the non-determinism.

Table 2. A sample O of length \(k=5\) and depth \(m=10^5\) of the FireWire root contention protocol. Calculations of \(\chi ^2\) are done after optimization in p.

TableĀ 2 shows the recorded traces, where c1? and c2? denote coin1 and coin2 respectively. We have tested five implementations: Implementation Correct implements fair coins, while the mutants \(M_1\), \(M_2\), \(M_3\) and \(M_4\) were subjects to probabilistic deviations giving advantage to the second node, i.e. \(P\left( fast_1\right) =P\left( slow_2\right) =0.1\), \(P\left( fast_1\right) =P\left( slow_2\right) =0.4\), \(P\left( fast_1\right) =P\left( slow_2\right) =0.45\) and \(P\left( fast_1\right) =P\left( slow_2\right) =0.49\) for mutants 1, 2, 3 and 4 respectively. The expected value \(E^{\varvec{D}}_\sigma \) depends on resolving one non-determinism by varying p (which coin was flipped first). Note that other non-determinism was not subject to optimization, but immediately clear by trace frequencies. The calculated \(\chi ^2\) scores are based on an optimized value for p for each sample and compared to the critical value \(\chi ^2_{0.1}=17.28\) resulting in the verdicts shown.

6 Conclusions and Future Work

We defined a sound and complete framework to test probabilistic systems, defined a conformance relation in the ioco tradition called pioco and showed how to derive probabilistic tests of a requirements model. Verdicts that handle the functional and statistical behaviour are assigned after a test is applied. We showed that the correct verdict can be assigned upĀ to arbitrary precision by setting a level of significance and sufficiently large sample size.

Future work should focus on the practical aspects of our theory: tool support, larger case studies and more powerful statistical methods to increase efficiency.