1 Introduction

Statistical model checking (SMC) [27] is increasingly seen as a powerful alternative to numerical model checking. This is witnessed by two main developments. The first is the implementation of SMC techniques in classical model checking tools such as UPPAAL [8], PRISM [25] and MRMC [24]. The second is that several libraries explicitly dedicated to SMC techniques have recently been developed, e.g., COSMOS [7] and PLASMA [22]. The main reason behind this increase in popularity is the fact that SMC in many cases can avoid problems that have long plagued numerical model checking. These include the state space explosion problem (the memory requirements of SMC only depend on the high level description of the model) and the fact that numerical techniques that deal with more complicated models—e.g., Markov reward models or probabilistic timed automata with uniformly distributed transition times—quickly become computationally, i.e., numerically infeasible.

The core idea underlying statistical model checking is to use a computer program to repeatedly simulate the behaviour of the system model in order to say something about the system’s performance in terms of a given performance measure. Throughout this paper this will be some probability of interest \(p\).Footnote 1 The exact way in which these simulation runs are then interpreted depends on the interests of the investigator. First of all, she could be interested in a quantitative statement, consisting of an estimate of the performance measure with a corresponding confidence interval (e.g., with 95 % confidence, the probability of deadlock before termination is 10 % with a 2 % margin of error). Secondly, she could be interested in a qualitative statement about a performance property, specified as a hypothesis that asserts that the true probability \(p\) is larger (or smaller) than some boundary value \(p_0\) (e.g., with 95 % confidence, the probability of deadlock before termination is greater than 5 %).

The two approaches are closely related. Given a procedure to construct confidence intervals, one obtains a hypothesis test in the following way: construct the confidence interval, then check whether the boundary value \(p_0\) is inside the interval. If not, accept or reject the assertion \(p>p_0\) depending on whether \(p_0\) is to the ‘left’ or ‘right’ of the interval. Despite this relationship, procedures for constructing confidence intervals are sometimes implemented completely in parallel to procedures exclusively focused on hypothesis testing.Footnote 2

In this paper we present a general framework for hypothesis testing that allows for a clear and intuitive comparison of both ‘pure’ hypothesis tests and tests based on confidence intervals. We use the framework to describe, classify and parametrise the tests so far implemented in model checking tools, and introduce two tests that are new to the field of SMC, whose output guarantees are fundamentally different from those of the other tests. We also compare the tests empirically in a comprehensive case study. To help the reader get a feeling for the difference between tests and the influence of parameters, we have built a companion website to this paper, where investigators can interactively modify parameters; see [1].

The structure of this paper is as follows: We present a single framework that allows comparison of the hypothesis tests discussed in this paper in Sect. 2, and discuss the main criteria by which to judge a test. In Sect. 3, we present an overview of these tests using the framework of Sect. 2. In Sect. 4, we discuss how these tests must be parametrised to ensure that the output guarantees are satisfied. We compare the performance of all these tests empirically in Sect. 5. Section 6 concludes the paper.

2 General framework

In this section we discuss the framework that we use to compare the tests described in Sect. 3. We start in Sect. 2.1 with a discussion of the model setting; we focus particularly on the generality of the framework. Section 2.2 begins with a summary of elementary statistical methodology in order to fix terminology and notation; we then move on to discussing the features specific to this paper. Having discussed the framework for hypothesis tests, we focus on criteria for comparing hypothesis tests in Sect. 2.3, and introduce a classification of tests.

2.1 Model setting

As mentioned in the introduction, we are interested in comparing the probability \(p\) to a given boundary value \(p_0\). Typically, \(p\) denotes the probability that a formula expressed in a temporal logic is satisfied. With \(\phi \) denoting the temporal logic formula, the performance property that we seek to evaluate is then oftenFootnote 3 expressed as \({\mathcal {P}}_{>p_0} (\phi )\).Footnote 4 Formally, this performance property holds in a state if, from that state, the probability that an execution path (generated randomly using the system specification) satisfies the property \(\phi \) is greater than \(p_0\). The only requirement on the system model that we want to apply to our tests is that we can randomly generate execution paths in order to obtain information about whether or not \({\mathcal {P}}_{>p_0} (\phi )\) is satisfied. This can be rewritten into the following requirements:

  1. 1.

    we can generate execution paths from the model, according to a well-defined probability measure on the execution paths;

  2. 2.

    with probability \(1\), these paths are generated in a finite amount of time and we can test, also in a finite amount of time, whether the property \(\phi \) holds on a path; and

  3. 3.

    we either do not encounter nondeterminism [6], or we have a well-defined policy or scheduler to resolve it.

In principle, we do not need any additional information about the system model as long as we can obtain execution paths that satisfy these three requirements. A system about which additional information is not available is commonly called a black-box system [34].

In practice, a system model is often available that allows us to write a computer program that can generate execution paths, which means that the system is not completely black-box. Popular modelling formalisms include generalized semi-Markov processes (GSMPs, [14, 28]) and stochastic (possibly non-Markovian) Petri nets [16]. Requirements 2 and 3 will not be satisfied in all GSMPs or stochastic Petri nets. For example, if the property \(\phi \) does not involve a time bound then requirement 2 may be violated, e.g., when the system reaches a bottom strongly connected component that does not contain termination states. Also requirement 3 may be violated in a GSMP when two transitions are scheduled to occur at the same time, e.g., when some of the transitions have deterministic delays. However, even in such cases it might still be possible to apply a refined form of statistical model checking (see, for example [12, 35], or [40], in which requirement 2 is not satisfied). Judging whether requirements 2 and 3 are satisfied given a system model and performance property is a field of study in itself. As this paper is not about generating sample paths but about the interpretation of the results, we refer the interested reader to the vast literature on stochastic simulation [13, 31], and from now on assume that we draw samples from a black-box system in order to say something about \({\mathcal {P}}_{>p_0}(\phi )\).

In this paper, we will not consider nested probabilistic operators. To read about how nested operators are treated in other settings, see, e.g., Sect. 3.2 of [35] or [41], in which a combined numerical/statistical procedure is used.

2.2 Statistical framework

With \(i=1,\ldots ,N\), let \(\omega _i\) be the execution path in the \(i\)th sample, and define:

$$\begin{aligned} X_i \triangleq \mathbf{1}_{\phi } (\omega _i) = \left\{ \begin{array}{r@{\quad }l} 1 &{} \text { if } \phi \text { holds on } \omega _i, \\ 0 &{} \text { otherwise.} \end{array} \right. \end{aligned}$$
(1)

Then \(X_i\) has a Bernoulli distribution with parameter \(p\), where \(p\) denotes the true probability that \(\phi \) is satisfied. This means that,

$$\begin{aligned} \mathbb {P}(X_i=x) = \left\{ \begin{array}{l@{\quad }l} p&{} \text { if } x = 1, \\ 1-p&{} \text { if } x = 0 . \end{array} \right. \end{aligned}$$

The total sample \(\varvec{X} \triangleq (X_i)_{i=1,\ldots ,N}\) will be used to perform a statistical test. To do this, we combine all relevant information from the individual sample paths into a function that maps \(\{0,1\}^N\) onto \(\mathbb {R}\), called the test statistic. We use the test statistic to falsify claims about \(p\), called hypotheses. If we can show that, under the condition that some hypothesis \(H\) is true, the probability that the observed outcome of the test statistic occurs is smaller than some given \(\alpha \in (0,\frac{1}{2})\), then we reject \(H\). The parameter \(\alpha \) is called the significance parameter, and \(1-\alpha \) is called the confidence of the test. A hypothesis that can be rejected this way is called a null hypothesis, while a hypothesis that can be accepted through the rejection of a null hypothesis is called an alternative hypothesis. Rejecting a valid null hypothesis is called an error of the first kind (or a false positive). Not accepting a valid alternative hypothesis is called an error of the second kind (or a false negative).

Since we are interested in checking whether \({\mathcal {P}}_{>p_0}(\phi )\) holds, there are two relevant claims: \(p> p_0\) and \(p\le p_0\). There is no clear distinction between a null and alternative hypothesis, as there is no asymmetry in our desire to reject any one of the two claims. Accordingly, we specify two alternative hypotheses, each of which we would like to accept if it were true:

$$\begin{aligned}&\begin{aligned}&H_{+1} : p> p_0,\\&H_{-1} : p< p_0. \end{aligned} \end{aligned}$$
(2)

Additionally, we have the null hypothesis:

$$\begin{aligned} H_{0} : p= p_0. \end{aligned}$$

Note that the null hypothesis cannot be shown to be correct, as its negation \(p\ne p_0\) cannot be disproved statistically. The reason is that no matter how many samples we draw and no matter how much evidence we see for \(p=p_0\), there will always be some small \(\epsilon \) such that we cannot reject the claim that \(p= p_0 + \epsilon \). However, \(H_0\) can be shown to be incorrect.

The procedure to test which of the alternative hypotheses is true is as follows: after having drawn \(N\) samples, we let \(S_N(\varvec{X})\) be the test statistic given by the sum of \(X_1\) up to \(X_N\), i.e.,

$$\begin{aligned} S_N(\varvec{X}) = \sum _{i=1}^N X_i, \end{aligned}$$

and omit the argument \(\varvec{X}\) for brevity. We can then view the evolution of \(S_N\) as the evolution of a discrete-time Markov chain on state space \({\mathbb {N}}^{2}\), with the number of drawn samples on the \(x\)-axis and the value of the test statistic on the \(y\)-axis, where in each step we take a jump to the right or top-right (as can be seen in Fig. 1a).

Fig. 1
figure 1

Markov chain representations of the processes \(S_N\) and \(Z_N\). In each state, both processes jump up with probability \(p\) and down with probability \(1-p\). The two processes have the same structure; the only difference is a normalisation of the variable on the vertical axis

While we are drawing samples, the expected behaviour of the process \(S_N\) is that it drifts away from the \(x\)-axis. The true parameter \(p\) determines the speed of this drift. Remember that our main interest is to test whether \(p-p_0\) is positive or negative. Hence, we focus on the shifted test statistic,

$$\begin{aligned} Z_N \triangleq S_N - N p_0. \end{aligned}$$

The process \(Z_n\) is essentially a random walk that always jumps up by \(1-p_0\) with probability \(p\), or down by \(p_0\) with probability \(1-p\). Its evolution is depicted in Fig. 1b. The speed at which \(Z_N\) drifts away from the \(x\)-axis is completely determined by \(p-p_0\). If \(Z_N \gg 0\) then this is strong evidence for \(H_{+1}\), while if \(Z_N \ll 0\) then this is strong evidence for \(H_{-1}\).

We then specify four test decision areas which are subsets of \(\mathbb {R}^2\). Three of them are called critical, which means that we draw a conclusion as soon as they are entered by \(Z_N\). The first critical area \(\mathcal {U}\) is the area such that as soon as \(Z_N\) enters \(\mathcal {U}\), we accept \(H_{+1}\). The second critical area \(\mathcal {L}\) does the same for \(H_{-1}\). As soon as \(Z_N\) enters the critical area \(\mathcal {I}\), we stop the test without accepting any hypothesis. We accordingly say that the test was inconclusive. All that is outside these three areas makes up the non-critical area \(\mathcal {NC}\).

The tests that we consider in this paper are completely determined by the shape of these areas. There are two main types of tests: fixed sample size tests, where the decision is taken after an a-priori determined number of samples, and sequential tests, where the decision whether or not to continue sampling is made on the basis of the samples so far; mixtures of both types are also possible. Typical examples of both types are illustrated in Fig. 2. The relevant parts of these figures are the boundaries between the area \(\mathcal {NC}\) and the three other areas. After all, we only continue the testing procedure while we are in \(\mathcal {NC}\); we stop when a relevant boundary is crossed. For example, the exact shape of the border between \(\mathcal {U}\) and \(\mathcal {I}\) in Fig. 2a is irrelevant because we stop when we enter either.

Fig. 2
figure 2

Graphical representation of the test decision areas \(\mathcal {L}\), \(\mathcal {U}\), \(\mathcal {I}\) and \(\mathcal {NC}\). Left: a typical fixed sample size test. Right: a typical sequential test. Grey areas represent areas in which \(Z_N\) cannot go

In a fixed sample size test, as illustrated in Fig. 2a, the borders between \(\mathcal {NC}\) and the other areas is a single straight line because for a fixed sample size test, the point at which we stop is always at the same value for \(N\) on the \(x\)-axis. In a sequential test like in Fig. 2b, the important thing is that \(\mathcal {NC}\) continues indefinitely, since we keep sampling until we draw a conclusion. Note that this also implies that typical sequential tests in principle do not have an area \(\mathcal {I}\).Footnote 5 Hence, the structure of the sequential tests is entirely determined by two borders: the \(\mathcal {L}\)-\(\mathcal {NC}\) boundary, denoted by \(l(N)\), and the \(\mathcal {U}\)-\(\mathcal {NC}\) boundary, denoted by \(u(N)\). Most of the discussion of the sequential tests will therefore focus on the shape of these functions. For fixed sample size tests on the other hand, we merely need to determine two numbers, \(u^*\) and \(l^*\), which depend on the chosen sample size, but are not functions of the number of samples \(N\) drawn at present.

Given \(\mathcal {L}\), \(\mathcal {U}\) and \(\mathcal {I}\), we want to bound the probability that these areas are entered given that a hypothesis is valid. To formalise this, for \(i\in \{-1,+1\},\) let \(A_{i}\) be the event that we reject \(H_0\) in favour of \(H_i\), and let \(A_0\) be the event that we do not reject \(H_0\), meaning that the test remains inconclusive. More specifically we have,

$$\begin{aligned} A_{+1}&= \{\text {reach }\, \mathcal {U}\,\text {before } \,\mathcal {L}\,\text {or } \,\mathcal {I}\}, \\ A_{-1}&= \{\text {reach }\, \mathcal {L}\,\text {before } \,\mathcal {U}\,\text {or } \,\mathcal {I}\}, \\ A_{0}&= \{\text {reach }\, \mathcal {I}\,\text {or }\, \text {stay}\, \text {in } \,\mathcal {NC}\}, \\ \lnot A_{+1}&= A_{-1} \cup A_0,\\ \lnot A_{-1}&= A_{+1} \cup A_0. \end{aligned}$$

Then we typically impose the following two conditions on the two errors of the first kind (‘false positives’):

$$\begin{aligned}&\mathbb {P}(A_{+1} \text { } | \text { } \lnot H_{+1}) \le \alpha _1, \end{aligned}$$
(3)
$$\begin{aligned}&\mathbb {P}(A_{-1} \text { } | \text { } \lnot H_{-1}) \le \alpha _2. \end{aligned}$$
(4)

These probabilities deal with drawing a wrong conclusion. We will usually bound these probabilities by replacing the condition \(\lnot H_{+1}\) (or \(\lnot H_{-1}\)) by the worst case, which is \(H_0\). A more detailed explanation of this is given in Sect. 3.1.

Also we like to impose conditions on the two errors of the second kind (‘false negatives’):

$$\begin{aligned}&\mathbb {P}(\lnot A_{+1} \text { } | \text { } H_{+1}) \le \beta _1, \end{aligned}$$
(5)
$$\begin{aligned}&\mathbb {P}(\lnot A_{-1} \text { } | \text { } H_{-1}) \le \beta _2. \end{aligned}$$
(6)

These probabilities deal with drawing no (or a wrong) conclusion. For tests that always draw a conclusion (i.e., where \(A_0\) never happens), these probabilities coincide with the ones in (3) and (4), assuming that \(H_0\) is never exactly true. For tests that may end inconclusively, the probabilities in (5) and (6) are usually only slightly larger than the probability of \(A_0\) (given \(H_{+1}\) or \(H_{-1}\), respectively) since, e.g. under \(H_{+1}\) the event \(A_{-1}\) is much less likely than \(A_0\). This is the reason we will, instead of (5) and (6), often use the power, which is a function of the real value of \(p\), and is definedFootnote 6 as:

$$\begin{aligned} \mathbb {P}(A_{-1} \cup A_{+1}) = 1-\mathbb {P}(A_0). \end{aligned}$$

Throughout this paper, we will choose \(\alpha = \alpha _1 = \alpha _2\) and \(\beta = \beta _1 = \beta _2\) for simplicity. In principle, the total probability of error of the first kind is \(\alpha _1 + \alpha _2\), since if \(H_0\) is true, accepting either \(H_{+1}\) or \(H_{-1}\) constitutes an error. But if \(H_{+1}\) or \(H_{-1}\) is true, the probability of error of the first kind is only \(\alpha _2\) or \(\alpha _1\), respectively. We argue that we should focus on the latter case. This is not to say that \(H_0\) cannot hold in practice. However, if it does, then statistical model checking cannot be used to show it holds, as argued earlier. Thus, an investigator who wants to know whether \(H_0\) is true should use a different model checking technique. Furthermore, an investigator who does not care about \(H_0\) probably does not mind either \(H_{+1}\) or \(H_{-1}\) being accepted in that case. Throughout this paper, we will therefore assume \(\alpha = \alpha _1=\alpha _2\). An investigator who does care about \(H_0\) could replace \(\alpha \) by \(\alpha /2\) in all testsFootnote 7 at the cost of increasing the computational effort.

2.3 Main criteria and classification of tests

Given a selection of tests specified using the framework of Sect. 2.2, it is up to the investigator to decide which test she finds the most appealing. We use three main criteria by which to judge the appeal of these tests:

  1. 1.

    the correctness: we call a test correct if its probability of not drawing the correct conclusion is guaranteed to be smaller than \(\alpha \), where \(1-\alpha \) is the confidence level; mathematically, this means (3) and (4) hold;

  2. 2.

    the power: recall the definition of power from Sect. 2.2 as the probability that the test will eventually draw a conclusion, i.e., \(1-\mathbb {P}(A_0)\);

  3. 3.

    the efficiency: the number of samples needed (in expectation) before a conclusion can be drawn.

As these three criteria are partly contradictory, each test will be affected adversely on at least one criterion when \(p\) is close to \(p_0\). We introduce three classes of tests, based on which criterion is affected (most):

  1. I.

    Tests whose probability of drawing a wrong conclusion exceeds \(\alpha \) when \(|p-p_0|\) is small.

  2. II.

    Tests whose probability of drawing no conclusion (or a wrong conclusion) exceeds \(\beta \) when \(|p-p_0|\) is small.

  3. III.

    Tests that are always correct and always draw a conclusion, at the cost of drawing an extremely large number of samples before reaching a conclusion when \(|p-p_0|\) is small.

Note that this classification in itself is independent of the type of test as described in the previous subsection, namely fixed sample size or sequential. However, it is worth mentioning at this point that a fixed sample size test that satisfies criterion 1 can never also satisfy criterion 2, at least not for all possible values \(p\) close to \(p_0\). Such a test will therefore always be in class II. In other words, tests in class III, that satisfy both criteria 1 and 2, are necessarily sequential tests.

For each class, we introduce an extra input parameter, which influences how poor the performance will be when \(|p-p_0|\) is small. For classes I and II, the extra parameter is a threshold on \(|p-p_0|\), below which the investigator no longer cares about the test’s correctness or the power, respectively. We call these parameters the correctness-indifference level \(\delta \) for class-I tests, and the power-indifference level \(\zeta \) for class II. Class-III tests do not need such a threshold parameter, since their correctness and power do not suffer when \(|p-p_0|\) is small; however, they may use a guess called \(\gamma \), representing the investigator’s expectation of \(|p-p_0|\), to minimise the runtime for that case.

We emphasise that, although the three parameters \(\delta \), \(\zeta \), and \(\gamma \) are all related to the difference between \(p\) and \(p_0\), their meaning is different. The choice of \(\delta \) or \(\zeta \) (in class I/II tests) depends on the interest of the investigator (namely, in what case she no longer cares either about the correctness or the probability of receiving a meaningful answer), while the choice of \(\gamma \) depends on her expectation of the true \(p\), and only influences the running time, but never the correctness or power.

All of the above is summarised in Table 1, which also shows the tests we will consider in Sect. 3, including their classes and types.

Table 1 Overview of test classes

3 Overview of the tests

In this section, we discuss several hypothesis tests that an investigator can choose to use. In particular, we focus on how they fit into the framework of Sect. 2.2. For a quick overview we refer to Table 1 (a more detailed overview follows later in Table 5). How these tests can be expressed in terms of the parameters of Sect. 2.3 is the subject of Sect. 4. The first five tests (which belong to classes I and II), have been implemented in existing model checking tools, or are described in the model checking literature, while the others (belonging to class III), to the best of our knowledge, are not.

The outline of this section is as follows: Starting with class II tests, we begin in Sect. 3.1 with a discussion of a hypothesis testing procedure that uses a confidence interval based on the Gaussian approximation and a sample size that is fixed beforehand. In Sect. 3.2 we focus on a similar method based on the Chernoff–Hoeffding bound. In Sect. 3.3, we discuss the Chow–Robbins test, which is based on confidence intervals that are sequential in the sense that we continue sampling until the width of the confidence interval has reached a given value. Turning to class I tests, we discuss the sequential probability ratio test (SPRT) in Sect. 3.4, followed by its ‘fixed sample size variant’, the Gauss-SSP test, in Sect. 3.5. In Sects. 3.6 and 3.7 we discuss the two tests in class III, namely the Azuma test and the Darling–Robbins test, respectively. These two tests have not been implemented in model checking tools so far. Finally, in Sect. 3.8 we briefly discuss some noteworthy tests that have been proposed but (to the best of our knowledge) never implemented.

3.1 Binomial and Gaussian confidence intervals

The idea behind the test described in this section is the confidence interval based on an a priori fixed sample size \(N\). Formally, a \((1-\alpha )\)-confidence interval is an interval \([l,u]\) that is constructed using a procedure that, with probability \(1-\alpha \), produces intervals containing the true probability \(p\). As we argued in the introduction, a confidence interval can be used for a hypothesis test by checking if \(p_0\) is inside the interval.

The critical regions for this test have the form displayed in Fig. 2a. Since the number of samples drawn is fixed to be \(N\), the non-critical region \(\mathcal {NC}\) consists of all points \((n,z)\) for which \(n < N\). The other regions can be characterised by two values, namely \(l^*\), which is the border between \(\mathcal {L}\) and \(\mathcal {I}\), and \(u^*\) which is the border between \(\mathcal {I}\) and \(\mathcal {U}\). According to (3), we must choose \(u^*\) such that when \(H_{0}\) or \(H_{-1}\) is true, the probability that \(Z_N>u^*\) is smaller than \(\alpha \). As we already mentioned in Sect. 2.2, it is sufficient to check this under the worst case assumption, i.e., whether

$$\begin{aligned} \mathbb {P}(Z_N>u^*|H_0)<\alpha \end{aligned}$$
(7)

holds. The reason is that under \(H_{-1}\) (i.e., for any true \(p\le p_0\)), high values of \(Z_N\) are even less likely than under \(H_0\) (when \(p=p_0\)), so that \(\mathbb {P}(Z_N>u^*|\lnot H_{+1}) \,\,\le \,\, \mathbb {P}(Z_N>u^*|H_0)\). Hence (7) implies (3). Analogously, we base \(l^*\) only on \(H_0\) and not on \(H_{+1}\).

If \(N\) is large enough, we can use the CLT to argue that the distribution of \(Z_N\) can be well approximated by a normal distribution. Let \(\Phi \) be the standard normal cumulative distribution function and \(\text {Var}(Z_N) = \text {Var}(Z_N|H_0) = N p_0 (1-p_0)\), then it follows from basic statistical analysis (see [29] for details) that

$$\begin{aligned}&l^* = \Phi ^{-1}\left( \alpha \right) \sqrt{\text {Var}(Z_N)},\end{aligned}$$
(8)
$$\begin{aligned}&u^* = \Phi ^{-1}\left( 1-\alpha \right) \sqrt{\text {Var}(Z_N)} = -l^*. \end{aligned}$$
(9)

Note that the procedure above is not exactly the same as constructing a confidence interval and checking whether \(p_0\) is inside the interval. The one difference is that under \(H_0\), we can assume that the variance of both \(S_N\) and \(Z_N\) is given by \(N p_0 (1-p_0)\), while for a regular confidence interval this would be estimated using the realisation of \(S_N\), i.e., \(\text {Var}(Z_N) = S_N (1-S_N / N)\). This difference is only noticeable when \(|p-p_0|\) is large.

In this paper we call the test described above the ‘Gauss-CI’ test because of its relationship with the Gaussian confidence interval obtained using the CLT. Alternatively, confidence intervals can be based on the exact binomial distribution; they are called ‘Clopper–Pearson’ intervals in the scientific literature. A third alternative exists in the form of the ‘Agresti–Coull’ confidence intervals, which are between the binomial and Gaussian confidence intervals in terms of the degree of approximation—such intervals have been implemented in the tool MRMC. Hypothesis tests can also be based on such confidence intervals, but since the difference with Gaussian intervals is only noticeable at very small \(N\), we will not separately consider such tests in this paper.

The choice of \(N\) is non-trivial. It impacts both the efficiency (obviously) and the power. In Sect. 4.1, we demonstrate how to determine \(N\) such that for a given power-indifference level \(\zeta \) the power of the test is guaranteed to be at least \(1-\beta \).

3.2 Confidence intervals using the Chernoff–Hoeffding bound

The test described in this section is a fixed sample size test based on a different type of confidence interval. Its basis is the Chernoff–Hoeffding bound [20], which states the following: for any sequence \(X_1,X_2,\ldots ,X_N\) of independent random variables with \(\mathbb {P}(0 \le X_i \le 1)=1\), it holds for all \(t>0\) that,

$$\begin{aligned} \mathbb {P}(|\bar{X} - {\mathbb {E}}(\bar{X})| > t) \le 2e^{-2Nt^2}, \end{aligned}$$
(10)

where \(\bar{X} = \frac{1}{N}\sum _{i=1}^N X_i\). A test that is analogous to the Gauss-CI test of Sect. 3.1 is then as follows. The investigator chooses a significance parameter \(\alpha \) and a so-called ‘approximation parameter’ \(\epsilon \). She then draws

$$\begin{aligned} N = \frac{1}{2\epsilon ^2} {\log \left( \textstyle \frac{2}{\alpha }\right) } \end{aligned}$$
(11)

samples. We can then rewrite (10) to

$$\begin{aligned} \mathbb {P}(|\bar{X}-p_0|\ge \epsilon ) \le \alpha . \end{aligned}$$
(12)

The test is then as follows: we draw \(N\) samples and check if \(|\bar{X}-p_0|>\epsilon \). If so, we reject the null hypothesis, otherwise the test is inconclusive. If we reject the null hypothesis, we accept \(H_{+1}\) if \(\bar{X}>p_0\) and we accept \(H_{-1}\) otherwise. The test satisfies (3) and (4) because under the null hypothesis \({\mathbb {E}}(\bar{X}) = p_0\), so that (12) is really an upper bound for the probability of rejecting the null hypothesis when it is valid. A test of this form is implemented in the tool PRISM. Since we assume that \(H_0\) does not hold, as we argued in Sect. 2.2, we replace \(\frac{2}{\alpha }\) in (11) by \(\frac{1}{\alpha }\) when we compute \(N\) for the tables in Sect. 5.

As with the Gauss-CI test, the shape of the critical regions is as displayed in Fig. 2a. Apart from \(\alpha \) and \(p_0\), the main parameter that determines the location of the critical region boundaries is \(\epsilon \). Since \(\epsilon \) does not have a clear interpretation in terms of the output guarantees described in Sect. 2.3, we discuss in Sect. 4.2 how to calculate it from a power-indifference level \(\zeta \) instead. As with the Gauss-CI test, \(\zeta \) will turn out to impact both the power and efficiency of the test.

3.3 Chow–Robbins test

The test described in this section is similar to the test described in Sect. 3.1, but the difference is that we continue drawing samples until the width of the confidence interval for \(\hat{p}_N = S_N / N\) has reached some given value, denoted by \(2\epsilon \), at confidence level \(1-2\alpha \). Then \(H_{+1}\) can be accepted if this confidence interval is entirely above \(p_0\), \(H_{-1}\) if it is entirely below \(p_0\), and the test is inconclusive otherwise.

After having drawn \(N\) samples, the width of this confidence interval (at confidence level \(1-2\alpha \)) equals \(2\Phi ^{-1}\left( \alpha \right) \sqrt{\text {Var}(\hat{p}_N)}\), where \(\text {Var}(\hat{p}_N) = \hat{p}_N (1-\hat{p}_N)/N\). This width is maximal when \(\hat{p} = \frac{1}{2}\) and is smaller when \(\hat{p}_N\) is closer to \(0\) or \(1\). Hence, this test can reach a conclusion more quickly than the Gauss-CI test when \(p\) is further away from \(\frac{1}{2}\) than \(p_0\), and takes longer otherwise. We call this test the ‘Chow–Robbins test’ after the authors of [10], who showed that a confidence interval created this way asymptotically satisfies the requirements on the errors of the first kind.

The critical areas of this test do not look like those depicted in Fig. 2. It is between a fixed sample size test and a sequential test: even though the sample size is clearly not fixed, the sample size is upper bounded as there is a maximal \(N\) for which the confidence interval reaches the specified width even if the variance of \(\hat{p}_N\) is maximal (i.e., when \(\hat{p}_N = \frac{1}{2}\)). The exact shape of the critical regions is discussed further in Sect. 5.1.

What is left is choosing the half-width of the confidence interval, denoted by \(\epsilon \) because of analogy with \(\epsilon \) in the Chernoff-CI test of Sect. 3.2. The parameter \(\epsilon \) impacts both the power and the efficiency. In Sect. 4.3, we show how to choose \(\epsilon \) based on the power-indifference level \(\zeta \).

3.4 Sequential probability ratio test

The sequential probability ratio test (SPRT) for statistical model checking was introduced by YounesFootnote 8 in [42], based on ideas that go back to [37]. In [37], Wald tries to sequentially test which of the following two hypotheses is true,

$$\begin{aligned}&\begin{aligned}&H_{+1} : p\ge p_{+1}, \\&H_{-1} : p\le p_{-1} \end{aligned} \end{aligned}$$
(13)

for values \(p_{-1} < p_{+1}\). He argues that a suitable test statistic is the so-called hypotheses’ likelihood ratio:

$$\begin{aligned} T_N \triangleq \frac{p_{+1}^{S_N} (1-p_{+1})^{N-{S_N}}}{p_{-1}^{S_N} (1-p_{-1})^{N-{S_N}}}. \end{aligned}$$

Clearly, small values of \(T_N\) speak in favour of \(H_{-1}\) while large values speak for \(H_{+1}\). The idea is then to construct boundaries \(l'\) and \(u'\) such that when \(T_N\) crosses either of these boundaries we accept the corresponding hypothesis. We then have to bound, for given boundaries \(l' < u'\), the probability of crossing \(l'\) given \(H_{+1}\) and the probability of crossing \(u'\) given \(H_{-1}\). Wald showed how to achieve such a bound. In particular, for \(l' = \alpha _1/(1-\alpha _2)\) and \(u' = (1-\alpha _1)/\alpha _2\) one knows that the probability of accepting \(H_{-1}\) while \(H_{+1}\) is true is smaller than \(\alpha _2\) while the probability of accepting \(H_{+1}\) while \(H_{-1}\) is true is smaller than \(\alpha _1\).

To evaluate the validity of \(\mathcal {P}_{>p_0}(\phi )\), we have the hypotheses of (2), which are similar to those of (13) with \(p_{+1} = p_{-1} = p_0\). Unfortunately, in this case the value \(T_N\) is always \(1\). The idea proposed in [42] is to choose an indifference level \(\delta \) such that we can safely assume that the true value for \(p\) is not inside the interval \([p_0-\delta ,p_0+\delta ]\). Then we can set \(p_{-1} = p_0-\delta \) and \(p_{+1} = p_0+\delta \) and carry out the above procedure. To be precise, the hypotheses in this settingFootnote 9 are given by:

$$\begin{aligned}&\begin{aligned}&H'_{+1} : p> p_0+\delta , \text { and }\\&H'_{-1} : p< p_0-\delta . \end{aligned} \end{aligned}$$
(14)

To see how this test fits into the framework of Sect. 2.2, first note that instead of the test statistic \(T_N\) we could also use:

$$\begin{aligned} \log T_N \triangleq q_1 S_N + q_2 N{,} \end{aligned}$$

where

$$\begin{aligned} q_1 = \log \left( \frac{p_{+1} \cdot (1-p_{-1})}{(1-p_{+1}) \cdot p_{-1}} \right) , \,\, q_2 = \log \left( \frac{1-p_{+1}}{1-p_{-1}} \right) . \end{aligned}$$

Hence, an equivalent formulation is to use the process \(Z_N = S_N - N p_0\) of Fig. 2 as a test statistic, with boundaries:

$$\begin{aligned} l(N)&= \displaystyle \frac{1}{q_1}(\log l'-q_2 N) - N p_0 , \end{aligned}$$
(15)
$$\begin{aligned} u(N)&= \displaystyle \frac{1}{q_1}(\log u'-q_2 N) - N p_0. \end{aligned}$$
(16)

These are linear functions in \(N\). So, whereas the boundaries of (8) and (9) are proportional to \(\sqrt{N}\), the boundaries of (15) and (16) increase linearly. One can verify that when \(p_0 = \frac{1}{2}\) or in the limit \(\delta \downarrow 0\) the boundaries are constants.

The bounds on the error probabilities are only valid if \(p\) does not lie in \([p_0-\delta , p_0+\delta ]\); consequently, \(\delta \) impacts the correctness of the test. Furthermore, the efficiency is affected. Since \(\delta \) is the only parameter, and has clear interpretations in terms of the output guarantees of Sect. 2.3, the parameter choice for this test is not discussed further in Sect. 4.

3.5 Gauss-SSP test

The test discussed in this section goes back to [15], was discussed in [38] and [35] and has been implemented in the tool VeStA and its offshoots. It can be seen as a fixed sample size version of the SPRT. As with the SPRT, we assume that \(p\) is outside the interval \([p_0-\delta ,p_0+\delta ]\) and, hence, consider the hypotheses of (14). The idea is then to draw \(N\) samples, with \(N\) fixed beforehand, and accept \(H'_{+1}\) if \(Z_N \ge 0\) and to accept \(H'_{-1}\) otherwise. The sample size \(N\) is computed such that the requirements on the two errors of the first kind are met. To make this precise: we can write the first error of the first kind (given in the general setting by (3)) as follows:

$$\begin{aligned} \mathbb {P}(A_{+1} \text { } | \text { } H'_{-1})&\le \mathbb {P}(Z_N \ge 0 \text { } | \text { } p = p_0-\delta ) \\&= \mathbb {P}\left( \left. Y_N \ge \frac{N \delta }{\sqrt{\text {Var}(Z_N)}} \text { } \right| \text { } p = p_0-\delta \right) , \end{aligned}$$

where \(\text {Var}(Z_N) = N (p_0-\delta )(1-p_0+\delta )\) if \(p = p_0-\delta \) and

$$\begin{aligned} Y_N = \frac{Z_N + N \delta }{\sqrt{\text {Var}(Z_N)}}. \end{aligned}$$

is a normalised version of \(Z_N\). One obtains a similar expression for the second error of the first kind. In [35] the exact binomial distribution of \(Z_N\) is used to find an upper bound for these probabilities. In this paper, we use the fact that \(Y_N\) is approximately normally distributed for large \(N\), which leads to the following requirements on \(N\) in order to bound the errors of the first kind:

$$\begin{aligned}&N\ge \left( \frac{\Phi ^{-1}(1-\alpha _1)}{\delta } \right) ^2 (p_0-\delta )(1-p_0+\delta )\\&N\ge \left( \frac{\Phi ^{-1}(\alpha _2)}{\delta } \right) ^2 (p_0+\delta )(1-p_0-\delta ), \end{aligned}$$

where \(\Phi \) (as in Sect. 3.1) denotes the Gaussian cumulative distribution function.

As with the SPRT, the indifference parameter \(\delta \) impacts both the correctness and efficiency of the test and because its interpretation is clear, this test is not discussed further in Sect. 4.

We call this test the ‘Gauss-SSP’ test; SSP stands for single sampling plan as it was called in [43]. An SSP test variant that uses the Chernoff–Hoeffding bound instead of the Gaussian approximation is discussed in [18] (the authors call the method based on this test ‘approximate model checking’). This test can be called Chernoff-SSP, and compares to the Gauss-SSP in a way that is similar to how the Chernoff-CI test compares to the Gauss-CI; it will not be discussed here further.

3.6 Azuma test

The test of this section is the first of two class-III tests to be discussed in this paper. These tests are different from the tests discussed previously in the sense that their input parameters only determine the efficiency of the test. So far, no class-III tests have been implemented in the model checking tools. These tests have the shape of the typical sequential test depicted in Fig. 2b: they are characterised by functions \(u(N)\) and \(l(N)\) denoting the boundaries between \(\mathcal {U}\) and \(\mathcal {L}\) respectively and \(\mathcal {NC}\). We assume that the tests are symmetric (i.e., \(u(N) = -l(N)\)), which means that \(u(N)\) remains to be chosen such that (36) are satisfied.

The function \(u(N)\) must asymptotically grow faster than \(\sqrt{N}\), otherwise errors of the first kind will be too likely for small \(|p-p_0|\). An informal argument is that the standard deviation of the process \(Z_N\) grows proportionally to \(\sqrt{N}\), so that even under \(H_0\), given an infinite amount of time such boundaries will eventually be crossed with probability \(1\). This is discussed in greater detail in [29, 30]. Also, \(u(N)\) must grow slower than linearly in \(N\), otherwise errors of the second kind will be too likely for small \(|p-p_0|\). The argument here is that even under one of the alternative hypotheses, the drift of \(Z_N\) is only linear, so that for \(|p-p_0|\) small enough the function \(u(N)\) will diverge linearly from the expected trajectory. As a result, the probability of ever crossing a linearly increasing \(u(N)\), and thus taking a decision, is too small when \(|p-p_0|\) is tiny.

The first shape of \(u(N)\) that we consider is the form:

$$\begin{aligned}&u(N) = a(N+k)^{b}\text {, with }b \in \left( \textstyle \frac{1}{2},1\right) ,\\&\hbox {for some } a > 0 \hbox { and } k > 0. \end{aligned}$$

For this case, both the correctness of the test and a lower bound on the power are proven in [29, 30] using a bounding result that is comparable to, and inspired by, Ross’ “generalized Azuma inequality” in Sect. 6.5 of [32] (which also explains the name of the test). In particular, (3), (4), (5) and (6) are all satisfied, with

$$\begin{aligned}&\alpha _1=\alpha _2=\beta _1=\beta _2 \nonumber \\&\quad = e^{-8(3b-2) a^2 k^{2b-1}}, \hbox {for some } a > 0 \hbox { and } k > 0. \end{aligned}$$
(17)

The input parameters of the test are \(a\) and \(k\); we discuss in Sect. 4.4 how to choose these parameters, based on a guess \(\gamma \) which only affects the efficiency.

3.7 Darling–Robbins test

In this section, we consider a test similar to the one described in the previous section but with a different form of \(u(N)\). It is based on [11] (Theorem 3), in which the following statement is proven for the test of Sect. 2.2 for general \(\mathcal {U}\)-\(\mathcal {NC}\) boundary \(u(N)\) and \(\mathcal {L}\)-\(\mathcal {NC}\) boundary \(-u(N)\): if one can find an \(\epsilon > 0\) such that,

$$\begin{aligned} \sum _{n=1}^\infty e^{-\frac{u^2(n)}{n+1}} \le \epsilon \end{aligned}$$
(18)

then the probability of error is bounded from above by \(2 \sqrt{2} \epsilon \). If we assume that \(H_0\) does not hold, then the probability of error can be upper bounded by \(\sqrt{2} \epsilon \). The idea is then to carry out the test of Sect. 2.2, with \(u(N)\) chosen such that (18) can be used to show that (36) hold.

The bound (18) can in principle also be applied to the test from the previous section, with \(u(N)=a(N+k)^{b}\), but turns out to be much looser than the bound of (17). On the other hand, the proof in [29] of (17) requires analytical steps that do not work for boundaries that are of order \(N^{2/3}\) or tighter, such as \(N\log (N)\). So in order to evaluate such tighter boundaries, only (18) is available. Using this rather loose bound will negatively affect the efficiency of the resulting method.

In this paper we apply the test based on (18), which we call the ‘Darling–Robbins’ or ‘Darling’ test for brevity, only to boundaries of the form:

$$\begin{aligned}&u(N) \!=\! \sqrt{a (N+1) \log {(N+k)}},\\&\hbox {for some } a \!>\! 0 \hbox { and } k > 0. \end{aligned}$$

As with the Azuma test, the remaining input parameters of this are \(a\) and \(k\); how they can be chosen based on a guess \(\gamma \) which only affects the efficiency is discussed in Sect. 4.4.

3.8 Other tests

In this section, we mention some other hypothesis tests that could also be applied in the context of statistical model checking. None of these tests has been implemented in the major model checking tools, and we will not discuss them in the rest of this paper.

The first is the Bayesian SPRT which was proposed for statistical model checking in [23] and which is based on ideas going back to [21]. In Bayesian statistics, the true parameter \(p\) is itself seen as the realisation of a random variable, of which a prior distribution must be given, which affects not only the efficiency of the test but also its correctness. For a more detailed discussion of the Bayesian SPRT in our framework, see [29].

A second test is the one proposed in [26] and which is mentioned, among others, in [41]. The input of this test is a constant \(c\) which represents the relative cost of drawing a sample compared to the cost of accepting an invalid hypothesis. The critical areas are then constructed such that the expected cost is minimised in a Bayesian setting.

Finally, in [39] a variant of the SPRT is proposed that includes an inconclusive area \(\mathcal {I}\) (thus, in our terminology, it essentially turns the SPRT from a class I test into a class II version). In fact, the test entails that two SPRT tests are performed simultaneously (i.e., based on the same sample path of \(Z_N\)), namely one testing \(p\ge p_0+\delta \) against \(p\le p_0\), and one testing \(p\ge p_0\) against \(p\le p_0-\delta \). At first sight this test seems to fit in the framework of Sect. 2.2, with a somewhat remarkable shape of \(\mathcal {NC}\) (see Fig. 1b. in [39]), but one needs to be careful here: since the sample path is not stopped when one of the sub-tests draws a conclusion, one should not just look where the process \(Z_N\) eventually ends up, but also take into account its whole sample path. Thus, when implemented correctly, the test does not formally fit in the framework of Sect. 2.2.

4 Choice of parameters

In Sect. 3, we discussed a range of tests in terms of the framework of Sect. 2.2; in particular, we focused on the general shape of the critical areas. We found that for each test, an additional parameter was still needed to be able to determine the exact shape of the critical areas. For the tests in class I (i.e., the SPRT and Gauss-SSP test of Sects. 3.4 and 3.5), this was the indifference level \(\delta \). This parameter has a clear interpretation as discussed in Sect. 2.3; consequently, these tests do not further appear in this section. For the other tests we discuss how to choose their parameters such that they have clear interpretations in terms of the output guarantees of Sect. 2.3.

The class II tests are treated in Sects. 4.1, 4.2 and 4.3, where we discuss how to parametrise the Gauss-CI, Chernoff-CI and Chow–Robbins tests, respectively, using the power-indifference level \(\zeta \). This replaces their previous parametrisation in terms of either the sample size \(N\) or the confidence interval width \(\epsilon \), of which the latter has a clear interpretation for making quantitative statements (i.e., confidence intervals) but less so for hypothesis testing. In Sect. 4.4, we discuss the parametrisation of the class III tests (Azuma and Darling) in terms of the guess \(\gamma \).

4.1 Choice of parameters for the Gauss-CI test

In Sect. 3.1, we derived the expressions (8) and (9) for the critical region boundaries, with \(\alpha \), \(p_0\) and \(N\) left as parameters. While the interpretation of \(\alpha \) and \(p_0\) is clear, the choice of \(N\) for the Gauss-CI test is non-trivial as it settles the trade-off between the power and the efficiency. If \(p\) is very different from \(p_0\), a small value for \(N\) suffices—choosing \(N\) too large then leads to extra inefficiency. Alternatively, if \(p\) is close to \(p_0\) a large value for \(N\) is needed—choosing \(N\) too small then leads to a decrease in power. For making quantitative statements, the goal is often to choose \(N\) such that the width of the confidence interval has a certain value. But since we focus on hypothesis testing, we want a procedure for choosing \(N\) such that (5) and (6) are satisfied.

If \(p- p_0\) were known to be equal to some given value \(\zeta >0\), then the minimal choice of \(N\) for which (5) and (6) are still satisfied can be calculated. For large \(N\), \(\hat{p}_N \triangleq S_N/N\) can be well approximated by a normally distributed random variable with mean \(p_0+\zeta \) and variance \(\sigma ^2\) \(\triangleq \) \((p_0+\zeta )~(1-p_0-\zeta )/N\). Writing \(\xi = \Phi ^{-1}(1-\alpha )\) and \(\sigma _{H_0}^2\) \(\triangleq \) \((p_0)~(1-p_0)/N\), the probability of not being able to accept \(H_{+1}\) after drawing \(N\) samples is given by:

$$\begin{aligned}&\displaystyle \mathbb {P}\left( \hat{p}_N \le p_0 + \xi {\sigma _{H_0}} \right) \nonumber \\&\quad = \displaystyle \mathbb {P}\left( \frac{\hat{p}_N - p_0 - {{\zeta }}}{\sigma } \le \frac{\xi \sigma _{H_0} - {\zeta }}{\sigma }\right) \nonumber \\&\quad = \displaystyle \Phi \left( \frac{\xi \sigma _{H_0} -{\zeta }}{\sigma }\right) = \displaystyle \Phi \left( \frac{\xi \sqrt{p_0 (1-p_0)} -{\zeta \sqrt{N}}}{\sqrt{(p_0+\zeta )(1-p_0-\zeta )}}\right) . \end{aligned}$$
(19)

Setting this equation equal to \(\beta \) and solving for \(N\) yields the following expression:

$$\begin{aligned} N_G = \left( \frac{\xi \sqrt{p_0(1-p_0)} - \Phi ^{-1}(\beta )\sqrt{(p_0+\zeta )(1-p_0-\zeta )} }{\zeta }\right) ^2. \end{aligned}$$

An analogous procedure can be carried out for \(p = p_0-\zeta \), which means that we have two expressions for \(N\). Taking the maximum of the two guarantees that if this many samples are drawn, (5) and (6) hold when \(|p-p_0|>\zeta \).

4.2 Choice of parameters for Chernoff-CI test

For the Chernoff-CI test of Sect. 3.2, the remaining parameter is \(\epsilon \), which is related to the width of a confidence interval. Since this has an impact on the power, we use it to establish an upper bound on the error probability of the second kind. Assume, without loss of generality, that \(H_{+1}\) holds, so that \(p= p_0 + \Delta \) for some \(\Delta >0\); outside the power-indifference region, we have \(\Delta \ge \zeta \). Note that we can use (11) to write \(\epsilon \) as \(\epsilon _N\), i.e., as a function of \(N\). For an error of the second kind to occur it must hold that after \(N\) samples we have that \(\bar{X}-p_0<\epsilon _{N}\). We can use a form of the Chernoff–Hoeffding bound [20] and the fact that \({\mathbb {E}}(p_0-\bar{X}) = -\Delta \) to establish

$$\begin{aligned} \mathbb {P}(\bar{X}-p_0<{\epsilon _{N}})&= \mathbb {P}(p_0-\bar{X}>-{\epsilon _{N}})\\&= \mathbb {P}(p_0-\bar{X} + \Delta >\Delta -{\epsilon _{N}})\\&\le e^{-2N (\Delta -{\epsilon _{N}})^2}. \end{aligned}$$

Setting \(\beta = e^{-2N (\Delta -\epsilon _{N})^2}\) means that (5) is valid. It has two solutions for \(N\), one of which gives positive \(\Delta -\epsilon _N\) (which is a requirement of the Chernoff–Hoeffding bound). Setting \(\Delta =\zeta \) in this solution, we find the worst-case number of samples needed outside the power-indifference region:

$$\begin{aligned} N_{C} = \frac{2\sqrt{\log (\beta )\log (\alpha )}-\log (\alpha \beta )}{2\zeta ^2}. \end{aligned}$$
(20)

A similar argument can be made for \(\Delta <0\) and (6), which leads to the same value for \(N_{C}\) for both error probabilities of the second kind.

Table 2 compares the sample size for the Chernoff-CI test \(N_C\) and the sample size for the Gauss-CI test \(N_G\), for the same values of \(\alpha \) and \(\beta \), calculated using (19). One thing to note is that the Chernoff-CI test’s sample size does not depend on \(p_0\), while the Gauss-CI test’s sample size does. Another thing is that the sample size for the Chernoff-CI test always seems to be larger than for the Gauss-CI test.

Table 2 Chernoff (\(N_C\)) and Gaussian (\(N_G\)) sample sizes, \(\beta = \alpha \)

4.3 Choice of parameters for Chow–Robbins test

For the Chow–Robbins test of Sect. 3.3, the only parameter left to choose is the (half-)width of the confidence interval \(\epsilon \), such that the error probability of the second kind is bounded as desired.

To obtain a value for \(\epsilon \) we start by observing that \(\hat{p}_N\) is approximately normally distributed with mean \(p_0+\zeta \) and standard deviationFootnote 10 \(\sigma =\epsilon /\Phi ^{-1}(1-\alpha )=-\epsilon /\Phi ^{-1}(\alpha )\). The test will not accept \(H_{+1}\) if \(\hat{p}_N\le p_0+\epsilon \), leading to (19) with \(\xi \sigma _{H_0} = \epsilon \) substituted. Setting this to \(\beta \), one finds

$$\begin{aligned} \epsilon = \frac{\zeta }{1+\Phi ^{-1}(\beta )/\Phi ^{-1}(\alpha )}. \end{aligned}$$

4.4 Choice of parameters for Azuma and Darling tests

Since the Azuma and Darling tests are closely related, we discuss their parameter choices together. These two tests have non-critical area upper boundaries \(u(N;a,k)\) given by \(a(N+k)^{b}\) and \(\sqrt{a(N+1)\log (N+k)}\), respectively. The impact of the parameters \(a\) and \(k\) on the cone that defines \(\mathcal {NC}\) is illustrated in Fig. 3 for the Azuma test. The parameter \(a\) influences the increase in width of \(\mathcal {NC}\) and its influence does not fade relative to \(N\) when \(N\) grows large. A high value of the parameter \(k\) on the other hand makes the area \(\mathcal {NC}\) wider for small values of \(N\). For the Darling test, the influence of these two parameters is similar. The Azuma test additionally depends on a parameter \(b\); a high value for \(b\) means that the area \(\mathcal {NC}\) boundary \(u(N)\) will more closely resemble a straight line, which means that it will grow much wider asymptotically.

Fig. 3
figure 3

Impact of the parameters \(a\) and \(k\) on the shape of the critical regions of the Azuma test

A high value for \(k\) makes it harder to accept an alternative hypothesis in the beginning, but—since \(a\) and \(b\) can be chosen smaller to maintain the same significance level \(\alpha \)—easier to reject as \(N\) grows bigger. Since the upper bound on the probability of error is fixed to equal \(\alpha \), we can determine \(k\) as a function of \(a\), \(\alpha \) and \(b\). For the Azuma test, we easily derive from (17) that,

$$\begin{aligned} k_{\text {Azuma}}(a,\alpha ,b) = \left( \frac{\log \left( \alpha \right) }{8a^2(2-3b)} \right) ^{\frac{1}{2b-1}}. \end{aligned}$$

In Table 3, we show the (approximately) optimal parameters \(a\) and \(k\) that we found for both tests for several values of \(\gamma \) (recall that this is our guess for \(|p-p_0|\)). We can see that for the Azuma test, \(a\) grows proportional to \(\sqrt{\gamma }\), and \(k\) inversely proportional to \({\gamma }^2\).

Table 3 Approximately optimal parameter choices for \(\alpha =0.05\). For this table we used \(b = \frac{3}{4}\)

For the Darling test, it is harder to obtain a similar expression from (18) since we have to solve for the lower bound of a summation, but for practical purposes the summation in (18) can be approximated by the integral

$$\begin{aligned} \int _{1}^\infty e^{-\frac{u^2(x)}{x+1}} dx. \end{aligned}$$

We then derive

$$\begin{aligned} k_{\text {Darling}}(a,\alpha ) = \left( \frac{\alpha (a-1)}{\sqrt{2}}\right) ^{-\frac{1}{a-1}}-1. \end{aligned}$$

We then minimise the expected number of samples drawn, which we approximate using the intersection of the expected trajectory of \(Z_N\) and \(u(N)\). This means that we have to solve

$$\begin{aligned} |p-p_0|N = u(N;a,k(a,\alpha )) \end{aligned}$$
(21)

for \(N\) and then minimise over \(a\). Unfortunately, both in the cases of the Azuma and the Darling tests, solving (21) for \(N\) does not lead to a closed form expression. However, in both cases we can do the minimisation numerically, since the function \(u(N)-|p-p_0|N\) has a derivative simple enough to allow for Newton’s method to find its roots. We seek the minimum of \(N(a)\) for \(a \in [0,\infty )\), but for the sake of being able to use straightforward numerical techniques, we search for the minimum of \(N(\frac{1}{1+a})\) for \(\frac{1}{1+a} \in (0,1]\). Since this is a bounded interval, we can use techniques such as golden section search [9] to find the minimum. For the Darling test we even know that \(a>1\), meaning that we can minimise \(N(\frac{1}{a})\) on \((0,1]\).

The final remaining value to choose is then the parameter \(b\) of the Azuma test. A higher value for \(b\) means a tighter bound on the error probability of the first kind, but the area \(\mathcal {NC}\) will grow larger asymptotically. The difference in terms of the tightness of the bound can be observed in Table 4, where we display the solutions to equation (21) for the Azuma test with several values of \(b\) and the Darling test (with \(a\) and \(k\) chosen optimally). The impact of a low value for \(b\) is twofold: the expected number of needed samples when the guess is correct will be higher, but the test will become less sensitive to the guess \(\gamma \). Note, however, that even for very low values of \(b\) (e.g., \(0.67\)), the Azuma test will still be more sensitive than the Darling test. Since for \(b=0.67\) the Azuma test has a higher expected number of needed samples than the Darling test, while it is still less sensitive, the Azuma test has no advantages over the Darling test so we can say that it performs strictly worse than the Darling test. The choice \(b=0.9\) on the other hand leads to enormous parameter sensitivity. Values of \(b\) around \(\frac{3}{4}\) seem to strike a nice balance, and in Sect. 5, where we empirically validate the analysis of this section, we will only consider the Azuma test with this parameter choice.

Table 4 For each combination \((\gamma ,|p-p_0|,\text {test type})\), we display the solution to (21) i.e., the \(N\) for which the expected trajectory leaves \(\mathcal {NC}\) with parameters \(a\) and \(k\) chosen optimally. Bold values imply that \(\gamma = |p-p_0|\), i.e., that the guess is correct. In all cases, \(\alpha =0.05\)

By going through the above numerical procedure for a wide range of values of \(\alpha \) and \(\gamma \), for \(b=3/4\), and then fitting a function, we have obtained the following approximate solutions:

$$\begin{aligned} a_{\text {Azuma}} \approx (0.25-0.144 \alpha ^{0.15}) \sqrt{\gamma /0.0243} \end{aligned}$$

and

$$\begin{aligned} a_{\text {Darling}}&\approx \exp \big ( 0.4913 -0.0715 x + 0.0988 y -0.00089 x^2 \\&\quad +\, 0.00639 y^2 -0.00361 xy \big ), \end{aligned}$$

with \(x=\log \alpha \) and \(y=\log \gamma \). Note that we have not thoroughly quantified the precision of the above approximations. However, they need not be very precise: after all, these calculations are only used to optimise the convergence speed for a guess for \(\gamma =|p=p_0|\), and that guess will typically be imprecise by itself; furthermore, any error in the calculation only affects the efficiency of the test, not its correctness. Thus, simple approximations like the above can suffice for use in tools.

5 Results and comparisons

In this section we compare the performance of the tests discussed in Sect. 3—see Table 5 for a summary. We do this in two ways: we will begin in Sect. 5.1 by comparing the tests in terms of the implied test decision areas as discussed in Sect. 2.2, and see how these areas behave as a function of the number of samples drawn. In Sect. 5.2, we will then compare the tests by the three performance measures mentioned in Sect. 2.3: the correctness, the power and the efficiency. In Sect. 5.3, we discuss the implementation of the tests in the model checking tools.

5.1 Shape of the non-critical areas (NC)

As was explained in Sect. 2.2, all of the tests in this paper can be considered in the context of a single framework: a random walk \(Z_n\) that always jumps up by \(1-p_0\) with probability \(p\) or down by \(p_0\) with probability \(1-p\). The tests can then be defined in terms of the boundaries of the test decision area \(\mathcal {NC}\), as sketched in Fig. 2. In Figs. 4 and 5, we compare the shapes of these boundaries for all tests introduced before. For tests that can end inconclusively, the boundary of the corresponding decision area \(\mathcal {I}\) is drawn as a grey line.

Table 5 Summary of the tests
Fig. 4
figure 4

Critical regions, \(p_0 = 0.5\), \(\delta = \zeta = 0.025\), \(\gamma = 0.1\), \(\alpha =\beta =0.05\). Solid lines indicate boundaries of the critical regions; grey lines indicate where the test is inconclusive. Dotted lines indicate thresholds of Gauss-CI and Chernoff-CI for different \(\beta \). Dashed lines indicates expected sample path for \(p=p_0+\delta \), i.e., edge of indifference region, and for \(p=p_0+\gamma \), i.e., for which the Azuma and Darling tests have been optimally parametrised

Figure 4 shows the decision boundaries for the symmetrical situation \(p_0 = \frac{1}{2}\). For the parametrisation, the accepted error probabilities of first and second kind \(\alpha \) and \(\beta \) are set to 0.05. The indifference parameters \(\delta \) and \(\zeta \) are set to 0.025, while the guess \(\gamma \) for \(p-p_0\) is 0.1. Note that choosing \(\gamma >\delta \) makes sense: it expresses the investigator’s guess that \(p\) is 0.1 away from \(p_0\) (and that she wishes to optimise the Darling and Azuma tests for that case), but also that she wishes to have reliable results even if \(p\) turns out to be only 0.025 away from \(p_0\).

First, consider the Darling and Azuma tests. Although they never terminate inconclusively, they may take very long if \(p\) is very close to \(p_0\). Comparing their \(\mathcal {NC}\) regions, we see that it is narrower for the Azuma test than for the Darling test for small values of \(N\), but the Azuma boundaries eventually overtake those of the Darling test; this is obvious as functions of the type \(N^{\frac{3}{4}}\) are asymptotically wider than those of type \(\sqrt{N \log (N)}\).

The SPRT is also a sequential test and may theoretically take indefinitely long. However, its \(\mathcal {NC}\) region is narrow, so long runs are unlikely. The price for this is that the SPRT may draw an incorrect conclusion with probability more than \(\alpha \) if the true \(p\) is not at least \(\delta \) away from \(p_0\).

Finally, consider the four (almost) fixed-\(N\) tests: Chernoff-CI, Gauss-CI, Chow–Robbins and Gauss-SSP. As was pointed out in Sect. 4.2, the Chernoff-CI test is based on a looser bound than the others and therefore takes more samples for the same confidence level. The Gauss-CI and Chow–Robbins use the same bound and only differ in how they determine at what \(N\) to terminate. For the Gauss-CI test, this \(N\) is determined in advance, based on obtaining a sufficiently narrow confidence interval under the null hypothesis (\(p=p_0\)). On the other hand, Chow–Robbins stops as soon as the confidence interval is narrow enough based on the actual samples. If \(p\) is close to 0 or 1, this may occur much sooner. The Gauss-SSP test is similar to the Gauss-CI test in that its stopping time is determined in advance. However, it stops earlier because, like the SPRT, it takes the risk of drawing the wrong conclusion with probability more than \(\alpha \) if \(|p-p_0|<\delta \), while the Gauss-CI test in that case mostly risks terminating inconclusively.

Fig. 5
figure 5

As Fig. 4, but for \(p_0=0.2\)

As an aid in understanding, a dashed line shows the expected sample path if \(p=p_0+\delta \), i.e., at the border of where the investigator is indifferent about the outcome. This line crosses the Gauss-CI, Chernoff-CI and Chow–Robbins boundaries well away from the grey (inconclusive) parts, thus showing that these tests are indeed likely to conclude conclusively (namely 95 %), and extremely unlikely to draw the wrong conclusion. The SPRT area is rather narrow; the dashed line leaves it soon, at a point where it is still relatively near to the lower edge of this area, illustrating the 5 % risk of drawing the wrong conclusion. The Gauss-SSP test runs the same risk, due to its early termination. The Azuma and Darling boundaries are intercepted beyond the edge of the figure. These tests take rather long in this case because we chose to parametrise them optimally for a larger difference between \(p\) and \(p_0\), namely \(\gamma =0.1\) rather than \(\delta =0.025\), which is illustrated by the other dashed line.

Figure 5 is similar to Fig. 4 but with \(p_0\) set to \(0.2\). We mention the main differences. First, although this is barely visible, the boundaries of the SPRT are not constant, but they decrease in \(N\) (see (15) and (16)). Second, the Gauss-CI test’s area \(\mathcal {NC}\) is less broad due to the smaller variance under the null hypothesis. Third, the Chow–Robbins test may now take longer than the Gauss-CI test, because the Chow–Robbins test continues until the confidence interval has reached a prescribed width, which takes longest if \(p=0.5\). The Gauss-CI test stops earlier in that case, because \(p\) is so far away from its value under the null hypothesis that a much wider confidence interval still allows for a confident decision.

Figures 4 and 5 are only two examples of the figures that can be generated interactively on the website [1] mentioned earlier.

5.2 Simulation results

In this section, we compare the tests discussed in this paper by empirically evaluating their performance for a range of underlying parameter values.Footnote 11 Since we only compare different statistical tests, we do not need to consider the simulation aspect of statistical model checking. Accordingly, we let our computer program directly draw samples from a Bernoulli distribution with (known) parameter \(p\). With \(p\) chosen, the remaining parameter to be chosen is \(\delta \), \(\zeta \) or \(\gamma \). In all cases \(\alpha = \beta = 0.05\).

For each test we estimate the following metrics:

  1. 1.

    \(\rho \), the probability that a test accepts the right hypothesis, used as a measure for the confidence (the higher the better);

  2. 2.

    \(\upsilon \), the probability that a test proves inconclusive, used as a measure for the power (the lower the better);

  3. 3.

    \(\eta \), the expected number of samples drawn before the test is concluded, used as a measure for the efficiency (the lower the better).

The procedure is as follows: we conduct each test \(1\,000\) times, let \(\hat{\rho }\) be the fraction of correct conclusions, \(\hat{\upsilon }\) the fraction of tests that remained inconclusive (where for the sequential tests, we set a 60 s time bound) and \(\hat{\eta }\) be the average number of samples drawn. In Tables 67 and 8, we display these estimates plus/minus the half-width of a 95 %-CI around the estimate. In Tables 6 and 8 we have set \(p_0=0.5\); the only difference between these two tables is the choice of \(|p-p_0|\), which equals \(0.1\) for the former and \(0.001\) for the latter. For Table 7 we have set \(p=0.2\) and \(|p-p_0|=0.01\). The rows in bold indicate that the input parameter \(\delta \), \(\zeta \) or \(\gamma \) is exactly equal to \(|p-p_0|\).

Table 6 Simulation results for \(p_0 = 0.5\), \(p= 0.6\)

The number of samples needed for the Gauss-CI test grows inversely proportional to the square root of \(\zeta \). Because the Gauss-CI test is a fixed sample size test, \(\hat{\eta }\) has no variance. The main drawback is that if \(\zeta \) is considerably larger than \(|p-p_0|\), the Gauss-CI test will almost never draw a conclusion. This is witnessed by \(\hat{\upsilon } \gg 0\), seen particularly in Table 8. The bounds on the error probabilities are very tight; in all tables we see that if \(\zeta = |p-p_0|\), the probability of drawing the correct conclusion is close to \(1-\beta = 0.95\). Furthermore, in Table 8 we observe that when \(\zeta \) is chosen much too large, the proportion of incorrect conclusions (i.e., \(1-\hat{\rho } -\hat{\upsilon }\)) is close to \(\alpha =0.05\).

We see in general that the Chernoff-CI test requires more samples than the Gauss-CI test; in Table 7, for which \(p_0\) equals \(0.2\) instead of \(0.5\), the difference between the sample sizes of the Gauss-CI and Chernoff-CI tests is larger than in Tables 6 and 8. This is consistent with the discussion of Sect. 3.2. The bound on the probability of error of the second kind for the Chernoff-CI test appears to be rather loose; when the power-indifference level \(\zeta \) equals the actual difference \(p-p_0\), the estimate for the probability of inconclusive termination \(\hat{\upsilon }\) is well below \(\beta =0.05\).

That the Chow–Robbins test is a mixture of a fixed sample size test and a sequential test can be seen from the low variance of the number of samples drawn. In Table 6, the variance of \(Z_N\) under \(p_0\) is considerably higher than under \(p\), so the Chow–Robbins test requires a noticeably smaller sample size on average than the Gauss-CI test. However, the reverse is true in Table 7 and the Chow–Robbins test does slightly worse than the Gauss-CI test as a consequence. Overall, the two tests have similar efficiency.

The SPRT is the most efficient among all tests when \(\delta \) is picked just right; in each table, its value \(\hat{\eta }\) is the lowest among all tests that satisfy correctness. However, we indeed see its performance degrade when its assumptions are violated, i.e., when \(|p-p_0|\) turns out to be smaller than \(\delta \). In Table 8, the CI for \(\hat{\rho }\) contains \(\frac{1}{2}\) when \(\delta \) is large, which is the worst level of \(\rho \) that a test can satisfy (after all, if the confidence was even lower one could always use the opposite result of the test and obtain a confidence that is \(>\!\!\frac{1}{2}\)). The average number of samples needed seems to grow inversely proportional to \(\delta \).

The Gauss-SSP test is similar to the SPRT, albeit slightly less efficient. This was to be expected, see also [39] where the same observation was made.

Both the Azuma and Darling tests are very conservative: they have a \(\hat{\rho }\) of well over 95 \(\%\). When the guess is (almost) correct, the Azuma test is more efficient than the Darling test. However, if \(\gamma \) is taken to be considerably larger than \(|p-p_0|\), the number of samples needed for the Azuma test grows rapidly, while the Darling test remains remarkably insensitive to the model parameters, as can been seen in all tables. The Azuma result \(\hat{\upsilon } \approx 1\) in Table 8 means that the Azuma test did not draw a conclusion within a 60 s time period.

Table 7 Simulation results for \(p_0 = 0.2\), \(p= 0.21\)
Table 8 Simulation results for \(p_0 = 0.5\), \(p= 0.501\)

5.3 Tool implementation

Table 9 contains a summary of the implementation of the tests of Sect. 3 in model checking tools. In this section, we discuss each of the tools in some detail.

Table 9 Tool implementation

UPPAAL allows the user to check qualitative as well as quantitative statements (as described in the introduction). Qualitative statements are evaluated using the SPRT. Quantitative statements were evaluated using a sample size determined using the Chernoff–Hoeffding bound; since version 4.1.15, the Chow–Robbins procedure is used to construct a Clopper–Pearson confidence interval. PRISM (version 4.1.beta2) implements all four methods in the context of making qualitative statements; Gauss-CI and Chow–Robbins are implemented as versions of the ‘ACI’ method, Chernoff as the ‘APMC’ method and the SPRT as the ‘SPRT’ method. PRISM does not allow the user to directly create confidence intervals for the sake of making quantitative statements; however confidence intervals are created as a by-product of hypothesis tests and can be found in the ‘log’ section. MRMC (v1.5) [24] only implements the Chow–Robbins test, but, unlike PRISM, also allows this method to be used to evaluate steady-state properties (which we do not discuss in this paper).

COSMOS (v1.0) [7] implements the Chow–Robbins test for quantitative purposes. PLASMA (version 1.2.8) [22] implements the SPRT for qualitative statements and the Chernoff test for quantitative statements.

YMER uses the SPRT. Different version of YMER feature different add-ons; e.g., version 3.0.9 includes a numerical solution engine that allows the user to check nested operators, while version 4.0 includes support for unbounded until. The tool PVeStA, which is based on the tool VeStA [36], implements the Gauss-SSP test and the Chow–Robbins method. Another variant of VeStA, MultiVeStA [33], implements the Chow–Robbins procedure for quantitative purposes. APMC (v3.0) [19] implements an SSP test based on the Chernoff bound, cf. end of Sect. 3.5.

6 Conclusions

We have presented a common framework that allows the hypothesis testing methods—both ‘pure’ hypothesis tests and those based on confidence intervals—proposed earlier in the statistical model checking literature to be compared in a mathematically solid, yet intuitive manner. Previously, these methods were often implemented in tools completely parallel to one another with little information given about the subtle differences between the methods and their parameters. Our contribution aids general understanding of these methods, reducing the likelihood of incorrect interpretation of the outcomes.

In order for the methods to be meaningfully compared to each other, they have to parametrised. Tools typically ask the user to specify values for parameters that are specific to a method (such as the number of samples), without a clear indication of the consequences for the outcomes. We have expressed all method-specific parameters in terms of quantities that are meaningful to the user, such as the confidence level, the risk of inconclusive termination, and indifference levels.

Having parametrised the methods consistently, we compared them graphically and numerically, highlighting each method’s properties, and demonstrating quantitative performance differences.

Besides all methods known to us and implemented in tools, our comparison has also included two hypothesis testing methods that have not been discussed in the SMC context before. Those two methods (called Azuma and Darling–Robbins in this paper) are sequential methods. They behave fundamentally different from the other methods in cases where the model probability being studied is very close to the threshold under consideration: these methods will neither terminate inconclusively, nor have their confidence level drop.

There is no single best method to be recommended, since this depends on the requirements of the user. The present paper gives an overview both of the methods’ characteristics and their performance, summarised in Table 5, and thus can help tool users and authors in making a well-informed choice.