1 Introduction

A typical problem in machine learning is to compare the accuracy of two competing classifiers on a data set \(D\). Usually one measures the accuracy of both classifiers via \(k\)-folds cross-validation. After having performed cross-validation, one has to decide if the accuracy of the two classifiers on data set \(D\) is significantly different. The decision is made using a statistical hypothesis test which analyzes the measures of accuracy yielded by cross-validation on the different folds. Using a \(t\) test is however a naive choice. The \(t\) test assumes the measures of accuracy taken on the different folds to be independent. Such measures are instead correlated because of the overlap of the training sets built during cross-validation. As a result the \(t\) test is not calibrated, namely its rate of Type I errors is much larger than the nominal sizeFootnote 1 \(\alpha \) of the test. Thus the \(t\) test is not suitable for analyzing the cross-validation results (Dietterich 1998; Nadeau and Bengio 2003).

A suitable approach is instead the correlatedFootnote 2 \(t\) test (Nadeau and Bengio 2003), which adjusts the \(t\) test accounting for correlation. The statistic of the correlated \(t\) test is composed by two pieces of information: the mean difference of accuracy between the two classifiers (computed averaging over the different folds) and the uncertainty of such estimate, known as the standard error. The standard error of the correlated \(t\) test accounts for correlation, differently from the \(t\) test. The correlated \(t\) test is the recommended approach for the analysis of cross-validation results on a single data set (Nadeau and Bengio 2003; Bouckaert 2003).

Assume now that the two classifiers have assessed via cross-validation on a collection of data sets \(\varvec{D}=\{D_1,D_2,\ldots ,D_q\}\). One has to decide if the difference of accuracy between the two classifiers on the multiple data sets of \(\varvec{D}\) is significant. The recommended approach is the signed-rank test (Demšar 2006). It is a non-parametric test. As such it is derived under mild assumptions and is robust to outliers. A Bayesian counterpart of the signed-rank test (Benavoli et al. 2014) has been also recently proposed. However the signed-rank test considers only the mean difference of accuracy measured on each data set, ignoring the associated uncertainty.

Dietterich (1998) pointed out the need for a test able to compare two classifier on multiple data sets accounting for the uncertainty of the results on each data set. Tests dealing with this issue have been devised only recently. Otero et al. (2014) proposes an interval-valued approach to considers the uncertainty of the cross-validation results on each data set. When working with multiple data sets, the interval uncertainty is propagated. In some cases the interval becomes wide, preventing to achieve a conclusion.

The Poisson-binomial test (Lacoste et al. 2012) performs inference on multiple data sets accounting for the uncertainty of the result on each data set. First it computes on each data set the posterior probability of the difference of accuracy being significant; then it merges such probabilities through a Poisson-binomial distribution to make inference on \(\varvec{D}\). Its limit is that the posterior probabilities computed on the individual data sets assume that the two classifiers have been compared on a single test set. It does not manage the multiple correlated test sets produced by cross-validation. This limits its applicability, since classifiers are typically assessed by cross-validation.

To design a test able to perform inference on multiple data sets accounting for the uncertainty of the estimates yielded by cross-validation is a challenging task.

In this paper we solve this problem. Our solution is based on two main steps. First we develop a Bayesian counterpart of the correlated \(t\) test (its posterior probabilities are later exploited to build a Poisson-binomial distribution). We design a generative model for the correlated results of cross-validation and we analytically derive the posterior distribution of the mean difference of accuracy between the two classifiers. Moreover, we show that for a particular choice of the prior over the parameters, the posterior distribution coincides with the sampling distribution of the correlated \(t\) test by Nadeau and Bengio (2003). Under the matching prior the inferences of the Bayesian correlated \(t\) test and of the frequentist correlated \(t\) test are numerically equivalent. The meaning of the inferences is however different. The inference of the frequentist test is a \(p\) value; the inference of the Bayesian test is a posterior probability. The posterior probabilities computed on the individual data sets can be combined to make further Bayesian inference on multiple data sets.

After having computed the posterior probabilities on each individual data set through the correlated Bayesian \(t\) test, we merge them to make inference on \(\varvec{D}\), borrowing the intuition of the Poisson-binomial test (Lacoste et al. 2012). This is the second piece of the solution. We model each data set as a Bernoulli trial, whose possible outcomes are the win of the first or the second classifier. The probability of success of the Bernoulli trial corresponds to the posterior probability computed by the Bayesian correlated \(t\) test on that data set. The number of data sets on which the first classifier is more accurate than the second is a random variable which follows a Poisson-binomial distribution. We use this distribution to make inference about the difference of accuracy of the two classifiers on \(\varvec{D}\). The resulting approach couples the Bayesian correlated \(t\) test and the Poisson-binomial approach; we call it the Poisson test.

It is worth discussing an important difference between the signed-rank and the Poisson test. The signed rank test assumes the results on the individual data sets to be i.i.d. The Poisson test assumes them to be independent but not identically distributed, which can be advocated as follows. The different data sets \(D_1,\ldots ,D_q\) have different size and complexity. The uncertainty of the cross-validation result is thus different on each data set, breaking the assumption of the results on different data sets to be identically distributed.

We compare the Poisson and the signed-rank test through extensive simulations, performing either one run or ten runs of cross-validation. When we perform one run of cross-validation, the estimates are affected by important uncertainty. In this case the Poisson behaves cautiously and it is less powerful than the signed-rank test. When we perform ten runs of cross-validation, the uncertainty of the cross-validation estimate decreases. In this case the Poisson test is generally more powerful than the signed-rank test. To perform ten runs rather than a single one run of cross-validation is anyway recommended to obtain robust cross-validation estimates (Bouckaert 2003). The signed-rank test does not account for the uncertainty of the estimates and thus its power is roughly the same whether one or ten runs of cross-validation are performed.

Under the null hypothesis, the Type I errors of both test are correctly calibrated in all the investigated settings.

The paper is organized as follows: Sect. 2 presents the methods for inference on a single data set; Sect. 3 presents the methods for inference on multiple data set; Sect. 4 presents the experimental results.

2 Inference from cross-validation results on a single data set

2.1 Problem statement and frequentist tests

We want to statistically compare the accuracy of two classifiers which have been assessed via \(m\) runs of \(k\)-folds cross-validation. We provide both classifiers with the same training and test sets and we compute the difference of accuracy between the two classifiers on each test set. This yields the differences of accuracy \(\varvec{x}=\{x_1,x_2,\ldots ,x_n\}\), where \(n=mk\). We denote the sample mean and the sample variance of the differences as \(\overline{x}\) and \(\hat{\sigma }^2\).

A statistical test has to establish whether the mean difference between the two classifier is significantly different from zero, analyzing the vector of results \(\varvec{x}\). Such results are correlated because of the overlapping training sets. Nadeau and Bengio (2003) prove that there is no unbiased estimator of such correlation. They assume the correlation to be \(\rho =\frac{n_{te}}{n}\), where \(n_{te}, n_{tr}\) and \(n_{tot}\) denote the size of the training set, of the test set and of the whole available data set. Thus \(n_{tot}=n_{tr}+n_{te}\). The statistic of the correlated \(t\) test is:

$$\begin{aligned} t= \frac{\overline{x}}{\sqrt{\hat{\sigma }^2\left( \frac{1}{n}+\frac{\rho }{1-\rho }\right) }}= \frac{\overline{x}}{\sqrt{\hat{\sigma }^2\left( \frac{1}{n}+\frac{n_{te}}{n_{tr}}\right) }}. \end{aligned}$$
(1)

Its sampling distribution is a Student with \(n-1\) degrees of freedom. The correlation heuristic has proven to be effective and the correlated \(t\) test is much closer to a correct calibration than the standard \(t\) test (Nadeau and Bengio 2003). The correlation heuristic of Nadeau and Bengio (2003) is derived assuming random selection of the instances which compose the different training and test sets used in cross-validation. Under random selection the different test sets overlap. The standard cross-validation yields non-overlapping test sets. This is also the setup we consider in this paper. The correlation heuristic of Nadeau and Bengio (2003) is anyway effective also with the standard cross-validation (Bouckaert 2003).

The denominator of the statistics is the standard error, namely the standard deviation of the estimate of \(\overline{x}\). The standard error increases with \(\hat{\sigma }^2\), which typically increases on smaller data sets. On the other hand the standard error decreases with \(n=mk\). Previous studies (Kohavi 1995) recommend to set the number of folds to \(k=10\) to obtain a reliable estimate from cross-validation. This has become a standard choice. Having set \(k=10\), one can further decrease the standard error of the test by increasing the number or runs \(m\). Indeed Bouckaert (2003) and (Witten et al. 2011, Sec. 5.3) recommend to perform \(m=10\) runs of ten-folds cross-validation.

The correlated \(t\) test has been originally designed to analyze the results of a single run of cross-validation. Indeed its correlation heuristic models the correlation due to overlapping training sets. When multiple runs of cross-validation are performed, there is an additional correlation due to overlapping test sets. We are unaware of approaches able to represent also this second type of correlation, which is usually ignored.

2.2 Bayesian \(t\) test for uncorrelated observations

Before introducing the Bayesian \(t\) test for correlated observations, we briefly discuss the Bayesian inference in the uncorrelated case. Assume we have a vector of independent and identically distributed observations of a variable \(X\), i.e., \(\varvec{x}=\{x_1,x_2,\ldots ,x_n\}\), and that we aim to test if the mean of \(X\) is positive. In the Bayesian \(t\) test we assume that the likelihood of the observations is Normal with unknown mean \(\mu \) and and unknown precision \(\nu \) (the precision is the inverse of variance \(\nu =1/\sigma ^2\)):

$$\begin{aligned} p(\varvec{x}|\mu ,\nu )=\prod \limits _{i=1}^n N(x_i;\mu ,1/\nu ). \end{aligned}$$
(2)

Our aim is to compute the posterior of \(\mu \) (here \(\nu \) is a nuisance parameter). A natural prior for \(\mu ,\nu \) is the Normal-Gamma distribution (Bernardo and Smith 2009, Chap. 5), which is conjugate with the likelihood model:

$$\begin{aligned} p(\mu ,\nu |\mu _0,k_0,a,b)=N\left( \mu ;\mu _0,\frac{k_0}{\nu }\right) G\left( \nu ;a,b\right) =NG(\mu ,\nu ;\mu _0,k_0,a,b). \end{aligned}$$

It is the product of a Normal distribution over \(\mu \) (with precision \(\nu /k_0\) proportional to \(\nu \)) and a Gamma distribution over \(\nu \) and depends on four parameters \(\mu _0,k_0,a,b\). Updating the prior-normal gamma with the normal likelihood, one obtains a posterior normal-gamma joint distribution with updated parameters \((\mu _n,k_n,a_n,b_n)\), whose values are reported in first column of Table 1 (see also Murphy 2012, Chap. 4). Marginalizing out the precision from the Normal-Gamma posterior one obtains the posterior marginal distribution of the mean, which follows a Student distribution:

$$\begin{aligned} p(\mu |\varvec{x},\mu _0,k_0,a,b)=\text {St}\left( \mu ;2a_n,\mu _n, \frac{b_n k_n}{ a_n} \right) \!. \end{aligned}$$

Then, the Bayesian \(t\) test for the positiveness of \(\mu \) is:

$$\begin{aligned} P(\mu >0|\varvec{x},\mu _0,k_0,a,b)=\int _0^{\infty }\text {St}\left( \mu ;2a_n,\mu _n, \frac{b_n k_n}{ a_n} \right) d\mu =\mathcal {T}_{2a_n}\left( \frac{\mu _n}{\sqrt{\frac{b_n k_n}{ a_n}}} \right) >1-\alpha , \end{aligned}$$
(3)

where \(\mathcal {T}_{2a_n}(z)\) denotes the cumulative distribution of the standardized Student distribution with \(2a_n\) degrees of freedom computed at \(z\). By choosing \(\alpha =0.05\), we can assess the positivity of \(\mu \) with posterior probability \(0.95\). If the prior parameters are set as follows: \(\{\mu _0=0, k_0 \rightarrow \infty , a=-1/2, b=0\}\), from Eqn.(3) it follows that \(P(\mu >0|\varvec{x},\mu _0,k_0,a,b)=1-p\), where \(p\) is the \(p\) value of the frequentist \(t\) test. See Murphy 2012, Chap. 4 for further details on the correspondence between frequentist and Bayesian \(t\) tests. In fact, for these values, the posterior reduces to \(\text {St}\left( \mu ;n-1,\overline{x}, \sigma ^2/n \right) \), as shown also in the second column in Table 1. Therefore, if we consider this matching (improper) prior, the Bayesian and frequentist \(t\) test coincide.

Table 1 Posterior parameters for the uncorrelated case

2.3 A novel Bayesian \(t\) test for correlated observations

Assume now that the observations of the variable \(X\), \(\varvec{x}=\{x_1,x_2,\ldots ,x_n\}\), are identically distributed but dependent. In particular, consider the case in which the observations have the same mean \(\mu \), the same precision \(\nu \) and are equally correlated with each other with correlation \(\rho >0\). This is for instance the case in which the \(n\) observations are the \(n\) differences of accuracy among two classifiers yielded by cross-validation. The data generating process can be modelled as follows:

$$\begin{aligned} \mathbf {x}=\mathbf {H} \mu +\mathbf {v} \end{aligned}$$
(4)

where \(\mathbf {H}_{\scriptscriptstyle n\times 1}\) is a vector of ones (\(\mathbf {H}_{\scriptscriptstyle n\times 1} = \mathbf {1}{\scriptscriptstyle n\times 1}\)) and \(\mathbf {v}\) is a noise vector with zero mean and covariance matrix \(\varvec{\Sigma }_{\scriptscriptstyle n\times n}\) patterned as follows: each diagonal elements equals \(\sigma ^2=1/\nu \); each non-diagonal element equals \(\rho \sigma ^2\). This is the so-called intraclass covariance matrix (Press 2012). We define \(\varvec{\varSigma }=\sigma ^2\varvec{M}\), where \(\varvec{M}\) is the (n \(\times \) n) correlation matrix. As an example, with \(n=3\) we have:

$$\begin{aligned} \varvec{\varSigma }= \left[ \begin{array}{c@{\quad }c@{\quad }c} \sigma ^{2} &{} \rho \sigma ^{2} &{} \rho \sigma ^{2}\\ \rho \sigma ^{2} &{} \sigma ^{2} &{} \rho \sigma ^{2}\\ \rho \sigma ^{2} &{} \rho \sigma ^{2} &{} \sigma ^{2} \end{array}\right] \qquad \qquad \qquad \qquad \varvec{M}= \left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c@{\quad }c} 1 &{} &{} \rho &{}&{} \rho \\ \rho &{}&{} 1 &{}&{} \rho \\ \rho &{}&{} \rho &{}&{} 1 \end{array}\right] \end{aligned}$$
(5)

To allow for \(\varvec{\varSigma }\) to be invertible and positive definite, we require \(\sigma ^2>0\) and \(0\le \rho <1\). The correlation among the cross-validation results is positive anyway (Nadeau and Bengio 2003). These two conditions define the admissibility region of the parameters.

In the Bayesian \(t\) test for correlated observations, we assume the noise vector \(\mathbf {v}\) to be follow a multivariate Normal distribution: \(\mathbf {v}\sim \mathrm {MVN}(0,\varvec{\varSigma })\). The likelihood corresponding to (4) is:

$$\begin{aligned} p\left( \mathbf {x}|\mu ,\varvec{\varSigma }\right) =\dfrac{\exp \left( -\frac{1}{2}(\mathbf {x}-\mathbf {H}\mu )^{T}\varvec{\varSigma }^{-1}(\mathbf {x}-\mathbf {H}\mu )\right) }{(2\pi )^{n/2}\sqrt{|\varvec{\varSigma }|}}. \end{aligned}$$
(6)

Equation (6) reduces to Eq. (2) in the uncorrelated case (\(\rho =0\)). As in the previous section, our aim is to test the positivity of \(\mu \). To this end, we need to estimate the model parameters: \(\mu \), \(\sigma ^2\) and \(\rho \).

Theorem 1

The maximum likelihood estimator of (\(\mu \),\(\sigma ^2\),\(\rho \)) from the model (6) is not asymptotically consistent: it does not converge to the true value of the parameters as \(n\rightarrow \infty \).

The proof is given in “Appendix”. By computing the derivatives of the likelihood w.r.t. the parameters, it shows that the maximum likelihood estimate of \(\mu \), \(\sigma ^2\) is \(\hat{\mu }=\frac{1}{n} \sum _{i=1}^n x_i\) and, respectively, \(\hat{\sigma }^2=tr(\varvec{M}^{-1} \mathbf {Z})\), where \(\mathbf {Z}=(\mathbf {x}-\mathbf {H}\hat{\mu })(\mathbf {x}-\mathbf {H}\hat{\mu })^{T}\). Thus \(\hat{\sigma }^2\) depends on \(\rho \) through \(M\). By plugging these estimates into the likelihood and computing the derivative w.r.t. \(\rho \), we show that the derivative is never zero in the admissibility region. The derivative decreases with \(\rho \) and does not depend on the data. Hence, the maximum likelihood estimate of \(\rho \) is \(\hat{\rho }=0\) regardless the observations. When the number of observations \(n\) increases, the likelihood gets more concentrated around the maximum likelihood estimate. Thus the maximum likelihood estimate is not asymptotically consistent whenever \(\rho \ne 0\). This will also be true for the Bayesian estimate, since the likelihood dominates the conjugate prior for large \(n\). This means that we cannot consistently estimate all the three parameters (\(\mu \), \(\sigma ^2\), \(\rho \)) from data.

2.4 Introducing the correlation heuristic

To enable inferences from correlated samples we renounce estimating \(\rho \) from data. We adopt instead the correlation heuristic of (Nadeau and Bengio 2003), setting \(\rho =\frac{n_{te}}{n_{tot}}\), where \(n_{te}\) and \(n_{tot}\) are the size of test set and of the entire data set. Having fixed the value of \(\rho \), we can derive the posterior marginal distribution of \(\mu \).

Theorem 2

Choose \(p(\mu ,\nu |\mu _0,k_0,a,b)=NG(\mu ,\nu ;\mu _0,k_0,a,b)\) as joint prior over \(\mu ,\nu \). Update it with the likelihood of Eq. (6). The posterior distribution of the parameters is \(p(\mu ,\nu |\mathbf {x},\mu _0,k_0,a,b,\rho )=NG(\mu ,\nu ;\tilde{\mu }_n,\tilde{k}_n,\tilde{a}_n,\tilde{b}_n)\) and the posterior marginal over \(\mu \) is a Student distribution:

$$\begin{aligned} p(\mu |\mathbf {x},\mu _0,k_0,a,b,\rho )=St\left( \mu ;2\tilde{a}_n,\tilde{\mu }_n,\frac{\tilde{b}_n \tilde{k}_n}{\tilde{a}_n}\right) \!. \end{aligned}$$
(7)

The expression of the parameters and their values are reported in Table 2.

Table 2 Posterior parameters for the correlated case

Corollary 1

Under the matching prior (\(\mu _0=0, k_0 \rightarrow \infty , a=-1/2, b=0\)), the posterior marginal distribution of \(\mu \) simplifies as:

$$\begin{aligned} St\left( \mu ;n-1,\bar{x},\left( \frac{1}{n}+\frac{\rho }{1-\rho }\right) \hat{\sigma }^2\right) \end{aligned}$$
(8)

where \(\bar{x}=\tfrac{\sum _{i=1}^n x_i}{n}\) and \(\hat{\sigma }^2=\tfrac{\sum _{i=1}^n (x_i-\bar{x})^2}{n-1}\) and, therefore,

$$\begin{aligned} P[\mu >0|\varvec{x},\mu _0,k_0,a,b,\rho ]=\mathcal {T}_{n-1}\left( \frac{\bar{x}}{\hat{\sigma }\sqrt{\frac{1}{n}+\frac{\rho }{1-\rho }}}\right) \end{aligned}$$
(9)

The proof of both the theorem and corollary are given in “Appendix”. Under the matching prior the posterior Student distribution (9) coincides with the sampling distribution of the statistic of the correlated \(t\) test by Nadeau and Bengio (2003). This implies that given the same test size \(\alpha \), the Bayesian correlated \(t\) test and the frequentist correlated \(t\) test take the same decisions. In other words, the posterior probability \(P(\mu >0|\varvec{x},\mu _0,k_0,a,b,\rho )\) equals \(1-p\) where \(p\) is the \(p\) value of the correlated \(t\) test.

3 Inference on multiple data sets

Consider now the problem of comparing two classifiers on \(q\) different data sets, after having assessed both classifiers via cross-validation on each data set. The mean difference of accuracy on each data set are stored in vector \(\varvec{\overline{x}}=\{\overline{x}_1,\overline{x}_2,\ldots ,\overline{x}_q\}\). The recommended test to compare two classifiers on multiple data sets is the signed-rank test (Demšar 2006).

The signed-rank test assumes the \(\overline{x}_i\)’s to be i.i.d. and generated from a symmetric distribution. The null hypothesis is that the median of the distribution is \(M\). When the test accept the alternative hypothesis it claims that the median of the distribution is significantly different from \(M\).

The test ranks the \(\overline{x}_i\)’s according to their absolute value and then compares the ranks of the positive and negative differences. The test statistic is:

$$\begin{aligned} T^+=\sum \limits _{\{i:~\overline{x}_i\ge 0\}} r_i (|\overline{x}_i|)= \sum \limits _{1\le i \le j \le n} T^+_{ij}, \end{aligned}$$

where \(r_i (|\overline{x}_i|)\) is the rank of \(|\overline{x}_i|\) and

$$\begin{aligned} T^+_{ij}=\left\{ \begin{array}{l@{\quad }l} 1 &{}\quad \textit{if }\,\overline{x}_i \ge \overline{x}_j,\\ 0 &{}\quad \textit{otherwise. } \\ \end{array}\right. \end{aligned}$$

For a large enough number of samples (e.g., \(q> 10\)), the sampling distribution of the statistic under the null hypothesis is approximately normal with mean \(1/2\). Being non-parametric, the signed-rank test does not average the results across data sets. This is a sensible approach since the average of results referring to different domains is in general meaningless. The test is moreover robust to outliers.

A limit of the signed-rank test is that does not consider the standard error of the \(\overline{x}_i\)’s. It assumes the samples to be i.i.d and thus all the \(\overline{x}_i\)’s to have equal uncertainty. This is a questionable assumptions. The data sets typically have different size and complexity. Moreover one could have performed a different number of cross-validation runs on different data sets. For these reasons the \(\overline{x}_i\)’s typically have different uncertainties; thus they are not identically distributed.

3.1 Poisson-binomial inference on multiple data sets

Our approach to make inference on multiple data sets is inspired to the Poisson-binomial test (Lacoste et al. 2012). As a preliminary step we perform cross-validation on each data set and we analyze the results through the Bayesian correlated \(t\) test. We denote by \(p_i\) the posterior probability that the second classifier is more accurate than the first on the \(i\)th data set. This is computed according to Eq.(9): \(p_i=p(\mu _i>0|\mathbf {x}_i,\mu _0,k_0,a,b,\rho )\). We consider each data set as an independent Bernoulli trial, whose possible outcome are the win of the first or of the second classifier. The probability of success (win of the second classifier) of the \(i\)th Bernoulli trial is \(p_i\).

The number of data sets in which the second classifier is more accurate than the first classifier is a random variable \(X\) which follows a Poisson-binomial distribution (Lacoste et al. 2012). The Poisson-binomial distribution is a generalization of the binomial distribution in which the Bernoulli trials are allowed to have different probability of success. This probabilities are computed by Bayesian correlated \(t\) test and thus account both for the mean and the standard error of the cross-validation estimates. The probability of success is different on each data set, and thus the test does not assume the results on the different data sets to be identically distributed.

The cumulative distribution function of \(X\) is:

$$\begin{aligned} P(X \le k)=\sum _{i=0}^{k}\xi (i)=\sum _{i=0}^{k} \left( \sum _{A \in \mathcal {F}_i} \prod _{i \in A} p_i \prod _{i \in A^c} (1-p_i) \right) \end{aligned}$$
(10)

where \(\xi (i)=P(X=i)\), \(\mathcal {F}_i\) is the set of all subsets of \(i\) integers that can be drawn from \(\{1,2,3,\ldots ,q\}\) and \(A^c\) is the complement of \(A\): \(A^c=\{1,2,3,,q\}\backslash A\). Hong (2013) discusses several methods to exactly compute the Poisson-binomial distribution. We adopt a sampling approach. We simulate \(q\) biased coin, one for each data set. The bias of the \(i\)th coin is \(p_i\). We simulate the \(q\) coins 100,000 times. We count the proportion of times in which \(x=1, x=2,\dots , X=q\) out of the 100,000 trials. This yields a numerical approximation of the Poisson-binomial distribution.

The Poisson binomial test declares the second classifier significantly more accurate than the first classifier if \(P(X> q/2)>1-\alpha \), namely if the probability of the second classifier being more accurate than the first on more than half the data sets is larger than \(1-\alpha \).

3.2 Example

In order to understand the differences between the Poisson test and the Wilcoxon signed-rank test, consider the artificial example of Table 3.

Table 3 Example of comparison of two classifiers in multiple datasets

In case 1, classifier A is more accurate than classifier B on five data sets. Classifier B is more accurate than classifier A on the remaining five data sets. Parameter \(\mu _i\) and \(\sigma _i\) represent the mean and the standard deviation of the actual difference of accuracy among the two classifiers on each data set. The absolute value of \(\mu _i\) is equal on all data sets and \(\sigma _i\) is equal on all data sets.

In case 2, the mean differences \(\mu _i\) are the same as in case 1, but the standard deviation in \(D_6,\ldots ,D_{10}\) is three times larger. We have generated observations as follows

$$\begin{aligned} x_{ji}\sim N\left( \mu _i,\sigma _i^2\right) \!, \end{aligned}$$

for \(i=1,\ldots ,10\) (ten-folds cross-validation) and for the \(j=1,\ldots ,10\) datasets (here \(\rho =0\) but the results are similar if we consider a correlated model). Figure 1 shows the distribution of \(P(X> q/2)\) (classifier \(A\) is better than \(B\)) for the Poisson test and the distribution of the \(p\) values for the Wilcoxon signed-rank test in the two cases (computed for \(5000\) Monte Carlo runs). It can be observed that the distribution for Wilcoxon signed-rank test is practically unchanged in the two cases, while the distribution of the Poisson test is very different. The Poisson test is thus able to distinguish the two cases: it takes into account the variance of the mean accuracy in the ten-folds cross-validation of each dataset, while the Wilcoxon signed-rank test does not.

Fig. 1
figure 1

Distribution of \(P(X> q/2)\) for the Poisson test and distribution of the \(p\) values for the Wilcoxon signed-rank test in the two cases. a Wilcoxon case 1. b Wilcoxon case 2. c Poisson case 1. d Poisson case 2

4 Experiments

The calibration and the power of the correlated \(t\) test have been already extensively studied by (Nadeau and Bengio 2003; Bouckaert 2003) and we refrain from doing it here. The same results apply to the Bayesian correlated \(t\) test, since the frequentist and the Bayesian correlated \(t\) test take the same decisions. The main result of such studies is that the rate of Type I errors of the correlated \(t\) test is considerably closer to the nominal test size \(\alpha \) than the rate of Type I error of the standard \(t\) test. In the following we thus present results dealing with the inference on multiple data sets.

4.1 Two classifiers with known difference of accuracy

We generate the data sets sampling the instances from the Bayesian network \(C\rightarrow F\), where \(C\) is the binary class with states \(\{c_0,c_1\}\) and \(F\) is a binary feature with states \(\{f_0,f_1\}\). The parameters are: \(P(c_0)=0.5; P(f_0|c_0)=\theta ; P(f_0|c_1)=1-\theta \) with \(\theta >0.5\). We refer to this model with exactly these parameters as BN.

Notice that if the BN model is used both the generate the instances and to issue the prediction, its expected accuracy isFootnote 3 \(\theta \).

Once a data set is generated, we assess via cross-validation the accuracy of two classifiers. The first classifier is the majority predictor also known as zeroR. It predicts the most frequent class observed in the training set. If the two classes are equally frequent in the training set, it randomly draws the prediction. Its expected accuracy is thus 0.5.

The second classifier is \(\hat{BN}\), namely the Bayesian network \(C\rightarrow F\) with parameters learned from the training data. The actual difference of accuracy between the two classifiers is thus approximately \(\delta _{acc}=\theta -0.5\). To simulate the difference of accuracy \(\delta _{acc}\) between the two classifiers we set \(\theta =0.5+\delta _{acc}\) in the parameters of the BN model. We repeat experiments using different values of \(\delta _{acc}\).

We perform the tests in a one-sided fashion: the null hypothesis is that zeroR is less or equally accurate than \(\hat{BN}\). The alternative hypothesis is that \(\hat{BN}\) is more powerful than zeroR. We set the size of both the signed rank and the Poisson tests to \(\alpha =005\). We measure the power of a test as the rate of rejection of the null hypothesis when \(\delta _{acc}>0\).

We present results obtained with \(m=1\) and \(m=10\) runs of cross-validation.

4.2 Fixed difference of accuracy on all data sets

As a first experiment, we set the actual difference of accuracy \(\delta _{acc}\) among the two classifiers as identical on all the \(q\) data sets. We assume the availability of \(q=50\) data sets. This is a common size for a comparison of classifiers. We consider the following different values of \(\delta _{acc}\): \(\{0,0.01,0.02,\ldots ,0.1\}\).

For each value of \(\delta _{acc}\) we repeat 5000 experiments as follows. We allow the various data sets to have different size \(\varvec{s}=s_1\),\(s_2\),\(\ldots \),\(s_q\). We draw the sample size of each data set uniformly from \(\mathcal {S}=\{25,50,100,250,500,1000\}\). We generate each data set using the BN model; then we assess via cross-validation the accuracy of both zero R and \(\hat{BN}\). We then compare the two classifiers via the Poisson and the signed-rank test.

The results are shown in Fig. 2a. Both tests yield Type I error rate lower than 0.05 when \(\delta _{acc}=0\); thus they are correctly calibrated. The power of the tests can be assessed looking at the results for strictly positive values of \(\delta _{acc}\). If one run of cross-validation is performed, the Poisson test is generally less powerful than the signed-rank test. However if ten runs of cross-validation are performed, the Poisson is generally more powerful than the signed rank. The signed-rank does not account for the uncertainty of the estimates and thus its power is roughly the same regardless whether one or ten runs of cross-validation have been performed.

The same conclusions are confirmed in the case \(q=25\).

Fig. 2
figure 2

Power and calibration of the tests over multiple data sets. The plots share the same legend. The Poisson test has squared marks. The signed-rank test has circle marks. Dashed lines refer to one run of cross-validation, solid lines refer to ten runs of cross-validation. The plots refer to the case \(q=50\). a Difference of accuracy \((\delta _{acc})\). b Mean difference of accuracy \((\bar{\delta }_{acc})\)

4.3 Difference of accuracy sampled from the Cauchy distributions

We remove the assumption of \(\delta _{acc}\) being equal for all data sets. Instead for each data set we sample \(\delta _{acc}\) from a Cauchy distribution. We set the median and the scale parameter of the Cauchy to a value \(\overline{\delta _{acc}}>0\). A different value of \(\overline{\delta _{acc}}\) defines a different experimental setting. We consider the following values of \(\overline{\delta }_{acc}\): \(\{0,0.01,0.02,\ldots ,0.05\}\). We run 5,000 experiments for each value of \(\overline{\delta }_{acc}\). We assume the availability of \(q=50\) data sets.

Sampling from the Cauchy one sometimes obtains values of \(\delta _{acc}\) whose absolute value is larger than 0.5. It is not possible to simulate difference of accuracy that large. Thus sampled values of \(\delta _{acc}\) larger than 0.5 or smaller than \({-}\)0.5 are capped to 0.5 and \({-}\)0.5 respectively.

The results are given in Fig. 2b. Both tests are correctly calibrated for \(\delta _{acc}=0\). This is noteworthy since values sampled from the Cauchy are often aberrant and can easily affect the inference of parametric tests.

Let us analyze the power of the tests for \(\delta _{acc}>0\). If one run of cross-validation is performed, the Poisson test is slightly less powerful than the signed-rank test. If ten runs of cross-validation are performed, the Poisson test is more powerful than the signed-rank test.

Such findings are confirmed by repeating the simulation with a number of data sets \(q=25\).

4.4 Application to real data sets

We consider 54 data setsFootnote 4 from the UCI repository. We consider five different classifiers: naive Bayes, averaged one-dependence estimator (AODE), hidden naive Bayes (HNB), J48 decision tree and J48 grafted (J48-gr). All the algorithms are described in (Witten et al. 2011). On each data set we run ten runs of ten-folds cross-validation using the WEKAFootnote 5 software.

We then compare each couple of classifiers via the signed-rank and the Poisson test.

We sort the data sets alphabetically and we repeat the analysis three times. The first time we compare the classifiers on data sets 1–27; the second time we compare the classifiers on data sets 28–54; the third time we compare the classifiers on all data sets. The results are given in Table 4. The zeros and the ones in Table 4 indicate respectively that the null or the alternative hypothesis has been accepted.

Table 4 Comparison of the decision of the Poisson and the signed-rank test on real data sets

The Poisson test detects seven significant differences out of the ten comparison in all the three experiments. It consistently detects the same seven significances in all the three experiments. The signed-rank test is less powerful. It detects only three significances in the first and in the second experiment. When all data sets are available its power increases and it detects three further differences, arriving to six detected differences. Overall the Poisson test is both more powerful and more replicable.

The detected differences are in agreement with what is known in literature: both AODE and HNB are recognized as significantly more accurate than naive Bayes, J48-gr is recognized as significantly more accurate than both naive Bayes and J48. The two tests take different decisions when comparing couples of high-performance classifiers such as HNB, AODE and J48-gr.

4.5 Software

At the link www.idsia.ch/~giorgio/poisson/test-package.zip we provide both the Matlab and the R implementation of our test. They can be used by a researcher who wants to compare any two algorithms assessed via cross-validation on multiple data sets. The package also allows reproducing the experiments of this paper.

The procedure can be easily implemented also in other computational environments. The standard \(t\) test is available within every computational package. The frequentist correlated \(t\) test can be implemented by simply changing the statistic of the standard \(t\) test, according to Eq. (1). Under the matching prior, the posterior probability of the null computed by the Bayesian correlated \(t\) test correspond to the \(p\) value computed by the one-sided frequentist correlated \(t\) test. Once the posterior probabilities are computed on each data set, it remains to compute the Poisson-binomial probability distribution. The Poisson-binomial distribution can be straightforwardly computed via sampling, while exact approaches (Hong 2013) are more difficult to implement.

5 Conclusions

To our knowledge, the Poisson test is the first test which compares two classifiers on multiple data sets accounting for the correlation and the uncertainty of the results generated by cross-validation on each individual data set. The test is usually more powerful than the signed-rank if ten runs of cross-validation are performed, which is anyway common practice. A limit of the approach based on the Poisson-binomial is that its inferences refer to the sample of provided data sets rather than to the population from which the data sets have been drawn. A way to overcome this limit could be the development a hierarchical test able to make inference on the population of data sets.