# A Bayesian approach for comparing cross-validated algorithms on multiple data sets

## Abstract

We present a Bayesian approach for making statistical inference about the accuracy (or any other score) of two competing algorithms which have been assessed via cross-validation on multiple data sets. The approach is constituted by two pieces. The first is a novel *correlated* Bayesian \(t\) test for the analysis of the cross-validation results on a single data set which accounts for the correlation due to the overlapping training sets. The second piece merges the posterior probabilities computed by the Bayesian correlated \(t\) test on the different data sets to make inference on multiple data sets. It does so by adopting a Poisson-binomial model. The inferences on multiple data sets account for the different uncertainty of the cross-validation results on the different data sets. It is the first test able to achieve this goal. It is generally more powerful than the signed-rank test if ten runs of cross-validation are performed, as it is anyway generally recommended.

## Keywords

Bayesian hypothesis tests Signed-rank test Cross-validation Poisson-binomial Hypothesis test Evaluation of classifiers## 1 Introduction

A typical problem in machine learning is to compare the accuracy of two competing classifiers on a data set \(D\). Usually one measures the accuracy of both classifiers via \(k\)-folds cross-validation. After having performed cross-validation, one has to decide if the accuracy of the two classifiers on data set \(D\) is significantly different. The decision is made using a statistical hypothesis test which analyzes the measures of accuracy yielded by cross-validation on the different folds. Using a \(t\) test is however a naive choice. The \(t\) test assumes the measures of accuracy taken on the different folds to be independent. Such measures are instead correlated because of the overlap of the training sets built during cross-validation. As a result the \(t\) test is *not* calibrated, namely its rate of Type I errors is much larger than the nominal size^{1} \(\alpha \) of the test. Thus the \(t\) test is not suitable for analyzing the cross-validation results (Dietterich 1998; Nadeau and Bengio 2003).

A suitable approach is instead the correlated^{2} \(t\) test (Nadeau and Bengio 2003), which adjusts the \(t\) test accounting for correlation. The statistic of the correlated \(t\) test is composed by two pieces of information: the mean difference of accuracy between the two classifiers (computed averaging over the different folds) and the uncertainty of such estimate, known as the *standard error*. The standard error of the correlated \(t\) test accounts for correlation, differently from the \(t\) test. The correlated \(t\) test is the recommended approach for the analysis of cross-validation results on a single data set (Nadeau and Bengio 2003; Bouckaert 2003).

Assume now that the two classifiers have assessed via cross-validation on a collection of data sets \(\varvec{D}=\{D_1,D_2,\ldots ,D_q\}\). One has to decide if the difference of accuracy between the two classifiers on the multiple data sets of \(\varvec{D}\) is significant. The recommended approach is the signed-rank test (Demšar 2006). It is a non-parametric test. As such it is derived under mild assumptions and is robust to outliers. A Bayesian counterpart of the signed-rank test (Benavoli et al. 2014) has been also recently proposed. However the signed-rank test considers only the mean difference of accuracy measured on each data set, ignoring the associated uncertainty.

Dietterich (1998) pointed out the need for a test able to compare two classifier on multiple data sets accounting for the uncertainty of the results on each data set. Tests dealing with this issue have been devised only recently. Otero et al. (2014) proposes an interval-valued approach to considers the uncertainty of the cross-validation results on each data set. When working with multiple data sets, the interval uncertainty is propagated. In some cases the interval becomes wide, preventing to achieve a conclusion.

The Poisson-binomial test (Lacoste et al. 2012) performs inference on multiple data sets accounting for the uncertainty of the result on each data set. First it computes on each data set the posterior probability of the difference of accuracy being significant; then it merges such probabilities through a Poisson-binomial distribution to make inference on \(\varvec{D}\). Its limit is that the posterior probabilities computed on the individual data sets assume that the two classifiers have been compared on a *single* test set. It does not manage the multiple correlated test sets produced by cross-validation. This limits its applicability, since classifiers are typically assessed by cross-validation.

To design a test able to perform inference on multiple data sets accounting for the uncertainty of the estimates yielded by cross-validation is a challenging task.

In this paper we solve this problem. Our solution is based on two main steps. First we develop a Bayesian counterpart of the correlated \(t\) test (its posterior probabilities are later exploited to build a Poisson-binomial distribution). We design a generative model for the correlated results of cross-validation and we analytically derive the posterior distribution of the mean difference of accuracy between the two classifiers. Moreover, we show that for a particular choice of the prior over the parameters, the posterior distribution coincides with the sampling distribution of the correlated \(t\) test by Nadeau and Bengio (2003). Under the matching prior the inferences of the Bayesian correlated \(t\) test and of the frequentist correlated \(t\) test are numerically equivalent. The meaning of the inferences is however different. The inference of the frequentist test is a \(p\) value; the inference of the Bayesian test is a posterior probability. The posterior probabilities computed on the individual data sets can be combined to make further Bayesian inference on multiple data sets.

After having computed the posterior probabilities on each individual data set through the correlated Bayesian \(t\) test, we merge them to make inference on \(\varvec{D}\), borrowing the intuition of the Poisson-binomial test (Lacoste et al. 2012). This is the second piece of the solution. We model each data set as a Bernoulli trial, whose possible outcomes are the win of the first or the second classifier. The probability of success of the Bernoulli trial corresponds to the posterior probability computed by the Bayesian correlated \(t\) test on that data set. The number of data sets on which the first classifier is more accurate than the second is a random variable which follows a Poisson-binomial distribution. We use this distribution to make inference about the difference of accuracy of the two classifiers on \(\varvec{D}\). The resulting approach couples the Bayesian correlated \(t\) test and the Poisson-binomial approach; we call it the *Poisson test*.

It is worth discussing an important difference between the signed-rank and the Poisson test. The signed rank test assumes the results on the individual data sets to be i.i.d. The Poisson test assumes them to be independent but *not* identically distributed, which can be advocated as follows. The different data sets \(D_1,\ldots ,D_q\) have different size and complexity. The uncertainty of the cross-validation result is thus different on each data set, breaking the assumption of the results on different data sets to be identically distributed.

We compare the Poisson and the signed-rank test through extensive simulations, performing either one run or ten runs of cross-validation. When we perform one run of cross-validation, the estimates are affected by important uncertainty. In this case the Poisson behaves cautiously and it is less powerful than the signed-rank test. When we perform ten runs of cross-validation, the uncertainty of the cross-validation estimate decreases. In this case the Poisson test is generally *more* powerful than the signed-rank test. To perform ten runs rather than a single one run of cross-validation is anyway recommended to obtain robust cross-validation estimates (Bouckaert 2003). The signed-rank test does not account for the uncertainty of the estimates and thus its power is roughly the same whether one or ten runs of cross-validation are performed.

Under the null hypothesis, the Type I errors of both test are correctly calibrated in all the investigated settings.

The paper is organized as follows: Sect. 2 presents the methods for inference on a single data set; Sect. 3 presents the methods for inference on multiple data set; Sect. 4 presents the experimental results.

## 2 Inference from cross-validation results on a single data set

### 2.1 Problem statement and frequentist tests

We want to statistically compare the accuracy of two classifiers which have been assessed via \(m\) runs of \(k\)-folds cross-validation. We provide both classifiers with the same training and test sets and we compute the difference of accuracy between the two classifiers on each test set. This yields the *differences of accuracy* \(\varvec{x}=\{x_1,x_2,\ldots ,x_n\}\), where \(n=mk\). We denote the sample mean and the sample variance of the differences as \(\overline{x}\) and \(\hat{\sigma }^2\).

*random selection*of the instances which compose the different training and test sets used in cross-validation. Under random selection the different test sets overlap. The standard cross-validation yields non-overlapping test sets. This is also the setup we consider in this paper. The correlation heuristic of Nadeau and Bengio (2003) is anyway effective also with the standard cross-validation (Bouckaert 2003).

The denominator of the statistics is the *standard error*, namely the standard deviation of the estimate of \(\overline{x}\). The standard error increases with \(\hat{\sigma }^2\), which typically increases on smaller data sets. On the other hand the standard error decreases with \(n=mk\). Previous studies (Kohavi 1995) recommend to set the number of folds to \(k=10\) to obtain a reliable estimate from cross-validation. This has become a standard choice. Having set \(k=10\), one can further decrease the standard error of the test by increasing the number or runs \(m\). Indeed Bouckaert (2003) and (Witten et al. 2011, Sec. 5.3) recommend to perform \(m=10\) runs of ten-folds cross-validation.

The correlated \(t\) test has been originally designed to analyze the results of a single run of cross-validation. Indeed its correlation heuristic models the correlation due to overlapping training sets. When multiple runs of cross-validation are performed, there is an additional correlation due to overlapping test sets. We are unaware of approaches able to represent also this second type of correlation, which is usually ignored.

### 2.2 Bayesian \(t\) test for uncorrelated observations

Posterior parameters for the uncorrelated case

Parameter | Analytical expression | Under matching prior |
---|---|---|

\(\mu _n\) | \(\frac{\mu _0/k_0 +n\overline{x}}{\frac{1}{k_0}+n}\) | \(\overline{x}\) |

\(k_n\) | \(\frac{1}{\frac{1}{k_0}+n}\) | \(\frac{1}{n}\) |

\(a_n\) | \(a+\frac{n}{2}\) | \(\frac{n-1}{2}\) |

\(b_n\) | \(b+\frac{1}{2}\sum _{i=1}^{n}(x_i-\overline{x})^2+\frac{\frac{1}{k_0} n (\overline{x}-\mu _0)^2}{2(\frac{1}{k_0}+n)}\) | \(\frac{1}{2}\sum _{i=1}^{n}(x_i-\overline{x})^2\) |

### 2.3 A novel Bayesian \(t\) test for correlated observations

*intraclass covariance matrix*(Press 2012). We define \(\varvec{\varSigma }=\sigma ^2\varvec{M}\), where \(\varvec{M}\) is the (n \(\times \) n) correlation matrix. As an example, with \(n=3\) we have:

*admissibility region*of the parameters.

### **Theorem 1**

The maximum likelihood estimator of (\(\mu \),\(\sigma ^2\),\(\rho \)) from the model (6) is not asymptotically consistent: it does not converge to the true value of the parameters as \(n\rightarrow \infty \).

The proof is given in “Appendix”. By computing the derivatives of the likelihood w.r.t. the parameters, it shows that the maximum likelihood estimate of \(\mu \), \(\sigma ^2\) is \(\hat{\mu }=\frac{1}{n} \sum _{i=1}^n x_i\) and, respectively, \(\hat{\sigma }^2=tr(\varvec{M}^{-1} \mathbf {Z})\), where \(\mathbf {Z}=(\mathbf {x}-\mathbf {H}\hat{\mu })(\mathbf {x}-\mathbf {H}\hat{\mu })^{T}\). Thus \(\hat{\sigma }^2\) depends on \(\rho \) through \(M\). By plugging these estimates into the likelihood and computing the derivative w.r.t. \(\rho \), we show that the derivative is never zero in the admissibility region. The derivative decreases with \(\rho \) and does not depend on the data. Hence, the maximum likelihood estimate of \(\rho \) is \(\hat{\rho }=0\) regardless the observations. When the number of observations \(n\) increases, the likelihood gets more concentrated around the maximum likelihood estimate. Thus the maximum likelihood estimate is not asymptotically consistent whenever \(\rho \ne 0\). This will also be true for the Bayesian estimate, since the likelihood dominates the conjugate prior for large \(n\). This means that we cannot consistently estimate all the three parameters (\(\mu \), \(\sigma ^2\), \(\rho \)) from data.

### 2.4 Introducing the correlation heuristic

To enable inferences from correlated samples we renounce estimating \(\rho \) from data. We adopt instead the correlation heuristic of (Nadeau and Bengio 2003), setting \(\rho =\frac{n_{te}}{n_{tot}}\), where \(n_{te}\) and \(n_{tot}\) are the size of test set and of the entire data set. Having fixed the value of \(\rho \), we can derive the posterior marginal distribution of \(\mu \).

### **Theorem 2**

Posterior parameters for the correlated case

Parameter | Analytical expression | Under matching prior |
---|---|---|

\(\tilde{\mu }_n\) | \(\displaystyle \frac{\mathbf {H}^T\varvec{M}^{-1}\mathbf {x}+\frac{\mu _0}{k_0}}{\mathbf {H}^T\varvec{M}^{-1}\mathbf {H}+\frac{1}{k_0}}\) | \(\displaystyle \frac{\sum _{i=1}^n x_i}{n}\) |

\(\tilde{k}_n\) | \(\frac{1}{\mathbf {H}^T\varvec{M}^{-1}\mathbf {H}+\frac{1}{k_0}}\) | \(\frac{1}{\mathbf {H}^T\varvec{M}^{-1}\mathbf {H}}\) |

\(\tilde{a}_n\) | \(\displaystyle a+\frac{n}{2}\) | \(\displaystyle \frac{n-1}{2}\) |

\(\tilde{b}_n\) | \(\begin{array}{l} \frac{1}{2}\Big ((\mathbf {x}-\mathbf {H}\hat{\mu })^{T}\varvec{M}^{-1}(\mathbf {x}-\mathbf {H}\hat{\mu })+2b-\frac{\mu _0^2}{k_0}\\ -\hat{\mu }^2\mathbf {H}^T\varvec{M}^{-1}\mathbf {H}+\tilde{\mu }^2\left( \mathbf {H}^T\varvec{M}^{-1}\mathbf {H}+\frac{1}{k_0}\right) \Big ) \end{array}\) | \( \frac{1}{2}(\mathbf {x}-\mathbf {H}\hat{\mu })^{T}\varvec{M}^{-1}(\mathbf {x}-\mathbf {H}\hat{\mu })\) |

### **Corollary 1**

The proof of both the theorem and corollary are given in “Appendix”. Under the matching prior the posterior Student distribution (9) coincides with the sampling distribution of the statistic of the correlated \(t\) test by Nadeau and Bengio (2003). This implies that given the same test size \(\alpha \), the Bayesian correlated \(t\) test and the frequentist correlated \(t\) test take the same decisions. In other words, the posterior probability \(P(\mu >0|\varvec{x},\mu _0,k_0,a,b,\rho )\) equals \(1-p\) where \(p\) is the \(p\) value of the correlated \(t\) test.

## 3 Inference on multiple data sets

Consider now the problem of comparing two classifiers on \(q\) different data sets, after having assessed both classifiers via cross-validation on each data set. The mean *difference* of accuracy on each data set are stored in vector \(\varvec{\overline{x}}=\{\overline{x}_1,\overline{x}_2,\ldots ,\overline{x}_q\}\). The recommended test to compare two classifiers on multiple data sets is the signed-rank test (Demšar 2006).

The signed-rank test assumes the \(\overline{x}_i\)’s to be i.i.d. and generated from a symmetric distribution. The null hypothesis is that the median of the distribution is \(M\). When the test accept the alternative hypothesis it claims that the median of the distribution is significantly different from \(M\).

*not*average the results across data sets. This is a sensible approach since the average of results referring to different domains is in general meaningless. The test is moreover robust to outliers.

A limit of the signed-rank test is that does not consider the standard error of the \(\overline{x}_i\)’s. It assumes the samples to be i.i.d and thus all the \(\overline{x}_i\)’s to have equal uncertainty. This is a questionable assumptions. The data sets typically have different size and complexity. Moreover one could have performed a different number of cross-validation runs on different data sets. For these reasons the \(\overline{x}_i\)’s typically have different uncertainties; thus they are *not* identically distributed.

### 3.1 Poisson-binomial inference on multiple data sets

Our approach to make inference on multiple data sets is inspired to the Poisson-binomial test (Lacoste et al. 2012). As a preliminary step we perform cross-validation on each data set and we analyze the results through the Bayesian correlated \(t\) test. We denote by \(p_i\) the posterior probability that the second classifier is more accurate than the first on the \(i\)th data set. This is computed according to Eq.(9): \(p_i=p(\mu _i>0|\mathbf {x}_i,\mu _0,k_0,a,b,\rho )\). We consider each data set as an independent Bernoulli trial, whose possible outcome are the win of the first or of the second classifier. The probability of success (win of the second classifier) of the \(i\)th Bernoulli trial is \(p_i\).

The number of data sets in which the second classifier is more accurate than the first classifier is a random variable \(X\) which follows a Poisson-binomial distribution (Lacoste et al. 2012). The Poisson-binomial distribution is a generalization of the binomial distribution in which the Bernoulli trials are allowed to have different probability of success. This probabilities are computed by Bayesian correlated \(t\) test and thus account both for the mean and the standard error of the cross-validation estimates. The probability of success is different on each data set, and thus the test does not assume the results on the different data sets to be identically distributed.

The Poisson binomial test declares the second classifier significantly more accurate than the first classifier if \(P(X> q/2)>1-\alpha \), namely if the probability of the second classifier being more accurate than the first on more than half the data sets is larger than \(1-\alpha \).

### 3.2 Example

Example of comparison of two classifiers in multiple datasets

Datasets | \(\mu _i\) | \(\sigma _i\) | |
---|---|---|---|

Case 1 | \(D_1,\ldots ,D_5\) | 0.1 | 0.05 |

\(D_6,\ldots ,D_{10}\) | -0.1 | 0.05 | |

Case 2 | \(D_1,\ldots ,D_5\) | 0.1 | 0.05 |

\(D_6,\ldots ,D_{10}\) | -0.1 | 0.15 |

In case 1, classifier A is more accurate than classifier B on five data sets. Classifier B is more accurate than classifier A on the remaining five data sets. Parameter \(\mu _i\) and \(\sigma _i\) represent the mean and the standard deviation of the actual difference of accuracy among the two classifiers on each data set. The absolute value of \(\mu _i\) is equal on all data sets and \(\sigma _i\) is equal on all data sets.

## 4 Experiments

The calibration and the power of the correlated \(t\) test have been already extensively studied by (Nadeau and Bengio 2003; Bouckaert 2003) and we refrain from doing it here. The same results apply to the Bayesian correlated \(t\) test, since the frequentist and the Bayesian correlated \(t\) test take the same decisions. The main result of such studies is that the rate of Type I errors of the correlated \(t\) test is considerably closer to the nominal test size \(\alpha \) than the rate of Type I error of the standard \(t\) test. In the following we thus present results dealing with the inference on multiple data sets.

### 4.1 Two classifiers with known difference of accuracy

We generate the data sets sampling the instances from the Bayesian network \(C\rightarrow F\), where \(C\) is the binary class with states \(\{c_0,c_1\}\) and \(F\) is a binary feature with states \(\{f_0,f_1\}\). The parameters are: \(P(c_0)=0.5; P(f_0|c_0)=\theta ; P(f_0|c_1)=1-\theta \) with \(\theta >0.5\). We refer to this model with exactly these parameters as BN.

Notice that if the BN model is used both the generate the instances and to issue the prediction, its expected accuracy is^{3} \(\theta \).

Once a data set is generated, we assess via cross-validation the accuracy of two classifiers. The first classifier is the majority predictor also known as *zeroR*. It predicts the most frequent class observed in the training set. If the two classes are equally frequent in the training set, it randomly draws the prediction. Its expected accuracy is thus 0.5.

The second classifier is \(\hat{BN}\), namely the Bayesian network \(C\rightarrow F\) with parameters learned from the training data. The actual difference of accuracy between the two classifiers is thus approximately \(\delta _{acc}=\theta -0.5\). To simulate the difference of accuracy \(\delta _{acc}\) between the two classifiers we set \(\theta =0.5+\delta _{acc}\) in the parameters of the BN model. We repeat experiments using different values of \(\delta _{acc}\).

We perform the tests in a one-sided fashion: the null hypothesis is that zeroR is less or equally accurate than \(\hat{BN}\). The alternative hypothesis is that \(\hat{BN}\) is more powerful than zeroR. We set the size of both the signed rank and the Poisson tests to \(\alpha =005\). We measure the power of a test as the rate of rejection of the null hypothesis when \(\delta _{acc}>0\).

We present results obtained with \(m=1\) and \(m=10\) runs of cross-validation.

### 4.2 Fixed difference of accuracy on all data sets

As a first experiment, we set the actual difference of accuracy \(\delta _{acc}\) among the two classifiers as identical on all the \(q\) data sets. We assume the availability of \(q=50\) data sets. This is a common size for a comparison of classifiers. We consider the following different values of \(\delta _{acc}\): \(\{0,0.01,0.02,\ldots ,0.1\}\).

For each value of \(\delta _{acc}\) we repeat 5000 experiments as follows. We allow the various data sets to have different size \(\varvec{s}=s_1\),\(s_2\),\(\ldots \),\(s_q\). We draw the sample size of each data set uniformly from \(\mathcal {S}=\{25,50,100,250,500,1000\}\). We generate each data set using the BN model; then we assess via cross-validation the accuracy of both zero R and \(\hat{BN}\). We then compare the two classifiers via the Poisson and the signed-rank test.

The results are shown in Fig. 2a. Both tests yield Type I error rate lower than 0.05 when \(\delta _{acc}=0\); thus they are correctly calibrated. The power of the tests can be assessed looking at the results for strictly positive values of \(\delta _{acc}\). If one run of cross-validation is performed, the Poisson test is generally *less* powerful than the signed-rank test. However if ten runs of cross-validation are performed, the Poisson is generally *more* powerful than the signed rank. The signed-rank does not account for the uncertainty of the estimates and thus its power is roughly the same regardless whether one or ten runs of cross-validation have been performed.

### 4.3 Difference of accuracy sampled from the Cauchy distributions

We remove the assumption of \(\delta _{acc}\) being equal for all data sets. Instead for each data set we sample \(\delta _{acc}\) from a Cauchy distribution. We set the median and the scale parameter of the Cauchy to a value \(\overline{\delta _{acc}}>0\). A different value of \(\overline{\delta _{acc}}\) defines a different experimental setting. We consider the following values of \(\overline{\delta }_{acc}\): \(\{0,0.01,0.02,\ldots ,0.05\}\). We run 5,000 experiments for each value of \(\overline{\delta }_{acc}\). We assume the availability of \(q=50\) data sets.

Sampling from the Cauchy one sometimes obtains values of \(\delta _{acc}\) whose absolute value is larger than 0.5. It is not possible to simulate difference of accuracy that large. Thus sampled values of \(\delta _{acc}\) larger than 0.5 or smaller than \({-}\)0.5 are capped to 0.5 and \({-}\)0.5 respectively.

The results are given in Fig. 2b. Both tests are correctly calibrated for \(\delta _{acc}=0\). This is noteworthy since values sampled from the Cauchy are often aberrant and can easily affect the inference of parametric tests.

Let us analyze the power of the tests for \(\delta _{acc}>0\). If one run of cross-validation is performed, the Poisson test is slightly *less* powerful than the signed-rank test. If ten runs of cross-validation are performed, the Poisson test is *more* powerful than the signed-rank test.

Such findings are confirmed by repeating the simulation with a number of data sets \(q=25\).

### 4.4 Application to real data sets

We consider 54 data sets^{4} from the UCI repository. We consider five different classifiers: naive Bayes, averaged one-dependence estimator (AODE), hidden naive Bayes (HNB), J48 decision tree and J48 grafted (J48-gr). All the algorithms are described in (Witten et al. 2011). On each data set we run ten runs of ten-folds cross-validation using the WEKA^{5} software.

We then compare each couple of classifiers via the signed-rank and the Poisson test.

Comparison of the decision of the Poisson and the signed-rank test on real data sets

Naive Bayes | J48 | J48-gr | AODE | HNB | |
---|---|---|---|---|---|

| |||||

Naive Bayes | – | 1/0 | 1/0 | 1/1 | 1/1 |

J48 | – | – | 1/1 | 1/0 | 0/0 |

J48-gr | – | – | – | 1/0 | 0/0 |

AODE | – | – | – | – | 0/0 |

| |||||

Naive Bayes | – | 1/0 | 1/0 | 1/1 | 1/1 |

J48 | – | – | 1/1 | 1/0 | 0/0 |

J48-gr | – | – | – | 1/0 | 0/0 |

AODE | – | – | – | – | 0/0 |

| |||||

Naive Bayes | – | 1/0 | 1/0 | 1/1 | 1/1 |

J48 | – | – | 1/1 | 1/1 | 0/1 |

J48-gr | – | – | – | 1/0 | 0/1 |

AODE | – | – | – | – | 0/0 |

The Poisson test detects seven significant differences out of the ten comparison in all the three experiments. It consistently detects the same seven significances in all the three experiments. The signed-rank test is less powerful. It detects only three significances in the first and in the second experiment. When all data sets are available its power increases and it detects three further differences, arriving to six detected differences. Overall the Poisson test is both more powerful and more replicable.

The detected differences are in agreement with what is known in literature: both AODE and HNB are recognized as significantly more accurate than naive Bayes, J48-gr is recognized as significantly more accurate than both naive Bayes and J48. The two tests take different decisions when comparing couples of high-performance classifiers such as HNB, AODE and J48-gr.

### 4.5 Software

At the link www.idsia.ch/~giorgio/poisson/test-package.zip we provide both the Matlab and the R implementation of our test. They can be used by a researcher who wants to compare any two algorithms assessed via cross-validation on multiple data sets. The package also allows reproducing the experiments of this paper.

The procedure can be easily implemented also in other computational environments. The standard \(t\) test is available within every computational package. The frequentist correlated \(t\) test can be implemented by simply changing the statistic of the standard \(t\) test, according to Eq. (1). Under the matching prior, the posterior probability of the null computed by the Bayesian correlated \(t\) test correspond to the \(p\) value computed by the one-sided frequentist correlated \(t\) test. Once the posterior probabilities are computed on each data set, it remains to compute the Poisson-binomial probability distribution. The Poisson-binomial distribution can be straightforwardly computed via sampling, while exact approaches (Hong 2013) are more difficult to implement.

## 5 Conclusions

To our knowledge, the Poisson test is the first test which compares two classifiers on multiple data sets accounting for the correlation and the uncertainty of the results generated by cross-validation on each individual data set. The test is usually more powerful than the signed-rank if ten runs of cross-validation are performed, which is anyway common practice. A limit of the approach based on the Poisson-binomial is that its inferences refer to the sample of provided data sets rather than to the population from which the data sets have been drawn. A way to overcome this limit could be the development a hierarchical test able to make inference on the population of data sets.

## Footnotes

- 1.
Consider performing many experiments in which the data are generated under the null hypothesis. A test executed with size \(\alpha \) is correctly calibrated if its rate of rejection of the null hypothesis is not \(>\alpha \).

- 2.
Nadeau and Bengio (2003) refer to this test as the

*corrected*\(t\) test. We adopt in this paper the more informative terminology of*correlated*\(t\) test. - 3.
The proof is as follows. Consider the instances with \(F=f_0\). We have that \(P(c_0|f_0)=\theta >0.5\), so the model always predicts \(c_0\) if \(F=f_0\). This prediction is accurate with probability \(\theta \). Regarding the instances with \(F=f_1\), the most probable class is \(c_1\). Also this prediction is accurate with probability \(\theta \). Overall the classifier has probability \(\theta \) of being correct.

- 4.
Available from http://www.cs.waikato.ac.nz/ml/weka/datasets.html.

- 5.
Available from http://www.cs.waikato.ac.nz/ml/weka.

## References

- Benavoli, A., Mangili, F., Corani, G., Zaffalon, M., & Ruggeri, F. (2014). A Bayesian Wilcoxon signed-rank test based on the Dirichlet process. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014) (pp. 1026–1034).Google Scholar
- Bernardo, J. M., & Smith, A. F. M. (2009).
*Bayesian theory*(Vol. 405). Chichester: Wiley.MATHGoogle Scholar - Bouckaert, R. R. (2003). Choosing between two learning algorithms based on calibrated tests. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 51–58).Google Scholar
- Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets.
*The Journal of Machine Learning Research*,*7*, 1–30.MATHGoogle Scholar - Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms.
*Neural Computation*,*10*(7), 1895–1923.CrossRefGoogle Scholar - Hong, Y. (2013). On computing the distribution function for the Poisson binomial distribution.
*Computational Statistics and Data Analysis*,*59*, 41–51.MathSciNetCrossRefGoogle Scholar - Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial intelligence-Volume 2 (pp. 1137–1143). Morgan Kaufmann Publishers Inc.Google Scholar
- Lacoste, A., Laviolette, F., & Marchand, M. (2012). Bayesian comparison of machine learning algorithms on single and multiple datasets. In Proceeding of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS-12) (pp. 665–675).Google Scholar
- Murphy, K. P. (2012).
*Machine learning: A probabilistic perspective*. Cambridge: MIT press.MATHGoogle Scholar - Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error.
*Machine Learning*,*52*(3), 239–281.MATHCrossRefGoogle Scholar - Otero, J., Sánchez, L., Couso, I., & Palacios, A. (2014). Bootstrap analysis of multiple repetitions of experiments using an interval-valued multiple comparison procedure.
*Journal of Computer and System Sciences*,*80*(1), 88–100.MATHMathSciNetCrossRefGoogle Scholar - Press, S. J. (2012).
*Applied multivariate analysis: Using Bayesian and frequentist methods of inference*. Mineola: Courier Dover Publications.Google Scholar - Witten, I. H., Frank, E., & Hall, M. A. (2011).
*Data mining: Practical machine learning tools and techniques*(3rd ed.). Burlington, USA: Morgan Kaufmann.Google Scholar