Statistical comparison of classifiers through Bayesian hierarchical modelling
Abstract
Usually one compares the accuracy of two competing classifiers using null hypothesis significance tests. Yet such tests suffer from important shortcomings, which can be overcome by switching to Bayesian hypothesis testing. We propose a Bayesian hierarchical model that jointly analyzes the crossvalidation results obtained by two classifiers on multiple data sets. The model estimates more accurately the difference between classifiers on the individual data sets than the traditional approach of averaging, independently on each data set, the crossvalidation results. It does so by jointly analyzing the results obtained on all data sets, and applying shrinkage to the estimates. The model eventually returns the posterior probability of the accuracies of the two classifiers being practically equivalent or significantly different.
Keywords
Posterior Probability Posterior Distribution Hierarchical Model Maximum Likelihood Estimator Equivalent Classifier1 Introduction
The statistical comparison of learning algorithms is fundamental in machine learning; it is typically carried out through hypothesis testing. In this paper we assume that one is interested in comparing the accuracy of two learning algorithms for classification (referred to as classifiers in the following). However our discussion readily applies to any other measure of performance.
Assume that two classifiers have been assessed via crossvalidation on a single data set. The recommended approach for comparing them is the correlated ttest (Nadeau and Bengio 2003). If instead one aims at comparing two classifiers on multiple data sets the recommended test is the signedrank test (Demšar 2006). Both tests are based on the frequentist framework of the nullhypothesis significance tests (nhst), which has severe drawbacks.
First, the nhst computes the probability of getting the observed (or a larger) difference in the data if the null hypothesis was true. It does not compute the probability of interest, which is the probability of one classifier being more accurate than another given the observed results.
Second, the claimed statistical significances do not necessarily imply practical significance, since null hypotheses can be easily rejected by increasing the number of observations (Wasserstein and Lazar 2016). Thus for instance the signedrank can reject the null hypothesis when dealing with two classifiers whose accuracies are nearly equal, but which have been compared on a large number of data sets.
Third, when the null hypothesis is not rejected, we cannot assume the null hypothesis to be true (Kruschke 2015, Chap. 11). Thus nhst tests cannot recognize equivalent classifiers.
These issues can be overcome by switching to Bayesian hypothesis testing (Kruschke 2015, Sec. 11) which are recently being applied also in machine learning (Lacoste et al. 2012; Corani and Benavoli 2015; Benavoli et al., under review).
Let us denote by \(\delta _i\) the actual difference of accuracy between the two classifiers on the ith data set. Usually \(\delta _i\) is estimated via crossvalidation. We propose the first model that represents both the distribution \(p(\delta _i)\) across the different data sets and the distribution of the crossvalidation results on the ith data set given \(\delta _i\).
Following Kruschke (2013) we analyze the results by adopting a region of practical equivalence (rope). In particular we consider two classifiers to be practically equivalent if their difference of accuracy belongs to the interval \((0.01, 0.01)\). This mitigates the risk of claiming significance because of a thin difference of accuracy in simulation, which is likely to be swamped by other sources of uncertainty when the classifier is adopted in practice (Hand et al. 2006). There are however no correct rope limits; thus other researchers might set the rope differently. Based on the rope we compute the posterior probability of the two classifiers being practically equivalent or significantly different. Such probabilities convey meaningfully information even when they do not exceed the 95% threshold: this is a more informative outcome than that of a nhst.
Moreover, the hierarchical model estimates the \(\delta _i\)’s more accurately than the traditional approach of computing, independently on each data set, the mean of the crossvalidation differences. It does so by jointly estimating the \(\delta _i\)’s and shrinking them towards each other. We prove theoretically that such shrinkage yields lower estimation error than the traditional approach.
2 Existing approaches
Let us introduce some notation. We have a collection of q data sets; the actual mean difference of accuracy between the two classifiers on the ith data set is \(\delta _i\). We can think of \(\delta _i\) as the average difference of accuracy that we would obtain by repeating many times the procedure of sampling from the actual distribution as many instances as there are in the actually available data set, train the two classifiers and measure the difference of accuracy on a large test set.
Usually \(\delta _i\) is estimated via crossvalidation. Assume that we have performed m runs of kfold crossvalidation on each data set, using the same folds for both classifiers. The differences of accuracy on each fold of crossvalidation are \({{\varvec{x_i}}}=\{x_{i1},x_{i2},\dots ,x_{in}\}\), where \(n=mk\). The mean and the standard deviation of the results on the ith data set are \({\bar{x}}_i\) and \(s_i\). The mean of the crossvalidation results is also the maximum likelihood estimator (MLE) of \(\delta _i\).
The signedrank test is instead the recommended method (Demšar 2006) to compare two classifiers on a collection of q different data sets. It is usually applied after having performed crossvalidation on each data set. The test analyzes the mean differences measured on each data set (\({\bar{x}}_1,{\bar{x}}_2,\ldots ,{\bar{x}}_q\)) assuming them to be i.i.d.. This is a simplistic assumption: the \({\bar{x}}_i\)’s are not i.i.d. since they are characterized by different uncertainty; indeed their standard errors are typically different.
The two tests discussed so far are nullhypothesis significance test (nhst) and as such they suffer from the drawbacks discussed in the Sect. 1.
Let us now consider the Bayesian approaches. Kruschke (2013) presents a Bayesian ttest for i.i.d. observations, which is thus not suitable for analyzing the correlated crossvalidation results. The Bayesian correlated ttest (Corani and Benavoli 2015) is instead suitable. It computes the posterior distribution of \(\delta _i\) on a single data set, assuming the crossvalidation observations to be sampled from a multivariate normal distribution whose components have the same mean \(\delta _i\), the same standard deviation \(\sigma _i\) and are equally crosscorrelated with correlation \(\rho =\frac{1}{k}\).
As for the analysis of multiple data sets, Lacoste et al. (2012) models each data set as an independent Bernoulli trial. The two possible outcomes of the Bernoulli trial are the first classifier being more accurate than the second or vice versa. This approach yields the posterior probability of the first classifier being more accurate than the second classifier on more than half of the q data sets. A shortcoming is that its conclusions apply only to the q available data sets without generalizing to the whole population of data sets.
3 The hierarchical model
The \(\delta _i\)’s are assumed to be drawn from a Student distribution with mean \(\delta _0\), scale factor \(\sigma _0\) and degrees of freedom \(\nu \). The Student distribution is more flexible than the Gaussian, thanks to the additional parameter \(\nu \). When \(\nu \) is small, the Student distribution has heavy tails; when \(\nu \) is above 30, the Student distribution is practically a Gaussian. A Student distribution with low degrees of freedom robustly deals with outliers and for this reason is often used for robust Bayesian estimation (Kruschke 2013).
We assume \(\sigma _i\) to be drawn from a uniform distribution over the interval \((0,{\bar{\sigma }})\). This prior (Gelman 2006) yields inferences which are insensitive to the value of \({\bar{\sigma }}\) if \({\bar{\sigma }}\) is large enough. We adopt \({\bar{\sigma }}= 1000 {\bar{s}}\), where \({\bar{s}}\) is the mean standard deviation observed on the different data sets (\({\bar{s}}=\sum _i^q s_i/q\)).
Equation (4) models the fact that the crossvalidation measures \({{\varvec{x_i}}}=\{x_{i1},x_{i2},\dots ,x_{in}\}\) of the ith data set are generated from a multivariate normal whose components have the same mean (\(\delta _i\)), the same standard deviation (\(\sigma _i\)) and are equally crosscorrelated with correlation \(\rho \). Thus the covariance matrix \(\mathbf {\Sigma _i}\) is patterned as follows: each diagonal elements equals \(\sigma _i^2\); each nondiagonal element equals \(\rho \sigma _i^2\). Such assumptions are borrowed from the Bayesian correlated ttest (Corani and Benavoli 2015).
We complete the model with the prior on the parameters \(\delta _0\), \(\sigma _0\) and \(\nu \) of the highlevel distribution. We assume \(\delta _0\) to be uniformly distributed within 1 and \({}\)1. This choice works for all the measures bounded within ±1, such as accuracy, AUC, precision and recall. Other type of indicators might require different bounds.
For the standard deviation \(\sigma _0\) we adopt the prior \(unif(0,{\bar{s_0}})\), with \({\bar{s_0}}=1000 s_{{\bar{x}}}\), where \(s_{{\bar{x}}}\) is the standard deviation of the \({\bar{x}}_i\)’s.
As for the prior \(p(\nu )\) on the degrees of freedom, there are two proposals in the literature. Kruschke (2013) proposes an exponentially shaped distribution which balances the prior probability of nearly normal distributions (\(\nu > 30\)) and heavy tailed distributions (\(\nu < 30\)). We reparameterize this distribution as a Gamma(\(\alpha \),\(\beta \)) with \(\alpha =1\), \(\beta = 0.0345\). Juárez and Steel (2010) proposes instead \(p(\nu ) = \mathrm {Gamma}(2,0.1)\), assigning larger prior probability to normal distributions, as shown in Table 1.
3.1 The region of practical equivalence
Our knowledge about a parameter is fully represented by the posterior distribution. Yet it is handy to summarize the posterior in order to take decisions. In Corani and Benavoli (2015) we summarized the posterior distribution by reporting the probability of positiveness and negativeness; however in this way we considered only the sign of the differences, neglecting their magnitude.
A more informative summary of the posterior is obtained introducing a region of practical equivalence (rope), constituted by a range of parameter values that are practically equivalent to the null difference between the two classifiers. We thus summarize the posterior distribution by reporting how much probability lies within the rope, at its left and at its right. The limits of the rope are established by the analyst based on his experience; thus there are no uniquely correct limits for the rope (Kruschke 2015, Chap. 12). In this paper we consider two classifiers to be practically equivalent if their mean difference of accuracy lies within (\({}\)0.01,0.01).
The rope yields a realistic null hypothesis that can be verified. If a large mass of posterior probability lies within the rope, we claim the two classifiers to be practically equivalent. A sound approach to detect equivalent classifiers could be very useful in online model selection (Krueger et al. 2015) where one should quickly discard algorithms that are practically equivalent.
3.2 The inference of the test
We focus on estimating the posterior distribution of the difference of accuracy between the two classifiers on a future unseen data set. We compute the probability of left, rope and right being the most probable outcome on the next data set.
Thus we compute the probability by which \(p(left)> \max (p(rope),p(right))\) or \(p(right)> \max (p(rope),p(left))\) or \(p(rope)>\max (p(left),p(right))\). This is similar to the inference carried out by the Bayesian signedrank test (Benavoli et al. 2014).
 1.
initialize the counters \(n_{left}=n_{rope}=n_{right}=0\);
 2.for \(i=1,2,3,\dots ,N_s\) repeat

sample \(\mu _0, \sigma _0,\nu \) from their posteriors;

define the posterior of the mean difference accuracy on the next dataset, i.e., \(t(\delta _{next};\delta _0, \sigma _0,\nu )\);

from \(t(\delta _{next};\delta _0, \sigma _0,\nu )\) compute the three probabilities p(left) (integral on \((\infty ,r])\)), p(rope) (integral on \([r,r]\)) and p(right) (integral on \([r,\infty )\); notice that \(r\) and r denote the lower and upper bound of the rope);

determine the highest among p(left), p(rope), p(right) and increment the respective counter \(n_{left},n_{rope},n_{right}\);

 3.
compute \(P(left)=n_{left}/N_s\), \(P(rope)=n_{rope}/N_s\) and \(P(right)=n_{right}/N_s\);
 4.
decision: when \(P(rope)> 1\alpha \) (\(\alpha \) is the size of the test) declare the two classifiers to be practically equivalent; when \(P(left) > 1\alpha \) or \(P(right) > 1\alpha \), declare the two classifiers to be significantly different.
3.3 The shrinkage estimator
The \(\delta _i\)’s of the hierarchical model are independent given the parameters of the higherlevel distribution. If such parameters were known, the \(\delta _i\)’s would be conditionally independent and they would be independently estimated. Instead such parameters are unknown, causing the \(\delta _0\) and the \(\delta _i\)’s to be jointly estimated. The hierarchical model jointly estimates the \(\delta _i\)’s by applying shrinkage to the \({\bar{x}}_i\)’s, namely it pulls the estimates close to each other. It is known that the shrinkage estimator achieves a lower error than MLE in case of uncorrelated data; see (Murphy 2012, Sec 6.3.3.2) and the references therein. However there is currently no analysis of shrinkage with correlated data, such as those yielded by crossvalidation. We study this problem in the following.
Proposition 1
Proposition 2
Now consider the shrinkage estimator \(\hat{\delta }_i({\mathbf {x}}_i)=w{\bar{x}}_i+(1w)\delta _0\) with \(w \in (0,1)\), which pulls the MLE estimate \({\bar{x}}_i\) towards the mean \(\delta _0\) of the upperlevel distribution.
Proposition 3
3.4 Implementation and code availability
We implemented the hierarchical model in Stan (Carpenter et al. 2017), a language for Bayesian inference. In order to improve the computational efficiency, we exploit a quadratic matrix form to compute simultaneously the likelihood of the q data sets. This provides a speedup of about one order of magnitude compared to the naive implementation in which the likelihoods are computed separately on each data set. Inferring the hierarchical model on the results of ten runs of tenfolds crossvalidation on 50 data sets (a total of 5000 observations) takes about three minutes on a standard laptop. For the sake of completeness we recall that the computation of the much simpler signedrank test is instead immediate.
The Stan code is available from https://github.com/BayesianTestsML/tutorial/tree/master/hierarchical. The same repository provides the R code of all the simulations of Sect. 4.
4 Experiments
4.1 Estimation of the \(\delta _i\)’s under misspecification of p(\(\delta _i\))
According to the proofs of Sect. 3, the shrinkage estimator of the \(\delta _i\)’s has lower mean squared error than the maximum likelihood estimator, constituted by the arithmetic mean of the crossvalidation results. This result holds even if the \(p(\delta _i)\) of the hierarchical model is misspecified: it only requires the hierarchical model to reliably estimate the first two moments of \(p(\delta _i)\).
 sampling of the \(\delta _i\)’s (\(\delta _1, \delta _2,\ldots ,\delta _q\)) from the bimodal mixturewith k=2, \(\mu _1\)=0.005, \(\mu _2\)=0.02, \(\sigma _1\)=\(\sigma _2\)=\(\sigma \)=0.001, \(\pi _1=\pi _2=0.5\).$$\begin{aligned} p(\delta _i)= \pi _1 N(\delta _i\mu _1,\sigma _1) + \pi _2 N(\delta _i\mu _2,\sigma _2), \end{aligned}$$
 For each \(\delta _i\):

implement two classifiers whose actual difference of accuracy is \(\delta _i\), following the procedure given in “Appendix”;

perform 10 runs of 10folds crossvalidation with the two classifiers;

measure the mean of the crossvalidation results \({\bar{x}}_i\) (MLE).


infer the hierarchical model using the results referring to the q data sets;

obtain the shrinkage estimates of each \(\delta _i\);

measure \(\mathrm {MSE_{MLE}}\) and \(\mathrm {MSE_{SHR}}\) as defined in Sect. 3.3.
Estimation error of the \(\delta _i\)’s
q  Mean squared error  

\(\mathrm {MLE}\)  \(\mathrm {Shrinkage}\)  
5  .00036  .00017 
10  .00036  .00014 
50  .00036  .00012 
As reported in Table 2, \(\mathrm {MSE_{SHR}}\) is at least 50% lower than \(\mathrm {MSE_{MLE}}\) for every value of q. This confirms our theoretical findings. It also shows that the mean of the crossvalidation estimates is a quite noisy estimator of \(\delta _i\), even if ten repetitions of crossvalidation are performed. The problem is that all such results are correlated and thus they have limited informative content.
Interestingly, the MSE of the shrinkage estimator decreases with q. Thus the presence of more data sets allows to better estimate the moments of \(p(\delta _i)\), improving the shrinkage estimates as well. Instead the error of the MLE does not vary with q since the parameters of each data set are independently estimated.
4.2 Comparison of equivalent classifiers
In this section we adopt a Cauchy distribution as \(p(\delta _i)\); this is an idealized situation in which the hierarchical model can recover the actual \(p(\delta _i)\). We will relax this assumption in Sect. 4.7.
We simulate the null hypothesis of the signedrank test by setting the median of the Cauchy to \(\delta _0=0\). We set the scale factor of the distribution to 1/6 of the rope length; this implies that 80% of the sampled \(\delta _i\)’s lies within the rope, which is the most probable outcome.

sampling the \(\delta _i\)’s (\(\delta _1, \delta _2, \ldots , \delta _q\)) from \(p(\delta _i)\);
 for each \(\delta _i\):

implement two classifiers whose actual difference of accuracy is \(\delta _i\), following the procedure given in “Appendix”;

perform ten runs of tenfold crossvalidation with the two classifiers;


analyze the results through the signedrank and the hierarchical model.
Moreover in our simulations the hierarchical model never estimated p(left)>95% or p(right)>95%, so it made no Type I errors. In fact nsht commits a rate \(\alpha \) of Type I errors under the null hypothesis, while Bayesian estimation with rope typically makes less Type I errors (Kruschke 2013).
Running the signedrank twice? We cannot detect practically equivalent classifiers by running twice the signedrank test, e.g., once with null hypothesis \(\delta _0=0.01\) and once with the null hypothesis \(\delta _0 =0.01\). Even if the signedrank test does not reject the null in both cases, we still cannot affirm that the two classifiers are equivalent, since nonrejection of the null does not allow claiming that the null is true.
4.3 Comparison of practically equivalent classifiers
We now simulate two classifiers whose actual difference of accuracy is practically irrelevant but different from zero. We consider two classifiers whose average difference is \(\delta _0\)=0.005, thus within the rope.

set \(p(\delta _i)\) as a Cauchy distribution with \(\delta _0\)=0.005 and the same scale factor as in previous experiments (the rope remains the most probable outcome for the sampled \(\delta _i\)’s);

sample the \(\delta _i\)’s (\(\delta _1, \delta _2, \ldots , \delta _q\)) from \(p(\delta _i)\);

implement for each \(\delta _i\) two classifiers whose actual difference of accuracy is \(\delta _i\) and perform ten runs of tenfold crossvalidation;

analyze the crossvalidation results through the signedrank and the hierarchical model.
The behavior of the hierarchical test is far more sensible. The hierarchical test increases the posterior probability of rope (Fig. 3) when the number of data sets in which the classifiers show similar performance increases. It is slightly less effective in recognizing equivalence than in the previous experiment since \(\delta _0\) is now closer to the limit of the rope. When q=50, it declares equivalence detection with 95% confidence in about 40% of the simulated cases.
The hierarchical test thus effectively detects classifiers that are practically equivalent; this is instead impossible for the signedrank test.
The hierarchical model is more conservative as it rejects the null hypothesis less easily than the signed rank test. The price to be paid is that it might be less powerful at claiming significance when comparing two classifiers whose accuracies are truly different. We investigate this setting in the next section.
4.4 Simulation of practically different classifiers
We now simulate two classifiers which are significantly different. We consider different values of \(\delta _0\): \(\{0.015, 0.02, 0.025, 0.03\}\). We set the scale factor of the Cauchy to \(\sigma _0\)=0.01 and the number of data sets to q=50.
We repeat 500 experiments for each value of \(\delta _0\), as in the previous sections. We then check the power of the two tests for each value of \(\delta _0\). The power of the signedrank is the proportion of simulations in which it rejects the null hypothesis (\(\alpha \)=0.05). The power of the hierarchical test is the proportion of simulations in which it estimates p(right) > 0.95.
As expected, the signedrank test is indeed more powerful in this setting than the hierarchical model, especially when \(\delta _0\) lies just slightly outside the rope (Fig. 4). The two tests have however similar power when \(\delta _0\) is larger than 0.02.
4.5 Discussion
The main experimental findings so far are as follows. First, the shrinkage estimator of the \(\delta _i\)’s yields a lower mean squared error than the MLE estimator, even under misspecification of \(p(\delta _i)\).
Second, the hierarchical model effectively detects equivalent classifiers, unlike the nhst test.
However, it is also less powerful than the signedrank when comparing two significantly different classifiers. The difference in power is however not necessarily large, as shown in the previous simulation.
In the next section we discuss how the probabilities returned by the hierarchical model can be interpreted in a more meaningful way than simply checking if they are larger than \((1\alpha )\).
4.6 Interpreting posterior odds
Grades of evidence corresponding to posterior odds
Posterior odds  Evidence 

1–3  weak 
3–20  positive 
>20  strong 
Thus even if none of the three probabilities exceeds the 95% threshold, we can still draw meaningful conclusions by interpreting the posterior odds. We will adopt this approach in the following simulations.
The p values cannot be interpreted in a similar fashion, since they are affected both by sample size and effect size. In particular (Wasserstein and Lazar 2016) show that smaller p values do not necessarily imply the presence of larger effects and larger p values do not imply a lack of effect. A tiny effect can produce a small p value if the sample size is large enough, and large effects may produce unimpressive p values if the sample size is small.
4.7 Experiments with Friedman’s functions
The results presented in the previous sections refer to conditions in which the actual \(p(\delta _i)\) (misspecified or not) is known. In this section we perform experiments in which the \(\delta _i\)’s are not sampled from an analytical distribution; rather, they are due to different settings of sample size, noise etc. This is a challenging setting for the hierarchical model, whose \(p(\delta _i)\) is unavoidably misspecified.
We generate data sets via the three functions (\(F\#1\), \(F\#2\) and \(F\#3\)) proposed by Friedman (1991).
Settings of the Friedman functions
Function type  \(\sigma _{\epsilon }\)  n  random feats  Tot settings 

F#1  {0.5,1,2}  {30,100,1000}  {0,20}  3 \(\cdot \) 3 \(\cdot \) 2 =18 
F#2  {62.5,125,250}  {30,100,1000}  {0,20}  3 \(\cdot \) 3 \(\cdot \) 2 =18 
F#3  {0.05,0.1,0.2}  {30,100,1000}  {0,20}  3 \(\cdot \) 3 \(\cdot \) 2 =18 
As a pair of classifiers we consider linear discriminant analysis (lda) and classification trees (cart), as implemented in the caret package for R, without any hyperparameter tuning. As first step we need to measure the actual \(\delta _i\) between two given classifiers in each setting, which then allows us to know the population of the \(\delta _i\)’s.
Our second step will be to check the conclusions of the signedrank test and of the hierarchical model when they are provided with crossvalidation results referring to a subset of settings.
4.7.1 Measuring \(\delta _i\)
 for j=1:500

sample training data according to the specifics of the ith setting: <function type, n, \(\sigma _{\epsilon }\), number of random features;

fit lda and cart on the generated training data;

sample a large test set (5000 instances) and measure the difference of accuracy \(d_{ij}\) between cart and lda;


set \(\delta _i \simeq 1/500 \sum _j d_{ij}\) .
4.7.2 Groundtruth
We compute the \(\delta _i\) of each setting using the above procedure. The groundtruth is that lda is significantly more accurate than cart. More in detail, 65% of the \(\delta _i\)’s belong to the region to the right of the rope (lda being significantly more accurate than cart). Thus right is the most probable outcome of the next \(\delta _i\). Moreover, the mean of the \(\delta _i\)’s is \(\delta _0\)=0.02 (in favor of lda).
4.7.3 Assessing the conclusions of the tests

random selection of 12 out of 18 settings for each Friedman function, thus selecting 36 settings;
 in each setting:

generate a data set according to the specific of the setting;

run ten runs of tenfolds crossvalidation of lda and cart using paired folds;


analyze the crossvalidation results on the q=36 data sets using the signed rank and the hierarchical test.
The two tests have roughly the same power: 28% for the signedrank and 27.5% for the hierarchical test. In the remaining simulations the signedrank does not reject \(H_0\); in those cases it conveys no information since the p values cannot be interpreted.

in 11% of the simulations both o(right, rope) and o(right, left) are larger than 20, providing strong evidence in favor of lda even though p(right) does not exceed 95%;

in a further 33% of the simulations both o(right, rope) and o(right, left) are larger than 3, providing at least positive evidence in favor of lda.
Thus the interpretation of posterior odds allows drawing meaningful conclusions even when the 95% threshold is not exceeded. The probabilities are sensibly estimated, even if \(p(\delta _i)\) is unavoidably misspecified.
4.8 Sensitivity analysis on realworld data sets
We now consider real data sets. In this case we cannot know the actual \(\delta _i\)’s: we could repeat a few hundred times crossvalidation but the resulting estimates would have large uncertainty as already discussed.
We exploit this setting to perform sensitivity analysis and to further compare the conclusions drawn by the hierarchical model and of the signedrank test.
We consider 54 data sets taken from the webpage^{1} of WEKA data sets. We consider four classifiers: naive Bayes (nbc), hidden naive Bayes (hnb), decision tree (j48), grafted decision tree (j48gr). Witten et al. (2011) provides a summary description of all such classifiers with pointers to the relevant papers. We perform ten runs of tenfolds crossvalidation for each classifier on each data set. We run all experiments using the WEKA^{2} software.
A fundamental step of Bayesian analysis is to check how the posterior conclusions depend on the chosen prior and how the model fits the data. The hierarchical model shows some sensitivity on the choice of \(p(\delta _i)\), being instead robust to the other assumptions (see later for further discussion). The Student distribution is more flexible than the Gaussian and we have found that it consistently provides better fit to the data. Yet, the model conclusions are sometimes sensitive on the prior on the degrees of freedom \(p(\nu )\) of the Student.
Posterior probabilities computed by two variants of the hierarchical model
Hierarchical  Gamma(2,0.1)  

pair  left  rope  right  left  rope  right 
nbchnb  1.00  0.00  0.00  1.00  0.00  0.00 
nbcj48  0.80  0.02  0.18  0.80  0.01  0.20 
nbcj48gr  0.84  0.02  0.14  0.84  0.01  0.15 
hnbj48  0.03  0.10  0.87  0.03  0.02  0.95 
hnbj48gr  0.03  0.07  0.90  0.03  0.02  0.95 
j48j48gr  0.00  1.00  0.00  0.00  1.00  0.00 
In some cases the estimates of the two models differ by some points (Table 5). This means that the actual highlevel distribution from which the \(\delta _i\)’s are sampled is not a Student (or a Gaussian), otherwise the estimate of the two models would converge.
4.8.1 Sensitivity on the prior on \(\sigma _0\) and \(\sigma _i\)
The model conclusions are moreover robust with respect to the specification of the priors \(p(\sigma _i)\) and \(p(\sigma _0)\). Recall that \(\sigma _i\) is the standard deviation on the ith data set while \(\sigma _0\) is the standard deviation of the highlevel distribution.
Our model assumes \(\sigma _i \sim {\mathrm {unif}} (0,{\bar{\sigma }})\) where \({\bar{\sigma }}= 1000{\bar{s}}\) where \({\bar{s}}\) is the average of the sample standard deviations of the different data sets. The posterior distribution of \(\sigma _i\) is however substantially unchanged if we adopt instead \({\bar{\sigma }}= 100{\bar{s}}\).
The same consideration applies to \(\sigma _0\), whose prior is \(p(\sigma _0) = unif(0,{\bar{s_0}})\). We obtain the same posterior distribution for \(\sigma _0\) using as upper bound \({\bar{s_0}}=1000 s_{{\bar{x}}}\) or \({\bar{s_0}}=100 s_{{\bar{x}}}\), where \(s_{{\bar{x}}}\) is the standard deviation of the \({\bar{x}}_i\)’s.
4.9 Comparing the signedrank and the hierarchical test
Posterior probabilities of the hierarchical model and p values of the signedrank
Hierarchical  Signedrank  

pair  left  rope  right  p value 
nbchnb  1.00  0.00  0.00  0.00 
nbcj48  0.80  0.02  0.18  0.46 
nbcj48gr  0.84  0.02  0.14  0.39 
hnbj48  0.03  0.10  0.87  0.07 
hnbj48gr  0.03  0.07  0.90  0.08 
j48j48gr  0.00  1.00  0.00  0.00 
Both the signedrank and the hierarchical test claim with 95% confidence hnb to be significantly more accurate than nbc.
In the following comparisons apart from the last one, the two tests do not draw any conclusion with 95% confidence. The signedrank does not reject the null hypothesis, while the hierarchical test does not achieve probability larger than 95%.
When the signedrank test does not reject the null hypothesis, it draws a noninformative conclusion. We can instead always interpret the posterior odds yielded by the hierarchical model. When comparing nbc and j48, there is a positive evidence for right (j48 being more accurate than nbc) over left and strong evidence for right over rope. We thus conclude that there is positive evidence of j48 being practically more accurate than nbc. Similarly, we conclude that there is positive evidence of j48gr being practically more accurate than nbc.
The two test draw opposite conclusions when comparing j48 and j48gr. The signedrank declares j48gr to be significantly more accurate than j48 (p value 0.00) while the hierarchical model declares them to be practically equivalent, with p(rope)=1. The reason why the two tests achieved opposite conclusions is that the differences have a consistent sign but are smallsized. Most data sets yield a positive difference in favor of j48gr; this leads the signed rank test to claim significance. Yet the differences lies mostly within the rope (Fig. 7). The hierarchical model shrinks them further towards the overall mean and eventually claims the two classifiers to be practically equivalent. The posterior probabilities remain unchanged even adopting the halfsized rope (\({}\)0.005, 0.005). It thus seems fair to conclude that, even if most signs are in favor of j48gr, the accuracies of j48 and j48gr are practically equivalent.
5 Conclusions
The proposed approach is a realistic model of the data generated by crossvalidation across multiple data sets. Through the rope it also defines a sensible null hypothesis which can be verified, allowing the test to detect classifiers that are practically equivalent. The interpretation of the posterior odds allows drawing meaningful conclusions even when the posterior probabilities do not exceed 95%. Thanks to shrinkage, the hierarchical model estimates the \(\delta _i\)’s more accurately than the usual approach of averaging (independently on each data set) the crossvalidation differences. An interesting research direction is thus the adoption of a nonparametric approach for the highlevel distribution \(p(\delta _i)\). This is a nontrivial task which we leave for future research.
Footnotes
Notes
Acknowledgements
The research in this paper has been partially supported by the Swiss NSF grants ns. IZKSZ2_162188 and n. 200021_146606.
References
 Benavoli, A., Corani, G., Demsar, J., & Zaffalon, M. Time for a change: A tutorial for comparing multiple classifiers through Bayesian analysis. arXiv:1606.04316.
 Benavoli, A., Corani, G., Mangili, F., Zaffalon, M., & Ruggeri, F. (2014). A Bayesian Wilcoxon signedrank test based on the Dirichlet process. In: Proceedings of the 31st International Conference on Machine Learning (ICML14), (pp. 1026–1034).Google Scholar
 Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., et al. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32.CrossRefGoogle Scholar
 Corani, G., & Benavoli, A. (2015). A Bayesian approach for comparing crossvalidated algorithms on multiple data sets. Machine Learning, 100(2), 285–304.MathSciNetCrossRefzbMATHGoogle Scholar
 Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.MathSciNetzbMATHGoogle Scholar
 Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1–67.MathSciNetCrossRefzbMATHGoogle Scholar
 Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis, 1(3), 515–534.MathSciNetCrossRefzbMATHGoogle Scholar
 Hand, D. J., et al. (2006). Classifier technology and the illusion of progress. Statistical Science, 21(1), 1–14.MathSciNetCrossRefzbMATHGoogle Scholar
 Juárez, M. A., & Steel, M. F. J. (2010). Modelbased clustering of nonGaussian panel data based on skewt distributions. Journal of Business & Economic Statistics, 28(1), 52–66.MathSciNetCrossRefzbMATHGoogle Scholar
 Krueger, T., Panknin, D., & Braun, M. (2015). Fast crossvalidation via sequential testing. Journal of Machine Learning Research, 16, 1103–1155.MathSciNetzbMATHGoogle Scholar
 Kruschke, J. (2015). Doing Bayesian data analysis: A tutorial with R, Jags and Stan. New York: Academic Press.zbMATHGoogle Scholar
 Kruschke, J. K. (2013). Bayesian estimation supersedes the t–test. Journal of Experimental Psychology: General, 142(2), 573.CrossRefGoogle Scholar
 Lacoste, A., Laviolette, F., & Marchand, M. (2012). Bayesian comparison of machine learning algorithms on single and multiple datasets. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS12), (pp. 665–675).Google Scholar
 Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge: MIT press.zbMATHGoogle Scholar
 Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52(3), 239–281.CrossRefzbMATHGoogle Scholar
 Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–164.CrossRefGoogle Scholar
 Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on pvalues: context, process, and purpose. The American Statistician, 70(2), 129–133.MathSciNetCrossRefGoogle Scholar
 Witten, I. H., Frank, E., & Hall, M. (2011). Data Mining: Practical machine learning tools and techniques (third ed.). Los Altos: Morgan Kaufmann.Google Scholar