We empirically evaluate the efficiency and investigate the properties of BBC-CV and BBCD-CV, on both controlled settings and real problems. In particular, we focus on the bias of the performance estimates of the protocols, and on computational time. We compare the results to those of three standard approaches: CVT, TT and NCV. We also examine the tuning (configuration selection) properties of BBC-CV, BBCD-CV and BBC-CV with repeats, as well as the confidence intervals that these methods construct. WMCS and IPL are not included in this empirical comparison, for a variety of reasons, including the need for parametric, metric-specific assumptions (WMCS) and increased computational complexity (IPL) (see Sect. 3, subsection Related Work); in addition, both methods are complex to implement. As the main advantage of the proposed methods are on a conceptual level (simplicity of the approach and broad applicability to almost any type of performance metric and outcome of interest), such empirical evaluation would probably not be very informative.
Simulation studies
Extensive simulation studies were conducted in order to validate BBC-CV and BBCD-CV, and assess their performance. We focus on binary classification tasks and use classification accuracy as the measure of performance, as it is easier to simulate models with a prespecified accuracy. We examine multiple settings for varying sample size \(N \in \{20, 40, 60, 80, 100, 500, 1000\}\), number of candidate configurations \(C \in \{50, 100, 200, 300, 500, 1000, 2000\}\), and true performances P of the candidate configurations drawn from different Beta distributions Be(a, b) with \((a, b) \in \{(9, 6), (14, 6), (24, 6), (54, 6)\}\). These betas provide configurations with mean performance values \(\mu \in \{0.6, 0.7, 0.8, 0.9\}\) and variances of these performances of 0.015, 0.01, 0.0052, 0.0015. These choices result in a total of 196 different experimental settings. We chose distributions with small variances since these are the most challenging cases where the models have quite similar performances.
For each setting, we generate a simulated matrix of out-of-sample predictions \(\varPi \). First, a true performance value \(P_j, j = 1,\ldots ,C\), sampled from the same beta distribution, is assigned to each configuration \(c_j\). Then, the sample predictions for each \(c_j\) are produced as \(\varPi _{ij} = \mathbb {1}(r_i < P_j), i = 1,\ldots ,N\), where \(r_i\) are random numbers sampled uniformly from (0, 1), and \(\mathbb {1}(condition)\) denotes the unit (indicator) function. Notice that there is no need to simulate the actual training of the models, just the predictions of these models so that they obtain a prespecified predictive accuracy.
Then, the BBC-CV, BBCD-CV, CVT, TT, and NCV protocols for tuning and performance assessment of the returned model are applied. We set the number of bootstraps \(B = 1000\) for the BBC-CV method, and for the BBCD-CV we set \(B = 1000\) and the dropping threshold to \(a = 0.99\). We applied the same split of the data into \(K = 10\) folds for all the protocols. Consequently, all of them, with the possible exception of the BBCD-CV, select and return the same predictive model with different estimations of its performance. The internal cross-validation loop of the NCV uses \(K = 9\) folds. The whole procedure was repeated 500 times for each setting, leading to a total of 98,000 generated matrices of predictions, on which the protocols were applied. The results presented are the averages over the 500 repetitions. The code, in Matlab, implementing the simulation studies can be downloaded from https://github.com/mensxmachina/BBC-CV.
Bias estimation
The bias of the estimation is computed as \(\widehat{Bias} = \hat{P} - P\), where \(\hat{P}\) and P denote the estimated and the true performance of the selected configuration, respectively. A positive bias indicates a lower true performance than the one estimated by the corresponding performance estimation protocol and implies that the protocol is optimistic (i.e. overestimates the performance), whereas a negative bias indicates that the estimated performance is conservative. Ideally, the estimated bias should be 0, although a slightly conservative estimate is also acceptable in practice.
Figure 1 shows the average estimated bias for models with average true classification accuracy \(\mu = 0.6\), over 500 repetitions, of the protocols under comparison. Each panel corresponds to a different protocol (specified in the title) and shows the bias of its performance estimate relatively to the sample size (horizontal axis) and the number of configurations tested (different plotted lines). We omit results for the rest of the tested values of \(\mu \) as they are similar.
The CVT estimate of performance is optimistically biased in all settings with the bias being as high as 0.17 points of classification accuracy. We notice that the smaller the sample size, the more CVT overestimates the performance of the final model. However, as sample size increases, the bias of CVT tends to 0. Finally, we note that the bias of the estimate also grows as the number of models under comparison becomes greater, although the effect is relatively small in this experiment. The behaviour of TT greatly varies for small sample sizes (\(\le 100\)), and is highly sensitive to the number of configurations. On average, the protocol is optimistic (not correcting for the bias of the CVT estimate) for sample size \(N \in \{20, 40\}\), and over-corrects, for \(N \in \{60, 80, 100\}\). For larger sample size (\(\ge 500\)), TT is systematically conservative, over-correcting the bias of CVT. NCV provides an almost unbiased estimation of performance, across all sample sizes. However, recall that it is computationally expensive since the number of models that need to be trained depends quadratically on the number of folds K.
BBC-CV provides conservative estimates, having low bias which quickly tends to zero as sample size increases. Compared to TT, it is better fitting for small sample sizes and produces more accurate estimates overall. In comparison to NCV, BBC-CV is somewhat more conservative with a difference in the bias of 0.013 points of accuracy on average, and 0.034 in the worst case (for \(N = 20\)); on the other hand however, BBC-CV is more computationally efficient. BBCD-CV displays similar behaviour to BBC-CV, having lower bias which approaches zero faster. It is on par with NCV, having 0.005 points of accuracy higher bias on average, and 0.018 in the worst case. As we show later on, BBCD-CV is up to one order of magnitude faster than CVT, and consequently two orders of magnitude faster than NCV.
In summary, the proposed BBC-CV and BBCD-CV methods produce almost unbiased performance estimates, and perform only slightly worse in small sample settings than the computationally expensive NCV. As expected, CVT is overly optimistic, and thus should not be used for performance estimation purposes. Finally, the use of TT is discouraged, as (a) its performance estimate varies a lot for different sample sizes and numbers of configurations, and (b) it overestimates performance for small sample sizes, which are the cases where bias correction is needed the most.
Real datasets
After examining the behaviour of BBC-CV and BBCD-CV on controlled settings, we investigate their performance on real datasets. Again we focus on the binary classification task but now we use the AUC as the metric of performance, as it is independent of the class distribution. All of the datasets included in the experiments come from popular data science challenges [NIPS 2003 (Guyon et al. 2004); WCCI 2006 (Guyon et al. 2006); ChaLearn AutoML (Guyon et al. 2015)]. Table 1 summarizes their characteristics. The domains of application of the ChaLearn AutoML challenge’s datasets are not known, however the organizers claim that they are diverse and were chosen to span different scientific and industrial fields. gisette (Guyon et al. 2004) and gina (Guyon et al. 2006) are handwritten digit recognition problems, dexter (Guyon et al. 2004) is a text classification problem, and madelon (Guyon et al. 2004) is an artificially constructed dataset characterized by having no single feature that is informative by itself.
The experimental set-up is similar to the one used by Tsamardinos et al. (2015). Each original dataset D was split into two stratified subsets; \(D_{pool}\) which consisted of 30% of the total samples in D, and \(D_{holdout}\) which consisted of the remaining 70% of the samples. For each original dataset with the exception of dexter, \(D_{pool}\) was used to sample (without replacement) 20 sub-datasets for each sample size \(N \in \{20, 40, 60, 80, 100, 500\}\). For the dexter dataset we sampled 20 sub-datasets for each \(N \in \{20, 40, 60, 80, 100\}\). We created a total of \(8 \times 20 \times 6 + 20 \times 5 = 1060\) sub-datasets. \(D_{holdout}\) was used to estimate the true performance of the final, selected model of each of the protocols tested.
Table 1 Datasets’ characteristics. pr / nr denotes the ratio of positive to negative examples in a dataset. \(|D_{pool}|\) refers to the portion of the datasets (30%) from which the sub-datasets were sampled and \(|D_{holdout}|\) to the portion (70%) from which the true performance of a model is estimated
The set \(\varTheta \) (i.e. the search grid) explored consists of 610 configurations. These resulted from various combinations of preprocessing, feature selection, and learning methods and different values for their hyper-parameters. The preprocessing methods included imputation, binarization (of categorical variables) and standardization (of continuous variables) and were used when they could be applied. For feature selection we used the SES algorithm (Lagani et al. 2017) with alpha \(\in \{0.05, 0.01\}\), and \(k~\in ~\{2, 3\}\) and we also examined the case of no feature selection (i.e., a total of 5 cases/choices). The learning algorithms considered were Random Forests (Breiman 2001), SVMs (Cortes and Vapnik 1995), and LASSO (Tibshirani 1996). For Random Forests the hyper-parameters and values tried are numTrees \(= 1000\), minLeafSize \(\in \{1, 3, 5\}\) and numVarToSample \(\in \{(0.5, 1, 1.5, 2) * \sqrt{\textit{numVar}}\}\), where numVar is the number of variables of the dataset. We tested SVMs with linear, polynomial and radial basis function (RBF) kernels. For their hyper-parameters we examined, wherever applicable, all the combinations of degree \(\in \{2, 3\}\), \(gamma \in \{0.01, 0.1, 1, 10, 100\}\) and cost \(\in \{0.01, 0.1, 1, 10, 100\}\). Finally, LASSO was tested with alpha \(\in \{0.001, 0.5, 1.0\}\) (alpha \(= 1\) represents lasso regression, other values represent elastic net optimization, and alpha close to 0 approaches ridge regression) and 10 different values for lambda which were created independently for each dataset using the glmnet library (Friedman et al. 2010).
We performed tuning and performance estimation of the final model using CVT, TT, NCV, BBC-CV, BBCD-CV, and BBC-CV with 10 repeats (denoted as BBC-CV10) for each of the 1060 created sub-datasets, leading to more than 135 million trained models. We set \(B = 1000\) for the BBC-CV method, and \(B = 1000\), \(a = 0.99\) for the BBCD-CV method. We applied the same split of the data into \(K = 10\) stratified folds for all the protocols. The inner cross-validation loop of NCV uses \(K = 9\) folds. For each protocol, original dataset D, and sample size N, the results are averaged over the 20 randomly sampled sub-datasets.
To compute the AUC (and similar metrics like the concordance index) during CV-like protocols one could pool all predictions first and then compute the AUC on the pooled set of predictions. Alternatively, one could compute the AUC on each fold and average on all folds (see also Sect. 3). The final selection of the best configuration and estimation of performance may be different depending the method. However, in preliminary experiments (Greasidou 2017) we found that the two methods perform similarly in terms of model performance and bias of estimation. Notice that the pooling method cannot be applied to the TT method since the latter depends on estimates of performance in each fold individually. In the experiments that follow, all other methods using pooling to compute AUC except the TT and NCV (as it is standard in the literature).
Bias estimation
The bias of estimation is computed as in the simulation studies, i.e., \(\widehat{Bias} = \hat{P} - P\), where \(\hat{P}\) and P denote the estimated and the true performance of the selected configuration, respectively. In Fig. 2 we examine the average bias of the CVT, TT, NCV, BBC-CV, and BBCD-CV estimates of performance, on all datasets, relative to sample size. We notice that the results are in agreement with those of the simulation studies. In particular, CVT is optimistically biased for sample size \(N \le 100\) and its bias tends to zero as N increases. TT over-estimates performance for \(N = 20\), its bias varies with datasets for \(N = 40\), and it over-corrects the bias of CVT for \(N \ge 60\). TT exhibits the worst results among all protocols except CVT.
Both NCV and BBC-CV have low bias (in absolute value) regardless of sample size, though results vary with the dataset. BBC-CV is mainly conservative with the exception of the madeline dataset for \(N = 40\) and the madelon dataset for \(N \in \{60, 80, 100\}\). NCV is slightly optimistic for the dexter and madeline datasets for \(N = 40\) with a bias of 0.033 and 0.031 points of AUC respectively. BBCD-CV has, on average, greater bias than BBC-CV for \(N \le 100\). For \(N = 500\), its bias shrinks and becomes identical to that of BBC-CV and NCV.
Relative performance and speed up of BBCD-CV
We have shown that for large sample sizes (\(N = 500\)) BBCD-CV provides accurate estimates of performance of the model it returns, comparable to those of BBC-CV and NCV. How well does this model perform though? In this section, we evaluate the effectiveness of BBCD-CV in terms of its tuning (configuration selection) properties, and its efficiency in reducing the computational cost of CVT.
Figure 3 shows the relative average true performance of the models returned by the BBCD-CV and CVT protocols, plotted against sample size. We remind here that for each of the 20 sub-datasets of sample size \(N \in \{20, 40, 60, 80, 100, 500\}\) sampled from \(D_{pool}\), the true performance of the returned model is estimated on the \(D_{holdout}\) set. We notice that, for \(N \le 100\) the loss in performance varies greatly with dataset and is quite significant; up to \(9.05\%\) in the worst case (dexter dataset, \(N = 40\)). For \(N = 500\), however, there is negligible to no loss in performance. Specifically, for the sylvine, philippine, madeline, christine and gina datasets there is no loss in performance when applying BBCD-CV, while there is 0.44 and \(0.15\%\) loss for the gisette and jasmine datasets, respectively. madelon exhibits the higher average loss of \(1.4\%\). We expect the difference in performance between BBCD-CV and CVT to shrink even further with larger sample sizes.
We investigated the reason of the performance loss of BBCD-CV for low sample sizes (\(N \le 100\)). We observed that, in most cases the majority of configurations (\(> 95\%\)) were dropped very early within the CV procedure (in the first couple of iterations). With 10-fold CV, the number of out-of-sample predictions with \(N \le 100\) samples ranges from 2 to 10, which are not sufficient for the bootstrap test to reliably identify under-performing configurations. This observation leads to some practical considerations and recommendations. For small sample sizes, we recommend to start dropping configurations with BBCD-CV only after an adequate number of out-of-sample predictions become available. An exact number is hard to determine, as it depends on many factors, such as the analyzed dataset and the set of configurations tested. Given that with \(N = 500\) BBCD-CV incurs almost no loss in performance, we recommend a minimum of 50 out-of-sample predictions to start dropping configurations, although a smaller number may suffice. For example, with \(N = 100\), this would mean that dropping starts after the fifth iteration. Finally, we note that dropping is mostly useful with larger sample sizes (i.e. for computationally costly scenarios), which are also the cases where BBCD-CV is on par with BBC-CV and NCV, in terms of tuning and performance estimation.
Next, we compare the computational cost of BBCD-CV to CVT, in terms of total number of models trained. The results for \(N = 500\) are shown in Fig. 4. We only focused on the \(N = 500\) case, as it is the only case where both protocols produce models of comparable performance. We observe that a speed-up of 2 to 5 is typically achieved by BBCD-CV. For the gisette dataset, the speed-up is very close to the theoretical maximum of this experimental setup; the maximum is achieved when almost all configurations are dropped after the first fold and a speed up of K, the number of folds, is achieved. Overall, if sample size is sufficiently large, using dropping is recommended to speed-up CVT without a loss of performance.
Finally, we would like to note that we have also run experiments for \(\alpha \in \{0.90, 0.95\}\) which are included in the Master’s thesis of one of the authors (see Greasidou 2017). In terms of tuning, the results (accuracy of the final model selected) were not significantly different when compared to \(\alpha = 0.99\), however, the number of trained models for some datasets and sample sizes was larger for larger \(\alpha \). We chose to only present the results for \(\alpha = 0.99\) in this work since this is the value we suggest using in the general case (in favor of being conservative and trying a larger number of configurations versus being computationally efficient).
Multiple repeats
We repeated the previous experiments, running BBC-CV with 10 repeats of partitioning to different folds (called BBC-CV10 hereafter). First, we compare the true performance of the models returned by BBC-CV and BBC-CV10, as well as the bias of the estimation. Ideally, using multiple repeats should result in a better performing model, as the variance of the performance estimation (used by CVT for tuning) due to a specific choice of split for the data is reduced when multiple splits are considered. This comes at a cost of increased computational overhead, which in case of 10 repeats is similar to that of the NCV protocol. To determine which of the approaches is preferable, we also compare the performance of the final models produced by BBC-CV10 and NCV.
Figure 5 (left) shows the relative average true performance of BBC-CV10 to BBC-CV with increasing sample size N. We notice that, for \(N = 20\) the results vary with dataset, however, for \(N \ge 40\), BBC-CV10 systematically returns an equally good or (in most cases) better performing model than the one that BBC-CV returns. In terms of the bias of the performance estimates of the two methods, we have found them to be similar.
Similarly, Fig. 5 (right) shows the comparison between BBC-CV10 and NCV. We see again that for sample size \(N = 20\) the relative average true performance of the returned models vary with dataset. BBC-CV10 outperforms NCV for \(N \ge 40\) except for the philippine and jasmine datasets for which results vary with sample size. Thus, if computational time is not a limiting factor, it is still beneficial to use BBC-CV with multiple repeats instead of NCV.
To summarize, we have shown that using multiple repeats increases the quality of the resulting models as well as maintaining the accuracy of the performance estimation. We note that the number 10 was chosen mainly to compare BBC-CV to NCV with \(K=10\) folds on equal grounds (same number of trained models). If time permits, we recommend using as many repeats as possible, especially for low sample sizes. For larger sample sizes, usually one or a few repeats suffice.
Confidence intervals
The bootstrap-based estimation of performance, allows for easy computation of confidence intervals (CIs) as described in Sect. 3.1. We investigated the accuracy of the CIs (calibration) produced by the proposed BBC-CV, BBCD-CV and BBC-CV10 protocols. To this end, we computed the coverage of the \(\{50\%, 55\%, \dots , 95\%, 99\%\}\) CIs estimated by the protocols, defined as the ratio of the computed CIs that contain the corresponding true performances of the produced models. For a given sample size, the coverage of a CI was computed over all 20 sub-datasets and 9 datasets. To further examine the effect of multiple repeats on CIs, we computed their average width (over all 20 sub-datasets) for each dataset and different number of repeats (1–10).
Figure 6 shows the estimated coverage of the CIs constructed with the use of the percentile method relative to the expected coverage for the BBC-CV, BBCD-CV, and BBC-CV10 protocols. We present results for sample sizes \(N = 20\) (left), \(N = 100\) (middle), and \(N = 500\) (right). Figure 7 shows, for the same values for N and for each dataset, the average width of the CIs with increasing number of repeats.
We notice that for \(N = 20\) the CIs produced by BBC-CV are conservative, that is, they are wider than ought to be. As sample size increases (\(N \ge 100\)), BBC-CV returns more calibrated CIs which are still conservative. The use of 10 repeats (BBC-CV10) greatly shrinks the width of the CIs and improves their calibration (i.e., their true coverage is closer to the expected one). The same holds when using dropping of under-performing configurations (BBCD-CV). For \(N = 500\) the intervals appear to not be conservative. After closer inspection, we saw that this is caused by two datasets (madeline and jasmine) for which the majority of the true performances are higher than the upper bound of the CI. We note that those datasets are the ones with the highest negative bias (see Fig. 2 for \(N = 500\)), which implicitly causes the CIs to also be biased downwards, thus failing to capture performance estimates above the CI limits.
In conclusion, the proposed BBC-CV method provides mainly conservative CIs of the true performance of the returned models which become more accurate with increasing sample size. The use of multiple repeats improves the calibration of CIs and shrinks their width, for small sample sizes (< 100). The use of 3-4 repeats seems to suffice and further repeats provide small added value in CI estimation.