1 Introduction

Many classifiers are able to output not only the predicted class label, but also a probability distribution over the possible classes. Such probabilistic predictions have many obvious uses, one example is to filter out unlikely or very uncertain predictions. Another generic scenario is when the probability estimates are used as the basis for a decision, typically comparing the utility of different options. Naturally, probabilistic prediction requires that the probability estimates are well-calibrated, i.e., the predicted class probabilities must reflect the true, underlying probabilities. If this is not the case, the predicted probabilities actually become misleading.

There exist a number of general methods for calibrating probabilistic predictions, but the two most frequently used are Platt scaling (Platt 1999) and isotonic regression (Zadrozny and Elkan 2001). Both techniques have been successfully applied in conjunction with many different learning algorithms, including support-vector machines, boosted decision trees and naïve Bayes (Niculescu-Mizil and Caruana 2005). However, for single decision trees, as well as bagged trees and random forests, these calibration techniques have turned out to be less effective (Niculescu-Mizil and Caruana 2005), something which partly can be explained by their requirements for large calibration sets. Boström (2008) showed that this problem can be mitigated when employing bagging, e.g., as in the random forest algorithm, by utilizing out-of-bag predictions; in effect allowing all training instances to be used for calibration. In this work, we investigate the use of Venn predictors (Vovk et al. 2004), and Venn-Abers predictors (Vovk and Petej 2012) as alternative approaches to calibrating probabilities from random forests. Venn predictors (and the special case Venn-Abers predictors) are, under the standard i.i.d. assumption, automatically valid multiprobability predictors, i.e., their probability estimates will be perfectly calibrated, in the long run. The price paid for this rather amazing property is, however, that all probabilistic predictions from a Venn predictor come in the form of intervals.

A formal description of Venn predictors and Venn-Abers predictors is given in Sects. 3 and 3.1, but the overall procedure can be described like this; before the actual prediction, instances are divided into categories. When predicting, we first find the category to which the test instance belongs, and then tentatively classify it as every possible label, one at a time. From this, the frequencies of labels in the chosen category (including the tentative label of the test instance) are used as the estimates of the test label probabilities. Since every possible label is tried, the Venn predictor outputs several (two if it is a two-class problem) probability distributions for the test instance.

Venn predictors can be applied on top of all classifiers, as long as they return not only the class label, but also some score associated with the confidence in that prediction. In this paper, we focus on using the state-of-the-art predictive modeling technique random forests (Breiman 2001) as underlying models for Venn predictors. In the empirical investigation, the quality of the probability estimates from the Venn predictors will be compared to both using raw estimates from the random forests and standard calibration techniques. In addition, different versions of Venn predictors will be compared against each other, with regard to accuracy and informativeness. Specifically, since random forests are used as the underlying model, the option to calibrate the estimates on the so called out-of-bag set, instead of setting aside a separate data set for calibration, will be investigated.

Previous evaluations of Venn predictors, such as Lambrou et al. (2015), use very few data sets, thus precluding statistical analysis, i.e., they serve mainly as proof-of-concepts. In this paper, we present the first large-scale empirical investigation in which Venn predictors are compared to state-of-the-art methods for calibration of probabilistic predictions, on 22 publicly available data sets.

In the next section, we first define probabilistic prediction and describe random forests, before presenting some standard calibration techniques. The Venn predictors, including the special case of Venn-Abers predictors, are described in Sect. 3. In Sect. 4, we outline the experimental setup, which is followed by the experimental results presented in Sect. 5. Finally, we summarize the main conclusions in Sect. 6.

2 Background

2.1 Probabilistic prediction

In probabilistic prediction, the task is to predict the probability distribution of the label, given the training set and the test object. The goal is to obtain a valid predictor. In general, validity means that the probability distributions from the predictor must perform well against statistical tests based on subsequent observation of the labels. In particular, we are interested in calibration, i.e., we want:

$$\begin{aligned} p(c_j \mid p^{c_j})=p^{c_j}, \end{aligned}$$
(1)

where \(p^{c_j}\) is the probability estimate for class \(c_j\). It must be noted that validity cannot be achieved for probabilistic prediction in a general sense, see e.g., Gammerman et al. (1998).

2.2 Random forests

A random forest (Breiman 2001) is an ensemble consisting of random trees, which are decision trees generated in a specific way to obtain diversity among the trees. Each random tree is trained on a bootstrap replicate, i.e., a sample obtained from n training instances by randomly selecting n instances with replacement. This procedure is referred to as bagging. Moreover, only a randomly selected subset of the available attributes are considered when choosing each interior split. The instances that were missing in the bootstrap replicate, for a specific tree, are said to be out-of-bag for that tree. The random forest algorithm has frequently been demonstrated to achieve state-of-the-art predictive performance, see e.g., Caruana and Niculescu-Mizil (2006) and Delgado et al. (2014) for large-scale comparisons. Random forests can be used for several different tasks, including classification, regression, ranking and probability estimation, see e.g., Boström (2012). In addition to the strong predictive performance, the learning algorithm, being embarrassingly parallel, lends itself to efficient implementation on multi-core platforms, see e.g., Boström (2011) and Jansson et al. (2014).

2.3 Platt scaling

Platt scaling (1999) was originally introduced as a method for calibrating support-vector machines. The method maximizes the likelihood of the training set by finding parameters for the sigmoid function:

$$\begin{aligned} \hat{p}(c \mid s)=\frac{1}{1+e^{As+B}}, \end{aligned}$$
(2)

where \(\hat{p}(c \mid s)\) gives the probability that an example belongs to class c, given that it has obtained the score s, and where A and B are parameters, which are found by gradient descent search, minimizing a particular loss function that was devised by Platt (1999).

2.4 Isotonic regression

Zadrozny and Elkan (2001) suggested using isotonic regression for calibrating probabilities. It can be seen as a binning approach, which does not require the number of bins or the bin sizes to be specified. The calibration function, which is assumed to be isotonic, i.e., non-decreasing, is a step-wise regression function, which can be learned by an algorithm known as the pair-adjacent violators (PAV) algorithm. The algorithm repeatedly merges score intervals for which a lower interval has a higher or equal relative frequency of examples labeled as positive, using the original scores as starting points for the process. When eventually no such pair of intervals can be found, the algorithm outputs a function that for each score interval returns the relative frequency of examples with a score in the interval that are labeled as positive. For a detailed description of the algorithm, see Niculescu-Mizil and Caruana (2005).

3 Venn predictors

Venn predictors, as introduced by Vovk et al. (2004), are multi-probabilistic predictors with proven validity properties. The impossibility result mentioned earlier for probabilistic prediction is circumvented in two ways: (i) multiple probabilities for each label are output, with one of them being the valid one; (ii) the statistical tests for validity are restricted to calibration. More specifically, the probabilities must be matched by observed frequencies. As an example, if we make a number of predictions with the probability estimate 0.90, these predictions should be correct in about 90% of the cases.

Venn predictors are related to the more well-known Conformal Prediction (CP) framework, which was introduced as an approach for associating predictions with confidence measures, see e.g. Gammerman et al. (1998) and Saunders et al. (1999). Conformal predictors are applied to the predictions from models built using classical machine learning algorithms, often referred to as the underlying models, and complement the predictions with measures of confidence.

The CP framework produces valid region predictions, i.e., the prediction region contains the true target with a pre-defined probability, when examples are drawn according to a fixed underlying distribution. In classification, a region prediction is a (possibly empty) subset of all possible labels.

Similar to CP, Venn predictors were introduced in a transductive setting, which, however, is computationally inefficient, requiring one underlying model to be trained for every label for each new object. Again, as for CP, a more efficient inductive version, that requires the training of only one underlying model, has been developed (Lambrou et al. 2015). We now describe inductive Venn predictors, and the concept of multiprobability prediction, following the ideas by Lambrou et al. (2015).

To construct an inductive Venn predictor, the available labeled training examples are split into two parts, the proper training set, used to train an underlying model, and a calibration set used to estimate label probabilities for each new test example.

Assume we have a training set of the form \(\{z_1, \dots , z_l\}\) where each instance \({z_i=(x_i,y_i)}\) consists of two parts; an object\(x_i\) and a label\(y_i\). In the inductive setting, this training set is divided into the proper training set \(\{z_1, \dots , z_q\}\) and the calibration set \(\{z_{q+1}, \dots , z_l\}\). When presented with a new test object \(x_{l+1}\), the aim of Venn prediction is to estimate the probability that \(y_{l+1}=Y_j\), for each \(Y_j\) in the set of possible labels \(Y_j\in \{Y_1,\dots ,Y_c\}\). The key idea of inductive Venn prediction is to divide all calibration examples into a number of categories and use the relative frequency of label \(Y_j\in \{Y_1,\dots ,Y_c\}\) in each category to estimate label probabilities for test instances falling into that category. The categories are defined using a Venn taxonomy and every taxonomy leads to a different Venn predictor. Typically, the taxonomy is based on the underlying model, trained on the proper training set, and for each calibration and test object \(x_i\), the output of this model is used to assign \((x_i,y_i)\) into one of the categories. One basic Venn taxonomy, which can be used with every kind of classification model, simply puts all examples predicted with the same label into the same category.

When estimating label probabilities for a test instance, the category of that instance is first determined using the underlying model, in an identical way as for the calibration instances. Then, the label frequencies of the calibration instances in that category are used to calculate the label probabilities. In addition, again as in CP, the test instance \(z_{l+1}\) is included in this calculation. However, since the true label \(y_{l+1}\) is not known for the test object \(x_{l+1}\), all possible labels \(Y_j\in \{Y_1,\dots ,Y_c\}\) are used to create a set of label probability distributions. Instead of dealing directly with these distributions, an often employed compact representation is to use the lower \(L(Y_j)\) and upper \(U(Y_j)\) probability estimates for each label \(Y_j\). Let k be the category assigned to the test object \(x_{l+1}\) by the Venn taxonomy, and \(Z_k\) be the set of calibration instances belonging to category k. Then the lower and upper probability estimates are defined by:

$$\begin{aligned} L(Y_j) = \frac{|\{(x_m,y_m) \in Z_k \mid y_m = Y_j\}|}{|Z_k|+1} \end{aligned}$$
(3)

and:

$$\begin{aligned} U(Y_j) = \frac{|\{(x_m,y_m) \in Z_k \mid y_m = Y_j\}|+1}{|Z_k|+1} \end{aligned}$$
(4)

In order to make a prediction \(\hat{y}_{l+1}\) for \(x_{l+1}\) using the lower and upper probability estimates, the following procedure is employed:

$$\begin{aligned} \hat{y}_{l+1} = \max \limits _{Y_j \in \{Y_1, \ldots , Y_c\}} L(Y_j) \end{aligned}$$
(5)

The output of a Venn predictor is the above prediction \(\hat{y}_{l+1}\) together with the probability interval:

$$\begin{aligned}{}[L(\hat{y}_{l+1}),U(\hat{y}_{l+1})] \end{aligned}$$
(6)

It is proven by Vovk et al. (2005) that the multiprobability predictions produced by Venn predictors are automatically valid, regardless of the taxonomy used. Still, the taxonomy is not unimportant since it will affect both the accuracy of the Venn predictor and the size of the prediction interval. Obviously, smaller probability intervals are more informative, and the probability estimates should preferably be as close to one or zero as possible.

3.1 Venn-Abers predictors

One challenge with Venn predictors is to identify the most suitable taxonomy to use. Venn-Abers predictors (Vovk and Petej 2012) are Venn predictors applicable to two-class problems, where the taxonomy is automatically optimized using isotonic regression. Thus, the Venn-Abers predictor inherits the validity guarantee of Venn predictors.

Many classifiers are scoring classifiers, i.e., when they make a prediction for a test object, the output is a prediction scores(x). In a two-class problem, with labels 0 and 1, the actual prediction is obtained by comparing the score to a fixed threshold c, and predicting the label of x to be 1 if \(s(x)>c\). An alternative to using a fixed threshold c, is to apply an increasing function g to s(x) to calibrate the scores. After calibration, g(s(x)) should be interpreted as the probability that the label for x is 1.

Venn-Abers predictors use isotonic regression, as described in Sect. 2.4, for the calibration. A multiprobabilistic prediction from a Venn-Abers predictor is, in the inductive setting, produced as follows; let \(s_0\) be the scoring function for \(\{z_{q+1},\dots ,z_l,(x_{l+1},0)\}\), \(s_1\) be the scoring function for \(\{z_{q+1},\dots ,z_l,(x_{l+1},1)\}\), \(g_0\) be the isotonic calibrator for

$$\begin{aligned} \{(s_0(x_{q+1}),y_{q+1}),\dots ,(s_0(x_l),y_l),(s_0(x_{l+1}),0)\} \end{aligned}$$
(7)

and \(g_1\) be the isotonic calibrator for

$$\begin{aligned} \{(s_1(x_{q+1}),y_{q+1}),\dots ,(s_1(x_l),y_l),(s_1(x_{l+1}),1)\} \end{aligned}$$
(8)

Then the probability interval for \(y_{l+1}=1\) is

$$\begin{aligned}{}[g_0(s_0(x_{l+1})),g_1(s_1(x_{l+1}))] \end{aligned}$$
(9)

3.2 Out-of-bag-calibration

Although inductive Venn predictors remedy the computational inefficiencies of their transductive counterparts, by requiring the training of only one underlying model, this typically comes at the expense of informational efficiency, i.e., less accurate models. Since only part of the data can be used to train the underlying model, it will tend to produce less accurate predictions from which the taxonomies are constructed, causing the taxonomies to become less homogeneous with respect to the true class labels. Similarly, the taxonomy categories will contain fewer calibration examples (due to the reduced number of available calibration examples) leading to less fine-grained probability distribution estimates.

Regarding the size of the prediction intervals, previous research has shown that inductive Venn predictors will produce significantly tighter intervals, compared to the transductive approach, see e.g., Lambrou et al. (2015). The reason for this is straightforward; when using the transductive approach, the model is actually re-trained for each new test instance and class, leading to quite unstable models. In the inductive approach, though, the model is both trained and applied to the calibration set only once, i.e., the test instance does not affect the model at all, and only moderately impacts the prediction intervals.

This problem of having to trade informational efficiency for computational efficiency exists also within conformal prediction (Vovk et al. 2005), where a solution has been proposed for scenarios where an ensemble of bagged models is used, see Johansson et al. (2014). Here, calibration is performed using out-of-bag estimation, thus allowing all training data to be used both for training and calibration, without the need to retrain the underlying model for every new test instance. Due to the similarities between Venn prediction and conformal prediction, this out-of-bag calibration technique can easily be extended also to Venn predictors constructed using ensembles of bagged classifiers.

Let \(\textit{OOB}_{z_i}\) be the set of trees in the underlying random forest for which the instance \(z_i = (x_i,y_i)\) is out-of-bag, i.e., not included in the bootstrapped training sample. Let \(f_{\textit{OOB}_{z_i}}(x_i)\) be the out-of-bag prediction for \(x_i\), i.e., the combined prediction of the ensemble members \(\textit{OOB}_{z_i}\) on the object \(x_i\). We can now assign taxonomies for the training set based on \(f_{\textit{OOB}_{z_i}}(x_i)\), instead of using the underlying random forest. This means that each calibration instance is assigned a category based on a (possibly unique) sub-ensemble, that contains on average about a third of the ensemble members.

In order to retain validity, it is essential that the calibration and test instances are treated equally by the underlying model: in a transductive Venn predictor, all of them are included in the training set, and in an inductive Venn predictor, none of them are. In out-of-bag calibration, however, all calibration instances are used for training, whereas the test instances are not. Hence, some special considerations need to be made when assigning a category to a test object. One important observation is that, while calibration instances are used during training, the prediction used to assign a category to a calibration instance is always made using a sub-ensemble for which the calibration instance was not used in the bootstrap sample, i.e., the underlying (sub-)models do not have any inherent bias towards the calibration set. Still, the two most straight-forward means of assigning a category to the test instance will not retain exchangeability:

  1. 1.

    Use the entire ensemble (random forest) as the underlying model. This is, perhaps, the most natural choice, since any test instance is by default out-of-bag for all ensemble members. This, however, violates the exchangeability assumption in a fairly obvious way; the full ensemble is expected to provide more accurate predictions than the out-of-bag sub-ensembles used to make predictions for the calibration set, meaning that there is a qualitative difference in the way predictions are made, and hence categories assigned, for the calibration and test instances.

  2. 2.

    Use a randomly selected sub-ensemble (containing approximately one third of the ensemble members). This approach results in a prediction and category assignment for the test object that more closely resembles that of the calibration instances. However, exchangeability is still not guaranteed, as calibration instances are assigned categories based on sub-ensembles where (at most) \(l-1\) training examples were used as training data, whereas a randomly selected sub-ensemble will be trained using at most l examples (the full training set). Hence, there is still a small qualitative difference between predictions made for calibration and test instances.

Instead, in order to retain exchangeability between calibration and test instances, we re-use an out-of-bag sub-ensemble from one of the calibration examples. By randomly selecting a calibration example \(z_r\), and using its out-of-bag sub-ensemble \(\textit{OOB}_{z_r}\) to make predictions for the test object \(x_{l+1}\), we ensure that predictions for test objects are qualitatively identical to those made for the calibration instances. This is evident due to the fact that both \(z_r\) and \(z_{l+1}\) are, by definition, out-of-bag for \(\textit{OOB}_{z_r}\), and hence, \(\textit{OOB}_{z_r}\) is expected to perform identically on both instances. In total, all predictions for the instances \(z_1, \dots , z_{r-1}, z_{r+1}, \dots , z_{l+1}\) are made using sub-ensembles of approximately equal size, trained using (at most) \(l-1\) examples; in all cases, the prediction made for any \(z_i\) is done using a sub-ensemble for which \(z_i\) was not included in the training set.

So, during prediction, a random index \(r \in [1, l]\) is selected, where l is the size of the training set. A category k is assigned to the test instance \(x_{l+1}\) based on \(f_{\textit{OOB}_{z_r}}(x_{l+1})\), i.e., the prediction for \(x_{l+1}\) made by the out-of-bag sub-ensemble for \(z_r\). Lower and upper probability estimates are then computed by not including \(z_r\) in the calibration set \(Z_k\):

$$\begin{aligned} L(Y_j) = \frac{|\{(x_m,y_m) \in Z_k\setminus \{z_r\} \mid y_m = Y_j\}|}{|Z_k\setminus \{z_r\}|+1} \end{aligned}$$
(10)

and:

$$\begin{aligned} U(Y_j) = \frac{|\{(x_m,y_m) \in Z_k \setminus \{z_r\} \mid y_m = Y_j\}|+1}{|Z_k\setminus \{z_r\}|+1} \end{aligned}$$
(11)

A full proof of how this procedure maintains exchangeability between calibration and test instances is provided by Boström et al. (2017).

4 Method

In the empirical investigation, we look at different ways of utilizing random forests for probabilistic prediction. All experiments were performed in MatLab, and the random forests were generated using the MatLab implementation of the algorithm, called TreeBagger. Here, all parameter values were left at their default values, with the exception of using 300 trees in the forest.

The 22 data sets used are all two-class problems, publicly available from either the UCI repository (Bache and Lichman 2013) or the PROMISE Software Engineering Repository (Shirabad and Menzies 2005). In the experimentation, standard \(10\times 10\)-fold cross-validation was used.

The taxonomy used for the standard Venn predictors in this study is the label prediction of the underlying model, i.e., all instances predicted with the same label are put into one category. Since all problems are two-class, the resulting taxonomy contains only two categories.

For the actual calibration, we compared using standard Venn predictors and Venn-Abers predictors to Platt scaling and isotonic regression, as well as using no external calibration, i.e., the raw estimates from the forest. In addition, we compared calibrating on the out-of-bag instances to using a separate labeled data set (the calibration set) not used for learning the trees. When using a calibration set, 2 / 3 of the training instances were used for the tree induction and 1 / 3 for the calibration. It must be noted that when calibrating on the out-of-bag instances, we follow the procedure proposed by Boström et al. (2017), as described above, i.e., using a subset of the forest for each prediction to guarantee exchangeability between calibration and test instances. For approaches that employ a separate calibration set, however, the entire forest is used for the predictions. In summary, we compare the following ten approaches:

  • RF-cal: The raw estimates from the forest estimated from a separate calibration set.

  • RF-oob: The raw estimates from the forest estimated from the out-of-bag set.

  • Platt-cal: Standard Platt scaling where the logistic regression model was learned on the calibration set.

  • Platt-oob: Platt scaling calibrating on the out-of-bag set.

  • Iso-cal: Standard isotonic regression based on the calibration set.

  • Iso-oob: Isotonic regression calibrated on the out-of-bag set.

  • VP-cal: A Venn predictor calibrated on a separate data set and using the predicted label from the underlying model as the category.

  • VP-oob: A Venn predictor calibrated on the out-of-bag set and using the predicted label from the underlying model as the category.

  • VAP-cal: A Venn-Abers predictor calibrated on a separate data set.

  • VAP-oob: A Venn-Abers predictor calibrated on the out-of-bag set.

In the experimentation, we want to evaluate different criteria. For all ten setups, we compare the probability estimates to the true observed accuracies. Specifically, we will evaluate the quality of the probability estimates using the Brier score (Brier 1950). For two-class problems, let \(y_i\) denote the response variable (class) of instance i, where \(y_i\) = 0 or 1. Denote the probability estimate that instance i belongs to class 1, by \(p_i\). The Brier Score is then defined as

$$\begin{aligned} \textit{Brier Score}=\sum _{i=1}^N (y_i-p_i)^2, \end{aligned}$$
(12)

where N is the number of instances. The Bries score is consequently the sum of squares of the difference between the true class and the predicted probability over all instances. The Brier score can be further decomposed into three terms called uncertainty, resolution and reliability. In practice, this is done by dividing the range of probability values, i.e., [0, 1] into a number of K intervals and represent each interval \(1, 2, \dots , K\) by a corresponding typical probability value \(r_k\), see Murphy (1973). Here, the reliability term measures how close the probability estimates are to the true probabilities, i.e., it directly measures how well-calibrated the estimates are. The reliability is defined as

$$\begin{aligned} \textit{Reliability}=\frac{1}{N}\sum _{k=1}^K n_k(r_k-\phi _k)^2, \end{aligned}$$
(13)

where \(n_k\) is the number of instances in interval k, \(r_k\) is the mean probability estimate for the positive class over the instances in interval k, \(\phi _k\) is the proportion of instances actually belonging to the positive class in interval k and N is the total number of instances. It must be noted that the reliability score is defined in the contrary direction compared to the English language, i.e., lower reliability is better. In the experimentation, the number of intervals K was set to 100. When calculating the probability estimate for the positive class from all Venn predictors and Venn-Abers predictors, the center point of the corresponding prediction interval was used. It may be noted that another option for producing a single probability estimate from a Venn predictor prediction interval is suggested by Vovk and Petej (2012). While that method is theoretically sound, providing a regularized value where the estimate is moved towards the neutral value 0.5, the differences between the two methods are most often very small in practice.

For the Venn predictors and the Venn-Abers predictors, we also check the validity by making sure that the observed accuracies, i.e., the percentage of correctly predicted test instances, actually fall in (or at least are close to) the intervals. In addition to the quality of the estimates, there are two additional important metrics when comparing the Venn predictors and the Venn-Abers predictors:

  • Interval size: The tighter the interval is, the more informative.

  • Accuracy: The predictive performance of the model is of course vital in all of predictive modeling.

5 Results

We start by investigating the overall quality of the estimates. Table 1 shows the differences between the estimates (averaged over all instances for each data set) and the corresponding accuracies. Looking first at using the raw frequencies from the random forests, we see that while the estimates are fairly accurate, they tend to be too pessimistic. Averaged over all data sets, the difference between the estimates and the actual accuracies is approximately 1.5 percentage points. Using Platt scaling and a separate calibration set produces, on the other hand, too optimistic estimates. Platt scaling on out-of-bag is clearly better, but still systematically optimistic. Interestingly enough, the same holds for isotonic regression, but the estimates are generally worse than Platt scaling. The Venn predictor, though, when looking at these aggregated results, appears to be exceptionally well-calibrated. The Venn-Abers predictor, finally, is rather well-calibrated when using out-of-bag calibration, but too optimistic when calibrated on a separate data set. Even if the differences may appear to be rather small in absolute numbers (approximately 1.5 percentage points on average), the fact is that Platt scaling, isotonic regression and the Venn-Abers predictor turned out to be intrinsically optimistic, while using the random forest outputs was systematically pessimistic. Consequently, one could argue that they all must be considered misleading in this study. The standard Venn predictor, on the other hand, appears to be well-calibrated, specifically there is no inherent tendency to overestimate or underestimate the accuracy. Most importantly, it should be noted that calibrating on out-of-bag improved the quality of the estimates for all setups evaluated.

Table 1 Quality of estimates

In order to perform a more detailed analysis, Table 2 shows the reliability scores for the different techniques. As described above, this is a direct measurement of the quality of the probability estimates. The last row shows the average rank for that setup over all data sets.

Table 2 Reliability of estimates

First it must be noted that in this study, isotonic regression performs clearly worse than using the raw estimates from the random forest. Looking at the mean ranks, Platt scaling is slightly better than using the raw estimates from an out-of-bag calibration set, but clearly worse than calibrating the forest estimates using a separate calibration set. So, based on these results, there is little to gain from using the standard techniques Platt scaling and isotonic regression for calibrating a random forest. Turning to the Venn predictors, however, we see from the mean ranks that all four setups obtained better estimates, i.e., lower reliability scores, compared to the raw estimates from the random forest. Overall, the Venn predictor again showed the most accurate estimates, but when looking at this more detailed level, we see that calibrating on a separate data set was actually slightly better than using the out-of-bag set.

In order to determine whether the observed differences are statistically significant, we used the procedure recommended by Garcıa and Herrera (2008) and performed a Friedman test (Friedman 1937), followed by Bergmann–Hommel’s dynamic procedure (Bergmann and Hommel 1988) to establish all pairwise differences. With ten setups and just 22 data sets, only a few differences are actually significant at the \(\alpha =.05\) level, see Table 3, where a ‘v’ shows that the row setup obtained significantly more reliable estimates than the column setup.

Table 3 Statistically significant differences for reliability \(\alpha =.05\)

Analyzing the different Venn predictors, Table 4 shows the probability intervals, their sizes and the actual accuracies on each data set. An underlined accuracy means that it is outside the prediction interval.

Table 4 Venn predictor and Venn-Abers predictors intervals

First of all, we see that all setups are valid, i.e., the empirical accuracy is for all setups inside the intervals produced for a very large majority of data sets. Actually, even for the rare data sets where the accuracy is not within the intervals, it is most often very close. While we expect the Venn and the Venn-Abers predictors to be well-calibrated, it must be noted that the intervals produced are much smaller than what is typically the case when using the original transductive approach, see e.g., Papadopoulos (2013). This is also consistent with the findings by Lambrou et al. (2015).

Comparing the size of the intervals between the different setups, we observe fairly large differences. On average over all data sets, the intervals for VP-oob are smaller than one percentage point (.006), while for VAP-cal it is .075, i.e., over seven percentage points. Looking at the mean ranks, we find that there is a clear ordering, which is the same for every data set; VP-oob produced the smallest intervals, followed by VP-cal, VA-oob and finally VA-cal. Obtaining so tight (and still valid) intervals for the probabilistic predictions is of course a very strong result for the Venn predictor.

Turning to the accuracies, we see that VP-oob is again the best choice. Here, however, VAP-oob is the second best, indicating that the use of an out-of-bag calibration set will result in more accurate models. Interestingly enough, models calibrated on out-of-bag were more accurate than the corresponding models using a separate calibration set for all setups and on every data set. While the fact that using all data for generating the models (which is possible when calibrating on the out-of-bag set) will result in higher accuracies should be no surprise, we must remember that in this setup, the ensemble used for the actual prediction is a subset of the original ensemble. Consequently, the results actually show that a much smaller ensemble (approximately 100 trees) but bagging from all data, will generally be more accurate than a larger ensemble (300 trees) with access to less data (2/3 of the original training set) for the bagging.

For the statistical testing, we again used a Friedman test (Friedman 1937), followed by Bergmann–Hommel’s dynamic procedure (Bergmann and Hommel 1988) to establish the pairwise differences, which are shown in Table 5. Here we see that all setups produced significantly higher accuracy than VAP-cal. In addition, VP-oob was significantly more accurate than VP-cal. Looking finally at VP-oob versus VAP-oob, the p value is .07, i.e., while the difference is not significant at \(\alpha =.05\), it is still a strong result for VP-oob.

Table 5 Statistically significant differences for accuracy \(\alpha =.05\)

Extending the analysis of the predictive performance, Table 6 also includes the accuracies obtained by the standard setups, i.e., using the raw estimates from the forest, Platt scaling and isotonic regression.

Table 6 Accuracy for all setups

In Table 6, we can make at least two very important observations: (i) every setup using out-of-bag calibration was more accurate than all setups requiring a separate calibration set, and (ii) while the differences are small in absolute numbers, the Venn-predictor calibrated on out-of-bag instances, is actually the most accurate setup overall.

6 Concluding remarks

This paper has presented the first large-scale comparison of Venn predictors and Venn-Abers predictors to existing techniques for utilizing random forests in probabilistic prediction. Specifically, the novel option to perform the calibration on the out-of-bag instances has been evaluated.

Regarding calibration, as evaluated using the reliability metric, the results show that the standard techniques Platt scaling and isotonic regression were generally ineffective for calibrating a random forest. All four Venn predictors and Venn-Abers predictors, on the other hand, were better calibrated than both the raw estimates from the random forest, and the standard techniques Platt scaling and isotonic regression.

When comparing the intervals produced by the Venn predictors and the Venn-Abers predictors to the empirical accuracies, it is obvious that all evaluated setups are valid. The interval sizes, however, varied substantially between the different setups. In fact, the ordering between the setups was identical for all data sets; and the best choice was to use a standard Venn predictor calibrated on out-of-bag instances. The intervals produced using that setup were very tight, on average just over .5 percentage points.

Also when considering model accuracy, the best option was to use a standard Venn predictor, and calibrate on out-of-bag instances. That setup was significantly more accurate than both the Venn predictor and the Venn-Abers predictor calibrated on a separate data set. In addition, it was substantially more accurate than Venn-Abers calibrated on out-of-bag.

Generally, it must be noted that calibrating on out-of-bag instead of using a separate calibration set was extremely successful for both Venn predictors and Venn-Abers predictors, resulting in tighter intervals and more accurate models on every data set.