Efficient Venn predictors using random forests

Johansson, Ulf; Löfström, Tuve; Linusson, Henrik; Boström, Henrik

doi:10.1007/s10994-018-5753-x

Efficient Venn predictors using random forests

Open access
Published: 20 August 2018

Volume 108, pages 535–550, (2019)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Efficient Venn predictors using random forests

Download PDF

Ulf Johansson ORCID: orcid.org/0000-0003-0412-6199¹,
Tuve Löfström¹,
Henrik Linusson² &
…
Henrik Boström³

3941 Accesses
3 Citations
Explore all metrics

Abstract

Successful use of probabilistic classification requires well-calibrated probability estimates, i.e., the predicted class probabilities must correspond to the true probabilities. In addition, a probabilistic classifier must, of course, also be as accurate as possible. In this paper, Venn predictors, and its special case Venn-Abers predictors, are evaluated for probabilistic classification, using random forests as the underlying models. Venn predictors output multiple probabilities for each label, i.e., the predicted label is associated with a probability interval. Since all Venn predictors are valid in the long run, the size of the probability intervals is very important, with tighter intervals being more informative. The standard solution when calibrating a classifier is to employ an additional step, transforming the outputs from a classifier into probability estimates, using a labeled data set not employed for training of the models. For random forests, and other bagged ensembles, it is, however, possible to use the out-of-bag instances for calibration, making all training data available for both model learning and calibration. This procedure has previously been successfully applied to conformal prediction, but was here evaluated for the first time for Venn predictors. The empirical investigation, using 22 publicly available data sets, showed that all four versions of the Venn predictors were better calibrated than both the raw estimates from the random forest, and the standard techniques Platt scaling and isotonic regression. Regarding both informativeness and accuracy, the standard Venn predictor calibrated on out-of-bag instances was the best setup evaluated. Most importantly, calibrating on out-of-bag instances, instead of using a separate calibration set, resulted in tighter intervals and more accurate models on every data set, for both the Venn predictors and the Venn-Abers predictors.

Evaluation of a Variance-Based Nonconformity Measure for Regression Forests

Double random forest

Article 02 July 2020

Regression Forests

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Many classifiers are able to output not only the predicted class label, but also a probability distribution over the possible classes. Such probabilistic predictions have many obvious uses, one example is to filter out unlikely or very uncertain predictions. Another generic scenario is when the probability estimates are used as the basis for a decision, typically comparing the utility of different options. Naturally, probabilistic prediction requires that the probability estimates are well-calibrated, i.e., the predicted class probabilities must reflect the true, underlying probabilities. If this is not the case, the predicted probabilities actually become misleading.

There exist a number of general methods for calibrating probabilistic predictions, but the two most frequently used are Platt scaling (Platt 1999) and isotonic regression (Zadrozny and Elkan 2001). Both techniques have been successfully applied in conjunction with many different learning algorithms, including support-vector machines, boosted decision trees and naïve Bayes (Niculescu-Mizil and Caruana 2005). However, for single decision trees, as well as bagged trees and random forests, these calibration techniques have turned out to be less effective (Niculescu-Mizil and Caruana 2005), something which partly can be explained by their requirements for large calibration sets. Boström (2008) showed that this problem can be mitigated when employing bagging, e.g., as in the random forest algorithm, by utilizing out-of-bag predictions; in effect allowing all training instances to be used for calibration. In this work, we investigate the use of Venn predictors (Vovk et al. 2004), and Venn-Abers predictors (Vovk and Petej 2012) as alternative approaches to calibrating probabilities from random forests. Venn predictors (and the special case Venn-Abers predictors) are, under the standard i.i.d. assumption, automatically valid multiprobability predictors, i.e., their probability estimates will be perfectly calibrated, in the long run. The price paid for this rather amazing property is, however, that all probabilistic predictions from a Venn predictor come in the form of intervals.

A formal description of Venn predictors and Venn-Abers predictors is given in Sects. 3 and 3.1, but the overall procedure can be described like this; before the actual prediction, instances are divided into categories. When predicting, we first find the category to which the test instance belongs, and then tentatively classify it as every possible label, one at a time. From this, the frequencies of labels in the chosen category (including the tentative label of the test instance) are used as the estimates of the test label probabilities. Since every possible label is tried, the Venn predictor outputs several (two if it is a two-class problem) probability distributions for the test instance.

Venn predictors can be applied on top of all classifiers, as long as they return not only the class label, but also some score associated with the confidence in that prediction. In this paper, we focus on using the state-of-the-art predictive modeling technique random forests (Breiman 2001) as underlying models for Venn predictors. In the empirical investigation, the quality of the probability estimates from the Venn predictors will be compared to both using raw estimates from the random forests and standard calibration techniques. In addition, different versions of Venn predictors will be compared against each other, with regard to accuracy and informativeness. Specifically, since random forests are used as the underlying model, the option to calibrate the estimates on the so called out-of-bag set, instead of setting aside a separate data set for calibration, will be investigated.

Previous evaluations of Venn predictors, such as Lambrou et al. (2015), use very few data sets, thus precluding statistical analysis, i.e., they serve mainly as proof-of-concepts. In this paper, we present the first large-scale empirical investigation in which Venn predictors are compared to state-of-the-art methods for calibration of probabilistic predictions, on 22 publicly available data sets.

In the next section, we first define probabilistic prediction and describe random forests, before presenting some standard calibration techniques. The Venn predictors, including the special case of Venn-Abers predictors, are described in Sect. 3. In Sect. 4, we outline the experimental setup, which is followed by the experimental results presented in Sect. 5. Finally, we summarize the main conclusions in Sect. 6.

2 Background

2.1 Probabilistic prediction

In probabilistic prediction, the task is to predict the probability distribution of the label, given the training set and the test object. The goal is to obtain a valid predictor. In general, validity means that the probability distributions from the predictor must perform well against statistical tests based on subsequent observation of the labels. In particular, we are interested in calibration, i.e., we want:

$$\begin{aligned} p(c_j \mid p^{c_j})=p^{c_j}, \end{aligned}$$

(1)

where $p^{c_j}$ is the probability estimate for class $c_j$. It must be noted that validity cannot be achieved for probabilistic prediction in a general sense, see e.g., Gammerman et al. (1998).

2.2 Random forests

A random forest (Breiman 2001) is an ensemble consisting of random trees, which are decision trees generated in a specific way to obtain diversity among the trees. Each random tree is trained on a bootstrap replicate, i.e., a sample obtained from n training instances by randomly selecting n instances with replacement. This procedure is referred to as bagging. Moreover, only a randomly selected subset of the available attributes are considered when choosing each interior split. The instances that were missing in the bootstrap replicate, for a specific tree, are said to be out-of-bag for that tree. The random forest algorithm has frequently been demonstrated to achieve state-of-the-art predictive performance, see e.g., Caruana and Niculescu-Mizil (2006) and Delgado et al. (2014) for large-scale comparisons. Random forests can be used for several different tasks, including classification, regression, ranking and probability estimation, see e.g., Boström (2012). In addition to the strong predictive performance, the learning algorithm, being embarrassingly parallel, lends itself to efficient implementation on multi-core platforms, see e.g., Boström (2011) and Jansson et al. (2014).

2.3 Platt scaling

Platt scaling (1999) was originally introduced as a method for calibrating support-vector machines. The method maximizes the likelihood of the training set by finding parameters for the sigmoid function:

$$\begin{aligned} \hat{p}(c \mid s)=\frac{1}{1+e^{As+B}}, \end{aligned}$$

(2)

where $\hat{p}(c \mid s)$ gives the probability that an example belongs to class c, given that it has obtained the score s, and where A and B are parameters, which are found by gradient descent search, minimizing a particular loss function that was devised by Platt (1999).

2.4 Isotonic regression

Zadrozny and Elkan (2001) suggested using isotonic regression for calibrating probabilities. It can be seen as a binning approach, which does not require the number of bins or the bin sizes to be specified. The calibration function, which is assumed to be isotonic, i.e., non-decreasing, is a step-wise regression function, which can be learned by an algorithm known as the pair-adjacent violators (PAV) algorithm. The algorithm repeatedly merges score intervals for which a lower interval has a higher or equal relative frequency of examples labeled as positive, using the original scores as starting points for the process. When eventually no such pair of intervals can be found, the algorithm outputs a function that for each score interval returns the relative frequency of examples with a score in the interval that are labeled as positive. For a detailed description of the algorithm, see Niculescu-Mizil and Caruana (2005).

3 Venn predictors

Venn predictors, as introduced by Vovk et al. (2004), are multi-probabilistic predictors with proven validity properties. The impossibility result mentioned earlier for probabilistic prediction is circumvented in two ways: (i) multiple probabilities for each label are output, with one of them being the valid one; (ii) the statistical tests for validity are restricted to calibration. More specifically, the probabilities must be matched by observed frequencies. As an example, if we make a number of predictions with the probability estimate 0.90, these predictions should be correct in about 90% of the cases.

Venn predictors are related to the more well-known Conformal Prediction (CP) framework, which was introduced as an approach for associating predictions with confidence measures, see e.g. Gammerman et al. (1998) and Saunders et al. (1999). Conformal predictors are applied to the predictions from models built using classical machine learning algorithms, often referred to as the underlying models, and complement the predictions with measures of confidence.

The CP framework produces valid region predictions, i.e., the prediction region contains the true target with a pre-defined probability, when examples are drawn according to a fixed underlying distribution. In classification, a region prediction is a (possibly empty) subset of all possible labels.

Similar to CP, Venn predictors were introduced in a transductive setting, which, however, is computationally inefficient, requiring one underlying model to be trained for every label for each new object. Again, as for CP, a more efficient inductive version, that requires the training of only one underlying model, has been developed (Lambrou et al. 2015). We now describe inductive Venn predictors, and the concept of multiprobability prediction, following the ideas by Lambrou et al. (2015).

To construct an inductive Venn predictor, the available labeled training examples are split into two parts, the proper training set, used to train an underlying model, and a calibration set used to estimate label probabilities for each new test example.

Assume we have a training set of the form $\{z_1, \dots , z_l\}$ where each instance ${z_i=(x_i,y_i)}$ consists of two parts; an object$x_i$ and a label$y_i$. In the inductive setting, this training set is divided into the proper training set $\{z_1, \dots , z_q\}$ and the calibration set $\{z_{q+1}, \dots , z_l\}$. When presented with a new test object $x_{l+1}$, the aim of Venn prediction is to estimate the probability that $y_{l+1}=Y_j$, for each $Y_j$ in the set of possible labels $Y_j\in \{Y_1,\dots ,Y_c\}$. The key idea of inductive Venn prediction is to divide all calibration examples into a number of categories and use the relative frequency of label $Y_j\in \{Y_1,\dots ,Y_c\}$ in each category to estimate label probabilities for test instances falling into that category. The categories are defined using a Venn taxonomy and every taxonomy leads to a different Venn predictor. Typically, the taxonomy is based on the underlying model, trained on the proper training set, and for each calibration and test object $x_i$, the output of this model is used to assign $(x_i,y_i)$ into one of the categories. One basic Venn taxonomy, which can be used with every kind of classification model, simply puts all examples predicted with the same label into the same category.

When estimating label probabilities for a test instance, the category of that instance is first determined using the underlying model, in an identical way as for the calibration instances. Then, the label frequencies of the calibration instances in that category are used to calculate the label probabilities. In addition, again as in CP, the test instance $z_{l+1}$ is included in this calculation. However, since the true label $y_{l+1}$ is not known for the test object $x_{l+1}$, all possible labels $Y_j\in \{Y_1,\dots ,Y_c\}$ are used to create a set of label probability distributions. Instead of dealing directly with these distributions, an often employed compact representation is to use the lower $L(Y_j)$ and upper $U(Y_j)$ probability estimates for each label $Y_j$. Let k be the category assigned to the test object $x_{l+1}$ by the Venn taxonomy, and $Z_k$ be the set of calibration instances belonging to category k. Then the lower and upper probability estimates are defined by:

$$\begin{aligned} L(Y_j) = \frac{|\{(x_m,y_m) \in Z_k \mid y_m = Y_j\}|}{|Z_k|+1} \end{aligned}$$

(3)

and:

$$\begin{aligned} U(Y_j) = \frac{|\{(x_m,y_m) \in Z_k \mid y_m = Y_j\}|+1}{|Z_k|+1} \end{aligned}$$

(4)

In order to make a prediction $\hat{y}_{l+1}$ for $x_{l+1}$ using the lower and upper probability estimates, the following procedure is employed:

$$\begin{aligned} \hat{y}_{l+1} = \max \limits _{Y_j \in \{Y_1, \ldots , Y_c\}} L(Y_j) \end{aligned}$$

(5)

The output of a Venn predictor is the above prediction $\hat{y}_{l+1}$ together with the probability interval:

$$\begin{aligned}{}[L(\hat{y}_{l+1}),U(\hat{y}_{l+1})] \end{aligned}$$

(6)

It is proven by Vovk et al. (2005) that the multiprobability predictions produced by Venn predictors are automatically valid, regardless of the taxonomy used. Still, the taxonomy is not unimportant since it will affect both the accuracy of the Venn predictor and the size of the prediction interval. Obviously, smaller probability intervals are more informative, and the probability estimates should preferably be as close to one or zero as possible.

3.1 Venn-Abers predictors

One challenge with Venn predictors is to identify the most suitable taxonomy to use. Venn-Abers predictors (Vovk and Petej 2012) are Venn predictors applicable to two-class problems, where the taxonomy is automatically optimized using isotonic regression. Thus, the Venn-Abers predictor inherits the validity guarantee of Venn predictors.

Many classifiers are scoring classifiers, i.e., when they make a prediction for a test object, the output is a prediction scores(x). In a two-class problem, with labels 0 and 1, the actual prediction is obtained by comparing the score to a fixed threshold c, and predicting the label of x to be 1 if $s(x)>c$. An alternative to using a fixed threshold c, is to apply an increasing function g to s(x) to calibrate the scores. After calibration, g(s(x)) should be interpreted as the probability that the label for x is 1.

Venn-Abers predictors use isotonic regression, as described in Sect. 2.4, for the calibration. A multiprobabilistic prediction from a Venn-Abers predictor is, in the inductive setting, produced as follows; let $s_0$ be the scoring function for $\{z_{q+1},\dots ,z_l,(x_{l+1},0)\}$, $s_1$ be the scoring function for $\{z_{q+1},\dots ,z_l,(x_{l+1},1)\}$, $g_0$ be the isotonic calibrator for

$$\begin{aligned} \{(s_0(x_{q+1}),y_{q+1}),\dots ,(s_0(x_l),y_l),(s_0(x_{l+1}),0)\} \end{aligned}$$

(7)

and $g_1$ be the isotonic calibrator for

$$\begin{aligned} \{(s_1(x_{q+1}),y_{q+1}),\dots ,(s_1(x_l),y_l),(s_1(x_{l+1}),1)\} \end{aligned}$$

(8)

Then the probability interval for $y_{l+1}=1$ is

$$\begin{aligned}{}[g_0(s_0(x_{l+1})),g_1(s_1(x_{l+1}))] \end{aligned}$$

(9)

3.2 Out-of-bag-calibration

Although inductive Venn predictors remedy the computational inefficiencies of their transductive counterparts, by requiring the training of only one underlying model, this typically comes at the expense of informational efficiency, i.e., less accurate models. Since only part of the data can be used to train the underlying model, it will tend to produce less accurate predictions from which the taxonomies are constructed, causing the taxonomies to become less homogeneous with respect to the true class labels. Similarly, the taxonomy categories will contain fewer calibration examples (due to the reduced number of available calibration examples) leading to less fine-grained probability distribution estimates.

Regarding the size of the prediction intervals, previous research has shown that inductive Venn predictors will produce significantly tighter intervals, compared to the transductive approach, see e.g., Lambrou et al. (2015). The reason for this is straightforward; when using the transductive approach, the model is actually re-trained for each new test instance and class, leading to quite unstable models. In the inductive approach, though, the model is both trained and applied to the calibration set only once, i.e., the test instance does not affect the model at all, and only moderately impacts the prediction intervals.

This problem of having to trade informational efficiency for computational efficiency exists also within conformal prediction (Vovk et al. 2005), where a solution has been proposed for scenarios where an ensemble of bagged models is used, see Johansson et al. (2014). Here, calibration is performed using out-of-bag estimation, thus allowing all training data to be used both for training and calibration, without the need to retrain the underlying model for every new test instance. Due to the similarities between Venn prediction and conformal prediction, this out-of-bag calibration technique can easily be extended also to Venn predictors constructed using ensembles of bagged classifiers.

Let $\textit{OOB}_{z_i}$ be the set of trees in the underlying random forest for which the instance $z_i = (x_i,y_i)$ is out-of-bag, i.e., not included in the bootstrapped training sample. Let $f_{\textit{OOB}_{z_i}}(x_i)$ be the out-of-bag prediction for $x_i$, i.e., the combined prediction of the ensemble members $\textit{OOB}_{z_i}$ on the object $x_i$. We can now assign taxonomies for the training set based on $f_{\textit{OOB}_{z_i}}(x_i)$, instead of using the underlying random forest. This means that each calibration instance is assigned a category based on a (possibly unique) sub-ensemble, that contains on average about a third of the ensemble members.

In order to retain validity, it is essential that the calibration and test instances are treated equally by the underlying model: in a transductive Venn predictor, all of them are included in the training set, and in an inductive Venn predictor, none of them are. In out-of-bag calibration, however, all calibration instances are used for training, whereas the test instances are not. Hence, some special considerations need to be made when assigning a category to a test object. One important observation is that, while calibration instances are used during training, the prediction used to assign a category to a calibration instance is always made using a sub-ensemble for which the calibration instance was not used in the bootstrap sample, i.e., the underlying (sub-)models do not have any inherent bias towards the calibration set. Still, the two most straight-forward means of assigning a category to the test instance will not retain exchangeability:

1.
Use the entire ensemble (random forest) as the underlying model. This is, perhaps, the most natural choice, since any test instance is by default out-of-bag for all ensemble members. This, however, violates the exchangeability assumption in a fairly obvious way; the full ensemble is expected to provide more accurate predictions than the out-of-bag sub-ensembles used to make predictions for the calibration set, meaning that there is a qualitative difference in the way predictions are made, and hence categories assigned, for the calibration and test instances.
2.
Use a randomly selected sub-ensemble (containing approximately one third of the ensemble members). This approach results in a prediction and category assignment for the test object that more closely resembles that of the calibration instances. However, exchangeability is still not guaranteed, as calibration instances are assigned categories based on sub-ensembles where (at most) $l-1$ training examples were used as training data, whereas a randomly selected sub-ensemble will be trained using at most l examples (the full training set). Hence, there is still a small qualitative difference between predictions made for calibration and test instances.

Instead, in order to retain exchangeability between calibration and test instances, we re-use an out-of-bag sub-ensemble from one of the calibration examples. By randomly selecting a calibration example $z_r$, and using its out-of-bag sub-ensemble $\textit{OOB}_{z_r}$ to make predictions for the test object $x_{l+1}$, we ensure that predictions for test objects are qualitatively identical to those made for the calibration instances. This is evident due to the fact that both $z_r$ and $z_{l+1}$ are, by definition, out-of-bag for $\textit{OOB}_{z_r}$, and hence, $\textit{OOB}_{z_r}$ is expected to perform identically on both instances. In total, all predictions for the instances $z_1, \dots , z_{r-1}, z_{r+1}, \dots , z_{l+1}$ are made using sub-ensembles of approximately equal size, trained using (at most) $l-1$ examples; in all cases, the prediction made for any $z_i$ is done using a sub-ensemble for which $z_i$ was not included in the training set.

So, during prediction, a random index $r \in [1, l]$ is selected, where l is the size of the training set. A category k is assigned to the test instance $x_{l+1}$ based on $f_{\textit{OOB}_{z_r}}(x_{l+1})$, i.e., the prediction for $x_{l+1}$ made by the out-of-bag sub-ensemble for $z_r$. Lower and upper probability estimates are then computed by not including $z_r$ in the calibration set $Z_k$:

$$\begin{aligned} L(Y_j) = \frac{|\{(x_m,y_m) \in Z_k\setminus \{z_r\} \mid y_m = Y_j\}|}{|Z_k\setminus \{z_r\}|+1} \end{aligned}$$

(10)

and:

$$\begin{aligned} U(Y_j) = \frac{|\{(x_m,y_m) \in Z_k \setminus \{z_r\} \mid y_m = Y_j\}|+1}{|Z_k\setminus \{z_r\}|+1} \end{aligned}$$

(11)

A full proof of how this procedure maintains exchangeability between calibration and test instances is provided by Boström et al. (2017).

4 Method

In the empirical investigation, we look at different ways of utilizing random forests for probabilistic prediction. All experiments were performed in MatLab, and the random forests were generated using the MatLab implementation of the algorithm, called TreeBagger. Here, all parameter values were left at their default values, with the exception of using 300 trees in the forest.

The 22 data sets used are all two-class problems, publicly available from either the UCI repository (Bache and Lichman 2013) or the PROMISE Software Engineering Repository (Shirabad and Menzies 2005). In the experimentation, standard $10\times 10$-fold cross-validation was used.

The taxonomy used for the standard Venn predictors in this study is the label prediction of the underlying model, i.e., all instances predicted with the same label are put into one category. Since all problems are two-class, the resulting taxonomy contains only two categories.

For the actual calibration, we compared using standard Venn predictors and Venn-Abers predictors to Platt scaling and isotonic regression, as well as using no external calibration, i.e., the raw estimates from the forest. In addition, we compared calibrating on the out-of-bag instances to using a separate labeled data set (the calibration set) not used for learning the trees. When using a calibration set, 2 / 3 of the training instances were used for the tree induction and 1 / 3 for the calibration. It must be noted that when calibrating on the out-of-bag instances, we follow the procedure proposed by Boström et al. (2017), as described above, i.e., using a subset of the forest for each prediction to guarantee exchangeability between calibration and test instances. For approaches that employ a separate calibration set, however, the entire forest is used for the predictions. In summary, we compare the following ten approaches:

RF-cal: The raw estimates from the forest estimated from a separate calibration set.
RF-oob: The raw estimates from the forest estimated from the out-of-bag set.
Platt-cal: Standard Platt scaling where the logistic regression model was learned on the calibration set.
Platt-oob: Platt scaling calibrating on the out-of-bag set.
Iso-cal: Standard isotonic regression based on the calibration set.
Iso-oob: Isotonic regression calibrated on the out-of-bag set.
VP-cal: A Venn predictor calibrated on a separate data set and using the predicted label from the underlying model as the category.
VP-oob: A Venn predictor calibrated on the out-of-bag set and using the predicted label from the underlying model as the category.
VAP-cal: A Venn-Abers predictor calibrated on a separate data set.
VAP-oob: A Venn-Abers predictor calibrated on the out-of-bag set.

In the experimentation, we want to evaluate different criteria. For all ten setups, we compare the probability estimates to the true observed accuracies. Specifically, we will evaluate the quality of the probability estimates using the Brier score (Brier 1950). For two-class problems, let $y_i$ denote the response variable (class) of instance i, where $y_i$ = 0 or 1. Denote the probability estimate that instance i belongs to class 1, by $p_i$. The Brier Score is then defined as

$$\begin{aligned} \textit{Brier Score}=\sum _{i=1}^N (y_i-p_i)^2, \end{aligned}$$

(12)

where N is the number of instances. The Bries score is consequently the sum of squares of the difference between the true class and the predicted probability over all instances. The Brier score can be further decomposed into three terms called uncertainty, resolution and reliability. In practice, this is done by dividing the range of probability values, i.e., [0, 1] into a number of K intervals and represent each interval $1, 2, \dots , K$ by a corresponding typical probability value $r_k$, see Murphy (1973). Here, the reliability term measures how close the probability estimates are to the true probabilities, i.e., it directly measures how well-calibrated the estimates are. The reliability is defined as

$$\begin{aligned} \textit{Reliability}=\frac{1}{N}\sum _{k=1}^K n_k(r_k-\phi _k)^2, \end{aligned}$$

(13)

where $n_k$ is the number of instances in interval k, $r_k$ is the mean probability estimate for the positive class over the instances in interval k, $\phi _k$ is the proportion of instances actually belonging to the positive class in interval k and N is the total number of instances. It must be noted that the reliability score is defined in the contrary direction compared to the English language, i.e., lower reliability is better. In the experimentation, the number of intervals K was set to 100. When calculating the probability estimate for the positive class from all Venn predictors and Venn-Abers predictors, the center point of the corresponding prediction interval was used. It may be noted that another option for producing a single probability estimate from a Venn predictor prediction interval is suggested by Vovk and Petej (2012). While that method is theoretically sound, providing a regularized value where the estimate is moved towards the neutral value 0.5, the differences between the two methods are most often very small in practice.

For the Venn predictors and the Venn-Abers predictors, we also check the validity by making sure that the observed accuracies, i.e., the percentage of correctly predicted test instances, actually fall in (or at least are close to) the intervals. In addition to the quality of the estimates, there are two additional important metrics when comparing the Venn predictors and the Venn-Abers predictors:

Interval size: The tighter the interval is, the more informative.
Accuracy: The predictive performance of the model is of course vital in all of predictive modeling.

5 Results

We start by investigating the overall quality of the estimates. Table 1 shows the differences between the estimates (averaged over all instances for each data set) and the corresponding accuracies. Looking first at using the raw frequencies from the random forests, we see that while the estimates are fairly accurate, they tend to be too pessimistic. Averaged over all data sets, the difference between the estimates and the actual accuracies is approximately 1.5 percentage points. Using Platt scaling and a separate calibration set produces, on the other hand, too optimistic estimates. Platt scaling on out-of-bag is clearly better, but still systematically optimistic. Interestingly enough, the same holds for isotonic regression, but the estimates are generally worse than Platt scaling. The Venn predictor, though, when looking at these aggregated results, appears to be exceptionally well-calibrated. The Venn-Abers predictor, finally, is rather well-calibrated when using out-of-bag calibration, but too optimistic when calibrated on a separate data set. Even if the differences may appear to be rather small in absolute numbers (approximately 1.5 percentage points on average), the fact is that Platt scaling, isotonic regression and the Venn-Abers predictor turned out to be intrinsically optimistic, while using the random forest outputs was systematically pessimistic. Consequently, one could argue that they all must be considered misleading in this study. The standard Venn predictor, on the other hand, appears to be well-calibrated, specifically there is no inherent tendency to overestimate or underestimate the accuracy. Most importantly, it should be noted that calibrating on out-of-bag improved the quality of the estimates for all setups evaluated.

Table 1 Quality of estimates

Full size table

In order to perform a more detailed analysis, Table 2 shows the reliability scores for the different techniques. As described above, this is a direct measurement of the quality of the probability estimates. The last row shows the average rank for that setup over all data sets.

Table 2 Reliability of estimates

Full size table

First it must be noted that in this study, isotonic regression performs clearly worse than using the raw estimates from the random forest. Looking at the mean ranks, Platt scaling is slightly better than using the raw estimates from an out-of-bag calibration set, but clearly worse than calibrating the forest estimates using a separate calibration set. So, based on these results, there is little to gain from using the standard techniques Platt scaling and isotonic regression for calibrating a random forest. Turning to the Venn predictors, however, we see from the mean ranks that all four setups obtained better estimates, i.e., lower reliability scores, compared to the raw estimates from the random forest. Overall, the Venn predictor again showed the most accurate estimates, but when looking at this more detailed level, we see that calibrating on a separate data set was actually slightly better than using the out-of-bag set.

In order to determine whether the observed differences are statistically significant, we used the procedure recommended by Garcıa and Herrera (2008) and performed a Friedman test (Friedman 1937), followed by Bergmann–Hommel’s dynamic procedure (Bergmann and Hommel 1988) to establish all pairwise differences. With ten setups and just 22 data sets, only a few differences are actually significant at the $\alpha =.05$ level, see Table 3, where a ‘v’ shows that the row setup obtained significantly more reliable estimates than the column setup.

Table 3 Statistically significant differences for reliability $\alpha =.05$

Full size table

Analyzing the different Venn predictors, Table 4 shows the probability intervals, their sizes and the actual accuracies on each data set. An underlined accuracy means that it is outside the prediction interval.

Table 4 Venn predictor and Venn-Abers predictors intervals

Full size table

First of all, we see that all setups are valid, i.e., the empirical accuracy is for all setups inside the intervals produced for a very large majority of data sets. Actually, even for the rare data sets where the accuracy is not within the intervals, it is most often very close. While we expect the Venn and the Venn-Abers predictors to be well-calibrated, it must be noted that the intervals produced are much smaller than what is typically the case when using the original transductive approach, see e.g., Papadopoulos (2013). This is also consistent with the findings by Lambrou et al. (2015).

Comparing the size of the intervals between the different setups, we observe fairly large differences. On average over all data sets, the intervals for VP-oob are smaller than one percentage point (.006), while for VAP-cal it is .075, i.e., over seven percentage points. Looking at the mean ranks, we find that there is a clear ordering, which is the same for every data set; VP-oob produced the smallest intervals, followed by VP-cal, VA-oob and finally VA-cal. Obtaining so tight (and still valid) intervals for the probabilistic predictions is of course a very strong result for the Venn predictor.

Turning to the accuracies, we see that VP-oob is again the best choice. Here, however, VAP-oob is the second best, indicating that the use of an out-of-bag calibration set will result in more accurate models. Interestingly enough, models calibrated on out-of-bag were more accurate than the corresponding models using a separate calibration set for all setups and on every data set. While the fact that using all data for generating the models (which is possible when calibrating on the out-of-bag set) will result in higher accuracies should be no surprise, we must remember that in this setup, the ensemble used for the actual prediction is a subset of the original ensemble. Consequently, the results actually show that a much smaller ensemble (approximately 100 trees) but bagging from all data, will generally be more accurate than a larger ensemble (300 trees) with access to less data (2/3 of the original training set) for the bagging.

For the statistical testing, we again used a Friedman test (Friedman 1937), followed by Bergmann–Hommel’s dynamic procedure (Bergmann and Hommel 1988) to establish the pairwise differences, which are shown in Table 5. Here we see that all setups produced significantly higher accuracy than VAP-cal. In addition, VP-oob was significantly more accurate than VP-cal. Looking finally at VP-oob versus VAP-oob, the p value is .07, i.e., while the difference is not significant at $\alpha =.05$, it is still a strong result for VP-oob.

Table 5 Statistically significant differences for accuracy $\alpha =.05$

Full size table

Extending the analysis of the predictive performance, Table 6 also includes the accuracies obtained by the standard setups, i.e., using the raw estimates from the forest, Platt scaling and isotonic regression.

Table 6 Accuracy for all setups

Full size table

In Table 6, we can make at least two very important observations: (i) every setup using out-of-bag calibration was more accurate than all setups requiring a separate calibration set, and (ii) while the differences are small in absolute numbers, the Venn-predictor calibrated on out-of-bag instances, is actually the most accurate setup overall.

6 Concluding remarks

This paper has presented the first large-scale comparison of Venn predictors and Venn-Abers predictors to existing techniques for utilizing random forests in probabilistic prediction. Specifically, the novel option to perform the calibration on the out-of-bag instances has been evaluated.

Regarding calibration, as evaluated using the reliability metric, the results show that the standard techniques Platt scaling and isotonic regression were generally ineffective for calibrating a random forest. All four Venn predictors and Venn-Abers predictors, on the other hand, were better calibrated than both the raw estimates from the random forest, and the standard techniques Platt scaling and isotonic regression.

When comparing the intervals produced by the Venn predictors and the Venn-Abers predictors to the empirical accuracies, it is obvious that all evaluated setups are valid. The interval sizes, however, varied substantially between the different setups. In fact, the ordering between the setups was identical for all data sets; and the best choice was to use a standard Venn predictor calibrated on out-of-bag instances. The intervals produced using that setup were very tight, on average just over .5 percentage points.

Also when considering model accuracy, the best option was to use a standard Venn predictor, and calibrate on out-of-bag instances. That setup was significantly more accurate than both the Venn predictor and the Venn-Abers predictor calibrated on a separate data set. In addition, it was substantially more accurate than Venn-Abers calibrated on out-of-bag.

Generally, it must be noted that calibrating on out-of-bag instead of using a separate calibration set was extremely successful for both Venn predictors and Venn-Abers predictors, resulting in tighter intervals and more accurate models on every data set.

References

Bache, K., & Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 2 Jan 2018.
Bergmann, B., & Hommel, G. (1988). Improvements of general multiple test procedures for redundant systems of hypotheses. In Multiple hypotheses testing. Springer, pp. 100–115.
Boström, H. (2008). Calibrating random forests. In IEEE international conference on machine learning and applications, pp. 121–126.
Boström, H. (2011). Concurrent learning of large-scale random forests. In Eleventh Scandinavian conference on artificial intelligence, SCAI 2011, Trondheim, Norway, May 24th–26th, 2011, pp. 20–29.
Boström, H. (2012). Forests of probability estimation trees. International Journal of Pattern Recognition and Artificial Intelligence, 26(2), 2012.
Article MathSciNet Google Scholar
Boström, H., Linusson, H., Löfström, T., & Johansson, U. (2017). Accelerating difficulty estimation for conformal regression forests. Annals of Mathematics and Artificial Intelligence, 81(1–2), 125–144.
Article MathSciNet MATH Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Article MATH Google Scholar
Brier, G. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3.
Article Google Scholar
Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. In Machine learning, proceedings of the twenty-third international conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25–29, 2006, pp. 161–168.
Delgado, M. F., Cernadas, E., Barro, S., & Amorim, D. G. (2014). Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15(1), 3133–3181.
MathSciNet MATH Google Scholar
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of American Statistical Association, 32, 675–701.
Article MATH Google Scholar
Gammerman, A., Vovk, V., & Vapnik, V. (1998). Learning by transduction. In Proceedings of the fourteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann, pp. 148–155.
Garcıa, S., & Herrera, F. (2008). An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. Journal of Machine Learning Research, 9(2677–2694), 66.
MATH Google Scholar
Jansson, K., Sundell, H., Boström, & H. (2014). gpurf and gpuert: Efficient and scalable GPU algorithms for decision tree ensembles. In 2014 IEEE international parallel & distributed processing symposium workshops, Phoenix, AZ, USA, May 19–23, 2014, pp. 1612–1621.
Johansson, U., Boström, H., Löfström, T., & Linusson, H. (2014). Regression conformal prediction with random forests. Machine Learning, 97(1–2), 155–176. ISSN 0885-6125.
Article MathSciNet MATH Google Scholar
Lambrou, A., Nouretdinov, I., & Papadopoulos, H. (2015). Inductive venn prediction. Annals of Mathematics and Artificial Intelligence, 74(1), 181–201.
Article MathSciNet MATH Google Scholar
Murphy, A. H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology, 12(4), 595–600.
Article Google Scholar
Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on machine learning. ACM, pp. 625–632.
Papadopoulos, H. (2013). Reliable probabilistic classification with neural networks. Neurocomputing, 107(Supplement C), 59–68.
Article Google Scholar
Platt, J. C. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers. MIT Press, pp. 61–74.
Saunders, C., Gammerman, A., & Vovk, V. (1999). Transduction with confidence and credibility. In Proceedings of the sixteenth international joint conference on artificial intelligence (IJCAI’99), Vol. 2, pp. 722–726.
Shirabad, J. S., & Menzies, T. J. (2005). The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada. http://promise.site.uottawa.ca/SERepository. Accessed 2 Jan 2018.
Vovk, V., & Petej, I. (2012). Venn-abers predictors. arXiv preprint arXiv:1211.0025.
Vovk, V., Shafer, G., & Nouretdinov, I. (2004). Self-calibrating probability forecasting. In Advances in neural information processing systems, pp. 1133–1140.
Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world. New York: Springer.
MATH Google Scholar
Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of the 18th international conference on machine learning, pp. 609–616.

Download references

Acknowledgements

Funding was provided by “Stiftelsen för Kunskaps- och Kompetensutveckling” (Grant No. 20150185).

Author information

Authors and Affiliations

Department of Computer Science and Informatics, Jönköping University, Jönköping, Sweden
Ulf Johansson & Tuve Löfström
Department of Information Technology, University of Borås, Borås, Sweden
Henrik Linusson
School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden
Henrik Boström

Authors

Ulf Johansson
View author publications
You can also search for this author in PubMed Google Scholar
Tuve Löfström
View author publications
You can also search for this author in PubMed Google Scholar
Henrik Linusson
View author publications
You can also search for this author in PubMed Google Scholar
Henrik Boström
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ulf Johansson.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editor: Lars Carlsson.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Johansson, U., Löfström, T., Linusson, H. et al. Efficient Venn predictors using random forests. Mach Learn 108, 535–550 (2019). https://doi.org/10.1007/s10994-018-5753-x

Download citation

Received: 16 February 2018
Accepted: 01 August 2018
Published: 20 August 2018
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s10994-018-5753-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Efficient Venn predictors using random forests

Abstract

Similar content being viewed by others

Evaluation of a Variance-Based Nonconformity Measure for Regression Forests

Double random forest

Regression Forests

1 Introduction

2 Background

2.1 Probabilistic prediction

2.2 Random forests

2.3 Platt scaling

2.4 Isotonic regression

3 Venn predictors

3.1 Venn-Abers predictors

3.2 Out-of-bag-calibration

4 Method

5 Results

6 Concluding remarks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient Venn predictors using random forests

Abstract

Similar content being viewed by others

Evaluation of a Variance-Based Nonconformity Measure for Regression Forests

Double random forest

Regression Forests

Explore related subjects

1 Introduction

2 Background

2.1 Probabilistic prediction

2.2 Random forests

2.3 Platt scaling

2.4 Isotonic regression

3 Venn predictors

3.1 Venn-Abers predictors

3.2 Out-of-bag-calibration

4 Method

5 Results

6 Concluding remarks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation