1 Introduction

One of the main motivations for using the conformal prediction (CP) framework (Vovk et al. 2006) is that it provides guarantees for the prediction error; the probability of making incorrect predictions is bounded by a user-provided confidence threshold. In contrast to other learning frameworks that provide similar types of guarantees, e.g., PAC learning (Valiant 1984), CP makes it possible to assess the uncertainty of each single prediction. Hence, rather than just providing a bound on the prediction error for the entire distribution, CP allows for providing different bounds for different instances, something which may be very valuable in many practical applications. For example, knowing that the error of a certain model for predicting the stock price is bounded by 100 dollars with \(95~\%\) probability, is not as informative as knowing that for the specific stock we are interested in, the prediction error is, in fact, bounded by ten dollars, i.e., this particular stock is actually easier to predict than the average one. Similarly, in the medical domain, it is of course important to be able to assess the confidence in predictions related to individual patients instead of groups of patients.

CP employs some underlying predictive model, which may have been generated by any standard learning algorithm, for obtaining prediction regions rather than single point predictions. A prediction region corresponds to a set of class labels in a classification context, and to an interval in a regression context. A prediction error, in this framework, occurs when the correct label of a (test) instance is not included in the prediction region. The guarantee given by the conformal prediction framework, under the standard i.i.d. assumption, is that the probability of making an error is bounded by a predetermined confidence level. This means that the number of errors can be controlled, typically reducing the error level by increasing the sizes of the prediction regions or vice versa. There is an obvious resemblance of the conformal prediction framework to standard statistical hypothesis testing, where the type I and II errors are controlled by the choice of significance level.

Since all conformal predictors are valid, i.e., the probability of excluding the correct label is bounded by the confidence level, the main criterion when comparing different conformal predictors is their efficiency, i.e., the sizes of output prediction regions. Efficiency is, for classification, often measured as the (average) number of labels present in the prediction sets, and for regression as the (average) size of the intervals.

CP relies on real-valued functions, called nonconformity functions, that provide estimates for how different a new example is from a set of old examples. In a predictive modeling scenario, nonconformity functions use the underlying model to determine how strange the relationship between the feature vector (the input) and an output value for a certain new instance is, compared to a set of previously observed instances.

It is possible to design many different nonconformity functions for a specific predictive model, and each of them will define a different conformal predictor. All of these conformal predictors will be valid, but there may be significant differences in terms of efficiency. In the extreme case, even a function that returns the same nonconformity score for all examples will be valid, but the prediction regions will be very wide.

CP was originally introduced as a transductive approach for support vector machines (Gammerman et al. 1998). Transductive CP requires learning a new model for each new test instance to be predicted, which of course may be computationally prohibitive. For this reason, inductive conformal prediction (ICP) was suggested (Vovk et al. 2006). In ICP, which is the focus of this study, only one model is induced from the training data and that model is then used for predicting all test instances. In ICP, however, the calculation of the nonconformity scores requires a separate data set (called the calibration set) that was not used by the algorithm when learning the model. Consequently, it becomes very important how the training data is divided into the proper training set and the calibration set; using too few calibration instances will result in imprecise confidence values, while too few proper training instances may lead to weaker underlying models.

Looking specifically at ICP regression, there are very few published papers providing a systematic evaluation of different underlying models and nonconformity functions. As a matter of fact, until now, most studies focus on one specific underlying model, and use a very limited number of data sets, making them serve mainly as proofs-of-concept; see e.g., Papadopoulos et al. (2002); Papadopoulos and Haralambous (2011). With this in mind, there is an apparent need for larger studies, explicitly evaluating techniques for producing efficient conformal predictors. Such studies should preferably explore various learning algorithms and use a sufficiently large number of data sets to allow for statistical inference, thus making it possible to establish best practices. In this paper, we compare using random forests (Breiman 2001) as the underlying model for conformal prediction regression to existing state-of-the-art conformal regressors, which are based on artificial neural networks (ANN) and k-nearest neighbors (kNN). We investigate a number of nonconformity functions, and we specifically examine the option to use out-of-bag estimates for the necessary calibration.

In summary, the main contributions of this paper are:

  • a novel method for regression conformal prediction, which utilizes random forests together with a non-conformity function that exploits out-of-bag examples as a calibration set;

  • the first large-scale empirical investigation of methods for regression conformal prediction, which include state-of-the-art learning algorithms and multiple non-conformity functions that are evaluated on a large number of datasets;

  • significant findings concerning the relative efficiency of different conformal predictors, which provide new evidence for what may be considered best practices for regression conformal prediction.

In the next section, we formalize the conformal prediction framework and discuss related work. In Sect. 3, we describe the proposed approach for regression conformal prediction using random forests as well as competing state-of-the-art approaches. The setup for, and the results from, the empirical investigation are presented in Sect. 4. Finally, we summarize the main conclusions from the study and outline directions for future work in Sect. 5.

2 Background

In this section, we first provide a formalization of inductive conformal prediction, which is the theoretical foundation for this paper. We then briefly discuss its relation to alternative frameworks and summarize the main related previous studies upon which our study builds.

2.1 Inductive conformal prediction

An inductive conformal classifier or regressor only needs to be trained once, using the following scheme:

  1. 1.

    Divide the training set \(Z = \{(x_1, y_1),\ldots , (x_l, y_l)\}\) into two disjoint subsets \(Z^t\) (a proper training set) and \(Z^c\) (a calibration set):

    • \(Z^t = \{(x_1, y_1),\ldots , (x_m, y_m)\}\)

    • \(Z^c = \{(x_{m+1}, y_{m+1}),\ldots , (x_l, y_l)\}\)

  2. 2.

    Train the underlying model \(h_Z\) using \(Z^t\).

  3. 3.

    For each calibration instance \((x_i, y_i) \in Z^c\):

    • let \(h_Z\) predict the output value for \(x_i\) so that \(\hat{y}_i = h_Z(x_i)\) and

    • calculate the nonconformity score \(\alpha _i\) using the nonconformity function.

For a novel (test) instance, the input pattern \(x_j\) is supplied to the underlying model, resulting in a prediction \(\hat{y}_j\). Then a nonconformity score \(\alpha _j^{\tilde{y}}\) is produced for every tentative target value \(\tilde{y}\). The \(p\)-value of each tentative target \(\tilde{y}\) is then calculated by comparing \(\alpha _j^{\tilde{y}}\) to the nonconformity scores of the calibration set \(S = \{\alpha _{1},\ldots , \alpha _{q}\}\):

$$\begin{aligned} p(\tilde{y}) = \frac{\#\{ z_i \in Z^c \mid \alpha _i \ge a_j^{\tilde{y}} \} + 1}{\left| Z^c \right| + 1} \, . \end{aligned}$$
(1)

If \(p(\tilde{y}) < \delta \), the probability for \(\tilde{y}\) being the true target for \(x_j\) is smaller than \(\delta \), i.e., \(\tilde{y}\) can be excluded from the prediction region, at that confidence level. In classification, \(\tilde{y}\) represents a possible class label, and all possible labels are tested one at a time. In regression, we cannot consider every possible output value in that manner, so a conformal regressor will instead directly establish the prediction interval, for each test instance, given the confidence level. In regression, the nonconformity function is most often simply the absolute error, see e.g., Papadopoulos and Haralambous (2011); Papadopoulos et al. (2002, 2011):

$$\begin{aligned} \alpha _i = | y_i - \hat{y}_i | \, . \end{aligned}$$
(2)

Then, given a significance level \(\delta \) and a set of calibration scores \(S = \{\alpha _{1},\ldots , \alpha _{q}\}\), we locate the smallest \(\alpha _{s(\delta )} \in S\) that satisfies the equation

$$\begin{aligned} \frac{\#\{ z_i \in Z^c \mid \alpha _i < \alpha _{s(\delta )} \} +1}{\left| Z^c \right| + 1} \ge 1-\delta \, . \end{aligned}$$
(3)

Since it is not possible to consider each \(\tilde{y}\) in regression, it is also not possible to calculate the nonconformity scores \(\alpha _j^{\tilde{y}}\) for the test instance \(x_j\). Instead, \(\alpha _{s(\delta )}\) forms a probabilistic bound for the nonconformity scores at significance level \(\delta \); that is, with probability \(1-\delta \), the nonconformity of \(x_j\) will be at most \(\alpha _{s(\delta )}\). Thus, at significance \(\delta \), we can reject any label for which \(\alpha _j^{\tilde{y}} > \alpha _{s(\delta )}\), and must conversely include all labels for which \(\alpha _j^{\tilde{y}} \le \alpha _{s(\delta )}\). Using (2), \(\alpha _j^{\tilde{y}} = \alpha _{s(\delta )}\) exactly when \(|y - \tilde{y}| = \alpha _{s(\delta )}\), hence, by formulating the prediction region as

$$\begin{aligned} \hat{Y}_j^\delta = \hat{y}_j \pm \alpha _{s(\delta )}, \end{aligned}$$
(4)

where \(s(\delta )\) is found from (3) above, \(\hat{Y}_j^{\delta }\) will cover the true output \(y_j\) with probability \(1-\delta \). It must be noted that when using (2) and (4), the conformal regressor will, for any specific significance level \(\delta \), always produce prediction intervals of the same size for every \(x_j\); i.e., it does not consider the difficulty of a certain instance \(x_j\) in order to provide as informative predictions as possible, which often is a key motivation for using conformal prediction in the first place. It is, however, possible to employ normalized nonconformity functions, where the absolute error is scaled using the expected accuracy of the underlying model; see e.g., Papadopoulos and Haralambous (2011); Papadopoulos et al. (2011). The motivation for this, from a conformal prediction standpoint, is that if two instances have identical conformity scores using (2), but the first is expected to be more accurate than the second, then the second is actually stranger (more nonconforming) than the first. Using a normalized nonconformity function, the resulting prediction intervals will be smaller for instances that are deemed “easy” and larger for “harder” instances. When using a normalized nonconformity function, nonconformity scores are calculated using:

$$\begin{aligned} \alpha _i = \frac{| y_i - \hat{y}_i |}{\sigma _i}, \end{aligned}$$
(5)

where \(\sigma _i\) is an estimate of the accuracy of the underlying model for \(\hat{y}_i\). Naturally, there are several ways to estimate the accuracy; one suggestion is to train another model for predicting the errors; see e.g., Papadopoulos and Haralambous (2011). Other approaches use properties of the underlying model; see e.g., Papadopoulos et al. (2011). With normalized nonconformity functions, the prediction interval for \(\hat{Y}_j^\delta \) is:

$$\begin{aligned} \hat{Y}_j^\delta = \hat{y}_j \pm \alpha _{s(\delta )}\sigma _j, \end{aligned}$$
(6)

where \(\sigma _j\) is an estimate of the accuracy of the underlying model, for that instance.

2.2 Related work

As mentioned in the introduction, there are other machine learning frameworks that provide some sort of guarantee of the prediction error. Specifically, PAC-learning (Valiant 1984) will provide upper bounds on the probability of its error with respect to some confidence level. PAC theory only assumes that the instances are generated independently by some completely unknown distribution, but for the resulting bounds to be interesting in practice, the data set must be quite clean. Unfortunately, this is rarely the case for real-world data, which will lead to very loose bounds, see e.g., Nouretdinov et al. (2001), where the crudeness of PAC theory is demonstrated. In addition, the PAC bounds are for the overall error and not for individual predictions. The Bayesian framework can, on the other hand, be used to complement individual predictions with probabilistic measures of their quality. These measures are, however, based on some a priori assumption about the underlying distribution. When the assumed prior is violated, there is no guarantee that the resulting intervals produced by the Bayesian methods actually contain the true target as often as indicated by the confidence level, i.e., the resulting predictions are not valid. In Papadopoulos et al. (2011), CP is compared to the popular Bayesian method called Gaussian Processes (GP) (Rasmussen and Christopher 2005). The results show that when the (artificial) data set satisfied the GP prior, the intervals produced by GP-regression were valid, and slightly tighter than the corresponding intervals produced by CP. On a number of real-world data sets, however, the predictive regions produced by GP-regression were no longer valid, i.e., they may become misleading when the correct prior is not known.

The CP framework has been applied to classification using several popular learning algorithms, such as ANNs (Papadopoulos 2008), kNN (Nguyen and Luo 2012), SVMs (Devetyarov and Nouretdinov 2010; Makili et al. 2011), decision trees (Johansson et al. 2013a), random forests (Bhattacharyya 2011; Devetyarov and Nouretdinov 2010) and evolutionary algorithms (Johansson et al. 2013b; Lambrou et al. 2011). Although we in this study consider regression tasks, there is some overlap with previous studies on classification when it comes to design choices. Specifically, in Johansson et al. (2013a), the underlying learning algorithm is also decision trees. However, in the previous study, the focus was on how properties of the algorithm, e.g. split evaluation metric, pruning and the smoothening function, affect the efficiency of classification trees. In this paper, we instead study forests of regression trees, and investigate both standard and normalized nonconformity functions in this context. Moreover, the use of out-of-bag estimates for the calibration was suggested for random forests in Devetyarov and Nouretdinov (2010), and was also used for bagged ANNs in Löfström et al. (2013). None of these studies, however, evaluate efficiency in a systematic way while considering different underlying models and nonconformity functions. In particular, no normalized nonconformity functions were evaluated in these studies.

There are also a number of studies on conformal prediction for regression, using, for instance, ridge regression (Papadopoulos et al. 2002) and ANNs (Papadopoulos and Haralambous 2010). Two interesting and fairly recent studies do in fact evaluate normalized nonconformity functions for ANNs (Papadopoulos and Haralambous 2011) and k-Nearest Neighbors (Papadopoulos et al. 2011). Unfortunately, both studies use very few data sets, thus precluding statistical analysis. Despite these shortcomings, the suggested approaches must be regarded as state-of-the-art for ICP regression, making them natural benchmarks to compare our proposed methods against.

Conformal prediction has also been successfully used in a number of applications where confidence in the predictions is of concern, including prediction of space weather parameters (Papadopoulos and Haralambous 2011), estimation of software project effort (Papadopoulos et al. 2009b), early diagnostics of ovarian and breast cancers (Devetyarov et al. 2012) and diagnosis of acute abdominal pain (Papadopoulos et al. 2009a).

3 Methods

In this section, we first describe the proposed method for utilizing random forests for regression conformal prediction, including some variants, and then describe the competing state-of-the-art approaches.

3.1 Regression conformal prediction using random forests

A random forest (Breiman 2001) is a set of decision trees (Breiman et al. 1984; Quinlan 1986), where each tree is generated in a specific way to introduce diversity among the trees, and where predictions of the forest are formed by voting. A decision tree is a tree-structured (directed, acyclic and connected) graph, where each internal (non-leaf) node is labeled with a test on some attribute, with one arc leading to a unique (child) node for each possible outcome of the test, and where the leaf nodes of the tree are labeled with values to be predicted. If the predicted values are (categorical) class labels, the decision tree is called a classification tree, while if the predicted values are numeric, the tree is called a regression tree. When using a decision tree to predict a value for an example, starting at the root node, the test at the current internal node is performed and the arc corresponding to the outcome of the test is followed, until a leaf node is reached, for which the corresponding predicted value is returned. For a forest of classification trees, the predicted value is typically formed by selecting the majority among the predictions of the individual trees, while for a forest of regression trees, the resulting prediction is typically formed by taking the average of the individual predictions.

The standard procedure to generate a decision tree is to employ a recursive partitioning, or divide-and-conquer, strategy, starting with all training examples at the root node of the tree, and then evaluating all available tests to partition the examples, choosing the one that maximizes some evaluation metric, e.g., information gain for classification trees or variance for regression trees, labeling the current node with this test, partitioning the examples according to the outcome of the test and continuing building the tree recursively with each resulting subset, until some termination criterion is met, e.g., all examples in the subset have the same value on the target attribute, and forming a value to predict from the examples in the subset.

In order to introduce the necessary diversity among the trees in a random forest, each tree is trained on a bootstrap replicate of the original training set (Breiman 1996), i.e., a new training (multi-)set, or bag, of the same size as the original set is formed by randomly selecting examples with replacement from the original set. This means that some of the original examples may be duplicated in the bootstrap replicate, while other examples are excluded. The latter ones, for a specific tree, are said to be out-of-bag for that tree. To further increase diversity, each tree in the forest is created using the random subspace method (Ho 1998), i.e., only a randomly selected subset of all available attributes are evaluated when choosing the split at each internal node during the construction of the decision tree.

In this study, the implementation of the random forest algorithm from the MatLab Footnote 1 statistics toolbox, called TreeBagger, was used. The parameters were set to the default values for regression trees, i.e., the number of attributes to evaluate at each internal node was set to one third of the total number of attributes and mean square error was used as the split criterion.

Given that random forests are frequently observed to result in state-of-the-art predictive performance, see e.g., Caruana and Niculescu-Mizil (2006), random forest models can be expected to be more accurate than the underlying models that are currently used in regression conformal predictors, i.e., kNN and ANN models. It is, however, not obvious whether or not the use of random forests will result in smaller prediction intervals, when the models are used as the basis for CP. Another important question is whether or not anything could be gained from using out-of-bag estimates for the calibration, something which is an option for random forests, but not for the previous model types, which have to resort to using a separate calibration set.

In this study, we investigate nonconformity functions that are based on absolute errors (2). The first two nonconformity functions that will be considered for random forests use no normalization, i.e., the intervals are produced using (4). The first approach, called RFi, employs standard ICP, i.e., a separate calibration set is used. In the second approach, called RFo, out-of-bag instances are instead used for the calibration. This, of course, makes it possible to use all training instances for both the training and the calibration. More specifically, when producing the nonconformity score for a calibration instance \(z_i\), the ensemble used for producing the prediction \(\hat{y}_i\) consists of all trees that were not trained using \(z_i\), i.e., \(z_i\) was out-of-bag for those trees.

It should be noted that when using out-of-bag instances instead of a separate calibration set, the actual underlying model, i.e., the random forest, is no longer used when calculating the nonconformity scores and \(p\)-values. In fact, various subsets of the forest are used for the out-of-bag-instances, but the entire forest is used for the test instances. In other words, the nonconformity functions applied to the calibration and test instances are defined differently as

$$\begin{aligned} \alpha _{calibration}&= \left| y - h\theta (x)\right| \end{aligned}$$
(7)
$$\begin{aligned} \alpha _{test}&= \left| y - h(x) \right| , \end{aligned}$$
(8)

where \(\theta \) is a random factor determining the subset of trees for which \(x\) is out-of-bag. In general, the use of different nonconformity functions could clearly cause the resulting conformal predictor to become invalid, i.e., the probability of excluding the true target value would no longer be bounded by the provided confidence level. However, we argue that the conformal predictor in our particular setting, i.e., when using out-of-bag estimates for the calibration, must be valid. In principle, the same random component may also be used when predicting the target value for the test instance (by only considering a random subset of the forest when predicting the target of the test instance), and in that case the same nonconformity function (7) would obviously be used for all instances, hence not violating the assumptions underlying the ICP framework. When instead using the whole forest for the test instance, as proposed here, one would expect the predicted values to be closer to the true target, than when using a random subset of the trees. In fact, it is well-known that out-of-bag error estimates tend to overestimate the actual error made by a random forest, simply because a larger forest is normally a stronger model. Not until the random forest is so large that the randomized sub-ensembles will be as accurate as the entire forest, is this bias eliminated. Consequently, the expected nonconformity of a test instance is less than (or for a very large forest equal to) the expected nonconformity of a calibration instance, i.e., the probability of including nonconforming targets in the prediction region is unchanged or increased when using the whole forest. Hence, rather than increasing the risk for generating an invalid conformal predictor, one would expect the conformal predictor using out-of-bag instances to be conservative. Therefore, the proposed setup should be, if anything, less efficient than if the whole forest was used together with additional calibration instances. Naturally, the validity will be investigated in the experimentation in order to support this reasoning empirically.

We also investigate three normalized nonconformity functions, i.e., the prediction regions may vary for different test instances. RFia and RFoa both use an additional (linear) ANN to predict the logarithm of the error of the underlying model, for each instance. RFia is identical to the procedure used in Papadopoulos and Haralambous (2011) and Papadopoulos and Haralambous (2010), but of course uses a random forest as the underlying model instead of an ANN. The resulting nonconformity function is:

$$\begin{aligned} \alpha _i = \frac{| y_i - \hat{y}_i |}{exp(\mu _i)+\beta }, \end{aligned}$$
(9)

where \(\mu _i\) is the prediction of the value \(ln(| y_i - \hat{y}_i |)\) produced by the linear ANN, and \(\beta \) is a parameter, used to control the sensitivity of the nonconformity measure. Naturally, this ANN was trained on all pairs \((x_j, ln(| y_j - \hat{y}_j |)\), from the proper training set. Using this nonconformity function, the prediction intervals become:

$$\begin{aligned} \hat{Y}_j^\delta = \hat{y}_j \pm \alpha _{s(\delta )}(exp(\mu _j)+\beta ) \, . \end{aligned}$$
(10)

The only difference between RFia and the novel setup RFoa is that RFia uses a separate calibration set, while RFoa uses the out-of-bag instances for the calibration, i.e., the additional ANN is trained using the logarithm of the out-of-bag errors as targets. RFok, finally, is another novel setup, which instead of employing an additional ANN for predicting the logarithm of the error, considers the average out-of-bag error (normalized with the Euclidean distance) for the \(k\) closest instances. That is, RFok is based on the same Eq. (9), but here, \(\mu _i\) is defined as the logarithm of the average out-of-bag error of the \(k\) nearest neighbors. The motivation for this novel, and quite straightforward nonconformity function, is that if neighboring instances have small out-of-bag errors, the prediction for the new instance should be accurate, i.e., that instance should be considered as relatively easy. The exact number of neighbors to use is optimized (between \(1\) and \(45\)) for each fold based on the average interval size of the resulting conformal regressor. This fitting is similar to when the ANN is trained to learn the out-of-bag errors. Naturally, since both RFoa and RFok utilize the out-of-bag estimates for the calibration, they make it possible to use all available data as a proper training set for the random forest.

Since the normalization functions used in RFoa and RFok depend on the out-of-bag errors of the calibration instances, one may again raise concerns on the validity of the corresponding conformal regressors. Starting with RFok, we claim that this normalization is unbiased. More specifically, the difficulty of any instance (calibration or test) is estimated in the same manner, using its \(k\) closest calibration set neighbors, not counting a calibration instance as its own closest neighbor. Hence, for both test and calibration instances, the difficulty estimate is based on a set of examples that does not include the instance to which the estimate applies. Consequently, there is no reason to suspect that the estimated difficulty of a test instance is any less accurate than that of a calibration instance. The error rate of RFok is thus not expected to be affected by the normalization function used, and RFok is therefore expected to keep the (slightly conservative) validity of the RFo nonconformity function it is based on. In RFoa, on the other hand, the difficulty-estimating ANN model has been trained on the out-of-bag error of all calibration instances, i.e., when predicting the difficulty of some calibration instance, that particular instance will have been used in the training of the ANN, whereas the same does not apply to any test instance. So, in this case, there is indeed a bias towards the calibration set.

Consequently, for RFoa there are two forces working in opposite directions; the inherent conservatism in using out-of-bag estimates and the bias towards the calibration set when estimating the difficulty of an instance. When the latter bias is small, e.g., if the ANN is relatively weak, the resulting error rate will most likely be smaller than the confidence level, but for a larger bias, the error rate may actually be higher than the confidence level. With this in mind, it is important to recognize that RFoa is the only setup evaluated for which there is a known risk that validity is not guaranteed. Again, the empirical investigation will study how the error rate is affected by these nonconformity functions in practice.

3.2 Competing approaches

In the empirical evaluation, we compare the different variants of our suggested method to the state-of-the-art techniques from Papadopoulos and Haralambous (2011) and Papadopoulos et al. (2011). In both these papers, ICP methods were used, i.e., separate calibration sets were required. The first competing method, suggested and described in detail in Papadopoulos and Haralambous (2011), uses an ANN as the underlying model. In the most basic format (here referred to as ANN), it uses the standard nonconformity function (2) and produces intervals using (4). When using a normalized nonconformity function, the method, which is here referred to as ANNa, uses a linear ANN to predict the logarithm of the errors, and produces intervals using (9) and (10).

The second competing method is based on distance-weighted \(k\)-nearest neighbor regressors, and is suggested and described in Papadopoulos et al. (2011). In the basic format, this method (referred to as kNN) also uses the standard way of calculating nonconformity scores (2) and prediction intervals (4). In Papadopoulos et al. (2011), the authors evaluate a number of novel normalized nonconformity functions. The most efficient (here called kNNc) combines two different aspects of kNN, in order to produce as good estimates of the accuracy as possible. More specifically, the prediction from a kNN regressor is deemed to be more accurate, for a specific instance, if (i) the \(k\) nearest neighbors are close to the current test instance and (ii) the \(k\) nearest neighbors agree in their predictions. The resulting nonconformity function is

$$\begin{aligned} \alpha _i = \frac{| y_i - \hat{y}_i |}{exp(\gamma \lambda _i)+exp(\rho \xi _i)}, \end{aligned}$$
(11)

where \(\lambda \) and \(\xi \) are the measures of accuracy (difficulty) while \(\gamma \) and \(\rho \) are parameters controlling the sensitivity of each measure. Consequently, the prediction intervals are calculated using

$$\begin{aligned} \hat{Y}_j^\delta = \hat{y}_j \pm \alpha _{s(\delta )}(exp(\gamma \lambda _j)+exp(\rho \xi _j)) \, . \end{aligned}$$
(12)

It must be noted that an internal normalization of the measures is used to make sure that the two measures are of the same magnitude and robust over all data sets. For a detailed discussion on these and other difficulty estimators, see Papadopoulos et al. (2011).

4 Empirical evaluation

In this section, we first describe the experimental setup, i.e., what algorithms, data sets and performance metrics have been chosen, and then report the results from the experiment.

4.1 Experimental setup

In the first (main) experiment, the competing methods were re-implemented, and a large-scale study, using \(33\) publicly available data sets was performed. The considered data sets are small to medium sized; ranging from approximately \(500\) to \(10,000\) instances. All but one data set are from the UCI Bache and Lichman (2013), Delve Rasmussen et al. (1996) or KEEL Alcalá-Fdez et al. (2011) repositories. The data sets are described in Table 1, where #inst. is the number of instances, #attrib. is the number of input attributes and #calInst, is the number of instances used for calibration in the standard ICP settings.

Table 1 Data set characteristics

In the evaluation, we look at standard and normalized nonconformity functions separately. Naturally, the normalized nonconformity functions are the most important, since they provide prediction intervals of different sizes. In the second experiment, we employed the exact same settings as in the previous studies, including using only a handful of data sets, and compare our results directly to the published results.

In Experiment 1, a 10\(\times \)10-fold cross-validation scheme was used. The number of calibration instances was set to

$$\begin{aligned} q = 100\times \left\lfloor \frac{|Z|}{400}\right\rfloor - 1, \end{aligned}$$
(13)

where \(Z\) is the full training set, i.e., starting at \(99\) calibration instances for data sets with 400–799 examples, and adding \(100\) calibration instances for every additional \(400\) examples in the full training set. Before the experimentation, all target values were normalized to \([0,1]\), in order to obtain more readable efficiency comparisons across data sets. With this scaling, the size of a prediction interval, of course, expresses the fraction of the target range covered by the interval.

Regarding parameter values, we elected to use identical settings over all data sets and, when applicable, methods. Specifically, all random forests consisted of 500 random trees. For kNN regressors, \(k\) was set to \(25\), since some preliminary experiments showed that this actually produced higher efficiency than selecting different \(k\)-values based on internal cross-validation results. Similarly, all ANNs had exactly \(20\) hidden units. The sensitivity parameters had to be adjusted based on the much smaller normalized target ranges. Again, some preliminary experiments showed that the exact values were not vital, so the following values were used in Experiment 1: \(\beta = 0.01\) and \(\gamma = \rho = 1.0\).

Two things were measured for each method and data set in the experiments; the error rate, i.e., the fraction of target values in the test set that fall outside the predicted regions, and the efficiency, i.e., the size of the predicted intervals. For valid conformal predictors, the error rates should not (in the long run) exceed the chosen confidence threshold. Hence, by investigating the error rates, we may confirm (or reject) that a certain conformal predictor actually is valid. Note that this is here considered to be a binary property, i.e., we do not consider one method to be more valid than another. Given that we have a set of valid regression conformal predictors, the perhaps most interesting aspect to compare is the size of the predicted regions, as this directly corresponds to how informative these regions are. Such a comparison could be done in different ways, e.g., comparing extreme values, but we have opted for comparing the average sizes over all prediction regions. In fact, we report the median value from the ten runs of ten-fold cross-validation.

In order to be able to do a direct comparison with published results, we used the same settings for our methods in Experiment 2 as originally employed for the specific data sets, with regard to number of folds and number of calibration instances. In addition, in this experiment, the targets were not normalized. It may be noted that parameters like \(k\) in kNN and the number of hidden neurons in the ANNs, in the original studies, were optimized based on accuracy results using internal cross-validation. All sensitivity parameters (\(\beta \), \(\gamma \) and \(\rho \)) were, however, despite the fact that the importance of the parameters will be heavily affected by the actual range of the target values, somewhat ad hoc set to \(0.5\). Consequently, we too set \(\beta = 0.5\) for all our methods in Experiment 2.

All experimentation was performed in MatLab, in particular using the Neural network and the Statistics toolboxes.

4.2 Experimental results

Table 2 demonstrates validity for the methods utilizing standard nonconformity functions. Looking at the error rates, i.e., the fraction of test instances for which the true target value falls outside the predicted region, it is reassuring to see that the empirical results for each and every data set is very close to the predetermined confidence levels. In addition, it can be noted that RFo tends to be slightly conservative, which supports the reasoning about validity in Sect. 3.1.

Table 2 Error rates for standard nonconformity functions

Looking at the interval sizes tabulated in Table 3, while remembering that the output was normalized so that an interval size of \(1.0\) would cover the entire range of the target values, it can be seen from the averaged values that the methods at the \(90~\%\) confidence level returned valid prediction intervals covering, approximately, \(25~\%\) of the range. The corresponding average values for the 95 and \(99~\%\) confidence levels are (approximately) 30 and \(50~\%\), respectively. Clearly, these valid prediction intervals must be considered informative.

Table 3 Efficiency for standard nonconformity functions

In order to compare the efficiency of the five different techniques, and to find out if there are any statistically significant differences, we used the recommended procedure in Garcıa and Herrera (2008) and performed a Friedman test Friedman (1937), followed by Bergmann–Hommel’s dynamic procedure Bergmann and Hommel (1988) to establish all pairwise differences. Table 4 shows the adjusted p values. The most important result is that RFo is either significantly or substantially more efficient than all three competing methods, at the different confidence levels. Specifically, it should be noted that RFo outperformed RFi, clearly showing that the use of out-of-bag instances for the calibration is beneficial.

Table 4 Standard nonconformity functions

Table 5 demonstrates the validity for the methods utilizing normalized nonconformity functions. Again we see that all methods, including RFoa, produced valid and well-calibrated conformal predictors. Actually, when using 10\(\times \)10-fold cross-validation, the empirical error rates for most individual data sets are very close to the confidence level. Clearly, these results support our argumentation above that the suggested setup RFok will produce valid (possibly slightly conservative) conformal regressors. Although RFoa was argued above to be associated with a risk of producing non-valid predictions due to a bias in the difficulty estimation function, using the specific settings and parameter values employed here, this bias turned out to be compensated for by the conservative out-of-bag error estimates, thus resulting in empirical error rates below the confidence threshold even for this nonconformity function.

Table 5 Error rates for normalized nonconformity functions

Table 6 shows the interval widths for the normalized nonconformity functions. Comparing these to the results in Table 3, it is obvious that the prediction intervals here are much smaller, i.e., applying normalized nonconformity functions does not only provide tuned prediction intervals for each specific test instance, but the resulting intervals are also substantially tighter on average. Looking at individual methods, the mean ranks identify three groups, which are the same for all confidence levels; RFok is the most efficient method, followed by RFoa and then the other three methods. Clearly, this is a very strong result in favor of using random forests with out-of-bag examples for calibration.

Table 6 Efficiency for normalized nonconformity functions

Studying the adjusted p values in Table 7, we see that RFok is indeed significantly more efficient (for \(\alpha =0.05\)) than all other methods, with the exception of RFoa, on all three confidence levels. In addition, RFoa is either significantly or substantially more efficient than the existing methods utilizing separate calibration sets. Again, it is important to note that the two setups utilizing the out-of-bag instances clearly outperformed using a separate calibration set.

Table 7 Normalized nonconformity functions

Summarizing the main experiment, we see that all variants of the novel method produced empirically valid conformal predictors. Most importantly, the suggested approach, i.e., using random forests as the underlying model and utilizing out-of-bag instances for the calibration, clearly outperformed the existing alternatives with regard to efficiency. Finally, when comparing the specific nonconformity functions, the novel, theoretically sound and quite straightforward method to estimate the accuracy of the underlying model based on out-of-bag errors for neighboring instances, actually turned out to be the most efficient.

In order to analyze and explain the results further, the left part of Table 8 shows the accuracy (measured using Root Mean Square Error) for the different underlying models. Looking at mean values and ranks, the most obvious result is that the kNN models are the weakest. We also see that the random forests are generally the most accurate, and that there is a small but systematic advantage in using all the data for the training. The fact that the ANN has a better mean rank, but a worse average error, than RFi is explained by the fact that the two random forest setups have very similar accuracy.

Table 8 Analysis of underlying models and estimators

The right part of Table 8 shows the quality of the difficulty estimators. More specifically, the numbers tabulated are the correlations between the estimated difficulty and the actual error made by the underlying model on the test instances. It should be noted that these estimations are calculated in quite different ways by the different setups. ANNa, RFia and RFoa all train a separate model (a linear ANN) to predict the actual error for each instance, while RFok uses the average out-of-bag error from the \(k\) nearest neighbors. kNNc, finally, does not explicitly use or model the errors of the underlying model; instead an instance is deemed to be easier if the \(k\) nearest neighbors are (relatively) close and agree in their predictions.

Comparing the different estimators, we see that the estimates produced by RFok have the highest correlation with the model errors. The second best is actually kNNc, followed by the approaches using a separate ANN model as estimator.

From this analysis, and the comparison between normalized and standard nonconformity functions above, it is obvious that although the accuracy of the underlying model is very important, the quality of the nonconformity function is vital for the efficiency. Specifically, using normalized nonconformity functions will increase the efficiency and the quality of the difficulty estimations has an apparent impact on the efficiency.

In order to determine the importance of parameter values, a limited post-hoc analysis was performed, see Table 9. As described above, the parameters \(\beta , \rho \) and \(\gamma \) balance the difficulty estimations against the error in the nonconformity functions. In previous studies, all parameter values were set to \(0.5\), which is not a very robust choice since the importance of the parameter value is heavily affected by the range of the target variable. In this study, as described in Sect. 4.1, the target variables were normalized to \([0,1]\), so both errors and error estimates take on much smaller values. With this in mind, it was obvious that \(\beta \) too must be smaller. In the experimentation, we set \(\beta = 0.01\), based on some initial experimentation. When looking at the efficiencies obtained using different values for \(\beta \), we see that \(0.01\) is actually the best choice, even if the difference when compared to \(\beta = 0\) is marginal, on most data sets. Larger values for \(\beta \), on the other hand, clearly reduce the efficiency. So, the conclusion is that although the parameter value \(\beta \) is very important and dependent on the target range, all reasonable values produce conformal predictors with similar efficiency. Looking at the kNNc setup, it should be noted that there are actually two different parameters, making it possible to balance the two different parts of the difficulty estimation. In this study, however, this was not evaluated, instead \(\rho \) and \(\gamma \) were both set to \(1.0\). From the post-hoc analysis, we can see that while larger values (i.e., \(1.25\) or \(1.5\)) would have increased the efficiency slightly for the lower confidence levels, the resulting differences are almost never large enough to change the ordering of the evaluated setups for specific data sets in the main experiment. In addition, for the confidence level of \(99~\%, \rho = \gamma = 1.0\) was actually the best parameter setting.

Table 9 The importance of parameter values

In order to directly compare the efficiency of our methods to the published results in Papadopoulos and Haralambous (2011) and Papadopoulos et al. (2011), Table 10 compares the interval sizes obtained in Experiment 2 to the interval sizes published. Starting with the standard nonconformity functions, we immediately see that RFi and RFo almost always produced smaller prediction intervals than kNN. For several data sets and confidence levels, the differences are quite large. From a direct comparison, it is quite obvious that both RFo and RFi were more efficient than kNN, winning five of six data sets on all confidence levels. When compared to ANN, however, the results vary over the different confidence levels.

Table 10 Comparison to published results

For normalized nonconformity functions, we see that RFoa and, in particular, RFok are the most efficient over all data sets and confidence levels. Counting wins and losses, RFoa and RFok outperformed both kNNc and ANNa on a majority of data sets. To summarize this comparison with published results, we see that the results from the main experiment are confirmed, i.e., conformal regressors based on random forests, especially when utilizing out-of-bag instances for the calibration, outperform the existing techniques, both when using standard and normalized nonconformity functions.

5 Concluding remarks

In this paper, the use of random forests has been proposed as a strong candidate for regression conformal prediction, since it allows for the necessary calibration to be performed on the out-of-bag examples, thus making it possible to utilize all available data as a proper training set. In one of the largest empirical evaluations to date on regression conformal prediction, the random forest approach was compared to existing state-of-the-art approaches, based on ANN and kNN, for both the standard and normalized settings, i.e., when generating prediction intervals of uniform and varying sizes, respectively. The results show that the suggested approach, on almost all confidence levels and using both standard and normalized nonconformity functions, produced significantly more efficient conformal predictors than the existing alternatives. In particular, the most efficient setup overall was found to be one suggested in this paper, i.e., a random forest conformal predictor calibrated using a normalized nonconformity function based on out-of-bag errors of neighboring instances. The empirical evidence hence strongly suggests that random forests in conjunction with out-of-bag calibration is a highly competitive conformal regressor.

There are several possible directions for future research. One direction concerns the type of model to use for estimating the difficulty of each instance in the normalized setting. Currently, fairly simple models have been evaluated, i.e., kNN and linear ANNs, and gains are to be expected from considering more elaborate techniques, as well as from performing further parameter tuning. Another direction concerns investigating the application of other state-of-the-art machine learning algorithms, e.g., SVMs, to the regression conformal prediction framework and comparing the resulting conformal predictors to random forests with out-of-bag calibration.