Ensemble of optimal trees, random forest and random projection ensemble classification

The predictive performance of a random forest ensemble is highly associated with the strength of individual trees and their diversity. Ensemble of a small number of accurate and diverse trees, if prediction accuracy is not compromised, will also reduce computational burden. We investigate the idea of integrating trees that are accurate and diverse. For this purpose, we utilize out-of-bag observations as a validation sample from the training bootstrap samples, to choose the best trees based on their individual performance and then assess these trees for diversity using the Brier score on an independent validation sample. Starting from the first best tree, a tree is selected for the final ensemble if its addition to the forest reduces error of the trees that have already been added. Our approach does not use an implicit dimension reduction for each tree as random project ensemble classification. A total of 35 bench mark problems on classification and regression are used to assess the performance of the proposed method and compare it with random forest, random projection ensemble, node harvest, support vector machine, kNN and classification and regression tree. We compute unexplained variances or classification error rates for all the methods on the corresponding data sets. Our experiments reveal that the size of the ensemble is reduced significantly and better results are obtained in most of the cases. Results of a simulation study are also given where four tree style scenarios are considered to generate data sets with several structures.

Extending this notion, Breiman (2001) suggested growing a large number, T for instance, of classification and regression trees. Trees are grown on bootstrap samples form a given training data L = (X, Y) = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )}. The x i are observations on d features and y values are from the set of real numbers and a set of known classes (1, 2, 3, . . . , K ) in cases of regression and classification, respectively. Breiman called this method bagging and using random selections of features at each node random forest (Breiman 2001).
As the number of trees in random forest is often very large, there has been a significant work done on the problem of minimizing this number to reduce computational cost without decreasing prediction accuracy (Bernard et al. 2009;Meinshausen 2010;Oshiro et al. 2012;Latinne et al. 2001a).
Overall prediction error of a random forest is highly associated with the strength of individual trees and their diversity in the forest. This idea is backed by Breiman (2001) upper bound for the overall prediction error of random forest given by where j = 1, 2, 3, . . . , T , T denotes the number of all trees, Err is the overall prediction error of the forest,ρ represents weighted correlation between residuals from two independent trees i.e. mean (expected) value of their correlation over entire ensemble, and err j is the average prediction error of some jth tree in the forest. Based on the above discussion, our paper proposes to select the best trees, in terms of individual strength i.e. accuracy and diversity, from a large ensemble grown by random forest. Using 35 benchmark data sets, the results from the new method are compared with those of random forest, random projection ensemble (classification case only), node harvest, support vector machine, kNN and and classification and regression tree (CART). For further verification, a simulation study is also given where data sets with many tree structures are generated. The rest of the paper is organized as follows. The proposed method, the underlying algorithm and some other related approaches are given in Sect. 2, experiments and results based on benchmark and simulated data sets are given in Sect. 3. Finally, Sect. 4 gives the conclusion of the paper.

OTE: optimal trees ensemble
Random forest refines bagging by introducing additional randomness in the base models, trees, by drawing subsets of the predictor set for partitioning the nodes of a tree (Breiman 2001). This article investigates the possibility of further refinement by proposing a method of tree selection on the basis of their individual accuracy and diversity using unexplained variance and Brier score (Brier 1950) in cases of regression and classification respectively. To this end, we partition the given training data L = (X, Y) randomly into two non overlapping partitions, Grow T classification or regression trees on T bootstrap samples from the first partition L B = (X B , Y B ). While doing so, select a random sample of p < d features from the entire set of d predictors at each node of the trees. This inculcates additional randomness in the trees. Due to bootstrapping, there will be some observations left out of the samples which are called out-of-bag (OOB) observations. These observations take no part in the training of the tree and can be utilized in two ways: 1. In case of regression, out-of-bag observations are used to estimate unexplained variances of each tree grown on a bootstrap sample by the method of random forest (Breiman 2001). Trees are then ranked in ascending order with respect to their unexplained variances and the top ranked M trees are chosen. 2. In case of classification, out-of-bag observations are used to estimate error rates of the trees grown by the method of random forest (Breiman 2001). Trees are then ranked in ascending order whith respect to their error rates and the top ranked M trees are chosen.
A diversity check is carried out as follows 1. Starting from the two top ranked trees, successive ranked trees are added one by one to see how they perform on the independent validation data, This is done until the last Mth tree is tested. 2. Select treeL k , k = 1, 2, 3, . . . , M if its inclusion to the ensemble without the kth tree satisfys the following two criteria given for regression and classification respectively.
(a) In the regression case, let U.EX P k− be the unexplained variance of the ensemble not having the kth tree and U.EX P k+ be the unexplained variance of the ensemble with kth tree included, then treeL k is chosen if (b) In the classification case, letBS k− be the Brier score of the ensemble not having the kth tree andBS k+ be the Brier score of the ensemble with kth tree included, then treeL k is chosen if total # of test instances , y i is the state of y i for observation i in the (0, 1) form andP(y|X) is the binary response probability of the ensemble estimate given the features.
These trees, named as optimal trees, are then combined and are allowed to vote, in case of classification, or average, in case of regression, for new/test data. The resulting ensemble is named as optimal trees ensemble, OTE.

The Algorithm
Steps of the proposed algorithm both for regression and classification are 1. Take T bootstrap samples from the given portion of the training data L B = (X B , Y B ). 2. Grow regression/classification trees on all the bootstrap samples using random forest method. 3. Rank the trees in ascending order with respect to their prediction error on out-ofbag data. Choose the first M trees with the smallest individual prediction error. 4. Add the M selected trees one by one and select a tree if it improves performance on validation data, L V = (X V , Y V ), using unexplained variance and Brier score in cases of regression and classification as the respective performance measures. 5. Combine and allow the trees to vote, in case of classification, or average, in case of regression, for new/test data.
An illustrative flow chart of the proposed algorithm can be seen in Fig. 1. An algorithm, based on a similar idea has previously been proposed at the European Conference on Data Analysis 2014, where instead of classification trees, probability estimation trees are used . The ensemble of probability estimation trees is used for estimating class membership probabilities in binary class problems. This paper, OTE, focuses on regression and classification and evaluates the performance by the standard measures of unexplained variances and classification error rates. On the other hand, optimal trees ensemble given in Khan et al. (2016) is focusing on probability estimation and provides comparison of the benchmark results by Brier score. Moreover, we included a comparison of OTE and , OTE.Prob, (when evaluated by classification error rates) in the analysis of benchmark problems in the last two columns of Table 5 of this paper.
Ensembles selection for kNN classifiers have also been proposed recently where in addition to individual accuracy, the kNN models are grown on random subsets of the feature set instead of considering the entire feature set (Gul et al. 2016a, b).

Related approaches
There has been a significant work done on the issue of reducing the number of trees in random forests by various authors. One possibility of limiting the number of trees in a random forest might be determining a priori the least number of trees to combine that gives prediction performance very similar to that of a complete random forest as proposed by Latinne et al. (2001b). The main idea of this method is to avoid overfitting trees in the ensemble. This method uses the McNemar test of significance to decide between the predictions given by two different forests having different number of trees. Bernard et al. (2009) proposed a method of shrinking the size of forest by using two well known selection methods: sequential forward selection method and sequential backward selection method for finding sub-optimal forests. Li et al. (2010) proposed the idea of tree weighting for random forest to learn data sets with high dimensions. They used out-of-bag samples for weighting the trees in the forest. Adler et al. (2016) have recently considered ensemble pruning to fix the class imbalanced problem by using AUC and Brier score for Glaucoma detectection. Oshiro et al. (2012) examined the performance of random forests with different numbers of trees on 29 different data sets and concluded that there is no significant gain in the prediction accuracy of a random forest by adding more than a certain number of trees. Zhang and Wang (2009) considered the similarity of outcomes between the trees and removing the trees that were similar, thus reducing the size of the forest. They called this method the "By similarity method". However, this method was not able to compete with their proposed "By prediction" method. Motivated by the idea of downsizing ensembles, this work has proposed optimal tree selection for classification and regression that could reduce computational costs and achieve promissing prediction accuracy.

Simulation
This section presents four simulation scenarios each consisting of various tree structures ). The aim is to make the recognition problem slightly difficult for classifiers like kNN and CART, and to provide a challenging task for the most complex method like SVMs and random forest. In each of the scenarios, four different complexity levels are considered by changing the weights η i jk of the tree nodes. Consequently, four different values of the Bayes error are obtained where the lowest Bayes error indicates a data set with strong patterns and the highest Bayes error means a data set with weak patterns. Table 1 gives various values of η i jk used in Scenarios 1, 2, 3, and 4. Node weights for obtaining the complexity levels are listed in four columns of the table for k = 1, 2, 3, 4, for each model. A generic equation for producing class probabilities of the bernoulli response Y = Bernoulli( p) given the n×3T dimensional vector X of n iid observations from Uniform(0, 1) is c 1 and c 2 are some arbitrary constants, m = 1, 2, 3, 4 is the scenario number and Z m 's are n × 1 probability vectors. T is the total number of trees used in a scenario andp t 's are class probabilities for a particular response in Y. These probabilities are generated by the following tree structureŝ p 4 = η 41k × 1 (x 10 ≤0.5&x 11 ≤0.5) + η 42k × 1 (x 10 ≤0.5&x 11 >0.5) + η 43k × 1 (x 10 >0.5&x 12 ≤0.5) +η 44k × 1 (x 10 >0.5&x 12 >0.5) , 1 1 0.9 0.8 0.7 0.6 1 1 0.9 0.8 0.7 0.6 1 1 0.9 0.9 0.9 0.8 1 1 0.9 0.9 0.9 0. +η 54k × 1 (x13>0.5&x 15 >0.5) , where 0 < η i jk < 1 are weights given to the nodes of trees, k = 1, 2, 3, 4 and

Scenario 1
This scenario consists of 3 tree components each grown on 3 variables with T = 3, Z 1 = 3 t=1p t and X becomes a n × 9 dimensional vector.

Scenario 2
In this scenario we take a total of T = 4 trees where Z 2 = 4 t=1p t such that X becomes a n × 12 dimensional vector.

Scenario 3
This scenario is based on T = 5 trees such that Z 3 = 5 t=1p t and X becomes a n × 15 dimensional vector.

Scenario 4
This scenario consists of 6 tree components which follows that, T = 6, Z 4 = 6 t=1p t and X becomes a n × 18 dimensional vector.
To understand how the trees are grown in the above simulation scenarios, a tree used in simulation Scenario 1 is given in Fig. 2.
The values of c 1 and c 2 are fixed at 0.5 and 15, respectively, in all the scenarios for all variants. A total of n = 1000 observation are generated using the above setup. kNN, CART, random forest, node harvest, SVM and OTE are trained by using 90% of the data as training data (of which 90% is for bootstrapping and 10% for diversity check, in the case of OTE) and then applying the remaining 10% data as test data for testing purpose. For OTE, T = 1000 trees are grown as the initial ensemble. Experiments are repeated 1000 times in each scenario giving a total of 1000 realizations. The final results are obtained by averaging outcomes under the 1000 realizations made in all the scenarios and are given in Table 2. Node weights are changed in a manner that could make the patterns in the data less meaningful and thus getting a higher Bayes error. This can be observed in the fourth column of Table 2, where each scenario has four  It can be observed in the simulation that Bayes error of a scenario can be regulated by changing either the number of trees in the scenario or node weights of the trees or both. For example, weights of 0.9 and 0.1 assigned to extreme nodes (right most and left most) and inner nodes, respectively, would lead to a less complex tree as compared to the one with 0.8 and 0.2 such weights. Tree given in Fig. 2 is the least complex tree used in the simulation in terms of Bayes error. As anticipated, kNN and tree classifiers have the highest percentage errors in all the four scenarios. Random forest and OTE performed quite similarly with slight variations in few cases. In cases where the models have the highest Bayes error, the results of random forest are better or comparable with those of OTE. In all the remaing cases where the Bayes error is the smallest, OTE is better or comparable with random forest. SVM performed very similarly to kNN and tree. Percentage reduction in ensemble size of OTE compared to random forest is also shown in the last column of the table. A 90% reduction in the size would mean that OTE use only 10 trees to achieve a performance level of a random forest of 100 trees. This means that OTE could be very helpful in decreasing the size of the ensemble thus reducing storage costs. The box plots given in Fig. 3 reveal that the best results of OTE can be observed in Fig. 3a where a data set with meaningful tree structures is generated. Figure 3d is the worst example of OTE where the Bayes error is the highest (i.e. 33%), and where the data have no meaningful tree structures.

Benchmark problems
For assessing the performance of OTE on benchmark problems, we have considered 35 data sets out of which 14 are regression and 21 classification problems. A brief summary of the data sets is given in Table 3. The upper portion of Table 3 are regression problems whereas the lower portion are classification problems.

Experimental setup for benchmark data sets
Experiments carried out on the 35 data sets are designed as follows. Each data set is divided into two parts, a training part and testing part. The training part consists of 90% of the total data while the testing part consists of the remaining 10% of the data. A total of T = 1500 independent classification and regression trees are grown on bootstrap samples from 90% of training data along with randomly selecting p features for splitting the nodes of the trees. The remaining 10% of training data is used for diversity check. In the cases of both regression and classification, the number p of features is kept constant at p = √ d for all data sets. The best of the total T trees are selected by using the method given in Sect. 2 and are used as the final ensemble (M is taken as 20% of T ). Testing part of the data is applied on the final ensemble and a total of 1000 runs are carried out for each data set. Final result is the average of all these 1000 runs. The same setting is used for the optimal trees ensemble in Khan et al. (2016) i.e. OTE.Prob.
For tuning various parameters of CART, we used the R-Function "tune.rpart" available within the R-Package "e1071" (Meyer et al. 2014). We tried various values, (5,10,15,20,25,30) for finding the optimal number of splits and the minimal optimal depth of the trees.
The only parameter in the node harvest estimator is the number of nodes in the initial ensemble and for its large values the results are insensitive (Meinshausen 2010). Meinshausen (2010) showed for various data sets that initial ensemble size greater than 1000 yields almost the same results. In our experiments we kept this value fixed at 1500. In case of SVM, automatic estimation of sigma was used available with in the R package "kernlab". The rest of the parameters are kept at default values. Four kernels, Radial, Linear, Bessel and Laplacian, are used for SVM. kNN is tuned by using the R function "tune.knn" within the R library "e1071" for various values of the number of nearest neighbours i.e. k = 1, . . . , 10. A recently proposed method, random projection (RP) ensemble (Cannings and Samworth 2017), has also been considered for comparison purposes using the "RPEnsemble" (Cannings and Samworth 2016) R package. Due to computational constraint we have used B 1 = 30 and B 2 = 5. Linear discriminant analysis base = "LDA" and quadratic discriminant analysis base = "QDA" methods are used as the base classifiers along with d=5, projmethod = "Haar" keeping the rest of the parameters at their default values. We did not use k-NN base as it has been shown outperformed by LDA and QDA (Cannings and Samworth 2017).
The same set of training and test data is used for tree, random forest, node harvest, SVM and our proposed method. Average unexplained variances and classification errors, for regression and classification respectively, are noted down for all the four methods on the data sets. All the experiments are done using R version 3.0.2 R Core Team (2014). The results are given in Tables 4 and 5 for regression and classification respectively.

Discussion
The results given in Tables 4 and 5 show that the proposed method is performing better than the other methods on many of the data sets. In the case of regression problems, our method is giving better results than the other methods considered on 7 data sets out of a total of 14 data sets, whereas on 2 data sets, Wine and Abalone, random forest gives the best performance. On 5 of the data sets, Bone, Galaxy, Freidman, and Ozone, SVM with radial kernel and Concrete with Bessel kernel gave the best results. Tree and kNN are unsurprisingly the worst performers in all the methods with the exception of the Stock data set where kNN is the best.
In the case of classification problems, the new method is giving better results than the other methods considered on 9 data sets out of a total of 21 data sets and comparable to random forest on 1 data set. On 3 data sets, random forest gives the best performance. On three of the data sets, Mammographic, Appendicitis and SAHeart, node harvest classifier gives the best result among all other methods. SVM is better than the others on 3 data sets. Random projection ensemble gave better results on 3 data set.
Moverover, the optimal trees ensemble in Khan et al. (2016), OTE.Prob, when evaluated by classification error rates, is also giving very close results to those of OTE. This can be seen in the last two columns of Table 5 where the result of OTE.Prob is itilicised when it performed better than OTE.
Overall, the proposed method gave better results on 13 data sets and comparable results on 2 data set.
We kept all our parameters in the ensemble fixed for the sake of simplicity. Searching for the optimal total number T of trees grown before the selection process, the percentage M of best trees selected at the first phase, node size and the number of features for splitting the nodes might further improve our results. Large values are recommended for the size of the initial set under the available computation resources and a value of T ≥ 1500 is expected to work well in general. This can be seen in Fig. 4 that show the effect of the number of trees in the initial set on (a): unexplained variance and (b): misclassification error for the data sets given using OTE.

0.3329
The unexplained variance of the best performing method for the corresponding data set is shown in bold Table 5 Classification error rates of kNN, tree, random forest, node harvest, SVM, random projection with linear and quadractic discriminant analyses, OTE and OTE.Prob Khan et al. (2016) Data set One important parameter of our method is the number M of best trees selected at the first phase for the final ensemble. Various values of M reveal different behaviour of the method. We considered the effect of M = (1%, 5%, 10%, 20%, . . . , 70%) of the total T trees on the method for both regression and classification as shown in Fig. 5. It is clear from Fig. 5 that the highest accuracy is obtained by using only a small portion, 1-10%, of the total trees that are individually strong which is further reduced in the second phase. This may significantly decrease the storage costs of the ensemble while increasing/without loosing accuracy. On the other hand, having a large number of trees may not only increase storage costs of the resulting ensemble but also decrease the overall prediction accuracy of the ensemble. This can be seen in Fig. 5 in the cases of Concrete, WPBC and Ozone data sets where the best results are obtained at about less than 5% best trees of the total trees at the first phase. This might be due to the reason that in such cases the possibility of having poor trees is high if the size of ensemble is  (a) ( b ) Fig. 6 Effect of the number of features (on x-axis) selected at random for splitting the nodes of the trees on the unexplained variance (a), and error rate (b) for the data sets shown using OTE large and trees are simply grown with out considering their individual and collective behaviours.
We also looked at the effect of various numbers p = √ d, d 5 , d 4 , d 3 , d 2 of features selected at random for splitting the nodes of the trees on the unexplained variances and classification error in the cases of both regression and classification, respectively, for some data sets. The graph is shown in Fig. 6. The only reason that random forest is considered as an improvement over bagging is the inclusion of additional randomness by randomly selecting a subset of features for splitting the nodes of the tree. The effect of this randomness can be seen in Fig. 6 where different values of p results in different unexplained variances/classification errors for the data sets. For example in the case of Ozone data, selecting a higher value of p adversely affects the performance. For some data sets, Sonar for example, selecting large p results in better performance.

Conclusion
The possibility of selecting best trees from an original ensemble of a large number of trees, and combining them together to vote/average for the response is considered. The new method is applied on 35 data sets consisting of 14 regression problems and 21 classification problems. The ensemble performed better than kNN, tree, random forest, node harvest and SVM on many of the data sets. The intuition for the better performance of the new method is that if the base learners in the ensemble are individually accurate and diverse, then their ensemble must give better or at least comparable results as compared to the one consisting of all weak learners. This might also be due to the reason that there could be various different meaningful structures present in the data that could not be captured by an ordinary algorithm. Our method tries to find these meaningful structures in the data and ignore those that only increase the error.
Our simulation reveals that the method can find meaningful patterns in the data as effectively as other complex methods might do.
Even if one could get comparable results by using a few strong and diverse base learners to those based upon thousands of weak base learners should be welcomed. This might be very helpful in reducing the associated storage costs of tree forests with little or no loss of prediction accuracy.
The method is implemented in the R-Package "OTE" (Khan et al. 2014). A practical challenge for OTE arises when we have relatively small number of observations in the data. The trees are grown on 90% of the training data leaving the remaing 10% for internal validation. This might result in missing some important information to learn from while traing OTE. On the other hand, the rest of the methods use the whole training data. Solving this issue might further improve the results of OTE. One way to solve this issue could be using the out-of-bag data from boostrap samples again in a clever way while adding the corresponding trees for collective performance.
The use of some variable selection methods, (Hapfelmeier and Ulm 2013;Mahmoud et al. 2014a, b;Brahim and Limam 2017;Janitza et al. 2015), might, in conjunction with our method, lead to further improvements. Using the idea of random projection ensembles Samworth 2016, 2017) with the proposed method may also allow further improvements.