D2TS: a dual diversity tree selection approach to pruning of random forests

Random Forest is one of the most effective classification techniques. It is an ensemble technique with typically decision trees as its classifiers. Each tree votes for an outcome when a new instance is being classified, and a majority vote is taken to decide the final output. Two main factors play an essential role in Random Forests performance, namely diversity among trees in the forest and their number. Higher diversity increases prediction accuracy, whereas lower numbers of trees result in faster predictions. This paper aims at optimizing these two factors by using clustering analysis of trees in order to prune correlated trees while keeping outlier trees to maintain diversity. We group the trees into clusters and only take a number of representatives from each cluster while also keeping some or all of the outliers to preserve diversity. The resulting subset of trees will constitute a random forest of a reduced size. We will use the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm for clustering. DBSCAN is one of the most commonly used clustering techniques and is immune to outliers. We will use DBSCAN to (i) group trees in clusters based on their prediction behaviour and (ii) identify outliers. Each of the clustered and outlier trees bring an element of diversity into the pruned random forest, thus giving our approach its dual diversity aspect. Our approach achieved up to a 99% pruning level while resulting in similar, or even better, accuracy compared to the original forests for 19 public datasets with varying properties. Our source code is publicly available on GitHub.


Introduction
Data mining and machine learning have been a very active research area in the past years. This is due to the amount of data being collected nowadays, and the availability of computational resources that enable researchers to apply machine learning techniques on big datasets. Those techniques are generally divided into several categories including the supervised and unsupervised approaches [15]. The distinction between the two approaches is based on whether a desired output is used at the training time or not. Classification is a supervised learning technique in which we try to predict a class or a label for a given instance after acquiring the knowledge from the training samples [24]. Clustering, on the other hand, is an unsupervised learning technique in which similar instances are grouped in clusters which can be thought of as classes [17].
Decision trees are one of the most commonly used classifiers due to their simplicity and good performance [35]. Random Forest is a classification method which consists of combining multiple decision trees as one ensemble and classify new instances based on the majority votes of individual trees [6]. To diversify trees, Random Forest uses bagging, where each tree is built using a sample drawn from the training data with replacement and has the same size as the training data, resulting in some instances being drawn more than once and others ignored all together (approximately 34% ignored). Also, only a fraction of the features are considered to be chosen at each node split of a tree (typically √ N , where N is the total number of features in a dataest).
Random Forests were found to be the best classifier among 179 classifiers belonging to 17 classifier families 1 3 in a 2014 survey [14]. This highlights the importance of Random Forests, and the motivation to further optimize its performance.
Since its inception, many enhancements have been proposed for random forest to improve its classification accuracy. Those enhancements include techniques like changing the voting mechanism [33,37], using different attribute evaluation metrics [33], and pruning [2,13,26]. Pruning can be done by removing correlated, and to some extent redundant, decision trees from a Random Forest which makes it faster and reduces the classification error [11,12] In this paper, we propose D2TS; a novel clustering-based tree selection method to prune Random Forests using Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [10]. DBSCAN groups instances based on the distance between them and the density of resulting clusters. We place similar trees in clusters, then choose representative trees from each cluster (first source of diversity) and from the outliers (second source of diversity). This procedure removes redundant and correlated trees.

Novelty
Clustering-Based Diverse Random Forest (CLUB-DRF) [13] used k-Modes clustering technique to group the correlated trees and then select a single representative per cluster. Our approach brings two main improvements over CLUB-DRF:

The number of clusters is not predefined in DBSCAN.
This results in a more natural definition of clusters. 2. Contrary to CLUB-DRF, We use the outliers produced by DBSCAN to increase the diversity in the resulting forest and our results show, as we will discuss later, clear improvements to the pruned Random Forest's performance when outliers are used.
The results we obtained showed that our approach D2TS maintains the same performance or outperforms the original unpruned random forest on virtually all datasets we tested on. We achieved a pruning level of up to 99% while maintaining the same or better accuracy of the original classifier.
We also compared our results to CLUB-DRF and showed that our approach performs better in average over the 19 datasets we tested on.

Data availability
For result generation and comparison to CLUB-DRF, we used 19 different publicly available datasets from the UCI repository [8] and kaggle [38]. Our source code is available on Github at https:// github. com/ yassi nza/ D2TS. The paper is organized as follows. We review notable related work in Sect. 2. Section 3 describes our approach and Sect. 4 our experimental study results. We critically discuss them in Sect. 5. Section 6 concludes our work and proposes future directions of research.

Related work
In this section, we investigate notable work for pruning ensemble models, particularly Random Forest, then critically review them. We will start by briefly reviewing the DBSCAN clustering technique as well as Random Forests.

DBSCAN
Clustering is a technique used in Machine Learning and Data Mining to extract patterns from unlabelled data. Clustering breaks down the input into smaller groups of similar data based on a predefined similarity measure [17]. Several clustering algorithms exist and are used depending on the nature of the data and objectives. One of the first steps when performing clustering is defining a meaningful similarity measure or distance metric, since the objective is to group similar instances in the same group, far away from dissimilar instances.
Once a distance metric has been chosen, and depending on the chosen clustering techniques, the next step in a clustering process might be to define the number of clusters. For example, K-Means algorithm [31] requires a number of clusters to be provided. On the other hand, both Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [10] and Ordering Points To Identify the Clustering Structure (OPTICS) [1] clustering algorithms can be used if the number of clusters cannot be estimated. They do not require a predefined number of clusters.
DBSCAN has many advantages over other clustering techniques, namely its insensitivity to outliers, ability to find clusters with arbitrary shapes, and not requiring a predefined number of clusters [10]. These advantages are beneficial when the domain knowledge is unavailable or the data is very noisy. These are the main reasons for which we have chose DBSCAN in our approach.
DBSCAN is based on two parameters; the first one is minPts which is the minimum number of neighbours to form a cluster. The second one is , which is the maximum distance between two points in a neighbourhood. Based on those two parameters, DBSCAN classifies points into three types: (i) core points which have minPts neighbours with distance less than or equal to , (ii) border points which lie within an radius from a core point but have less than minPts neighbours in their radius, and (iii) outliers which are neither core nor border points. Outliers are left unclustered.
DBSCAN also defines a reachability notion which states that: 1. A point is directly reachable from a core point if they lie within distance from each other. 2. A point is reachable from a core point p if there is a connected path of core points from p such that; each two consecutive core points along the path are directly reachable from each other.
DBSCAN starts by discovering the neighbours of a random point within distance. If it has minPts neighbours, then it is marked as core point. This is done for all points, and all core points that are within distance from each other are grouped in the same cluster. Each border point is then assigned to the closest cluster(s) that has a directly reachable point. Outliers are left un-clustered. One of DBSCAN's strengths is its ability to find clusters with arbitrary shapes. This comes as a result of the minPts parameter and the reachability notion.

Random forests
Decision Trees are one of the most commonly used classifiers in data mining due to their efficiency for large datasets, and their intuitive like structure which result in simple interpretations. Each decision tree consists of a root, internal nodes, and leaves which represent the class. C4.5 [32] is the most commonly used implementation of decision trees, and our research will be using this implementation. C4.5 uses Gain Ratio to split at each node and construct the tree branches. It follows a greedy approach by choosing the feature with the highest Gain Ratio as the root and then applies the same at each level. The technique terminates once all the instances belong to the same class or there are no more features to split on. One advantage of C4.5 is that it can handle both numerical and categorical features with multiple values. C4.5 also performs a post-pruning process by replacing sub-tress with leaf nodes if it minimizes the classification error.
Ensemble models consist of several models from the same family or different types. They were proposed to overcome the limited predictive performance of a single classifier [34]. Each model in the ensemble produces an output, and the final result is decided according to a voting scheme. Diversity in the ensemble model plays a vital role in the accuracy of the model against unseen data [40]. If two models have different error rates when tested against unseen data, then we say they are diverse.
Random Forests use bagging to construct a training set when growing each decision tree in addition to selecting a random subset of features for each tree [19], hence their name. The two main factors that affect the classification error are [6]: 1. Trees correlation: correlation between trees increases the error, thus diversity is very crucial as mentioned before. 2. Performance of individual trees: the higher the accuracy of individual trees, the smaller the error rate.
The main strengths of random forests are its robustness against outliers and overfitting. The number of trees in a forest is usually in the range of 100-500, which makes Random Forest less convenient for real-time application [39]. Another drawback is that when the number of features is very high, the correlation among trees increases, thus increasing the error rate [5,6]. Our approach, presented in Sect. 3, aims to minimize the correlation between trees while reducing the number of trees as the same time, making Random Forests both more accurate (due to diversity) and faster at making decisions (due to the reduced number of trees). But we will first review some notable proposals with similar objectives in the next subsection.

Optimizing random forests
Random Forest optimization techniques can be divided into two categories, namely pruning techniques and nonpruning techniques. Non-Pruning techniques try to optimize the performance of the forest by tweaking the parameters of the Random Forest, or by changing the voting mechanism for example. Pruning techniques, on the other hand, take an ensemble and reduce the number of models in it to reduce correlation and increase speed. We will review both approaches in the rest of this section. Table 1 highlights the differences between notable proposals under the two approaches, including their performance.

Pruning techniques
The objective of pruning ensemble models is twofold: producing a smaller ensemble in order to reduce computational and memory requirements and to boost the performance by eliminating correlated models [43]. Pruning techniques can be classified into three types: ranking-based, optimizationbased, and clustering-based [36].
In Ranking-based techniques, the models in the ensemble are ranked based on an evaluation measure like Cohen's Kappa statistic, which measures the interrater agreement [30]. One advantage of the Kappa statistic is that it takes into account the agreement that could happen by chance. The models are then ranked in ascending order and pruned according to a predefined threshold. El Habib et al. used the correlation feature selection (CFS) algorithm to select best 1 3 trees [9]. They measured the tree-to-tree and tree-to-labels correlation. Correlation was measured as a symmetrical uncertainty correlation coefficient.
Optimization-based techniques use local search algorithms [16] and genetic programming to find an optimal subset of the models that achieve an objective function [25]. This objective function can be accuracy, receiver operating characteristic (ROC), root-mean-squared-error (RMSE), or other metrics of evaluation. Hill Climbing was used in [7] to traverse a search space of states to find some local optima, and converges when a better score cannot be achieved by any of the following states. Each state is a subset of the ensemble, and each adjacent state is reached by adding or removing a model from the ensemble.
In Clustering-Based Techniques, the first step is to cluster similar models in the same cluster based on their prediction behaviour. The pruning is then done by choosing a single, or multiple representatives for each cluster. Many mechanisms for selecting a representative exist, such as selecting the centroid of the cluster as target values to construct a new model [3]. Another approach is choosing the model with the highest distance from all other clusters; this way achieves more diversity among models [18]. Lastly, Lazarevic and Obradovic used a greedy approach where the models are selected according to their accuracy [27]. Fawagreh and Gaber proposed to use an evolutionary approach to evolve clusters to an optimal point [11]. They did so by creating a set of clusters and updating them using replicator dynamics [20] by adding and removing trees. Their experiments on 10 healthcare datasets showed that their approach significantly outperformed the original unpruned random forests.
One difficulty with clustering techniques like k-means and k-modes is determining the number of clusters, we will overcome this issue by using DBSCAN.

Other random forests enhancements
Several non-pruning techniques have been proposed such as changing the voting method from majority voting to a weighted voting scheme [33]. This way, the instances similar to the test instance are identified by the internal estimates. Trees that perform well on these instances are given a higher weight than the other trees. This approach performed better than the original Random Forest implementation on multiple datasets as stated in [33]. Decreasing the correlation between trees in the forest was also proposed in [33]. This was achieved by using multiple attribute evaluation metrics which led to a better performance too.
A genetic algorithm-based approach has shown to outperform several classification techniques [2]. Genetic algorithms are descendant of evolutionary algorithms; they used operations like mutation, selection, inheritance, and crossover to find an optimal subset of trees. In their approach, each tree is represented with a bit, which indicates if it is in the subset or not. An initial Random Forest of size n is generated, and then a number of sub-forests are randomly drawn Weighted voting [33] No Better over several datasets The voting mechanism is a weighted one instead of majority voting Several feature evaluation measures [33] No Better performance The correlation between trees was decreased which led to a better performance Genetic algorithms [2] Yes Outperformed Several Classifiers including Random Forests Genetic algorithms to find optimal subforests SFS and SBS [4] Yes Better performance with pruning level up to 84% Search for a subset of trees that perform better than the original forest using SFS and SBS Prune by prediction and by similarity [42] Yes Both performed better than Random Forests, prediction-based outperformed similarity-based Eliminate trees one by one based on the subforest's accuracy or the tree's similarity to the forest McNemar non-parametric test [26] Yes Similar acuracy to original Random Forests, less computaional resources.
A priori determine the smallest number of trees in a forest CLUB-DRF [13] Yes At least as good as Random Forest, 92% up to 99% pruning levels Cluster trees using k-modes based on the similarity between output vectors to form the initial population. Genetic algorithms operations are then used to optimize the objective function which is the classification accuracy of the sub forest. This approach outperformed several classification techniques including Random Forests. Gaber et al. propose to use evolutionary replicator dynamics to form subset of trees with optimal performance. They applied their approach on 10 healthcare datasets and showed their approach outperformed the original random forest at both classification and regression. Sequential Forward Search (SFS) and Sequential Backward Search (SBS) were used in [4] to add trees to the forest incrementally. The objective was not to find an optimal forest, but rather to prove that a subset of the forest might give a better performance. Their results showed that 50% or even 16% of the trees in the forest give better results than the original forest.
Zhang and Wang proposed two other approaches that rely on overproducing the forest and then removing trees based on different measures [42]. The first one is predictionbased where the accuracy of the original forest with n trees is calculated, and then the accuracy of each forest with n-1 trees is calculated. The nth tree associated with the weakest (least accurate) n − 1 trees will then be removed. This will continue until the accuracy converges and removing trees will not produce a better forest. The second approach is similarity-based is follows a similar procedure. The correlation between the outputs of each two trees is calculated, and the average correlation for a tree with the rest of the forest is the overall similarity with the forest. The tree with the highest similarity is then removed. The results showed that the prediction-based approach gave better results than the similarity-based approach.
McNemar non-parametric test of significance was used in [26] a priori determine the smallest number of trees in a forest to achieve a similar accuracy to a larger forest. This method also achieved high classification speed and required less memory.

CLUB-DRF
Fawagreh et al. proposed CLUB-DRF, a clustering based approach that used k-modes to prune Random Forests [13]. The trees are clustered based on the distance between their output vectors, produced by applying each tree on the training set. The distance is simply the number of mismatches between the output vectors. The higher the distance, the more dissimilar the trees. The authors tested different values for the number of clusters, and reported pruning levels in the range of 92-99%. They also tried different strategies for selecting a representative of a cluster, and the results showed that this approach performed same or better than the original forest while reducing the number of trees.
Due to the fact that k-modes does cluster all its input records and does not produce outliers, it was not possible in CLUB-DRF to distinguish outliers, which hold a significant amount of information due to their difference to the rest of the trees, to other highly-correlated trees. By using DBSCAN, we aim to overcome that limitation. Fundamentally, the output of both CLUB-DRF and D2TS is a pruned random forest. However, CLUB-DRF does not consider outlier trees in its tree selection process, given the clustering mechanism adopted. On the other hand, D2TS does consider outlier trees using the outlier cluster (i.e. unclustered instances) generated by DBSCAN.

D2TS: the proposed dual diversity tree selection pruning methodology
In this section, we discuss the proposed pruning method in details.. We propose to cluster the trees from a Random Forest and use cluster representatives to cast votes on behalf of their respective clusters. Clustering the trees achieves two objectives: firstly, it helps identifying redundancy between the trees and using a subset of the trees in each cluster reduces that redundancy, which increases the overall Random Forest diversity and hence accuracy. Secondly, the use of clusters representatives reduces the number of trees that cast a vote, thus reducing classification time once the model is deployed. Figure 1 shows the highlights of our approach. We start with a labelled dataset and we train a random forest on a subset of it, the rest of the dataset is left for validation. We then predict the labels of the validation dataset using the trained random forest, and use that as a new dataset. The new dataset, shown at the center of the figure, has a row for each tree and the row's contents are predictions made by the tree. So, for example, the cell in the fourth row and sixth column of the new dataset contains the prediction made by the fourth tree on the sixth row of the validation dataset. We then apply DBSCAN on the new dataset, to cluster it row-wise (i.e., tree-wise). The new clusters group similar trees together, and we then proceed to selecting a subset of trees from each cluster, as well as outliers, and group them together to become a pruned random forest.
We present here two algorithms. We use algorithm 1 to evaluate and validate our approach, whereas algorithm 2 is the algorithm to be used when applying our algorithm on a real-life case. The difference between the two algorithms comes from the difference in their purposes: algorithm 1 requires the addition of 10-fold cross validation to get better estimate of on the gain on accuracy, if any. Algorithm 1 also outputs ten pruned random forests that are then evaluated to get an average gain. Algorithm 2 output only one pruned random forest along with its gain, so the end-user can make an informed decision on the benefits of using the pruned random forest.
Algorithm 1 summarizes our methodology for the evaluation of our approach. To simplify our discussion, let's say that the input dataset has 2000 rows and 50 columns. Assume that we want to apply a pruning percentage of 95%, i.e., only 5% of the trees will be retained, and an outliers percentage of 20%. For example, if we start with a random forest of 500 trees, then we will only retain 25 trees (using our DBSCAN clustering as explained below). Out of those trees, 20% will be outliers, i.e., 5 trees will be outliers; the remaining 20 trees will be from the clusters. Using the above example, the algorithm can be summarized as follows. Since we are using 10-fold cross validation (CV), we do 10 times the following: we start by training a random forest of 500 trees on 90% of the data (the remaining 10% are kept for validation). The trained 500 trees are each applied to the training dataset and their outputs placed in rows to produce an output matrix of 500 rows (a row per tree) and 1800 columns (a column for each of the dataset rows, 1800 is 90% of 2000). We then cluster the rows (trees) of this matrix using DBSCAN. DBSCAN potentially identifies a set of clusters and some outliers. We then pick 5 trees from the outliers, and 20 trees from the clusters. We pick the 5 outlier trees that obtained the highest accuracies on the training dataset. We pick the 20 clustered trees from the clusters in a similar way with a minor addition. Let's say that DBSCAN generated 2 clusters of 300 and 100 trees respectively and we want to pick 20 trees from them. The number of trees we take from a cluster is proportional to the cluster size. This results in taking 15 trees from the first cluster and 5 from the second. In summary, our pruned random forest contains the best 15 trees in cluster 1, 5 best trees in cluster 2, and 5 best trees in the outliers. Our code handles the cases where there are not enough trees in the clusters or in the outliers. Lastly, we evaluate both the original and pruned random forests on the validation dataset to measure the accuracy difference between them. We average that difference over the 10 iterations of the 10-fold cross validation. The random forest model that scored the highest on its validation datasets should be used when deploying the model and making predictions.
Note that we are doing 10-fold cross validation only for the purpose of evaluating our approach and proving its value. For an actual deployment, there is no need to perform 10-fold cross validation. So, the input of the algorithm would be dataset and the parameters, and its output will be one pruned random forests, as shown in algorithm 2 While our approach does cluster the trees based on their output, it is an unsupervised approach since it is based on an unsupervised clustering technique. The trees' output classes are used as input data records, not as actual labels. For example, the first row of the dataset used at the clustering stage is the full list of labels output by the first tree when applied on the validation dataset. The second row is the outputs of the second tree, and so on. When pruning the Random Forest, we did select trees in their descendent accuracy order. So, for example, say we have a Random Forest of 500 trees, in which we have one cluster of 400 trees and the remaining trees are outliers. If we decide to apply a pruning level of 95% and keep a rate of 40% outliers, then the final pruned Random Forest will contain 25 trees in total (5% of 500), out of which 10 (40%) will be outliers and the remaining 15 will be from the cluster. The Contrary to CLUB-DRF [13], D2TS can detect outlier trees thanks to the use of DBSCAN. We did use those outliers for casting votes along with cluster representatives and this resulted in better results. We also did explore the impact of varying the percentage of outliers inside the pruned Random Forest on its performance as we will see in Sect. 5.
For the purpose of clustering trees, we did represent each tree of the original Random Forest of S trees as an array of values {t k,i } 1≤i≤N,1≤k≤S , where t k,i is the result of applying a tree T k on the i th row, aka record or individual, from a dataset D = {d i,j } 1≤i≤N,1≤j≤M .
We measured the distances between trees using the hamming distance. Clustering will thus group trees by 1 3 15 trees from the cluster will be selected based on their accuracy on the training dataset; the same applies to the outliers.

Experimental study
We present our experimental study steps and results in this section. The first subsection introduces the datasets we applied D2TS on. The next two subsections cover the two main steps of our algorithm, namely training the random forest and clustering them. The last section is dedicated to a discussion on why it was common to get one large cluster during our experiments.

Datasets
We used 18 different datasets from UCI repository [8] and the spotify recommendation dataset from kaggle [38]. We use in total 19 datasets for the comparison. Table 2 summarizes the main features of each dataset.
We used 10-fold cross validation; essentially doing the following ten times: use one tenth of the dataset as a validation dataset and the rest for training. The training dataset was used to train a random forest of 500 trees (the "original forest"), whereas the validation dataset was to compare the performance of pruned forests to the original.
We will mainly focus on the Glass dataset as an example to illustrate our approach. We still obtained similar results for the other datasets and we will mention them where appropriate.

Training the random forests
The first step in our pipeline is training the Random Forest model for each dataset. Our methodology was implemented using Python. We used information gain ratio as the split criterion. The advantage of information gain ratio is its ability to reduce the bias towards choosing multi-valued features by including the number of branches [23].
The number of trees in the forest is also an important factor in enhancing the accuracy of the model. It was shown both empirically and theoretically that increasing the number of trees beyond a certain limit will not always enhance the accuracy [29]. Hence, Random Forests are usually grown with a number of trees in the range of 100-500. In our study, each Random Forest consists of 500 trees in order to achieve high diversity and then prune the resulting model.  Each tree T k is then represented as an {t k,i } 1≤i≤N , where t k,i is the result of applying T k on the i th data point. The resulting classification output vector of N-cells is used at the next, clustering, stage. Trees that tend to cast similar decisions are likely to end-up in the same cluster and we claim that choosing a subset from that cluster will be representative enough for the whole cluster. Table 3 shows the accuracy of the random forest for each dataset.

Clustering
As mentioned in Sect. 2.1, we will use hamming distance to measure similarity between the trees. Each tree is represented by the vector of the labels for all records in the training dataset. For example: let's consider a dataset that has four rows and for which the available labels (classes) are C 1 , C 2 and C 3 . Let the output vector of a tree T 1 be [C 1 , C 2 , C 1 , C 3 ] and the output vector of another tree T 2 be [C 2 , C 2 , C 1 , C 2 ] , then the mismatch distance between the two trees is 2, as they disagree on two instances: the first and fourth. Table 4 shows the minimum, maximum, standard deviation ( ) and mean ( ) distance between all pairs for four datasets we use.
The next step after calculating the distance matrix is clustering the trees using DBSCAN. As mentioned in Sect. 2.1, DBSCAN uses two parameters and minPts. We did a brute force exploration on these two parameters to check their effect on the number of produced cluster. For each dataset, we set to be in the range between the min and max distance between all pairs (values taken from Table 4). As for the minimum number of points minPts, we used all values between 2 and 50.   Trees in  clusters   17  2  454  8  46  16  2  484  6  16  18  2  424  5  76  20  2  341  4  159  15  2  493  3  7  16  3  490  3  10  17  4  473  3  27  17  3  464  3  36  18  9  459  3  41  17  7  484  2  16  17  6  484  2  16  17  5  484  2  16  18  13  473  2  27  18  12  473  2  27  18  11  473  2  27  18  10  473  2  27  18  7  450  2  50  19  22  449  2  51  19  21  449  2  51  19  20  449  2  51  19  19  449  2  51  19  18  449  2  51  18  6  449  2  51  19  17  438  2  62  19  16  428  2  72  21  2  286  2  214  23  2  168  2  332  Table 5 shows the parameters that result in more than one cluster for the Glass dataset. As shown in Fig. 2, only a handful of parameters yield more than one cluster with very few trees in each cluster and the majority of trees are marked as outliers. We note that those parameters had usually a medium-range value of (all between 15 and 23, mostly 19 and less), and a low to average value of minPts (mostly less than 10). The remaining combinations of and minPts led to one of the following: 1. All trees are marked as outliers when is very small compared to the mean distance. 2. All trees are in one cluster with no outliers when is relatively large. 3. The trees are divided between outliers and a single cluster for the remaining values of epsilon. As we saw in Table 5, only a few of values of minPts can break this pattern and lead to more than one cluster.
All other combinations led the trees to grouped into a single cluster of variables sizes, but generally containing more than 50% of the tree as we will see in Sect. 5. This problem did not arise in the CLUB-DRF [13] approach due to the nature of k-modes where the user specifies the number of clusters beforehand. k-modes then clusters the points regardless of whether a natural clustering exists or not. Specifying the number of clusters forces the trees into being separated into several clusters, whether they exhibit different trends or not.

One cluster fits all!
Having one large cluster is not an issue, since we can still choose representatives from it and combine them with representatives from the outliers. That being said, we still wanted to investigate why getting only one cluster was that common. In order to do that, we first looked at the distribution of the distances between all pairs and found out that the distance is normally distributed as shown in Fig. 3. 89.06% of the distances lie between − 1.64 and + 1.64 as expected in a normal distribution. This means that 89.6% of the trees are very close to each other which explains the one cluster problem. We tried to visualize our points using T-Distributed Stochastic Neighbor Embedding (T-SNE) [28]. T-SNE is a nonlinear algorithm that is used to visualize high dimensional datasets in 2-D or 3-D space. It places similar points  near each other and away from dissimilar points based on the distance between them. This technique helps in detecting whether a natural clustering exists or not.
We applied this technique to the trees generates by each of our datasets and they all have similar results that show there is no natural clustering and the trees are uniformly distributed. Fig. 4 shows the results of T-SNE for the Glass dataset. It shows that some trees are very close to each other and form what seems like a cluster. However, we notice that there is no clear separation between clusters, but DBSCAN can detect clusters with random shapes as long as there are points within distance in the neighborhood as mentioned in Sect. 2.1. This confirms our findings from analyzing the distribution of the distances.
The reason we want to find dense clusters is to prune correlated trees more accurately. Tiny clusters do not represent a significant portion of the forest and thus they are not of great value.

Discussion
In this section, we will discuss our pruning parameters and present the results of D2TS for all datasets using the same steps mentioned in Sect. 3. We will test different pruning percentages and percentages of outliers to keep and see the effect on the accuracy of the pruned model. Indeed, changing the percentage of outliers will highlight the effect of increasing the diversity. We will also compare our results with the results of CLUB-DRF [13].

Pruning parameters
Our goal is to find a balance between the number of clustered trees and outliers. From the results in Table 5 and Fig. 2, we notice than when is greater than the mean distance, all trees form a single cluster with no outliers.
We set to four special values which are , − , − 1.5 , and − 2 . Table 6 shows that when equals − 2 gives balance between clustered trees and outliers.
We obtained similar results for the other datasets. The value of minPts does not effect this balance, however we will set its value to 20 as a balanced choice as can be seen from Fig. 2.
The pruning level is the percentage of trees to be removed from the forest. The pruning level will be in the range of 80-99%. Out of the retained 1-20% trees, we choose a percentage of outliers to keep. For example, say out of 500 trees in the original dataset, we only keep 2%, i.e., 10 trees. We can then apply a wide range of values for the percentage of outliers. For example, with an outlier percentage of 40%, our pruned tree would contain 6 representative trees from the clustered trees, and 4 representatives from outliers. We will set the percentage of the outliers to all values between 10% and 100% using a 10% increment when possible.

Pruning results on a COVID19 dataset
We applied our approach on a COVID19 dataset [41]. We generated a random forest of 500 trees on that dataset and pruned the random forest using all combinations of pruning levels and outliers' percentages mentioned in Sect. 5.1. Table 7 shows the best results of the COVID19 dataset along with the parameters used to achieve them.
The best accuracy achieved for the COVID19 dataset is 100% which is an improvement of 2.15% compared to the accuracy of the original forest (97.85%). This accuracy was achieved with pruning up to 99% (with 20% outliers). The 99%-pruned random forest contains 5 trees, out of which 2 are outliers.

Parameter tuning on the glass dataset
We also did a an investigation of the glass data, using 70/30 split instead of 10-fold cross validation for efficiency; Table 8 shows the results for all combinations of pruning levels and outliers' percentages on the glass dataset. We can clearly see that increasing the percentage of outliers enhances the overall accuracy which concords with our hypotheses. For example, in the fifth row (96% pruning, i.e., keeping only 20 trees out of the original 25), we notice that  in the third column, where outliers are 30%, the accuracy goes down by 3.1%. But when the outliers are 60%, there is no loss in accuracy. On the other hand, for higher percentages of outliers, the accuracy is improved by 3.1% (between 70% and 90% outliers) or even 6.2% (for 100% outliers). We also notice that, in general, 95% and 96% pruning levels yield the best results. Table 9 contains the best results achieved for each dataset using our approach. Those results were achieved with pruning levels of up to 99%. It is worth noting that, for the datasets where the original random forests (of 500 trees) achieved an accuracy of 100%, the pruning did not bring that accuracy down. Furthermore, while the accuracy improvement on some datasets was less than 3%, our approach still has the advantage of requiring much less processing power to make prediction once deployed. Indeed, the pruning level for all dataset was at least 80% and for about half of them (9) was at least 90%. Three datasets got better results using a 99% pruning (in addition to the COVID19 dataset we used in our discussion in Sect. 5.1).

Approach evaluation
As mentioned in our evaluation criteria in Sect. 3, achieving an accuracy that is similar to the original forest is an important factor in judging the effectiveness of D2TS. Hence, we can say that D2TS did succeed given the results in Table 9. The second factor is the pruning level which would reduce the computational requirements. We achieved different pruning levels ranging from 80 to 99%. Table 10 shows a comparison of the accuracy gains achieved by D2TS and CLUB-DRF. Both approaches performed pretty well on all datasets. Let's now investigate if there is a significant difference between D2TS and CLUB-DRF, and if so, then which one performed better.

Statistical validation
We perform a two-stage statistical validation; the first stage is a Friedman test [22], and the second is a posthoc analysis using a Holm test. The Friedman test is a non-parametric statistical test that can be used to compare 3 or more treatments on items [22]. We will use it to detect if the difference in the accuracy obtained from (i) original random forests, (ii) D2TS-pruned random forests, and (iii) CLUB-DRF-pruned randoms forests is statistically significant. Our hypothesis H0 is that all three treatments (a treatment is one set of accuracy results) are from the same distribution. We used an of 0.05. The input data were standardised by subtracting the respective original accuracy value from each row.
The test rejected H0, i.e., the three types of accuracy are not from the same distribution. The mean accuracy of D2TS was higher than than both the original random forest. D2TS achieved an average accuracy gain of 4.595%, compared to an average accuracy gain of 3.323% for CLUB-DRF. D2TS' gain was 38.28% higher than CLUB-DRF's. This showed that D2TS is better than both the original random forest and CLUB-DRF.
We then performed a Holm test [21] to justify the outcome of our first stage. The Holm test indeed confirmed that we had a significant difference between the accuracies obtained from the original random forests, D2TS-pruned random forests, and CLUB-DRF-pruned random forests with an of 0.07. The p-values obtained from the test are shown in Table 11 6 Conclusion and future work We proposed D2TS; a novel approach to prune Random Forests in order to reduce their computational power requirements for inference while maintaining their accuracy or even improving it. D2TS uses the DBSCAN clustering technique in order to group correlated trees in clusters. As discussed in the previous section, most of the configuration parameters produced one cluster. We did investigate the data in depth and proved it was quite expected to get such results given how the distances were distributed. Getting a single cluster is an interesting phenomenon and might be useful in judging the quality of the random forest and the diversity between the trees in the forest. We achieved an improvement in accuracy for most of the datasets with a pruning level that ranges between 80% and 99%. We also observed that increasing the percentage of the representatives coming from the outliers leads to a better accuracy.
For future work, it would be interesting to investigate the impact of using different similarity or correlation measures like Pearson Correlation on the number of clusters we get and how representative they are of the nature of the trees. Furthermore, Ordering Points To Identify the Clustering Structure (OPTICS) is a variation of DBSCAN but has an advantage over DBSCAN, which is detecting clusters when there is a high variation in the dataset density. Investigating the usage of OPTICS to this particular problem is also another interesting approach to investigate.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.