Identification of representative trees in random forests based on a new tree-based distance measure

In life sciences random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR (https://github.com/imbs-hl/timbR).


Introduction
Predicting future outcomes for patients is a central aspect of modern medicine. Therefore, accurate and robust prognostic and predictive models are needed to predict, for instance, which individual treatment leads to the highest success probability. However, with the gaining importance of precision medicine and accompanying phenotype-rich data sets, the underlying mechanisms become too complex to handle with classical regression-based models. Thus, machine learning approaches become attractive. One established method of machine learning are random forests which have become generally popular due to several advantages. First, they can be trained on a wide range of low-and high-dimensional data. Also, they can be applied in classification, regression as well as probability estimation settings (Breiman, 2001) and can even be used to model survival outcomes (Ishwaran et al., 2008;Hothorn et al., 2006). Finally, by aggregating single decision trees into a random forest, the prediction performance can usually be drastically improved. However, whereas single decision trees are easy to interpret, it is relatively difficult to understand and communicate the prediction by a random forest. Thus, similar to other complex machine learning approaches, random forests are often described to be predictive and explanatory black boxes. This often hinders their translation into clinical practice, given that it is more likely for a clinician to base his or her decision on a prediction model he or she understands. Hence, one of the key features of the usability of prediction models is that they are easy to interpret (Wyatt and Altman, 1995;Heinze et al., 2018). To gain insight into a random forest model, a standard approach is to utilize variable importance measures (Breiman, 2001;Strobl et al., 2007;Nembrini et al., 2018). These evaluate how important each independent variable is for the performance of the model. However, this does not capture the complex tree structure of the model, including interactions between variables. A completely different way to enhance the interpretability of a random forest is to identify a single tree, which best represents the forest. Thus, instead of theoretically having to inspect hundreds of trees in the entire forest, only one representative tree needs to be interpreted to understand the overall model. Banerjee et al. (2012) proposed to select those representative trees based on the highest similarity to all other trees in the forest. To measure the pairwise similarity of trees, they derived three distance metrics which cover different aspects of similarity of trees, namely, the similarity of predictions, the clustering in the terminal nodes, and the selection of splitting variables. However, these metrics are relatively simple and do not, for example, capture the underlying tree structure. Also, it should not only be of interest whether a variable was selected for splitting, but also at which point in the tree it was selected. We therefore developed a new distance measure for decision trees to identify the most representative trees in random forests, based on the selected splitting variables but incorporating the level at which they were selected within the tree. We additionally incorporate the prioritization of variables that are selected multiple times in a tree, assuming that these are more important. Section 2 of this paper begins with a short introduction to decision trees and random forests followed by a description of the metrics introduced by Banerjee et al. (2012). We then develop our new weighting splitting variable (WSV) metric and describe how to extract the most representative tree from the forest based on any tree distance. In Section 3, we investigate the behavior of the different tree-based distance measures by an extensive simulation study. We apply our new method to a real data set analyzing the genetic background of X-linked dystonia-parkinsonism (XDP) in Section 4. Therefore, we aim to predict the age at onset (AAO) of XDP patients by known genetic risk factors (Westenberger et al. (2019), Laabs et al. 2021). Finally, we discuss our results and give recommendations on the use of the different metrics in Section 5.

Decision trees and random forests
Random forests (Breiman, 2001) are based on a combination of classification and regression trees (CART, Breiman et al., 1984) and Bagging (Breiman, 1996). They are a supervised learning algorithm meaning that the true outcome is known for all observations in the training data set. As a general setting, we assume a model to predict an outcome y, which may be categorical, ordinal or continuous, by a set of independent variables x p (p = 1, . . . , P ). To train a random forest model, a training data set D containing outcomes y n (n = 1, . . . , N ) and independent variables x n,p of n observations is used. From this, num.trees subsets D b , (b = 1, . . . num.trees) of D are generated by bootstrapping or subsampling. Based on every subset D b , a decision tree is trained by iteratively splitting D b into binary subgroups. Unlike in the original CART algorithm, every split is based on just a subset of size mtry of the available independent variables, to increase the dissimilarity of the trees in a forest. Beginning with D b acting as the root of the tree T b , the number of possible splits is given by the combination of splitting variables and splitting values. For each possible split s the decrease of impurity is calculated. Here, imp(k) is any impurity measure, and n k , n k l and n kr are the number of observations in node k, the left k l and right k r child node of k. For a categorical outcome variable y usually the Gini index is used as impurity measure, while the mean squared error (MSE) is used for continuous outcomes. Based on this, the optimal split s * is the split which maximizes ∆imp(s, k). The training algorithm stops when either all leaves (terminal nodes) are pure or a pre-defined stopping criterion (e.g., minimal node size, maximal complexity, etc.) is fulfilled. The outcome of a new sample can be predicted by dropping the sample down the trees. Finally, the predictions of the num.trees trees are aggregated by majority vote for classification and by mean for regression trees, respectively.
Distance Measures by Banerjee et al. (2012) To select a limited subset of representative trees from the forest, Banerjee et al. (2012) derived three tree-based distance measures based on different aspects of the tree structure: 1. Similarity of predictions (d pred ): This compares the arbitrarily scaled predictionsŷ i,n andŷ j,n of two decision trees T i and T j for observation n of an independent test data set, leading to the distance metric where N test is the number of observations in the test data set. By comparing the predictions of trees this distance measure has a focus on selecting the tree with the most similar predictions to the complete ensemble, while it not necessarily gives the best explanation of the hidden tree mechanics.

Similarity of clustering in terminal nodes (d clust ):
This evaluates whether two samples of an independent test data set end in the same leaf of T j if they ended in the same leave of T i , leading to the metric where n and m are observations of the test data set. In contrast to the first metric this one focus on how the trees separate the feature space, leading to a most representative tree that has the biggest overlap in the sparation of the feature space and thus a high explanatory power. But like the first metric an independent data set is necessary to estimate the distance making it highly dependent on the kind of data set available for calculations.

Similarity of splitting variables (d split ):
This distance of two trees T i and T j is defined as d split (T i , T j ) = # of split variable mismatches between T i and T j # of independent variables in the data set , where one split variable mismatch occurs for an independent variable if one of the trees T i and T j uses this variable for splitting and the other does not. In contrast to the aforementioned methods the similarity of independent splitting variables can be estimated directly from each tree and thus does not need an independent data set making this method promising when no such data set is available. But since it only compares whether a variable is selected or not, its ability to cover the complex architecture of decision trees where a variable can be selected multiple times is limited.
T 1 x 1 Figure 1: Comparison of the estimated pairwise tree distances based on four different metrics. Two trees T 1 and T 2 were constructed which use two dichotomous predictor variables x 1 and x 2 to predict a dichotomous outcome variable. Since both trees use exactly the same splitting variables, d split is zero, although both trees use the two different variables in different ways. Therefore, d W SV takes into account on which level of the tree a variable was used, resulting in a distance of 1 8 . Based on a constructed test data set consisting of four observations o i = (y i , x 1i , x 2i ), o 1 = (−, 0, 0), o 2 = (−, 1, 0), o 3 = (+, 0, 1) and o 4 = (−, 1, 1), d pred also leads to a distance of zero, since both trees give the same predictions on the test data set. Finally, the clustering by the two trees differ in whether o 2 ends in a terminal node with o 1 or o 4 , leading to a distance d clust of 1 3 In Figure 1 we built an example to highlight the differences between these methods. For this, we assumed a simple model with two dichotomous variables x 1 and x 2 explaining a dichotomous outcome y. Two non-identical trees T 1 and T 2 (upper half of Figure 1) were constructed and used to predict the class of four observations o n = (y n , x 1n , x 2n ), o 1 = (−, 0, 0), o 2 = (−, 1, 0), o 3 = (+, 0, 1) and o 4 = (−, 1, 1) (lower half of Figure 1). The pairwise tree distances for all three metrics as given above are shown. For d pred , the predictions are identical in both trees for all observations o 1 , . . . , o 4 , leading to a distance of 0. For d clust , we observe that the observation o 2 ends in a terminal node together with o 4 in T 1 and together with o 1 in T 2 . Thus, there are two differences in clustering, while we have 6 possible combinations according to the metric, leading to a distance of 1 3 . The metric d split finally leads to a distance of 0, since both trees use the same split variable, but in a different way, which is ignored by d split . This example illustrates that the metrics d pred and d split proposed by Banerjee et al. (2012) can lead to a distance of 0 even for non-identical trees, because both do not exploit the tree structure in its entire complexity. The last metric d clust uses the underlying tree structure for calculation but requires an independent test data set.

Similarity of weighted splitting variable (WSV) (d W SV )
The WSV metric compares the structures of the trees in more detail without using an independent test data set. Specifically, we combine the information that a splitting variable v is used with the information about how much the tree relies on this splitting variable. The latter is assessed by a usage score where k indicates the K i nodes of T i . This leads to the distance measure where V is the set of all independent variables. Thus, the WSV metric counts the number of times a specific splitting variable is used within a tree, weighted by the depth it occurs in. The weight becomes smaller the later in the tree the splitting variable is used, so that the first splitting variable receives the highest weight, assuming that the first split is the most important. The resulting vector of usage scores of all splitting variables is compared between two trees. Returning to the example in Figure 1, our new distance measure leads to usage scores U 1 (x 1 ) = 1 2 , U 1 (x 2 ) = 1 4 , U 2 (x 1 ) = 1 4 and U 2 (x 2 ) = 1 2 , resulting in a distance of d W SV = 1 8 . Therefore, this new metric is promising when no additional data set is available for estimating tree distances, since it covers more of the complex tree architecture of decision trees.

Selection of the most representative tree
All of the tree-based distance metrics defined above measure the pairwise distances of trees based on different aspects of similarity. Given a specific distance metric, Banerjee et al. (2012) proposed to select the most representative tree from the random forest according to the distance score

Simulation study Simulation design
Aim of our simulation study was to evaluate the properties of representative trees selected from a random forest based on different distance measures. For this, we defined the following data generating mechanism: We assumed a continuous outcome y which can be expressed by y n = x n β + n , where y n is the true outcome of observation n, and x n = (x n,p ) =1,...,100 ∼ B(0.5) are independent variables with the effect sizes β p and a normally distributed error term n ∼ N (0, 1). For each replicate of our simulations we simulated a training data set of 1000 observations (y n , x n ) for training of random forests, a test data set of 100 observations for the calculation of distance metrics requiring independent test data and an additional validation data set of 1000 observations to compare tree and random forest predictionsŷ n with the simulated y n . We defined five different scenarios to be evaluated in our simulation study: 1. Few large main effects: In this most simple scenario only five independent variable have a β p = 2, while all 95 other independent variables have an effect of zero.
2. Many small main effects: In contrast to the first scenario, here 50 independent variables have an effect β p = 0.2 and the remaining 50 independent variable have no influence on the outcome y.
3. Correlated variables: While in the first two scenarios all independent variables were uncorrelated, we included correlated independent variables in the third scenario. Here, again we used five independent variables with β p = 2 but this time, for each of the effect variables four additional variables were simulated which have a correlation of 0.3 to the effect variable, but no effect on the outcome themselves. Therefore, the complete data set consists of five variables with an effect, 20 correlated variables without main effects and 75 uncorrelated variables without an influence on the outcome.

Interaction effects:
In this scenario again five independent variables have β p = 2, but this time five additional pairs of variables have an interaction effect of the same size, but no main effect.

Binary and continuous variables:
While in all other scenarios all independent variables were binary, we added five continuous variables sampled from a standard normal distribution to the data set alongside with five binary variables all having β p = 2. Finally, the data set was filled with 90 binary variables without any influence on the outcome.
For each of the five scenarios we simulated 100 replicates. On each resulting training data set we trained four different random forests using ranger (Wright & Ziegler, 2017) with bootstrapping, mtry = √ p = 10 and varying minimal node sizes of ten, 50, 100 and 200. For every random forest, all four distance measures defined above were calculated, and the most representative trees were selected. Additionally, we trained single decision trees using rpart with the same minimal node sizes of ten, 50, 100, 200.
To assess the properties of the distance measures and resulting representative trees, we defined three different estimands with accompanying performance measures.
1. Representation of the random forest prediction: Here, we assess the ability of the most representative trees to represent the prediction of the entire ensemble. We selected for each metric the most representative tree with the lowest distance to all other trees, and evaluated the similarities of the predictions of these trees with those of the entire random forest by whereŷ n is the prediction of the complete random forest on the validation data set, whileŷ * n is the prediction of the selected most representative tree on the same data set.

Generalizability of most representative trees:
Here, we evaluate the prediction performance of the most representative tree on a new independent data set. In the validation data the prediction accuracy of a most representative tree is investigated and compared with the accuracy based on the entire random forest. For this, we selected a most representative tree for each distance measure and compared its mean squared error with that of the entire random forest.

Fraction of covered effect variables:
Ideally the selected most representative tree includes all effect variables. Therefore, we estimated how many of the simulated effect variables were at least once included in the most representative tree.
Results on the representation of the random forest prediction  Figure 2: Representation of random forest predictions by single most representative trees selected based on four different distance measures, with respect to the minimal node size of the original random forest. Deviation of the representative tree prediction from the original random forest prediction was measured by the mean squared difference between the prediction of a most representative tree and the prediction of the complete random forest based on the same validation data set In Figure 2 the results for the representation of the random forest prediction is shown for all five scenarios. In all settings, the predictions of the most representative trees become more similar to those of the entire forest with increasing minimal node sizes. This is related to the fact that a larger minimal node size leads to smaller trees, which partition the prediction space more coarsely, so that a single tree can cover the information of the forest more easily. Additionally, the prediction based distance measure leads to the most similar predictions in all scenarios. This is also not surprising, since the distance measure uses the similarity of predictions defining the tree as most representative that is most similar to all other trees regarding their predictions. All other distance measures lead to rather similar results, but the WSV approach is the best of the remaining three metrics. In general most representative trees are more similar to the random forest than a decision tree that was trained on the same data set; but in a data set with many small main effects, the decision tree leads to remarkably better results than the most representative trees.
Results on the generalizability of most representative trees The relative mean squared error of the most representative tree on a new validation data set is depicted in Figure 3. In scenario one, the WSV approach leads to the best MSE when the minimal node size is small and the trees are therefore more complex. For minimal node sizes above one hundred the distance measure using the clustering in the terminal nodes becomes slightly better. Of note, the prediction based distance measure is the second best approach for the very small minimal node size of ten, but is the worst for all other minimal node sizes. In contrast to this, the prediction based distance measure leads to the best results in scenario two, where many small effects influence the outcome, while all other metrics lead to similar results, that are slightly worse than the prediction based metric. When correlated independent variables or interaction effects are in the data set (scenario three and four), the WSV approach is again the best for random forests with a minimal node size of ten and the second best for higher minimal node sizes. In contrast to scenario one, this time the splitting variable approach is the best for higher minimal node sizes, but all but the prediction based approach are comparatively good. Finally, when a mixture of binary and continuous variables have an effect on the outcome, the prediction based approach leads to the best results when the minimal node size is ten followed by the WSV approach. In all other cases, the WSV distance is again only the second best, but the splitting variable distance yields the best results. Again, except for scenario two, most representative trees lead to better results than a single decision tree.

Computation time
While the computation time for each metric was constant in the different settings, the methods differ drastically. Specifically, the splitting variable distance is the fastest and only need 0.87 seconds to calculate, while the WSV metric is only slightly slower with 1.88 seconds. The prediction based metric is also comparatively fast with 2.93 seconds, while the clustering based approach is far slower with 488 seconds to calculate all pair-wise distances for 500 trees.
Proportion of covered effect variables In scenario one, all trees cover all effect variables for a minimal node size of ten, but while the single decision tree still covers 90% of the effect variables for a minimal node size of 200, representative trees selected based on similarity of predictions drop below 20% for the same minimal node size. Overall it is observable that the similarity of prediction leads to representative trees that usually cover the least effect variables, which indicates that these trees may be good for prediction but not for explanation. Also, the single decision trees seem to have a lower coverage for many small main effects, interaction effects and the combination of continuous and dichotomous variables. Again, the similarity of weighted splitting variables leads to the trees with the second best coverage, but no other method performs constantly better.

Real data application Data set
To apply our newly developed distance measure to real data, we used data from 351 genetically confirmed XDP patients which was provided by the ProtectMove Research Unit (DFG: FOR2488). Participant enrollment and data analysis were approved by the Ethics Committees of the University of Lübeck, Germany, Massachusetts General Hospital, Boston, USA, Metropolitan Medical Center, Manila, Philippines, and Jose Reyes Medical Center, Manila, Philippines. XDP is a neurodegenerative movement disorder which combines symptoms of Dystonia and Parkinson's disease and is limited to the Philippines (Lee et al. 2011). Whereas the genetic cause of XDP is well-understood (Aneichyk et al. 2018, Rakovic et al. 2018, the aim of current research is to better understand basis of the varying age at onset (AAO) in the affected patients. Two recent studies (Bragg et al. 2017, Westenberger et al. 2019 revealed that 50% of the variability in AAO can be explained by the repeat number of a hexanucleotide repeat (RN) within thẽ 2,6-kb SINE-VNTR-Alu (SVA) retrotransposon insertion in intron 32 of TAF1 on the X chromosome. In our study population the AAO ranges between 21 and 67 years (mean = 41.7 years, standard deviation (sd) = 8.4 years) and RN ranges between 30 and 55 units (mean = 41.6 units, sd = 4.1 units). For all patients the AAO was determined in a standardized interview with movement disorder specialists. For each observation the data set includes not only AAO and RN, but also a set of 82 additional single nucleotide polymorphisms (SNPs). Therefore, we selected three SNPs which were recently identified to modify AAO in XDP (Laabs et al. 2021) located in MSH3 and PMS2. Additionally, we selected the SNPs with the lowest p-values in the genome-wide association study by Laabs et al. (2021) within nine different genes in the MSH3/PMS2 pathway. Since XDP patients show symptoms similar to dystonia (DYT) and Parkinson's disease (PD), we also included the SNPs with the lowest p-values in 12 PD and 12 DYT genes as well as 36 PD-related SNPs (Nalls et al. 2019) and 8 DYT-related SNPs (Mok et al. 2014, Lohmann et al. 2014).
Finally, two SNPs associated with Huntington's disease (HD) were included (Moss et al. 2017, Chao et al. 2018  We aimed to extract a single most representative tree for the given problem of modeling AAO in XDP patients and to examine its tree structure. Therefore, we trained a single random forest on the complete available data set with a number of trees of 500, mtry = P (so that on each node the best variable can be selected), bootstrapping and a minimal node size of 20% of the training data set. From the resulting random forest we selected the most representative tree ( Figure 5) 5 units), again no additional genetic information was used besides the SVA RN. After the split on rs245013, the tree uses variants not identified by Laabs et al. (2021) for the first and only time. Here we can see that in patients who were homozygous for the alternative allele of rs11060180, a variant in the CCDC62 gene which was previously associated with Parkinson's disease, was associated woth a high decrease in AAO, making it a potential risk factor. The fact that all known SNPs associated with the AAO of XDP were selected in different branches of the tree indicates that they tag modifiers that do not interact with each other, but act individually. Additionally, it is likely, that each of the tagged modifiers interacts with the RN.

Discussion
Our work aimed to develop a new tree-based distance measure capable of using the underlying tree architecture to estimate the similarity of decision trees. This is necessary to select most representative trees from random forests, which have a considerable potential in the visualization and interpretation of random forests and may facilitate the incorporation of machine learning into clinical practice. Therefore, we constructed a usage score which estimates how a given tree uses a specific variable by increasing the score for each occurrence of the variable in the tree weighted by the level of the tree it occurs in. Thus, the earlier a variable is selected for splitting, the higher the score. Our distance metric then compares the usage score for all variables between two trees and summarizes the squared difference of them.
To investigate the validity of the WSV metric in comparison with other distance measures, we designed an extensive simulation study in which we considered five different data scenarios. In these, we compared the prediction of the most representative tree with the complete random forest prediction. Here, the prediction based metric by Banerjee et al. (2012) that requires an additional test data for estimating the distance performed best, but the WSV metric performed second best in most scenarios. Additionally, we compared the prediction of the most representative tree with the known true outcome of the observations in an additional validation data set. In this case, the prediction based distance metric leads to the best results when the minimal node size is very small and the trees are therefore more complex. In all other cases the distance measures using the splitting variables or the clustering in terminal nodes lead to the best results. Of note, the WSV approach was again the second best approach in most of the cases. Although the most representative trees selected by the WSV metric are only the second-best in nearly all scenarios, they showed the most robust results, while each other metric sometimes lead to the best and sometimes to the worst results. Another advantage of the WSV approach is that it does not rely on an additional data set to estimate distances, like other metrics, making it valuable for situations without additional available data. Of note, the addtional test data sets generated in our simulation study used the exact same generating mechanism as the original training data set. In practice it can be expected that an additional data set underlies a slightly different generating mechanism leading to an expected decrease of the explanatory power of most representative trees that rely on such data sets. Finally, WSV is comparably fast and leads reliably to the trees which include the most effect variables.
In an application on real data predicting the AAO of XDP patients, we were able to replicate the results by Westenberger et al. 2018 andLaabs et al. 2021. As a major advantage, our approach takes the position of a variable within the tree into account and is, therefore, better able to mirror the tree architecture. However, it still does not incorporate the sequence or interaction of variables. Thus, further studies that may show how the WSV distance metric may be improved in a time-efficient way are warranted. We expect, that the improvement in the calculation of pair-wise tree distances are further steps to improve the interpretability and therefore the practical use of random forests.