1 Introduction

Predicting future outcomes for patients is a central aspect of modern medicine. Therefore, accurate and robust prognostic and predictive models are needed to predict, for instance, which individual treatment leads to the highest success probability. However, with the gaining importance of precision medicine and accompanying phenotype-rich data sets, the underlying mechanisms become too complex to handle with classical regression-based models. Thus, machine learning approaches become attractive.

One established method of machine learning are random forests which have become generally popular due to several advantages. First, they can be trained on a wide range of low- and high-dimensional data. Also, they can be applied in classification, regression as well as probability estimation settings (Breiman 2001) and can be used to model survival outcomes (Ishwaran et al. 2008; Hothorn et al. 2006). Recently, random forests were even applied for the complex task of text classification (Salles et al. 2020). Finally, by aggregating single decision trees into a random forest, the prediction performance can usually be drastically improved. However, whereas single decision trees are easy to interpret, it is relatively difficult to understand and communicate the prediction by a random forest. Thus, similar to other complex machine learning approaches, random forests are often described to be predictive and explanatory black boxes. This often hinders their translation into clinical practice, given that it is more likely for a clinician to base his or her decision on a prediction model he or she understands. Hence, one of the key features of the usability of prediction models is that they are easy to interpret (Wyatt Altman 1995; Heinze et al. 2018).

Another motivation is that recent years have seen efforts to optimize ensemble methods. For example, the reduction of an ensemble such as a random forest by excluding trees with a low predictive power or redundant information can increase the overall prediction performance of the ensemble (Adler et al. 2016; Kahn et al. 2019, 2021).

To gain insight into a random forest model, a standard approach is to utilize variable importance measures (Breiman 2001; Strobl et al. 2007; Nembrini et al. 2018). These evaluate how important each independent variable is for the performance of the model. However, this does not capture the complex tree structure of the model, including interactions between variables. Building up on this, Molnar et al. (2021) proposed a method to obtain statistical estimators for the data-generating process based on partial dependence plots and permutation feature importance. However, they found that these estimators are always biased and can only reflect the trained model. Alternatively, relative feature importance (Koenig et al. 2020) can be used to calculate feature importances relative to any other subset of variables, which should be able to cover variable importances in complex tree structures. Nevertheless, it will not be able to explain how the prediction of a tree structure is influenced by a certain variable.

A completely different way to enhance the interpretability of a random forest is to identify a single tree, which best represents the forest. Thus, instead of theoretically having to inspect hundreds of trees in the entire forest, only one representative tree needs to be interpreted to understand the overall model. Banerjee et al. (2012) proposed to select those representative trees based on the highest similarity to all other trees in the forest. To measure the pairwise similarity of trees, they derived three distance metrics which cover different aspects of similarity of trees, namely, the similarity of predictions, the clustering in the terminal nodes, and the selection of splitting variables. However, these metrics are relatively simple and do not, for example, capture the underlying tree structure. Also, it should not only be of interest whether a variable was selected for splitting, but also at which point in the tree it was selected.

We therefore developed a new distance measure for decision trees to identify the most representative trees in random forests, based on the selected splitting variables but incorporating the level at which they were selected within the tree. We additionally incorporate the prioritization of variables that are selected multiple times in a tree, assuming that these are more important.

Section 2 of this paper begins with a short introduction to decision trees and random forests followed by a description of the metrics introduced by Banerjee et al. (2012). We then develop our new weighting splitting variable (WSV) metric and describe how to extract the most representative tree from the forest based on any tree distance. In Sect. 3, we investigate the behavior of the different tree-based distance measures by an extensive simulation study. We apply our new method to a real data set analyzing the genetic background of X-linked dystonia-parkinsonism (XDP) in Sect. 4. Therefore, we aim to predict the age at onset of XDP patients by known genetic risk factors (Westenberger et al. (2019), Laabs et al. 2021). Finally, we discuss our results and give recommendations on the use of the different metrics in Sect. 5.

2 Methods

2.1 Decision trees and random forests

Random forests (Breiman 2001) are based on a combination of classification and regression trees (CART, Breiman et al. 1984) and Bagging (Breiman 1996). They are a supervised learning algorithm meaning that the true outcome is known for all observations in the training data set. As a general setting, we assume a model to predict an outcome y, which may be categorical, ordinal or continuous, by a set of independent variables \(x_p (p = 1,\ldots , P) \).

To train a random forest model, a training data set D containing outcomes \(y_n (n = 1,\ldots , N)\) and independent variables \(x_{n,p}\) of n observations is used. From this, num.trees subsets \(D_b,\ (b = 1, \ldots num.trees)\) of D are generated by bootstrapping or subsampling. Based on every subset \(D_b\), a decision tree is trained by iteratively splitting \(D_b\) into binary subgroups. Unlike in the original CART algorithm, every split is based on just a subset of size mtry of the available independent variables, to increase the dissimilarity of the trees in a forest. Beginning with \(D_b\) acting as the root of the tree \(T_b\), the number of possible splits is given by the combination of splitting variables and splitting values. For each possible split s the decrease of impurity

$$\begin{aligned} \Delta imp(s, k) = i(k) - \left( \frac{n_{k_l}}{n_k} imp(k_l) + \frac{n_{k_r}}{n_k} imp(k_r) \right) \end{aligned}$$

is calculated. Here, imp(k) is any impurity measure, and \(n_k\), \(n_{k_l}\) and \(n_{k_r}\) are the number of observations in node k, the left \(k_l\) and right \(k_r\) child node of k. For a categorical outcome variable y usually the Gini index is used as impurity measure, while the mean squared error (MSE) is used for continuous outcomes. Based on this, the optimal split \(s^*\) is the split which maximizes \(\Delta imp(s,k)\). The training algorithm stops when either all leaves (terminal nodes) are pure or a pre-defined stopping criterion (e.g., minimal node size, maximal complexity, etc.) is fulfilled. The outcome of a new sample can be predicted by dropping the sample down the trees. Finally, the predictions of the num.trees trees are aggregated by majority vote for classification and by mean for regression trees, respectively.

2.2 Distance measures by Banerjee et al. (2012)

To select a limited subset of representative trees from the forest, Banerjee et al. (2012) derived three tree-based distance measures based on different aspects of the tree structure:

  1. 1.

    Similarity of predictions (\(d_{pred}\)):

    This compares the arbitrarily scaled predictions \({\hat{y}}_{i,n}\) and \({\hat{y}}_{j,n}\) of two decision trees \(T_i\) and \(T_j\) for observation n of an independent test data set, leading to the distance metric

    $$\begin{aligned} d_{pred}(T_i, T_j) = \frac{1}{N_{test}} \sum _{n = 1}^{N_{test}} ({\hat{y}}_{i,n} - {\hat{y}}_{j,n})^2, \end{aligned}$$

    where \(N_{test}\) is the number of observations in the test data set. By comparing the predictions of trees this distance measure has a focus on selecting the tree with the most similar predictions to the complete ensemble, while it not necessarily gives the best explanation of the hidden tree mechanics. Of note, this metric requires a test data set for the estimation of predictions. Apart from the availability of such a test data set, it also affects the computation time of this method in a linear fashion.

  2. 2.

    Similarity of clustering in terminal nodes (\(d_{clust}\)):

    This evaluates whether two samples of an independent test data set end in the same leaf of \(T_j\) if they ended in the same leave of \(T_i\), leading to the metric

    $$\begin{aligned} d_{clust}(T_i, T_j) = \frac{\sum _{n > m} \sum _m \mid C_{T_i}(n,m) - C_{T_j}(n,m) \mid }{\left( {\begin{array}{c}N_{test}\\ 2\end{array}}\right) }, \end{aligned}$$

    with

    $$\begin{aligned} C_{T_i}(n,m) = {\left\{ \begin{array}{ll} 1 &{}, \text {if }n \text { and }m\text { end in the same leave of }T_i\\ 0 &{}, else \end{array}\right. }, \end{aligned}$$

    where n and m are observations of the test data set. In contrast to the first metric this one focuses on how the trees separate the feature space, leading to a most representative tree that has the biggest overlap in the separation of the feature space and thus a high explanatory power. But like the first metric an independent data set is necessary to estimate the distance making it highly dependent on the kind of data set available for calculations. Also, the computation time again increases with the number of samples in the test data set, but in a quadratic fashion.

  3. 3.

    Similarity of splitting variables (\(d_{split}\)):

    This distance of two trees \(T_i\) and \(T_j\) is defined as

    $$\begin{aligned} d_{split}(T_i, T_j) = \frac{\# \text {of split variable mismatches between }T_i \text { and }T_j}{\# \text {of independent variables in the data set}}, \end{aligned}$$

    where one split variable mismatch occurs for an independent variable if one of the trees \(T_i\) and \(T_j\) uses this variable for splitting and the other does not. In contrast to the aforementioned methods the similarity of independent splitting variables can be estimated directly from each tree and thus does not need an independent data set making this method promising when no such data set is available. But since it only compares whether a variable is selected or not, its ability to cover the complex architecture of decision trees where a variable can be selected multiple times is limited.

Fig. 1
figure 1

Comparison of the estimated pairwise tree distances based on four different metrics. Two trees \(T_1\) and \(T_2\) were constructed which use two dichotomous predictor variables \(x_1\) and \(x_2\) to predict a dichotomous outcome variable. Since both trees use exactly the same splitting variables, the metric comparing splitting variables (\(d_{split}\)) is zero, although both trees use the two different variables in different ways. Therefore, our weighted splitting variables approach (\(d_{WSV}\)) takes into account on which level of the tree a variable was used, resulting in a distance of \(\frac{1}{8}\). Based on a constructed test data set consisting of four observations \(o_i = (y_i, x_{1i}, x_{2i}), o_1 = (-,0,0), o_2 = (-,1,0), o_3 = (+,0,1)\) and \( o_4 = (-,1,1)\), the prediction-based metric (\(d_{pred}\)) also leads to a distance of zero, since both trees give the same predictions on the test data set. Finally, the clustering by the two trees differ in whether \(o_2\) ends in a terminal node with \(o_1\) or \(o_4\), leading to a clustering based distance (\(d_{clust}\)) of \(\frac{1}{3}\)

In Fig.  we built an example to highlight the differences between these methods. For this, we assumed a simple model with two dichotomous variables \(x_1\) and \(x_2\) explaining a dichotomous outcome y. Two non-identical trees \(T_1\) and \(T_2\) (upper half of Fig. 1) were constructed and used to predict the class of four observations \(o_n = (y_n, x_{1n}, x_{2n}), o_1 = (-,0,0), o_2 = (-,1,0), o_3 = (+,0,1)\) and \( o_4 = (-,1,1)\) (lower half of Fig. 1). The pairwise tree distances for all three metrics as given above are shown. For \(d_{pred}\), the predictions are identical in both trees for all observations \(o_1, \ldots , o_4\), leading to a distance of 0. For \(d_{clust}\), we observe that the observation \(o_2\) ends in a terminal node together with \(o_4\) in \(T_1\) and together with \(o_1\) in \(T_2\). Thus, there are two differences in clustering, while we have 6 possible combinations according to the metric, leading to a distance of \(\frac{1}{3}\). The metric \(d_{split}\) finally leads to a distance of 0, since both trees use the same split variable, but in a different way, which is ignored by \(d_{split}\).

This example illustrates that the metrics \(d_{pred}\) and \(d_{split}\) proposed by Banerjee et al. (2012) can lead to a distance of 0 even for non-identical trees, because both do not exploit the tree structure in its entire complexity. The last metric \(d_{clust}\) uses the underlying tree structure for calculation but requires an independent test data set.

2.3 Similarity of weighted splitting variable (WSV) (\(d_{WSV}\))

The WSV metric compares the structures of the trees in more detail without using an independent test data set. Specifically, we combine the information that a splitting variable v is used with the information about how much the tree relies on this splitting variable. The latter is assessed by a usage score

$$\begin{aligned} U_i(v) = \frac{\sum _{k=1}^{K_i} \frac{\mathbbm {1} _{ \left[ s_i(k) = v \right] }}{2^{l_i(k) - 1}}}{max_k(l_i(k))}, \end{aligned}$$

where k indicates the \(K_i\) nodes of \(T_i\), \(l_i(k)\) represents the level of the tree \(T_i\) in which node k occurs, and \( \mathbbm {1} _{ \left[ s_i(k) = v \right] } \) indicates whether the splitting variable selected at node k is the variable v. This leads to the distance measure

$$\begin{aligned} d_{WSV}(T_i,T_j) = \sum _{v \in V} \left( U_i(v) - U_j(v) \right) ^2, \end{aligned}$$

where V is the set of all independent variables. Thus, the WSV metric counts the number of times a specific splitting variable is used within a tree, weighted by the depth it occurs in. The weight becomes smaller the later in the tree the splitting variable is used, so that the first splitting variable receives the highest weight, assuming that the first split is the most important. The resulting vector of usage scores of all splitting variables is compared between two trees. Returning to the example in Fig. 1, our new distance measure leads to usage scores \(U_1(x_1) = \frac{1}{2}, U_1(x_2) = \frac{1}{4}, U_2(x_1) = \frac{1}{4} \) and \(U_2(x_2) = \frac{1}{2}\), resulting in a distance of \(d_{WSV} = \frac{1^{}}{8}\). Therefore, this new metric is promising when no additional data set is available for estimating tree distances, since it covers more of the complex tree architecture of decision trees.

2.4 Selection of the most representative tree

All of the tree-based distance metrics defined above measure the pairwise distances of trees based on different aspects of similarity. Given a specific distance metric, Banerjee et al. (2012) proposed to select the most representative tree from the random forest according to the distance score

$$\begin{aligned} D(T_i) = \frac{\sum _{j \ne i} d(T_i, T_j)}{num.trees - 1} \end{aligned}$$

of every tree \(T_i\). From all trees, the tree with the lowest \(D(T_i)\) is selected.

3 Simulation study

3.1 Simulation design

Aim of our simulation study was to evaluate the properties of representative trees selected from a random forest based on different distance measures. For this, we defined the following data generating mechanism: We assumed a continuous outcome y which can be expressed by \(y_n = {\textbf{x}}_n' \beta + \epsilon _n\), where \(y_n\) is the true outcome of observation n, and \({\textbf{x}}_n = (x_{n,p})_{ = 1,\ldots , 100} \sim {\mathcal {B}}(0.5)\) are independent variables with the effect sizes \(\beta _p\) and a normally distributed error term \(\epsilon _n \sim {\mathcal {N}}(0,1)\). For each replicate of our simulations we simulated a training data set of 1000 observations \((y_n, x_n)\) for training of random forests, a test data set of 100 observations for the calculation of distance metrics requiring independent test data and an additional validation data set of 1000 observations to compare tree and random forest predictions \({\hat{y}}_n\) with the simulated \(y_n\). Of note, the smaller sample size for the test data set was chosen to prevent extremely long computation times for the metrics relying on these data sets.

We defined five different scenarios to be evaluated in our simulation study:

  1. 1.

    Few large main effects: In this most simple scenario only five independent variable have a \(\beta _p = 2\), while all 95 other independent variables have an effect of zero.

  2. 2.

    Many small main effects: In contrast to the first scenario, here 50 independent variables have an effect \(\beta _p = 0.2\) and the remaining 50 independent variable have no influence on the outcome y.

  3. 3.

    Correlated variables: While in the first two scenarios all independent variables were uncorrelated, we included correlated independent variables in the third scenario. Here, again we used five independent variables with \(\beta _p = 2\) but this time, for each of the effect variables four additional variables were simulated which have a correlation of 0.3 to the effect variable, but no effect on the outcome themselves. Therefore, the complete data set consists of five variables with an effect, 20 correlated variables without main effects and 75 uncorrelated variables without an influence on the outcome. Since in genetic studies variants with a higher correlation than 0.3 are usually excluded from the analyses, the correlation value of 0.3 was chosen, to be comparable with the genetic real data set.

  4. 4.

    Interaction effects: In this scenario again five independent variables have \(\beta _p = 2\), but this time five additional pairs of variables have an interaction effect of the same size, but no main effect.

  5. 5.

    Binary and continuous variables: While in all other scenarios all independent variables were binary, we added five continuous variables sampled from a standard normal distribution to the data set alongside with five binary variables all having \(\beta _p = 2\). Finally, the data set was filled with 90 binary variables without any influence on the outcome.

For each of the five scenarios we simulated 100 replicates. On each resulting training data set we trained four different random forests consisting of 500 trees using ranger (Wright and Ziegler 2017) with bootstrapping and \(mtry = \sqrt{p} = 10\). Here, a minimal node size of 100 would represent the default value of 10\(\%\) of the number of samples in the training data set. Since the choice of this hyperparameter is assumed to have the biggest impact on the different metrics, we decided to use varying minimal node sizes of ten, 50, 100, and 200. With 200 we obtain relatively small trees, whose tree structure should be relatively easy to capture even by the more simple approaches, while a minimal node size of ten represents nearly fully grown, very complex trees. For every random forest, all four distance measures defined above were calculated, and the most representative trees were selected. Additionally, we trained single decision trees using rpart with the same minimal node sizes of ten, 50, 100, 200.

To assess the properties of the distance measures and resulting representative trees, we defined three different estimands with accompanying performance measures.

  1. 1.

    Representation of the random forest prediction:

    Here, we assess the ability of the most representative trees to represent the prediction of the entire ensemble. We selected for each metric the most representative tree with the lowest distance to all other trees, and evaluated the similarities of the predictions of these trees with those of the entire random forest by

    $$\begin{aligned} MSE_{rf} = \frac{1}{1000} \sum _{n = 1}^{1000}({\hat{y}}_n - {\hat{y}}_n^*)^2, \end{aligned}$$

    where \({\hat{y}}_n\) is the prediction of the complete random forest on the validation data set, while \({\hat{y}}_n^*\) is the prediction of the selected most representative tree on the same data set.

  2. 2.

    Generalizability of most representative trees:

    Here, we evaluate the prediction performance of the most representative tree on a new independent data set. In the validation data the prediction accuracy of a most representative tree is investigated and compared with the accuracy based on the entire random forest. For this, we selected a most representative tree for each distance measure and compared its mean squared error with that of the entire random forest.

  3. 3.

    Fraction of covered effect variables:

    Ideally, the selected most representative tree includes all effect variables. Therefore, we estimated how many of the simulated effect variables were at least once included in the most representative tree.

Fig. 2
figure 2

Representation of random forest predictions by single most representative trees selected based on four different distance measures, with respect to the minimal node size of the original random forest. Deviation of the representative tree prediction from the original random forest prediction was measured by the mean squared difference between the prediction of a most representative tree and the prediction of the complete random forest based on the same validation data set

3.2 Results on the representation of the random forest prediction

In Fig.   the results for the representation of the random forest prediction is shown for all five scenarios. In all settings, the predictions of the most representative trees become more similar to those of the entire forest with increasing minimal node sizes. This is related to the fact that a larger minimal node size leads to smaller trees, which partition the prediction space more coarsely, so that a single tree can cover the information of the forest more easily. Additionally, the prediction based distance measure leads to the most similar predictions in all scenarios. This is also not surprising, since the distance measure uses the similarity of predictions defining the tree as most representative that is most similar to all other trees regarding their predictions. All other distance measures lead to rather similar results, but the WSV approach is the best of the remaining three metrics. In general most representative trees are more similar to the random forest than a decision tree that was trained on the same data set; but in a data set with many small main effects, the decision tree leads to remarkably better results than the most representative trees. The reason for that could be that a random forest is unable to cover such a complex model (when many small effects influence the outcome) or that a single representative tree is not able to cover these complex structures.

Fig. 3
figure 3

Mean squared error of the prediction of the most representative tree selected by each method (color) on an independent test data set with respect to the minimal node size of the original random forest

3.3 Results on the generalizability of most representative trees

The relative mean squared error of the most representative tree on a new validation data set is depicted in Fig. . Again, most representative trees are better than a decision tree in most scenarios, despite when there are many small main effects in combination with large trees. In scenario one, the WSV approach leads to the best MSE when the minimal node size is small and the trees are therefore more complex. For minimal node sizes above one hundred the distance measure using the clustering in the terminal nodes becomes slightly better, although it was systematically worse for larger trees. In contrast to this, the prediction based distance measure leads to the best results of the representative trees in scenario two, where many small effects influence the outcome, while all other metrics lead to similar results, that are slightly worse than the prediction based metric. When correlated independent variables or interaction effects are in the data set (scenario three and four), the WSV approach is again the best for random forests with a minimal node size of ten and the second best for higher minimal node sizes. In contrast to scenario one, this time the splitting variable approach is the best for higher minimal node sizes, but all but the prediction based approach are comparatively good. Finally, when a mixture of binary and continuous variables have an effect on the outcome, the prediction based approach leads to the best results when the minimal node size is ten followed by the WSV approach. In all other cases, the WSV distance is again only the second best, but the splitting variable distance yields the best results.

Over all scenarios, it is firstly observable that the prediction-based approach is at least influenced by the size of the trees in the random forest, because it is the only metric that does not incorporate any information on the internal tree structures, but only the resulting prediction. Secondly, the splitting variable approach nearly always has a slight advantage when small trees are built because it is able to cover most of the underlying tree architecture with its rather simple approach. In such a scenario, the resulting trees consist of fewer splits, so that the splitting variable approach has a chance to cover all splits adequately. But when the trees become more complex and multiple splits on the same variable occur, our WSV metric is superior over the splitting variable approach because it is able to incorporate more of these complex structures. Finally, the clustering-based approach seems to have some problems with larger trees, which is probably caused by the relatively small test data set leading to a lower coverage in the terminal nodes, when the trees are more complex.

3.4 Computation time

While the computation time for each metric was constant in the different settings, the methods differ drastically. Specifically, the splitting variable distance is the fastest and only need 0.87 s to calculate, while the WSV metric is only slightly slower with 1.88 s. The prediction based metric is also comparatively fast with 2.93 s, while the clustering based approach is far slower with 488 s to calculate all pair-wise distances for 500 trees. Of note, the splitting variable approach and our WSV metric do not depend on an additional test data set, while the other two metrics do, and thus the computation time will increase linearly with the number of samples in the test data set for the prediction-based approach and quadratically for the clustering approach.

Fig. 4
figure 4

Proportion of simulated effect variables, that were covered by a single tree

3.5 Proportion of covered effect variables

Figure  depicts which mean proportion of the simulated effect variables were also included in the single generated trees. In scenario one, all trees cover all effect variables for a minimal node size of ten, but while the single decision tree still covers 90\(\%\) of the effect variables for a minimal node size of 200, representative trees selected based on similarity of predictions drop below \(20\%\) for the same minimal node size. Overall, it is observable that the similarity of prediction leads to representative trees that usually cover the least effect variables, which indicates that these trees may be good for prediction but not for explanation. Also, the single decision trees seem to have a lower coverage for many small main effects, interaction effects and the combination of continuous and dichotomous variables. Again, the similarity of weighted splitting variables leads to the trees with the second best coverage, but no other method performs constantly better.

Fig. 5
figure 5

Most representative tree selected by (A) our newly developed distance metric (B) the splitting variable approach

4 Real data application

4.1 Data set

To apply our newly developed distance measure to real data, we used data from 351 genetically confirmed XDP patients which was provided by the ProtectMove Research Unit (DFG: FOR2488). Participant enrollment and data analysis were approved by the Ethics Committees of the University of Lübeck, Germany, Massachusetts General Hospital, Boston, USA, Metropolitan Medical Center, Manila, Philippines, and Jose Reyes Medical Center, Manila, Philippines. XDP is a neurodegenerative movement disorder which combines symptoms of dystonia and Parkinson’s disease and is limited to the Philippines (Lee et al. 2011). It is inherited in an X-linked recessive manner and thus mainly affects men, who usually show first symptoms between age 20 and 40. Whereas the genetic cause of XDP is well-understood (Aneichyk et al. 2018; Rakovic et al. 2018), the aim of current research is to better understand the basis of the varying age at onset in the affected patients. Two recent studies (Bragg et al. 2017; Westenberger et al. 2019) revealed that  \(50\%\) of the variability in age at onset can be explained by the repeat number of a hexanucleotide repeat within the \(\sim \)2,6-kb SINE-VNTR-Alu (SVA) retrotransposon insertion in intron 32 of TAF1 on the X chromosome. In our study population the age at onset ranges between 21 and 67 years (mean = 41.7 years, standard deviation (sd) = 8.4 years) and repeat number ranges between 30 and 55 units (mean = 41.6 units, sd = 4.1 units). For all patients the age at onset was determined in a standardized interview with movement disorder specialists. For each observation the data set includes not only age at onset and repeat number, but also a set of 82 additional single nucleotide polymorphisms (SNPs). Therefore, we selected three SNPs which were recently identified to modify age at onset in XDP (Laabs et al. 2021) located in MSH3 and PMS2. Additionally, we selected the SNPs with the lowest p-values in the genome-wide association study by Laabs et al. (2021) within nine different genes in the MSH3/PMS2 pathway. Since XDP patients show symptoms similar to dystonia (DYT) and Parkinson’s disease (PD), we also included the SNPs with the lowest p-values in 12 PD and 12 DYT genes as well as 36 PD-related SNPs (Nalls et al. 2019) and 8 DYT-related SNPs (Mok et al. 2014; Lohmann et al. 2014). Finally, two SNPs associated with Huntington’s disease (HD) were included (Moss et al. 2017; Chao et al. 2018), since Laabs et al. 2021 revealed a possible relationship between HD and XDP. Details to collection and processing of the genetic data can be found in Laabs et al. (2021).

In total this leads to a data set of 351 samples and 83 independent variables. Of note, the sample size of 351 seems to be relatively small in the context of a genetic analysis, but since Laabs et al. (2021) found strong genetic effects influencing the age at onset in XDP, we assume this sample size as sufficient for a data set of the given dimension.

4.2 Real data results

We aimed to extract a single most representative tree for the given problem of modeling age at onset in XDP patients and to examine its tree structure. Therefore, we trained a single random forest on the complete available data set with a number of trees of 500, \(mtry = P\) (so that on each node the best variable can be selected leading to a bagging decision tree ensemble), bootstrapping and a minimal node size of 20\(\%\) of the training data set. From the resulting random forest we selected the most representative tree using the WSV metric (Fig. A). Here, the first four splits were all made at the SVA repeat number, separating the observations into different groups with increasing repeat number. This is not surprising, since Westenberger et al. (2019) described a negative linear influence of repeat number on the age at onset. In further concordance with Westenberger et al. (2019), the predicted age at onset for the groups with smaller repeat number were also higher. Interestingly, for the group with the lowest repeat number (\(\le 36.5\) units), no further genetic variant was selected for splitting. Another surprising fact was that the three variants reported by Laabs et al. (2021) never occurred together in one branch of the tree. The variant rs33003 in MSH3 was only used for splitting in patients with lower repeat number (\(>36.5\) but \(\le 39.5\) units), while the other variant in MSH3 (rs245013) was selected for splitting in patients with higher repeat number (\(>39.5\) but \(\le 43.5\) units). Finally, the variant in PMS2 (rs62456190) was of importance for patients with repeat number between 43.5 and 50.5 units. For the patients with the highest repeat number (\(>50.5\) units), again no additional genetic information was used besides the SVA repeat number. After the split on rs245013, the tree uses variants not identified by Laabs et al. (2021) for the first and only time. Here we can see that in patients who were homozygous for the alternative allele of rs11060180, a variant in the CCDC62 gene which was previously associated with Parkinson’s disease, was associated with a high decrease in age at onset, making it a potential risk factor. The fact that all known SNPs associated with the age at onset of XDP were selected in different branches of the tree indicates that they tag modifiers that do not interact with each other, but act individually. Additionally, it is likely, that each of the tagged modifiers interacts with the epeat number.

In contrast, the most representative tree selected based on the splitting variable approach (Fig. 5B) is by far smaller. Here, again the first four splits were made at the SVA repeat number, but after that only the first variant from the GWAS (rs33003) occurs in a single branch. The other metrics could not be computed for this example due to the lack of a test data set.

5 Discussion

Our work aimed to develop a new tree-based distance measure capable of using the underlying tree architecture to estimate the similarity of decision trees. This is necessary to select most representative trees from random forests, which have a considerable potential in the visualization and interpretation of random forests and may facilitate the incorporation of machine learning into clinical practice. Therefore, we constructed a usage score which estimates how a given tree uses a specific variable by increasing the score for each occurrence of the variable in the tree weighted by the level of the tree it occurs in. Thus, the earlier a variable is selected for splitting, the higher the score. Our distance metric then compares the usage score for all variables between two trees and summarizes the squared difference of them.

To investigate the validity of the WSV metric in comparison with other distance measures, we designed an extensive simulation study in which we considered five different data scenarios. In these, we compared the prediction of the most representative tree with the complete random forest prediction. Here, the prediction based metric by Banerjee et al. (2012) that requires an additional test data for estimating the distance performed best, but the WSV metric performed second best in most scenarios. Additionally, we compared the prediction of the most representative tree with the known true outcome of the observations in an additional validation data set. In this case, the prediction based distance metric leads to the best results when the minimal node size is very small and the trees are therefore more complex. In all other cases the distance measures using the splitting variables or the clustering in terminal nodes lead to the best results. Of note, the WSV approach was again the second best approach in most of the cases.

Although the most representative trees selected by the WSV metric are only the second-best in nearly all scenarios, they showed the most robust results, while each other metric sometimes lead to the best and sometimes to the worst results. Another advantage of the WSV approach is that it does not rely on an additional data set to estimate distances, like other metrics, making it valuable for situations without additional available data. Of note, the addtional test data sets generated in our simulation study used the exact same generating mechanism as the original training data set. In practice it can be expected that an additional data set underlies a slightly different generating mechanism leading to an expected decrease of the explanatory power of most representative trees that rely on such data sets. Finally, WSV is comparably fast and leads reliably to the trees which include the most effect variables.

One limitation of our simulation study could be the selection of a relatively small samples size (n = 100) for the test data set, which could be a disadvantage for metrics using these data. However, we assume that the general availability of independent test data and the dramatic increase in computation time with larger sample sizes in the test data will hinder the application of these methods overall.

In an application on real data predicting the age at onset of XDP patients, we were able to replicate the results of Westenberger et al. (2019) and Laabs et al. (2021) using our new WSV metric, while the replication of all results from Laabs et al. (2021) was not possible using the splitting variables approach. As a major advantage, our approach takes the position of a variable within the tree into account and is, therefore, better able to mirror the tree architecture. However, it still does not incorporate the sequence or interaction of variables. Thus, further studies that may show how the WSV distance metric may be improved in a time-efficient way are warranted.

Overall, the impact of our method is two-fold. On the one hand, the improvement in the selection of most representative trees is a contribution to explainable machine learning, by gaining more insight into random forests and improving their interpretability. On the other hand, a further application of our distance metric could be the optimization of random forests. The distances could, for example, be used to identify outlying trees that potentially reduce the overall prediction performance.

The general method proposed by Banerjee et al. (2012) is only meant to select a single tree from the ensemble. Selecting multiple trees would very likely result in a selection of rather similar trees. However, it is very likely that a single tree can not reflect the diversity of a large random forest. Therefore, future research should focus on the selection of multiple trees while maintaining the diversity of the selected trees.

In summary we expect, that the improvement in the calculation of pair-wise tree distances are further steps to improve the interpretability and therefore the practical use of random forests.