Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees

Fokkema, M.; Smits, N.; Zeileis, A.; Hothorn, T.; Kelderman, H.

doi:10.3758/s13428-017-0971-x

Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees

Open access
Published: 25 October 2017

Volume 50, pages 2016–2034, (2018)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees

Download PDF

M. Fokkema ORCID: orcid.org/0000-0002-9252-8325¹,
N. Smits²,
A. Zeileis³,
T. Hothorn⁴ &
…
H. Kelderman¹

12k Accesses
97 Citations
15 Altmetric
Explore all metrics

Abstract

Identification of subgroups of patients for whom treatment A is more effective than treatment B, and vice versa, is of key importance to the development of personalized medicine. Tree-based algorithms are helpful tools for the detection of such interactions, but none of the available algorithms allow for taking into account clustered or nested dataset structures, which are particularly common in psychological research. Therefore, we propose the generalized linear mixed-effects model tree (GLMM tree) algorithm, which allows for the detection of treatment-subgroup interactions, while accounting for the clustered structure of a dataset. The algorithm uses model-based recursive partitioning to detect treatment-subgroup interactions, and a GLMM to estimate the random-effects parameters. In a simulation study, GLMM trees show higher accuracy in recovering treatment-subgroup interactions, higher predictive accuracy, and lower type II error rates than linear-model-based recursive partitioning and mixed-effects regression trees. Also, GLMM trees show somewhat higher predictive accuracy than linear mixed-effects models with pre-specified interaction effects, on average. We illustrate the application of GLMM trees on an individual patient-level data meta-analysis on treatments for depression. We conclude that GLMM trees are a promising exploratory tool for the detection of treatment-subgroup interactions in clustered datasets.

A Tutorial on Applying the Difference-in-Differences Method to Health Data

Article Open access 07 September 2023

Sarah Rothbard, James C. Etheridge & Eleanor J. Murray

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

Article Open access 19 December 2014

Xiang Wan, Wenqian Wang, … Tiejun Tong

A Tutorial Introduction to Heterogeneous Treatment Effect Estimation with Meta-learners

Article Open access 03 November 2023

Marie Salditt, Theresa Eckes & Steffen Nestler

Introduction

In research on the efficacy of treatments for somatic and psychological disorders, the one-size-fits-all paradigm is slowly losing ground, and personalized or stratified medicine is becoming increasingly important. Stratified medicine presents the challenge of discovering which patients respond best to which treatments. This can be referred to as the detection of treatment-subgroup interactions (e.g., Doove, Dusseldorp, Van Deun, & Van Mechelen, 2014). Often, treatment-subgroup interactions are studied using linear models, such as factorial analysis of variance techniques, in which potential moderators have to be specified a priori, have to be checked one at a time, and continuous moderator variables have to be discretized. This may hamper identification of which treatment works best for whom, especially when there are no a priori hypotheses about treatment-subgroup interactions. As noted by Kraemer, Frank, and Kupfer (2006), there is a need for methods that generate instead of test such hypotheses.

Tree-based methods are such hypothesis-generating methods. Tree-based methods, also known as recursive partitioning methods, split observations repeatedly into groups so that they become increasingly similar with respect to the outcome within each group. Several tree-based methods take the mean of a continuous dependent variable or the majority class of a categorical dependent variable as the outcome, one of the earliest and most well-known examples being the classification and regression tree (CART) approach of Breiman, Friedman, Olshen, and Stone (1984). Other tree-based methods take the estimated parameters of a more complex model, of which the RECPAM (recursive partition and amalgamation) approach of Ciampi (1991) is the earliest example.

Due to the recursive nature of the splitting, the rectangular regions of a recursive partition can be graphically depicted as nodes in a decision tree, as shown in the artificial example in Fig. 1. The partition in Fig. 1 is rather simple, based on the values of two predictor variables: duration and anxiety. The resulting tree has a depth of two, as the longest path travels along two splits. Each of the splits in the tree is defined by a splitting variable and value. The first split in the tree separates the observations into two subgroups, based on the duration variable and a splitting value of eight, yielding two rectangular regions, represented by node 2 and node 5. Node 2 is an inner node, as the observations in this node are further split into terminal nodes 3 and 4, based on the anxiety variable. The observations in node 5 are not further split and this is therefore a terminal node.

If the partition in Fig. 1 would be used for prediction of a new observation, the new observation would be assigned to one of the terminal nodes according to its values on the splitting variables. The prediction is then based on the estimated distribution of the outcome variable within that terminal node. For example, the prediction may be the node-specific mean of a single continuous variable. In the current paper, we focus on trees where the terminal nodes consist of a linear (LM) or generalized linear model (GLM), in which case the predicted value for a new observation is determined by the node-specific parameter estimates of the (G)LM, while also adjusting for random effects.

Tree-based methods are particularly useful for exploratory purposes because they can handle many potential predictor variables at once and can automatically detect (higher-order) interactions between predictor variables (Strobl, Malley, & Tutz, 2009). As such, they are preeminently suited to the detection of treatment-subgroup interactions. Several tree-based algorithms for the detection of treatment-subgroup interactions have already been developed (Dusseldorp, Doove, & Van Mechelen, 2016; Dusseldorp & Meulman, 2004; Su, Tsai, Wang, Nickerson, & Li, 2009; Foster, Taylor, & Ruberg, 2011; Lipkovich, Dmitrienko, Denne, & Enas, 2011; Zeileis, Hothorn, & Hornik, 2008; Seibold, Zeileis, & Hothorn, 2016; Athey & Imbens 2016). Also, Zhang, Tsiatis, Laber, and Davidian (2012b) and Zhang, Tsiatis, Davidian, Zhang, and Laber (2012a) have developed a flexible classification-based approach, allowing users to select from a range of statistical methods, including trees.

In many instances, researchers may want to detect treatment-subgroup interactions in clustered or nested datasets, for example in individual-level patient data meta-analyses, where datasets of multiple clinical trials on the same treatments are pooled. In such analyses, the nested or clustered structure of the dataset should be taken into account by including study-specific random effects in the model, prompting the need for a mixed-effects model (e.g., Cooper & Patall 2009; Higgins, Whitehead, Turner, Omar, & Thompson, 2001). In linear models, ignoring the clustered structure may lead, for example, to biased inference due to underestimated standard errors (e.g., Bryk & Raudenbush, 1992). For tree-based methods, ignoring the clustered structure has been found to result in the detection of spurious subgroups and inaccurate predictor variable selection (e.g., Sela & Simonoff, 2012; Martin, 2015). However, none of the purely tree-based methods for treatment-subgroup interaction detection allow for taking into account the clustered structure of a dataset. Therefore, in the current paper, we present a tree-based algorithm that can be used for the detection of interactions and non-linearities in GLMM-type models: generalized linear mixed-effects model trees, or GLMM trees.

The GLMM tree algorithm builds on model-based recursive partitioning (MOB, Zeileis et al., 2008), which offers a flexible framework for subgroup detection. For example, GLM-based MOB has been applied to detect treatment-subgroup interactions for the treatment of depression (Driessen et al., 2016) and amyotrophic lateral sclerosis (Seibold et al., 2016). In contrast to other purely tree-based methods (e.g., Zeileis et al., 2008; Su et al., 2009; Dusseldorp et al., 2016), GLMM trees allow for taking into account the clustered structure of datasets. In contrast to previously suggested regression trees with random effects (e.g., Hajjem, Bellavance, & Larocque, 2011; Sela & Simonoff, 2012), GLMM trees allow for treatment effect estimation, with continuous as well as non-continuous response variables.

The remainder of this paper is structured into four sections: In the first section, we introduce the GLMM tree algorithm using an artificial motivating dataset with treatment-subgroup interactions. In the second section, we compare the performance of GLMM trees with that of three other methods: MOB trees without random effects, mixed-effects regression trees (MERTs) and linear mixed-effects models with pre-specified interactions. In the third section, we apply the GLMM tree algorithm to an existing dataset of a patient-level meta-analysis on the effects of psycho- and pharmacotherapy for depression. In the fourth and last section, we summarize the results and discuss limitations and directions for future research. In the Appendix, we provide a glossary explaining abbreviations and mathematical notation used in the current paper. Finally, a tutorial on how to fit GLMM trees using the R package glmertree is included as supplementary material. In the tutorial, the artificial motivating dataset is used, allowing users to recreate the trees and models to be fitted in the next section.

GLMM tree algorithm

Artificial motivating dataset

We will use an artificial motivating dataset with treatment-subgroup interactions to introduce the GLMM tree algorithm. This dataset consists of a set of observations on N = 150 patients, who were randomly assigned to one of two treatment alternatives (Treatment 1 or Treatment 2). The treatment outcome is represented by the variable depression, quantifying post-treatment depressive symptomatology. The potential moderator variables are duration, age, and anxiety. Duration reflects the number of months the patient has been suffering from depression prior to treatment, age reflects patients’ age in years at the start of treatment, and anxiety reflects patients’ total scores on an anxiety inventory administered before treatment. Summary statistics of these variables are provided in Table 1. Each patient was part of one of ten clusters, each having a different value for the random intercept, which were generated from a standard normal distribution and uncorrelated with the partitioning variables.

Table 1 Summary statistics for partitioning and outcome variables in the artificial motivating dataset

Full size table

The outcome variable was generated such that there are three subgroups with differential treatment effectiveness, corresponding to the terminal nodes in Fig. 1: For the first subgroup of patients (node 3) with short duration (≤ 8) and low anxiety scores (≤ 10), Treatment 1 leads to lower post-treatment depression than in Treatment 2 (true mean difference = 2). For the second subgroup of patients (node 4) with short duration but high anxiety scores (> 10), post-treatment depression is about equal in both treatment conditions (true mean difference = 0). For the third subgroup of patients (node 5) with long duration (> 8 months), Treatment 2 leads to lower post-treatment depression than Treatment 1 (true mean difference = −2.5). Thus, duration and anxiety are true partitioning or moderator variables, whereas age is not. Anticipating the final results of our analyses, the treatment-subgroup interactions are depicted in Fig. 4, which shows the GLMM tree that accurately recovered the treatment-subgroup interactions.

Model-based recursive partitioning

The rationale behind MOB is that a single global GLM (or other parametric model) may not describe the data well, and when additional covariates are available it may be possible to partition the dataset with respect to these covariates, and find better-fitting models in each cell of the partition. For example, to assess the effect of treatment, we may first fit a global GLM where the treatment indicator has the same effect/coefficient on the outcome for all observations. Subsequently, the data may be partitioned recursively with respect to other covariates, leading to separate models with different treatment effects/coefficients in each subsample.

More formally, in a single global GLM, the expectation μ _i of outcome y _i given the treatment regressor x _i is modeled through a linear predictor and suitable link function:

$$\begin{array}{@{}rcl@{}} E[y_{i} | x_{i}] & = & \mu_{i}, \end{array} $$

(1)

$$\begin{array}{@{}rcl@{}} g(\mu_{i}) & = & x_{i}^{\top}\beta, \end{array} $$

(2)

where x i⊤β is the linear predictor for observation i and g is the link function. β is a vector of fixed-effects regression coefficients. For simplicity, in the current paper we focus on two treatment groups and no further covariates in the GLM, so that in our illustrations x _i and β both have length 2. For the continuous response variable in the motivating data set, we employ the identity link function and assume a normal distribution for the error (denoted by 𝜖 _i = y _i − μ _i) with mean zero and variance σ 𝜖2. Thus, the first element of β then corresponds to the mean of the linear predictor in the first treatment group and the second element corresponds to the mean difference in the linear predictor between the first and second treatment groups. However, the model can easily accommodate additional treatment conditions and covariates, as well as binary or count/Poisson outcome variables.

Obviously, such a simple, global GLM will not fit the data well, especially in the presence of moderators. For expository purposes, however, we take it as a starting point to illustrate MOB. The global GLM fitted to the motivating example dataset is depicted in Fig. 2. As the boxplots show, there is little difference between the global effects of the two treatments and there is considerable residual variance.

The MOB algorithm can be used to partition the dataset using additional covariates and find better-fitting local models. To this end, the MOB algorithm tests for parameter stability with respect to each of a set of auxiliary covariates, also called partitioning variables, which we will denote by U. When the partitioning is based on a GLM, instabilities are differences in $\hat {\beta }$ across partitions of the dataset, which are defined by one or more auxiliary covariates not included in the linear predictor. To find these partitions, the MOB algorithm cycles iteratively through the following steps (Zeileis et al., 2008): (1) fit the parametric model to the dataset, (2) statistically test for parameter instability with respect to each of a set of partitioning variables, (3) if there is some overall parameter instability, split the dataset with respect to the variable associated with the highest instability, (4) repeat the procedure in each of the resulting subgroups.

In step (2) a test statistic quantifying parameter instability is calculated for every potential partitioning variable. As the distribution of these test statistics under the null hypothesis of parameter stability is known, a p value for every partitioning variable can be calculated. Note that a more in-depth discussion of the parameter stability tests is beyond the scope of this paper, but can be found in Zeileis and Hornik (2007) and Zeileis et al. (2008).

If at least one of the partitioning variables yields a p value below the pre-specified significance level α, the dataset is partitioned into two subsets in step (3). This partition is created using $U_{k^{*}}$, the partitioning variable with the minimal p value in step (2). The split point for $U_{k^{*}}$ is selected by taking the value that minimizes the instability as measured by the sum of the values of two loss functions, one for each of the resulting subgroups. In other words, the loss function is minimized separately in the two subgroups resulting from every possible split point and the split point yielding the minimum sum of the loss functions is selected. In step (4), steps (1) through (3) are repeated in each partition, until the null hypothesis of parameter stability can no longer be rejected (or the subsets become too small).

The partition resulting from application of MOB can be depicted as a decision tree. If the partitioning is based on a GLM, the result is a GLM tree, with a local fixed-effects regression model in every j-th (j = 1,…,J) terminal node:

$$ g(\mu_{ij}) = x_{i}^{\top}\beta_{j} $$

(3)

To illustrate, we fitted a GLM tree on the artificial motivating dataset. In addition to the treatment indicator and treatment outcome used to fit the earlier GLM, we specified the anxiety, duration and age variables as potential partitioning variables. Figure 3 shows the resulting GLM tree. MOB partitioned the observations into four subgroups, each with a different estimate β _j. Age was correctly not identified as a partitioning variable and the left- and rightmost nodes are in accordance with the true treatment-subgroup interactions described above. However, the two nodes in the middle represent an unnecessary split and thus do not represent true subgroups, possibly due to the dependence of observations within clusters not being taken into account.

Including random effects

For datasets containing observations from multiple clusters (e.g., trials or research centers), application of a mixed-effects model would be more appropriate. The GLM in Eq. 2 is then extended to include cluster-specific, or random effects:

$$ g(\mu_{i}) = x_{i}^{\top}\beta + z_{i}^{\top}b $$

(4)

For a random-intercept only model, z _i is a unit vector of length M, of which the m-th element takes a value of 1, and all other elements take a value of 0; m (m = 1,…,M) denotes the cluster which observation i is part of. Further, b is a random vector of length M, each m-th element corresponding to the random intercept for cluster m. For simplicity, we employ a cluster-specific intercept only, but further random effects can easily be included in z _i. Furthermore, within the GLMM it is assumed that b is normally distributed, with mean zero and variance σ b2 and that the errors 𝜖 have constant variance across clusters. The parameters of the GLMM can be estimated with, for example, maximum likelihood (ML) and restricted ML (REML).

Although the random-effects part of the GLMM in Eq. 4 accounts for the nested structure of the dataset, the global fixed-effects part x i⊤β may not describe the data well. Therefore, we propose the GLMM tree model, in which the fixed-effects part may be partitioned as in Eq. 3 while still adjusting for random effects:

$$ g(\mu_{ij}) = x_{i}^{\top}\beta_{j} + z_{i}^{\top}b $$

(5)

In the GLMM tree model, the fixed effects β _j are local parameters, their value depending on terminal node j, but the random effects b are global. To estimate the parameters of this model, we take an approach similar to that of the mixed-effects regression tree (MERT) approach of Hajjem et al. (2011) and Sela and Simonoff (2012). In the MERT approach, the fixed-effects part of a GLMM is replaced by a CART tree with constant fits in the nodes, and the random-effects parameters are estimated as usual. To estimate a MERT, an iterative approach is taken, alternating between (1) assuming random effects known, allowing for estimation of the CART tree, and (2) assuming the CART tree known, allowing for estimation of the random-effects parameters.

For estimating GLMM trees, we take this approach two steps further: (1) Instead of a CART tree with constant fits to estimate the fixed-effects part of the GLMM, we use a GLM tree. This allows not only for detection of differences in intercepts across terminal nodes but also for detection of differences in slopes such as treatment effects. (2) By using generalized linear (mixed) models, the response may also be a binary or count variable instead of a continuous variable. The GLMM tree algorithm takes the following steps to estimate the model in Eq. 5:

Step 0::: Initialize by setting r and all values $\hat {b}_{(r)}$ to 0.
Step 1::: Set r = r + 1. Estimate a GLM tree using $z_{i}^{\top }\hat {b}_{(r-1)}$ as an offset.
Step 2::: Fit the mixed-effects model g(μ _{i
j}) = x i⊤β _j + z i⊤b with terminal node j(r) from the GLM tree estimated in Step 1. Extract posterior predictions $\hat {b}_{(r)}$ from the estimated model.
Step 3::: Repeat Steps 1 and 2 until convergence.

The algorithm initializes by setting b to 0, since the random effects are initially unknown. In every iteration, the GLM tree is re-estimated in step (1) and the fixed- and random-effects parameters are re-estimated in step (2). Note that the random effects are not partitioned, but estimated globally. Only the fixed effects are estimated locally, within the cells of the partition. Convergence of the algorithm is monitored by computing the log-likelihood criterion of the mixed-effects model in Eq. 5. Typically, this converges if the tree does not change from one iteration to the next.

In Fig. 4, the result of applying the GLMM tree algorithm to the motivating dataset is presented. In addition to the treatment indicator, treatment outcome and the potential partitioning variables, the GLMM tree algorithm has also taken a random intercept with respect to the cluster indicator into account. As a result, the dependence between observations is taken into account, the true treatment subgroups have been recovered and the spurious split involving the anxiety variable no longer appears in the tree.

Simulation-based evaluation

To assess the performance of GLMM trees, we carried out three simulation studies: In Study I we assessed and compared the accuracy of GLMM trees, linear-model based MOB (LM trees) and mixed-effects regression trees (MERTs) in datasets with treatment-subgroup interactions. In Study II, we assessed and compared the type I error of GLMM trees and linear-model based MOB in datasets without treatment-subgroup interactions. In Study III, we assessed and compared the performance of GLMM trees and linear mixed-effects models (LMMs) with pre-specified interactions in datasets with piecewise and continuous interactions. As the outcome variable was continuous in all simulated datasets, the GLMM tree algorithm and trees resulting from its application will be referred to as LMM tree(s).

General simulation design

In all simulation studies, the following data-generating parameters were varied:

1.
Sample size: N = 200, N = 500, N = 1000.
2.
Number of potential partitioning covariates U ₁ through U _K: K = 5 and K = 15.
3.
Intercorrelation between the potential partitioning covariates U ₁ through U _K: $\rho _{U_{k},U_{k^{\prime }}}=0.0$, $\rho _{U_{k},U_{k^{\prime }}}=0.3$.
4.
Number of clusters: M = 5, M = 10, M = 25.
5.
Population standard deviation (SD) of the normal distribution from which the cluster-specific intercepts were drawn: σ _b = 0, σ _b = 5, σ _b = 10.
6.
Intercorrelation between b and one of the U _k variables: b and all U _k covariates uncorrelated, b correlated with one of the U _k covariates (r = .42).

Following the approach of Dusseldorp and Van Mechelen (2014), all partitioning covariates U ₁ through U _K were drawn from a multivariate normal distribution with means μ _U ₁ = 10, μ _U ₂ = 30, μ _U ₄ = −40, and μ _U ₅ = 70. The means of the other potential partitioning covariates (U ₃ and, depending on the value of K, also U ₆ through U ₁₅) were drawn from a discrete uniform distribution on the interval [−70,70]. All covariates U ₁ through U ₁₅ had the same standard deviation: σ _U _k = 10.

To generate the cluster-specific intercepts, we partitioned the sample into M equally sized clusters, conditional on one of the variables U ₁ through U ₅, producing the correlations in the sixth facet of the simulation design. For each cluster, a single value b _m was drawn from a normal distribution with mean 0 and the value of σ _b given by the fifth facet of the simulation design. If b was correlated with one of the potential partitioning variables, the correlated variable was randomly selected.

For every observation, we generated a binomial variable (with probability 0.5) as an indicator for treatment type. Random errors 𝜖 were drawn from a normal distribution with μ _𝜖 = 0 and σ _𝜖 = 5. The value of the outcome variable y _i was calculated as the sum of the random intercept, (node-specific) fixed effects and the random error term.

Due to the large number of cells in the simulation design, the most important predictors of accuracy were determined by means of ANOVAs and/or GLMs. The most important predictors of accuracy where then assessed through graphical displays. The ANOVAs and GLMs included main effects of algorithm type and the parameters of the data-generating process, as well as first-order interactions between algorithm type and each of the data-generating parameters.

Software

R (R Core Team, 2016) was used for data generation and analyses. The partykit package (version 1.0-2; Hothorn & Zeileis, 2015, 2016) was employed for estimating LM trees, using the lmtree function. For estimation of LMM trees, the lmertree function of the glmertree package (version 0.1-0; Fokkema & Zeileis, 2016; available from R-Forge) was used. The significance level α for the parameter instability tests was set to 0.05 for all trees, with a Bonferroni correction applied for multiple testing. The latter adjusts the p values of the parameter stability tests by multiplying these by the number of potential partitioning variables. The minimum number of observations per node in trees was set to 20 and maximum tree depth was set to three, thus limiting the number of terminal nodes to eight in every tree.

The REEMtree package (version 0.9.3; Sela & Simonoff, 2011) was employed for estimating MERTs, using default settings. For estimating LMMs, the lmer function from the lme4 package (version 1.1-7; Bates, Mächler, Bolker, & Walker, 2015; Bates et at., 2017) was employed, using restricted maximum likelihood (REML) estimation. The lmerTest package (version 2.0-32; Kuznetsova, Brockhoff, & Christensen, 2016) was used to assess statistical significance of fixed-effects predictors in LMMs in Study III. The lmerTest package calculates effective degrees of freedom and p values based on Satterthwaite approximations.

Study I: Performance of LMM trees, LM trees, and MERTs in datasets with treatment-subgroup interactions

Method

Treatment-subgroup interaction design

For generating datasets with treatment-subgroup interactions, we used a design from Dusseldorp and Van Mechelen (2014), which is depicted in Fig. 5. Figure 5 shows four terminal subgroups, characterized by values of the partitioning variables U ₂, and U ₁ or U ₅. Two of the subgroups have mean differences in treatment outcome, indicated by a non-zero value of β _j1, and two subgroups do not have mean differences in treatment outcome, indicated by a β _j1 value of 0.

In this simulation design, some of the potential partitioning covariates are true partitioning covariates, the others are noise variables. Therefore, in addition to the General simulation design, the following facet was added in this study:

6.
Intercorrelation between b and one of the U _k variables: b and all U _k covariates uncorrelated, b correlated with one of the true partitioning covariates (U ₁, U ₂ or U ₅), b correlated with one of the noise variables (U ₃ or U ₄).

To assess the effect of the magnitude of treatment-effect differences, the following facet was added in this study:

7.
Two levels for the mean difference in treatment outcomes: The absolute value of the treatment-effect difference was varied to be |β _j1| = 2.5 (corresponding to a medium effect size, Cohen’s d = 0.5; Cohen, 1992) and |β _j1| = 5.0 (corresponding to a large effect size; Cohen’s d = 1.0).

For each cell of the design, 50 datasets were generated. In every dataset, the outcome variable was calculated as y _i = x i⊤β _j + z i⊤b _m + 𝜖 _i.

Assessment of performance

Performance of the algorithms was assessed by means of tree size, tree accuracy, and predictive accuracy. An accurately recovered tree was defined as a tree with (1) seven nodes in total, (2) the first split involving variable U ₂ with a value of 30 ± 5, (3) the next split on the left involving variable U ₁ with a value of 17 ± 5, and (4) the next split on the right involving variable U ₅ with a value of 63 ± 5. The allowance of ± 5 equals plus or minus half the population SD of the partitioning variable ($\sigma _{U_{k}}$).

For MERT, the number of nodes and tree accuracy was not assessed, as the treatment-subgroup interaction design in Fig. 5 corresponds to a large number of regression tree structures, that would all be different but also correct. Therefore, performance of MERTs was only assessed in terms of predictive accuracy.

Predictive accuracy of each method was assessed by calculating the correlation between true and predicted treatment-effect differences. To prevent overly optimistic estimates of predictive accuracy, predictive accuracy was assessed using test datasets. Test datasets were generated from the same population as training datasets, but test observations were not drawn from the same clusters as the training observations, but from ‘new’ clusters.

The best approach for including treatment effects in MERTs is not completely obvious. Firstly, a single MERT may be fitted, where treatment is included as one of the potential partitioning variables. Predictions of treatment-effect differences can then be obtained by dropping test observations down the tree twice, once for every level of the treatment indicator. Secondly, two MERTs may be fitted: one using observations in the first treatment condition and one using observations in the second treatment condition. Predictions of treatment-effect differences can then be obtained by dropping a test observation down each of the two trees. We tried both approaches: the second approach yielded higher predictive accuracy, as the first approach often did not pick up the treatment indicator as a predictor. Therefore, we have taken the second approach of fitting two MERTs to each dataset in our simulations.

Results

Tree size

The average size of LMM trees was 7.15 nodes (SD = 0.61), whereas the average size of LM trees was 8.15 nodes (SD = 2.05), indicating that LM trees tend to involve more spurious splits than LMM trees. The effects of the most important predictors of tree size are depicted in Fig. 6. The average size of LMM trees was close to the true tree size in all conditions. In the absence of random effects, this was also the case for LM trees. In the presence of random effects that are correlated to a (potential) partitioning variable, LM trees start to create spurious splits, especially with larger σ _b values. In the presence of random effects that are uncorrelated to the other variables in the model, LM trees lack power to detect treatment-subgroup interactions if sample size is small (i.e., N = 200). With larger sample sizes, LM trees showed about the true tree size, on average. Tree size of MERTs was not assessed, as a single true tree size for MERTs could not be derived from the design in Fig. 5.

Accuracy of recovered trees

The estimated probability that a dataset was erroneously not partitioned (type II error) was 0 for both algorithms. For the first split, LMM trees selected the true partitioning variable (U ₂) in all datasets, and LM trees in all but one datasets. The mean splitting value of the first split was 29.94 for LM as well as LMM trees, which is very close to the true splitting value of 30 (Fig. 5).

Further splits were more accurately recovered by LMM trees yielding 90.40% accuracy for the full partition comparted to only 61.44% for LM trees. The effects of the four most important predictors of tree accuracy are depicted in Fig. 7. In the absence of random effects, LM and LMM trees were about equally accurate. In the presence of random effects, LM trees were much less accurate than LMM trees when random effects were correlated with a partitioning covariate. When random intercepts were not correlated with one of the U _k variables, LMM trees outperformed LM trees only when sample size was small (i.e., N = 200). Tree accuracy of MERTs was not assessed, as a single accurate tree structure for MERTs could not be derived from the design in Fig. 5.

Predictive accuracy

The predicted treatment-effect differences of LMM trees show an average correlation of 0.93 (SD = .13) with the true differences. LM trees and MERTs show lower accuracy, with an average correlations of 0.88 (SD = .19) and 0.75 (SD = .21), respectively. The most important predictors of predictive accuracy are depicted in Fig. 8. Performance of all three algorithms improves with increasing sample size and treatment-effect differences. Furthermore, LMM trees and MERTs are not much affected by the presence and magnitude of random effects in the data. LMM trees perform most accurately in most conditions and are never outperformed by the other methods. MERTs perform the least accurate in most conditions and never outperform the other methods, but the differences in accuracy become less pronounced with larger sample and effect sizes.