Detecting treatmentsubgroup interactions in clustered data with generalized linear mixedeffects model trees
Abstract
Identification of subgroups of patients for whom treatment A is more effective than treatment B, and vice versa, is of key importance to the development of personalized medicine. Treebased algorithms are helpful tools for the detection of such interactions, but none of the available algorithms allow for taking into account clustered or nested dataset structures, which are particularly common in psychological research. Therefore, we propose the generalized linear mixedeffects model tree (GLMM tree) algorithm, which allows for the detection of treatmentsubgroup interactions, while accounting for the clustered structure of a dataset. The algorithm uses modelbased recursive partitioning to detect treatmentsubgroup interactions, and a GLMM to estimate the randomeffects parameters. In a simulation study, GLMM trees show higher accuracy in recovering treatmentsubgroup interactions, higher predictive accuracy, and lower type II error rates than linearmodelbased recursive partitioning and mixedeffects regression trees. Also, GLMM trees show somewhat higher predictive accuracy than linear mixedeffects models with prespecified interaction effects, on average. We illustrate the application of GLMM trees on an individual patientlevel data metaanalysis on treatments for depression. We conclude that GLMM trees are a promising exploratory tool for the detection of treatmentsubgroup interactions in clustered datasets.
Keywords
Modelbased recursive partitioning Treatmentsubgroup interactions Mixedeffects models Classification and regression treesIntroduction
In research on the efficacy of treatments for somatic and psychological disorders, the onesizefitsall paradigm is slowly losing ground, and personalized or stratified medicine is becoming increasingly important. Stratified medicine presents the challenge of discovering which patients respond best to which treatments. This can be referred to as the detection of treatmentsubgroup interactions (e.g., Doove, Dusseldorp, Van Deun, & Van Mechelen, 2014). Often, treatmentsubgroup interactions are studied using linear models, such as factorial analysis of variance techniques, in which potential moderators have to be specified a priori, have to be checked one at a time, and continuous moderator variables have to be discretized. This may hamper identification of which treatment works best for whom, especially when there are no a priori hypotheses about treatmentsubgroup interactions. As noted by Kraemer, Frank, and Kupfer (2006), there is a need for methods that generate instead of test such hypotheses.
Treebased methods are such hypothesisgenerating methods. Treebased methods, also known as recursive partitioning methods, split observations repeatedly into groups so that they become increasingly similar with respect to the outcome within each group. Several treebased methods take the mean of a continuous dependent variable or the majority class of a categorical dependent variable as the outcome, one of the earliest and most wellknown examples being the classification and regression tree (CART) approach of Breiman, Friedman, Olshen, and Stone (1984). Other treebased methods take the estimated parameters of a more complex model, of which the RECPAM (recursive partition and amalgamation) approach of Ciampi (1991) is the earliest example.
If the partition in Fig. 1 would be used for prediction of a new observation, the new observation would be assigned to one of the terminal nodes according to its values on the splitting variables. The prediction is then based on the estimated distribution of the outcome variable within that terminal node. For example, the prediction may be the nodespecific mean of a single continuous variable. In the current paper, we focus on trees where the terminal nodes consist of a linear (LM) or generalized linear model (GLM), in which case the predicted value for a new observation is determined by the nodespecific parameter estimates of the (G)LM, while also adjusting for random effects.
Treebased methods are particularly useful for exploratory purposes because they can handle many potential predictor variables at once and can automatically detect (higherorder) interactions between predictor variables (Strobl, Malley, & Tutz, 2009). As such, they are preeminently suited to the detection of treatmentsubgroup interactions. Several treebased algorithms for the detection of treatmentsubgroup interactions have already been developed (Dusseldorp, Doove, & Van Mechelen, 2016; Dusseldorp & Meulman, 2004; Su, Tsai, Wang, Nickerson, & Li, 2009; Foster, Taylor, & Ruberg, 2011; Lipkovich, Dmitrienko, Denne, & Enas, 2011; Zeileis, Hothorn, & Hornik, 2008; Seibold, Zeileis, & Hothorn, 2016; Athey & Imbens 2016). Also, Zhang, Tsiatis, Laber, and Davidian (2012b) and Zhang, Tsiatis, Davidian, Zhang, and Laber (2012a) have developed a flexible classificationbased approach, allowing users to select from a range of statistical methods, including trees.
In many instances, researchers may want to detect treatmentsubgroup interactions in clustered or nested datasets, for example in individuallevel patient data metaanalyses, where datasets of multiple clinical trials on the same treatments are pooled. In such analyses, the nested or clustered structure of the dataset should be taken into account by including studyspecific random effects in the model, prompting the need for a mixedeffects model (e.g., Cooper & Patall 2009; Higgins, Whitehead, Turner, Omar, & Thompson, 2001). In linear models, ignoring the clustered structure may lead, for example, to biased inference due to underestimated standard errors (e.g., Bryk & Raudenbush, 1992). For treebased methods, ignoring the clustered structure has been found to result in the detection of spurious subgroups and inaccurate predictor variable selection (e.g., Sela & Simonoff, 2012; Martin, 2015). However, none of the purely treebased methods for treatmentsubgroup interaction detection allow for taking into account the clustered structure of a dataset. Therefore, in the current paper, we present a treebased algorithm that can be used for the detection of interactions and nonlinearities in GLMMtype models: generalized linear mixedeffects model trees, or GLMM trees.
The GLMM tree algorithm builds on modelbased recursive partitioning (MOB, Zeileis et al., 2008), which offers a flexible framework for subgroup detection. For example, GLMbased MOB has been applied to detect treatmentsubgroup interactions for the treatment of depression (Driessen et al., 2016) and amyotrophic lateral sclerosis (Seibold et al., 2016). In contrast to other purely treebased methods (e.g., Zeileis et al., 2008; Su et al., 2009; Dusseldorp et al., 2016), GLMM trees allow for taking into account the clustered structure of datasets. In contrast to previously suggested regression trees with random effects (e.g., Hajjem, Bellavance, & Larocque, 2011; Sela & Simonoff, 2012), GLMM trees allow for treatment effect estimation, with continuous as well as noncontinuous response variables.
The remainder of this paper is structured into four sections: In the first section, we introduce the GLMM tree algorithm using an artificial motivating dataset with treatmentsubgroup interactions. In the second section, we compare the performance of GLMM trees with that of three other methods: MOB trees without random effects, mixedeffects regression trees (MERTs) and linear mixedeffects models with prespecified interactions. In the third section, we apply the GLMM tree algorithm to an existing dataset of a patientlevel metaanalysis on the effects of psycho and pharmacotherapy for depression. In the fourth and last section, we summarize the results and discuss limitations and directions for future research. In the Appendix, we provide a glossary explaining abbreviations and mathematical notation used in the current paper. Finally, a tutorial on how to fit GLMM trees using the R package glmertree is included as supplementary material. In the tutorial, the artificial motivating dataset is used, allowing users to recreate the trees and models to be fitted in the next section.
GLMM tree algorithm
Artificial motivating dataset
Summary statistics for partitioning and outcome variables in the artificial motivating dataset
min  max  M  SD  

Depression  3  16  9.12  2.66 
Age  18  69  45.00  9.56 
Anxiety  3  18  10.26  3.05 
Duration  1  17  6.97  2.90 
The outcome variable was generated such that there are three subgroups with differential treatment effectiveness, corresponding to the terminal nodes in Fig. 1: For the first subgroup of patients (node 3) with short duration (≤ 8) and low anxiety scores (≤ 10), Treatment 1 leads to lower posttreatment depression than in Treatment 2 (true mean difference = 2). For the second subgroup of patients (node 4) with short duration but high anxiety scores (> 10), posttreatment depression is about equal in both treatment conditions (true mean difference = 0). For the third subgroup of patients (node 5) with long duration (> 8 months), Treatment 2 leads to lower posttreatment depression than Treatment 1 (true mean difference = −2.5). Thus, duration and anxiety are true partitioning or moderator variables, whereas age is not. Anticipating the final results of our analyses, the treatmentsubgroup interactions are depicted in Fig. 4, which shows the GLMM tree that accurately recovered the treatmentsubgroup interactions.
Modelbased recursive partitioning
The rationale behind MOB is that a single global GLM (or other parametric model) may not describe the data well, and when additional covariates are available it may be possible to partition the dataset with respect to these covariates, and find betterfitting models in each cell of the partition. For example, to assess the effect of treatment, we may first fit a global GLM where the treatment indicator has the same effect/coefficient on the outcome for all observations. Subsequently, the data may be partitioned recursively with respect to other covariates, leading to separate models with different treatment effects/coefficients in each subsample.
The MOB algorithm can be used to partition the dataset using additional covariates and find betterfitting local models. To this end, the MOB algorithm tests for parameter stability with respect to each of a set of auxiliary covariates, also called partitioning variables, which we will denote by U. When the partitioning is based on a GLM, instabilities are differences in \(\hat {\beta }\) across partitions of the dataset, which are defined by one or more auxiliary covariates not included in the linear predictor. To find these partitions, the MOB algorithm cycles iteratively through the following steps (Zeileis et al., 2008): (1) fit the parametric model to the dataset, (2) statistically test for parameter instability with respect to each of a set of partitioning variables, (3) if there is some overall parameter instability, split the dataset with respect to the variable associated with the highest instability, (4) repeat the procedure in each of the resulting subgroups.
In step (2) a test statistic quantifying parameter instability is calculated for every potential partitioning variable. As the distribution of these test statistics under the null hypothesis of parameter stability is known, a p value for every partitioning variable can be calculated. Note that a more indepth discussion of the parameter stability tests is beyond the scope of this paper, but can be found in Zeileis and Hornik (2007) and Zeileis et al. (2008).
If at least one of the partitioning variables yields a p value below the prespecified significance level α, the dataset is partitioned into two subsets in step (3). This partition is created using \(U_{k^{*}}\), the partitioning variable with the minimal p value in step (2). The split point for \(U_{k^{*}}\) is selected by taking the value that minimizes the instability as measured by the sum of the values of two loss functions, one for each of the resulting subgroups. In other words, the loss function is minimized separately in the two subgroups resulting from every possible split point and the split point yielding the minimum sum of the loss functions is selected. In step (4), steps (1) through (3) are repeated in each partition, until the null hypothesis of parameter stability can no longer be rejected (or the subsets become too small).
Including random effects
 Step 0:

Initialize by setting r and all values \(\hat {b}_{(r)}\) to 0.
 Step 1:

Set r = r + 1. Estimate a GLM tree using \(z_{i}^{\top }\hat {b}_{(r1)}\) as an offset.
 Step 2:

Fit the mixedeffects model g(μ _{ i j }) = x i⊤β _{ j } + z i⊤b with terminal node j(r) from the GLM tree estimated in Step 1. Extract posterior predictions \(\hat {b}_{(r)}\) from the estimated model.
 Step 3:

Repeat Steps 1 and 2 until convergence.
The algorithm initializes by setting b to 0, since the random effects are initially unknown. In every iteration, the GLM tree is reestimated in step (1) and the fixed and randomeffects parameters are reestimated in step (2). Note that the random effects are not partitioned, but estimated globally. Only the fixed effects are estimated locally, within the cells of the partition. Convergence of the algorithm is monitored by computing the loglikelihood criterion of the mixedeffects model in Eq. 5. Typically, this converges if the tree does not change from one iteration to the next.
Simulationbased evaluation
To assess the performance of GLMM trees, we carried out three simulation studies: In Study I we assessed and compared the accuracy of GLMM trees, linearmodel based MOB (LM trees) and mixedeffects regression trees (MERTs) in datasets with treatmentsubgroup interactions. In Study II, we assessed and compared the type I error of GLMM trees and linearmodel based MOB in datasets without treatmentsubgroup interactions. In Study III, we assessed and compared the performance of GLMM trees and linear mixedeffects models (LMMs) with prespecified interactions in datasets with piecewise and continuous interactions. As the outcome variable was continuous in all simulated datasets, the GLMM tree algorithm and trees resulting from its application will be referred to as LMM tree(s).
General simulation design
 1.
Sample size: N = 200, N = 500, N = 1000.
 2.
Number of potential partitioning covariates U _{1} through U _{ K }: K = 5 and K = 15.
 3.
Intercorrelation between the potential partitioning covariates U _{1} through U _{ K }: \(\rho _{U_{k},U_{k^{\prime }}}=0.0\), \(\rho _{U_{k},U_{k^{\prime }}}=0.3\).
 4.
Number of clusters: M = 5, M = 10, M = 25.
 5.
Population standard deviation (SD) of the normal distribution from which the clusterspecific intercepts were drawn: σ _{ b } = 0, σ _{ b } = 5, σ _{ b } = 10.
 6.
Intercorrelation between b and one of the U _{ k } variables: b and all U _{ k } covariates uncorrelated, b correlated with one of the U _{ k } covariates (r = .42).
Following the approach of Dusseldorp and Van Mechelen (2014), all partitioning covariates U _{1} through U _{ K } were drawn from a multivariate normal distribution with means μ _{ U } _{1} = 10, μ _{ U } _{2} = 30, μ _{ U } _{4} = −40, and μ _{ U } _{5} = 70. The means of the other potential partitioning covariates (U _{3} and, depending on the value of K, also U _{6} through U _{15}) were drawn from a discrete uniform distribution on the interval [−70,70]. All covariates U _{1} through U _{15} had the same standard deviation: σ _{ U } _{ k } = 10.
To generate the clusterspecific intercepts, we partitioned the sample into M equally sized clusters, conditional on one of the variables U _{1} through U _{5}, producing the correlations in the sixth facet of the simulation design. For each cluster, a single value b _{ m } was drawn from a normal distribution with mean 0 and the value of σ _{ b } given by the fifth facet of the simulation design. If b was correlated with one of the potential partitioning variables, the correlated variable was randomly selected.
For every observation, we generated a binomial variable (with probability 0.5) as an indicator for treatment type. Random errors 𝜖 were drawn from a normal distribution with μ _{ 𝜖 } = 0 and σ _{ 𝜖 } = 5. The value of the outcome variable y _{ i } was calculated as the sum of the random intercept, (nodespecific) fixed effects and the random error term.
Due to the large number of cells in the simulation design, the most important predictors of accuracy were determined by means of ANOVAs and/or GLMs. The most important predictors of accuracy where then assessed through graphical displays. The ANOVAs and GLMs included main effects of algorithm type and the parameters of the datagenerating process, as well as firstorder interactions between algorithm type and each of the datagenerating parameters.
Software
R (R Core Team, 2016) was used for data generation and analyses. The partykit package (version 1.02; Hothorn & Zeileis, 2015, 2016) was employed for estimating LM trees, using the lmtree function. For estimation of LMM trees, the lmertree function of the glmertree package (version 0.10; Fokkema & Zeileis, 2016; available from RForge) was used. The significance level α for the parameter instability tests was set to 0.05 for all trees, with a Bonferroni correction applied for multiple testing. The latter adjusts the p values of the parameter stability tests by multiplying these by the number of potential partitioning variables. The minimum number of observations per node in trees was set to 20 and maximum tree depth was set to three, thus limiting the number of terminal nodes to eight in every tree.
The REEMtree package (version 0.9.3; Sela & Simonoff, 2011) was employed for estimating MERTs, using default settings. For estimating LMMs, the lmer function from the lme4 package (version 1.17; Bates, Mächler, Bolker, & Walker, 2015; Bates et at., 2017) was employed, using restricted maximum likelihood (REML) estimation. The lmerTest package (version 2.032; Kuznetsova, Brockhoff, & Christensen, 2016) was used to assess statistical significance of fixedeffects predictors in LMMs in Study III. The lmerTest package calculates effective degrees of freedom and p values based on Satterthwaite approximations.
Study I: Performance of LMM trees, LM trees, and MERTs in datasets with treatmentsubgroup interactions
Method
Treatmentsubgroup interaction design
 6.
Intercorrelation between b and one of the U _{ k } variables: b and all U _{ k } covariates uncorrelated, b correlated with one of the true partitioning covariates (U _{1}, U _{2} or U _{5}), b correlated with one of the noise variables (U _{3} or U _{4}).
 7.
Two levels for the mean difference in treatment outcomes: The absolute value of the treatmenteffect difference was varied to be β _{ j1} = 2.5 (corresponding to a medium effect size, Cohen’s d = 0.5; Cohen, 1992) and β _{ j1} = 5.0 (corresponding to a large effect size; Cohen’s d = 1.0).
For each cell of the design, 50 datasets were generated. In every dataset, the outcome variable was calculated as y _{ i } = x i⊤β _{ j } + z i⊤b _{ m } + 𝜖 _{ i }.
Assessment of performance
Performance of the algorithms was assessed by means of tree size, tree accuracy, and predictive accuracy. An accurately recovered tree was defined as a tree with (1) seven nodes in total, (2) the first split involving variable U _{2} with a value of 30 ± 5, (3) the next split on the left involving variable U _{1} with a value of 17 ± 5, and (4) the next split on the right involving variable U _{5} with a value of 63 ± 5. The allowance of ± 5 equals plus or minus half the population SD of the partitioning variable (\(\sigma _{U_{k}}\)).
For MERT, the number of nodes and tree accuracy was not assessed, as the treatmentsubgroup interaction design in Fig. 5 corresponds to a large number of regression tree structures, that would all be different but also correct. Therefore, performance of MERTs was only assessed in terms of predictive accuracy.
Predictive accuracy of each method was assessed by calculating the correlation between true and predicted treatmenteffect differences. To prevent overly optimistic estimates of predictive accuracy, predictive accuracy was assessed using test datasets. Test datasets were generated from the same population as training datasets, but test observations were not drawn from the same clusters as the training observations, but from ‘new’ clusters.
The best approach for including treatment effects in MERTs is not completely obvious. Firstly, a single MERT may be fitted, where treatment is included as one of the potential partitioning variables. Predictions of treatmenteffect differences can then be obtained by dropping test observations down the tree twice, once for every level of the treatment indicator. Secondly, two MERTs may be fitted: one using observations in the first treatment condition and one using observations in the second treatment condition. Predictions of treatmenteffect differences can then be obtained by dropping a test observation down each of the two trees. We tried both approaches: the second approach yielded higher predictive accuracy, as the first approach often did not pick up the treatment indicator as a predictor. Therefore, we have taken the second approach of fitting two MERTs to each dataset in our simulations.
Results
Tree size
Accuracy of recovered trees
The estimated probability that a dataset was erroneously not partitioned (type II error) was 0 for both algorithms. For the first split, LMM trees selected the true partitioning variable (U _{2}) in all datasets, and LM trees in all but one datasets. The mean splitting value of the first split was 29.94 for LM as well as LMM trees, which is very close to the true splitting value of 30 (Fig. 5).
Predictive accuracy
Study II: Type I error of LM and LMM trees
Method
Design
In the second simulation study, we assessed the type I error rate of LM and LMM tree. In the datasets in this study, there was only a main effect of treatment in the population. Put differently, there was only a single global value of β _{ j } = β in every dataset. A type I error was defined as the proportion of datasets without treatmentsubgroup interactions which were erroneously partitioned by the algorithm.
 7.
Two levels for β, the global mean difference in treatment outcomes: β = 2.5 (corresponding to a medium effect size, Cohen’s d = 0.5) and β = 5.0 (corresponding to a large effect size; Cohen’s d = 1.0).
For each cell in the simulation design, 50 datasets were generated. In every dataset, the outcome variable was calculated as y _{ i } = x i⊤β + z i⊤b _{ m } + 𝜖 _{ i }.
Assessment of performance
To assess the type I error rates of LM and LMM trees, tree sizes were calculated and trees of size > 1 were classified as type I errors. The nominal type I error rate for both LM and LMM trees equals 0.05, corresponding to the prespecified significance level α for the parameter instability tests.
Results
Study III: Recovery of piecewise and continuous interactions by LMM trees and LMMs with prespecified interactions
Method
Interaction design
 7.
Three levels for interaction type: continuous, piecewise and combined piecewisecontinuous interactions.
To generate datasets with purely piecewise interactions, the same partition as in Study I (Fig. 5) was used. In other words, the outcome variable in this design was calculated as y _{ i } = x i⊤β _{ j } + z i⊤b + 𝜖 _{ i }, with the value of β _{ j } depending on the values of U _{2}, U _{1} and U _{5}.
Fixedeffects terms in simulations with continuous and combined continuous and piecewise interaction designs
Term  β  β _{3}  β _{4}  β _{6}  β _{7} 

intercept  27  27  27  27  27 
U _{2}  0.100  0.100  0.100  0.100  0.100 
U _{2} ⋅ U _{1}  − 0.357  − 0.357  0  0  0 
U _{2} ⋅ U _{5}  0.357  0  0  0  0.357 
U _{2} ⋅ U _{1} ⋅treatment  − 0.151  − 0.151  0  0  0 
U _{2} ⋅ U _{5} ⋅treatment  0.151  0  0  0  0.151 
In datasets with purely continuous interactions, β has a global value and no subscript, comprising only purely continuous main and interaction effects, as shown by terms and the single column for β in Table 2. The outcome variable was calculated as y _{ i } = x i⊤β + z i⊤b + 𝜖 _{ i }.
Furthermore, in this simulation study, the number of cells in the design was reduced by limiting the fourth facet of the datagenerating design to a single level (M = 25 clusters), as Study I and II indicated no effects of the number of clusters. The fifth facet of the datagenerating design was limited to two levels (σ _{ b } = 2.5 and σ _{ b } = 7.5). For every cell of the design, 50 datasets were generated.
LMMs with prespecified interactions
LMMs were estimated by specifying main effects for all covariates U _{ k } and the treatment indicator, firstorder interactions between all pairs of covariates U _{ k }, and secondorder interactions between all pairs of covariates U _{ k } and treatment. Continuous predictor variables were centered by subtracting the mean value, before calculating and including the interaction term in the LMM.
Assessment of performance
Predictive accuracy was assessed in terms of the correlation between the true and predicted treatmenteffect differences in test datasets. As full LMMs may be likely to overfit, LMMs were refitted on the training data, using only the predictors with p values < 0.05 in the original LMM. Predictions for test observations were obtained using the refitted LMMs.
Results
Performance of both LMM trees and LMMS improves with increasing sample size. Furthermore, performance of LMM trees is not affected by the number of covariates, whereas the predictive accuracy of LMMs deteriorates when the number of covariates increases, especially when the true interactions are not purely continuous. This indicates that LMM trees are especially useful for exploratory purposes, where there are many potential moderator variables. In addition, LMM trees may often provide simpler models: Whereas the LMMs included 12.30 significant terms on average, LMM trees had 3.38 inner nodes on average, requiring only about 3–4 variables to be evaluated for making predictions.
Application: Individual patientlevel metaanalysis on treatments for depression
Method
Dataset
To illustrate the use of GLMM trees in real data applications, we employ a dataset from an individualpatient data metaanalysis of Cuijpers et al. (2014). This metaanalysis was based on patientlevel observations from 14 RCTs, comparing the effects of psychotherapy (cognitive behavioral therapy; CBT) and pharmacotherapy (PHA) in the treatment of depression. The study of Cuijpers et al. (2014) was aimed at establishing whether gender is a predictor or moderator of the outcomes of psychological and pharmacological treatments for depression. Treatment outcomes were assessed by means of the 17item Hamilton Rating Scale for Depression (HAMD; Hamilton, 1960). Cuijpers et al. (2014) found no indication that gender predicted or moderated treatment outcome.
In our analyses, posttreatment HAMD score was the outcome variable, and potential partitioning variables were age, gender, level of education, presence of a comorbid anxiety disorder at baseline, and pretreatment HAMD score. The predictor variable in the linear model was treatment type (0 = CBT and 1 = PHA). An indicator for study was used as the cluster indicator.
In RCTs, ANCOVAs are often employed, to linearly control posttreatment values on the outcome measure for pretreatment values. Therefore, posttreatment HAMD scores, controlled for the linear effects of pretreatment HAMD scores were taken as the outcome variable. All models were fitted using data of the 694 patients from seven studies, for which complete data was available. Results of our analysis may therefore not be fully representative of the complete dataset of the metaanalysis by Cuijpers et al. (2014).
Models and comparisons
As the outcome variable is continuous, we employed an identity link and Gaussian response distribution. The resulting GLMM trees will therefore be referred to as LMM trees. To compare the accuracy of LMM trees, we also fitted LM trees and LMMs with prespecified interactions to the data. In the LMMs, the outcome variable was regressed on a random intercept, main effects of treatment and the potential moderators (partitioning variables) and interactions between treatment and the potential moderators. As it is not known in advance how to interact the potential moderators, higherorder interactions were not included.
Effect size
To provide a standardized estimate of the treatment effect differences in the final nodes of the trees, we calculated nodespecific Cohen’s d values. Cohen’s d was calculated by dividing the nodespecific predicted treatment outcome difference by the nodespecific pooled standard deviation.
Predictive accuracy
Predictive accuracy of each method was assessed by calculating the average correlation between observed and predicted HAMD posttreatment scores, based on 50fold cross validation.
Stability
The results of recursive partitioning techniques are known to be potentially unstable, in the sense that small changes in the dataset may substantially alter the variables or values selected for partitioning. Therefore, following Philipp, Zeileis, & Strobl (2016), subsampling is used to assess the stability of the selected splitting variables and values. More precisely, variable selection frequencies of the trees are computed from 500 subsamples, each comprising 90% of the full dataset.
Results
The studyspecific distributions of educational level and treatment outcome may explain why the LMM tree did not select level of education as a partitioning variable. Most (55) of the 74 observations with level of education ≤ 1 were part of a single study, which showed a markedly lower mean level of education (M = 2.57, S D = 1.02; 128 observations) compared to the other studies (M = 3.78, S D = 0.53; 566 observations), as well as a markedly higher mean level of posttreatment HAMD scores (M = 11.20, S D = 6.87) compared to the other studies (M = 7.78, S D = 5.95).
The LMM with prespecified treatment interactions yielded three significant predictors of treatment outcome: like in the LMM tree, an effect of the presence of a comorbid anxiety disorder was found (main effect: b = 2.29, p = 0.002; interaction with treatment: b = −2.10, p = 0.028). Also, the LMM indicated an interaction between treatment and age (b = .10, p = 0.018).
Assessment of predictive accuracy by means of 50fold cross validation indicated better predictive accuracy for the LMM tree than for the LM tree and the LMM. The correlation between true and predicted posttreatment HAMD total scores averaged over 50 folds was .272 (S D = .260) for LMM tree, .233 (S D = .252) for the LMM with prespecified interactions and .190 (S D = .290) for the LM tree.
Variable selection statistics for the LM and LMM trees in the IPDMA dataset
Selection frequency  

Variable  LM tree  LMM tree 
Education  .956  .014 
ComorbidAnxietyDisorder  .398  .528 
HRSDt0  .034  .002 
Age  .000  .022 
Gender  .002  .004 
Discussion
Summary
We presented the GLMM tree algorithm, which allows for estimation of a GLMbased recursive partition, as well as estimation of global randomeffects parameters. We hypothesized GLMM trees to be well suited for the detection of treatmentsubgroup interactions in clustered datasets. We confirmed this through our simulation studies and by applying the algorithm to an existing dataset on the effects of depression treatments.
Our simulations focused on the performance of the GLMM tree algorithm in datasets with continuous response variables. The resulting LMM trees accurately recovered the subgroups in 90% of simulated datasets with treatmentsubgroup interactions, outperforming LM trees without random effects. In terms of predictive accuracy, LMM trees outperformed LM trees as well as MERTs, predicting treatmenteffect differences in test data with .94 accuracy, on average. In datasets without treatmentsubgroup effects, LMM trees showed a rather low type I error rate of 4%, compared with a type I error rate of 33% for LM trees.
Better performance of LMM trees was mostly observed when random effects were sizeable and correlated with potential partitioning variables. In those circumstances, LM trees are likely to detect spurious splits and subgroups. Especially with smaller effect and sample sizes (i.e., Cohen’s d = .5 and/or N = 200), LMM trees outperformed the other tree methods. As such effect and sample sizes are quite common in multicenter clinical trials, the GLMM tree algorithm may provide a useful tool for subgroup detection in those instances.
In the absence of random effects, LM and LMM trees yielded very similar predictive accuracy. This finding is of practical importance, indicating that application of the GLMM tree algorithm will not reduce accuracy when in fact, random effects are absent from the data.
Compared to LMMs with prespecified interactions, LMM trees provided somewhat better accuracy, on average. LMM trees performed much better than LMMs when interactions were at least partly piecewise. However, the performance of LMM trees deteriorated when the interactions were purely continuous functions of the predictor variables, in which case LMMs with prespecified interactions performed very well. The performance of LMMs deteriorated with a larger number of predictor variables, whereas the performance of LMM trees was not affected by this, confirming our expectation that LMM trees are better suited for exploration than LMMs.
In the Application, we found the LMM tree yielded somewhat higher predictive accuracy, while using a smaller number of variables than the LM tree and the LMM with prespecified interactions. The LMM trees obtained over repeated subsamples of the training data proved to be relatively stable.
Limitations and future directions
Recursive partitioning methods were originally developed as a nonparametric tool for classification and regression, assuming the mechanism that generated the data unknown (e.g., Breiman, 2001). The GLMM tree algorithm takes MOB trees and GLMMs as building blocks and thereby inherits some of their sensitivity to model misspecifications.
As with GLMMs, misspecification of the random effects may negatively affect the accuracy of GLMM trees. Previous research on GLMMs has shown that misspecifying the shape of the randomeffects distribution reduces the accuracy of randomeffects predictions, but has little effects on the fixedeffects parameter estimates (e.g., Neuhaus, McCulloch, & Boylan, 2013). This indicates that for GLMM trees, misspecification of the shape of the randomeffects distribution will affect randomeffects predictions, but will have little effect on the estimated tree. The incorrect specification of a predictor as having a fixed instead of a random effect may be more likely to result in spurious splits or poorer parameter estimates. Therefore, to check for model misspecification, users are advised to visually inspect residuals and predicted random effects. The tutorial included in the supplementary material shows how residuals and random effects predictions can be obtained and plotted.
As with MOB trees, GLMM trees will perform best when the true data structure is treeshaped. Trees may need many splits to approximate structures with a different shape, which may be more unstable under small perturbations. Therefore, in the Application we have shown how tree stability can be assessed using the methods developed by Philipp et al. (2016). Furthermore, when relevant partitioning variables are omitted, GLMM tree can only approximate the subgroups using the specified variables. Although our simulations did not involve scenarios where relevant partitioning variables were omitted, the results of Frick, Strobl, and Zeileis (2014) indicate that MOB trees can detect parameter instability even when covariates are only loosely connected to the group structure.
GLMM trees rely more on user specification of predictor variables than methods like CART and MERT. On the one hand, this increases the danger of model misspecification. On the other hand, it increases power and accuracy if the fixedeffects predictor variable(s) are correctly specified. For example, the best approach for assessing treatmenteffect differences with MERTs is not completely obvious. We tried two approaches in our simulations: firstly, fitting a single MERT with the treatment indicator as one of the potential partitioning variables and secondly, fitting separate MERTs for each level of the treatment indicator. With the first approach, MERTs often did not detect the treatment indicator as a predictor variable in datasets with treatmentsubgroup interactions. The second approach yielded better accuracy, indicating that MERTs profit from user specification of the relevant predictor variable(s). The GLMM tree algorithm provides a straightforward approach for including relevant fixedeffects predictor variables in the model. Our results suggest this yields higher predictive accuracy if the model is correctly specified. However, our simulations have not assessed the effect of misspecification of the fixedeffects predictor variable(s) and such misspecification may reduce predictive accuracy.
For recursive partitioning methods fitting local parametric models only, the danger of model misspecification may be limited (e.g., Ciampi, 1991; Siciliano, Aria, & D’Ambrosio, 2008; Su, Wang, & Fan, 2004). GLMM trees, however, fit a global randomeffects model in addition to local fixedeffects regression models and may therefore be less robust against model misspecification. For example, the global randomeffects model assumes a single randomeffects variance across terminal nodes. If the randomeffects variance, however, does vary across terminal nodes, this may negatively affect the performance of the parameter stability tests. Further research on the effects of model misspecifications on the performance of GLMM trees is therefore needed.
In the Introduction, we mentioned several existing treebased methods for treatmentsubgroup interaction detection. These methods have different objectives and there is not yet an agreedupon single best method. In a simulation study, Sies and Van Mechelen (2016) found the method of Zhang et al. (2012a) to perform best, followed by MOB. However, the method of Zhang et al. performed worst under some conditions of the simulation study in terms of the type I error rate. Further research comparing treebased methods for treatmentsubgroup interaction detection is needed, especially for clustered datasets, as our simulations and comparisons mostly focused on LMM trees and LM trees.
Conclusions
Our results indicate that GLMM trees provide accurate recovery of treatmentsubgroup interactions and prediction of treatment effects, both in the presence and absence of random effects and interactions. Therefore, GLMM trees offer a promising method for detecting treatmentsubgroup interactions in clustered datasets, for example in multicenter trials or individuallevel patient data metaanalyses.
Notes
Acknowledgements
The authors would like to thank Prof. Pim Cuijpers, Prof. Jeanne Miranda, Dr. Boadie Dunlop, Prof. Rob DeRubeis, Prof. Zindel Segal, Dr. Sona Dimidjian, Prof. Steve Hollon and Dr. Erica Weitz for granting access to the dataset for the application. The work for this paper was partially done while MF, AZ and TH were visiting the Institute for Mathematical Sciences, National University of Singapore in 2014. The visit was supported by the Institute.
Supplementary material
References
 Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353–7360.CrossRefGoogle Scholar
 Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixedeffects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01.
 Bates, D., Maechler, M., Bolker, B., Walker, S., Christensen, R., Singmann, H., & Green, P. (2017). Linear MixedEffects Models using ’Eigen’ and S4 [Computer software manual]. Retrieved from https://CRAN.Rproject.org/package=lme4 (R package version 1.17).
 Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231.CrossRefGoogle Scholar
 Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. New York: Wadsworth.Google Scholar
 Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park: Sage.Google Scholar
 Ciampi, A. (1991). Generalized regression trees. Computational Statistics & Data Analysis, 12(1), 57–78.CrossRefGoogle Scholar
 Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159.CrossRefPubMedGoogle Scholar
 Cooper, H., & Patall, E. A. (2009). The relative benefits of metaanalysis conducted with individual participant data versus aggregated data. Psychological Methods, 14(2), 165.CrossRefPubMedGoogle Scholar
 Cuijpers, P., Weitz, E., Twisk, J., Kuehner, C., Cristea, I., David, D., & Hollon, S. D. (2014). Gender as predictor and moderator of outcome in cognitive behavior therapy and pharmacotherapy for adult depression: An individualpatients data metaanalysis. Depression and Anxiety, 31(11), 941–951.CrossRefPubMedGoogle Scholar
 Doove, L. L., Dusseldorp, E., Van Deun, K., & Van Mechelen, I. (2014). A comparison of five recursive partitioning methods to find person subgroups involved in meaningful treatmentsubgroup interactions. Advances in Data Analysis and Classification, 8, 403–425.CrossRefGoogle Scholar
 Driessen, E., Smits, N., Dekker, J., Peen, J., Don, F. J., Kool, S., & Van, H. L. (2016). Differential efficacy of cognitive behavioral therapy and psychodynamic therapy for major depression: A study of prescriptive factors. Psychological Medicine, 46(4), 731–744.CrossRefPubMedGoogle Scholar
 Dusseldorp, E., & Meulman, J. J. (2004). The regression trunk approach to discover treatment covariate interaction. Psychometrika, 69(3), 355–374.CrossRefGoogle Scholar
 Dusseldorp, E., & Van Mechelen, I. (2014). Qualitative interaction trees: A tool to identify qualitative treatmentsubgroup interactions. Statistics in Medicine, 33(2), 219–237.CrossRefPubMedGoogle Scholar
 Dusseldorp, E., Doove, L., & Van Mechelen, I. (2016). Quint: An R package for the identification of subgroups of clients who differ in which treatment alternative is best for them. Behavior Research Methods, 48, 650.CrossRefPubMedGoogle Scholar
 Fokkema, M., & Zeileis, A. (2016). glmertree: Generalized linear mixed model trees. Retrieved from http://RForge.Rproject.org/R/?group_id=261 (R package version 0.11).
 Foster, J. C., Taylor, J. M. G., & Ruberg, S. J. (2011). Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30(24), 2867–2880.CrossRefPubMedGoogle Scholar
 Frick, H., Strobl, C., & Zeileis, A. (2014). To split or to mix? tree vs. mixture models for detecting subgroups. In COMPSTAT 2014Proceedings in Computational Statistics (pp. 379–386).Google Scholar
 Hajjem, A., Bellavance, F., & Larocque, D. (2011). Mixed effects regression trees for clustered data. Statistics & Probability Letters, 81(4), 451–459.CrossRefGoogle Scholar
 Hamilton, M. (1960). A rating scale for depression. Journal of Neurology, Neurosurgery and Psychiatry, 23(1), 56.CrossRefPubMedPubMedCentralGoogle Scholar
 Higgins, J., Whitehead, A., Turner, R. M., Omar, R. Z., & Thompson, S. G. (2001). Metaanalysis of continuous outcome data from individual patients. Statistics in Medicine, 20(15), 2219–2241.CrossRefPubMedGoogle Scholar
 Hothorn, T., & Zeileis, A. (2015). partykit: A modular toolkit for recursive partytioning in R. Journal of Machine Learning Research, 16, 3905–3909. Retrieved from http://www.jmlr.org/papers/v16/hothorn15a.html.Google Scholar
 Hothorn, T., & Zeileis, A. (2016). A Toolkit for Recursive Partytioning [Computer software manual]. Retrieved from https://CRAN.Rproject.org/package=partykit (R package version 1.10).
 Kraemer, H. C., Frank, E., & Kupfer, D. J. (2006). Moderators of treatment outcomes: Clinical, research, and policy importance. Journal of the American Medical Association, 296(10), 1286–1289.CrossRefPubMedGoogle Scholar
 Kuznetsova, A., Brockhoff, P., & Christensen, R. (2016). lmertest: Tests in linear mixed effects models [Computer software manual]. Retrieved from https://CRAN.Rproject.org/package=lmerTest (R package version 2.032).
 Lipkovich, I., Dmitrienko, A., Denne, J., & Enas, G. (2011). Subgroup identification based on differential effect search  A recursive partitioning method for establishing response to treatment in patient subpopulations. Statistics in Medicine, 30(21), 2601–2621.PubMedGoogle Scholar
 Martin, D. (2015). Efficiently exploring multilevel data with recursive partitioning (Unpublished doctoral dissertation). University of Virginia.Google Scholar
 Neuhaus, J. M., McCulloch, C. E., & Boylan, R. (2013). Estimation of covariate effects in generalized linear mixed models with a misspecified distribution of random intercepts and slopes. Statistics in Medicine, 32(14), 2419–2429.CrossRefPubMedGoogle Scholar
 Philipp, M., Zeileis, A., & Strobl, C. (2016). A toolkit for stability assessment of treebased learners. In Colubi, A., Blanco, A., & Gatu, C. (Eds.) Proceedings of COMPSTAT 2016 – 22nd international conference on computational statistics (pp. 315–325). Oviedo: The International Statistical Institute/International Association for Statistical Computing.Google Scholar
 R Core Team (2016). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from https://www.Rproject.org/.
 Seibold, H., Zeileis, A., & Hothorn, T. (2016). Modelbased recursive partitioning for subgroup analyses. International Journal of Biostatistics, 12(1), 45–63. https://doi.org/10.1515/ijb20150032.CrossRefPubMedGoogle Scholar
 Sela, R. J., & Simonoff, J. S. (2011). Reemtree: Regression trees with random effects [Computer software manual]. (R package version 0.90.3).Google Scholar
 Sela, R. J., & Simonoff, J. S. (2012). REEM trees: A data mining approach for longitudinal and clustered data. Machine Learning, 86(2), 169–207.CrossRefGoogle Scholar
 Siciliano, R., Aria, M., & D’Ambrosio, A. (2008). Posterior prediction modelling of optimal trees. In Compstat 2008 (pp. 323–334). Heidelberg: PhysicaVerlag HD.Google Scholar
 Sies, A., & Van Mechelen, I. (2016). Comparing four methods for estimating treebased treatment regimes. (Submitted).Google Scholar
 Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323.CrossRefPubMedPubMedCentralGoogle Scholar
 Su, X., Wang, M., & Fan, J. (2004). Maximum likelihood regression trees. Journal of Computational and Graphical Statistics, 13(3), 586–598.CrossRefGoogle Scholar
 Su, X., Tsai, C.L., Wang, H., Nickerson, D. M., & Li, B. (2009). Subgroup analysis via recursive partitioning. The Journal of Machine Learning Research, 10, 141–158.Google Scholar
 Zeileis, A., & Hornik, K. (2007). Generalized Mfluctuation tests for parameter instability. Statistica Neerlandica, 61(4), 488–508.CrossRefGoogle Scholar
 Zeileis, A., Hothorn, T., & Hornik, K. (2008). Modelbased recursive partitioning. Journal of Computational and Graphical Statistics, 17(2), 492–514.CrossRefGoogle Scholar
 Zhang, B., Tsiatis, A. A., Davidian, M., Zhang, M., & Laber, E. (2012a). Estimating optimal treatment regimes from a classification perspective. Stat, 1(1), 103–114.CrossRefPubMedPubMedCentralGoogle Scholar
 Zhang, B., Tsiatis, A. A., Laber, E. B., & Davidian, M. (2012b). A robust method for estimating optimal treatment regimes. Biometrics, 68(4), 1010–1018.CrossRefPubMedPubMedCentralGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.