Subgroup identification in individual participant data meta-analysis using model-based recursive partitioning

Model-based recursive partitioning (MOB) can be used to identify subgroups with differing treatment effects. The detection rate of treatment-by-covariate interactions and the accuracy of identified subgroups using MOB depend strongly on the sample size. Using data from multiple randomized controlled clinical trials can overcome the problem of too small samples. However, naively pooling data from multiple trials may result in the identification of spurious subgroups as differences in study design, subject selection and other sources of between-trial heterogeneity are ignored. In order to account for between-trial heterogeneity in individual participant data (IPD) meta-analysis random-effect models are frequently used. Commonly, heterogeneity in the treatment effect is modelled using random effects whereas heterogeneity in the baseline risks is modelled by either fixed effects or random effects. In this article, we propose metaMOB, a procedure using the generalized mixed-effects model tree (GLMM tree) algorithm for subgroup identification in IPD meta-analysis. Although the application of metaMOB is potentially wider, e.g. randomized experiments with participants in social sciences or preclinical experiments in life sciences, we focus on randomized controlled clinical trials. In a simulation study, metaMOB outperformed GLMM trees assuming a random intercept only and model-based recursive partitioning (MOB), whose algorithm is the basis for GLMM trees, with respect to the false discovery rates, accuracy of identified subgroups and accuracy of estimated treatment effect. The most robust and therefore most promising method is metaMOB with fixed effects for modelling the between-trial heterogeneity in the baseline risks.


Introduction
For precision medicine it is crucial to identify subgroups which benefit differently from a specific drug or are exposed to particular harm. When subgroups are defined by single biomarkers or combinations of biomarkers, different treatment effects in subgroups imply treatment-by-biomarker interactions. We use the term biomarker here for not just genetic biomarkers, but also by other baseline patient characteristics as e.g. demographic variables (Biomarker 2001). The biomarkers involved in treatment-byinteractions are called predictive biomarkers, whereas prognostic markers predict the course of a disease. Several statistical approaches for identifying subgroups with differential treatment effects were proposed, e.g. review papers of Lipkovich et al. (2017), Ondra et al. (2016). For subgroup identification methods, specifically tree-based methods, one of the important factors influencing the performance is sample size as has been shown in Sies and Van Mechelen (2017), Alemayehu et al. (2018), Huber et al. (2019). Data from one randomized controlled study are therefore often not sufficient for the identification of subgroups. The sample size can be increased by pooling data from multiple studies investigating the same treatment or intervention. Although we consider randomized controlled trials here, the interest in identifying interactions of biomarkers and interventions is potentially wider, e.g. randomized experiments with participants in social sciences or preclinical experiments in life sciences. For instance, Patel et al. (2016) developed a repository of individual participant data (IPD) consisting of 19 randomized controlled trials in order to investigate whether some patient populations suffering from low back pain benefit differently from treatment. Cuijpers et al. (2007) conducted an IPD meta-analysis consisting of sixteen studies to investigate the effect of active scheduling as behavioural treatment of depression. Analysing differential effects in subgroups was also part of their research. Another example of an IPD is the International Weight Management in Pregnancy (i-WIP) database. The i-WIP Collaborative Group collected data from 36 randomized controlled trials in order to investigate the overall and differential effects of interventions based on diet and physical activity during pregnancy, primarily on gestational weight gain and maternal and offspring composite outcomes (The International Weight Management in Pregnancy (i-WIP) Collaborative Group 2017). Differences in the study designs, study populations, quality of the studies, choice of comparator intervention or other study-specific influences can cause heterogeneity in baseline (control group) outcomes as well as in treatment effect sizes from one study to another. We discuss these two (very different) types of heterogeneity in turn. One approach to address the heterogeneity in the baseline is to use models with stratified intercepts which estimates a separate intercept for each trial. Alternatively, the study intercepts can be considered as random assuming a suitable distribution (e.g. normal). Whereas the approach with stratified intercept does not have to make any assumptions regarding the distribution of intercepts across studies, a (parametric) random intercept approach needs fewer parameters to be estimated (Legha et al. 2018). Incorporating between-study heterogeneity in the treatment effect into the modelling can be achieved by using random-effect models. Various types of random-effect models have been suggested for aggregated meta-analysis. Jackson et al. (2018) give an overview of random-effect models in meta-analysis for binary outcome data. Between-study variations of the treatment effect are addressed by assuming the treatment effects to be normally distributed in all of the models presented by Jackson et al. (2018). Variations of the models arise due to different assumptions regarding the heterogeneity in the baseline risks or the independence of the random treatment and random baseline effects. In IPD meta-analysis models one-stage approaches using generalised mixed effect models are increasingly used for the analysis (Simmonds et al. 2015). One-stage models do not need to aggregate the IPD in a first step in order to analyse them with an appropriate meta-analysis models in the second step. Legha et al. (2018) focus on different approaches regarding the baseline outcomes for modelling the between-trial heterogeneity in one-stage IPD meta-analysis models. Although, it is not the main objective of Kontopantelis, one-stage models modelling the heterogeneity in the baseline differently are also compared in Kontopantelis (2018).
For subgroup identification methods some approaches have been suggested in order to account for trial heterogeneity. However, these approaches are not sufficiently flexible and they use rather simplistic models in order to account for between-trial heterogeneity. Subpopulation treatment effect pattern plot for meta-analysis (Meta-STEPP) (Wang et al. 2016) for example is a method developed for investigating treatment heterogeneity across one continuous covariate only. Meta-STEPP accounts for variations of the treatment effect across trials by two different approaches: A fixed-effect (Wang et al. 2016) and a random effect (Wang et al. 2018) meta-analysis approach. Nevertheless the investigation of multiple covariates at a time and therefore, the detection of possible multi-way interactions with the treatment indicator are not possible in this framework. For tree-based subgroup identification methods some approaches to account for between-study heterogeneity are available. Accounting for between-study heterogeneity in tree-based methods is crucial as ignoring heterogeneity can lead to the identification of spurious splitting variables and therefore spurious subgroups as shown by Sela and Simonoff (2012).
For SIDES (Lipkovich et al. 2011) and model-based recursive partitioning (MOB) (Seibold et al. 2016;Zeileis et al. 2008) extensions have been proposed, which adapt the algorithm by Sela and Simonoff (2012), allowing to account for heterogeneity, see Mistry et al. (2018), Fokkema et al. (2018), respectively. However, both investigate a simpler model assuming heterogeneity in the baseline only. Systematic combinations of tree-based subgroup identification methods and meta-analysis models which are adequately complex and flexible have not been investigated yet.
In this work, we introduce metaMOB which combines commonly made assumptions in meta-analysis models with generalized linear mixed model-tree (GLMM-tree) algorithm by Fokkema et al. (2018). We investigate metaMOB's performance compared to the original MOB (Seibold et al. 2016;Zeileis et al. 2008) and to the GLMM-tree investigated in Fokkema et al. (2018) in an extensive Monte-Carlo simulation study using a structured approach (Benda et al. 2010). Furthermore, we evaluate the impact of using different assumptions on the baseline effect for metaMOB.
The remainder of this paper is organized as follows. In Sect. 2 we outline the subgroup identification method MOB and its extension GLMM-trees. Moreover, we present different mixed-models appropriate for IPD meta-analysis in Sect. 2.1. In Sect. 2.2 we introduce metaMOB. In the following section we compare the performance of two variations of metaMOB to MOB and the GLMM-tree investigated by Fokkema et al. (2018) in a simulation study. Finally, we discuss advantages and limitations of the investigated methods in Sect. 4.

Methods
We consider IPD data from k = 1, . . . K randomized controlled trials investigating the same experimental treatment against the same control on an outcome variable Y . We assume that each of these trials has n k participants. The outcome of participant i of trial k is denoted by y ik and t ik refers to the observed treatment group. Furthermore, the IPD data consists of p baseline covariates denoted by X 1 , . . . , X p ; These covariates are only used for partitioning. In addition we assume that each participant is included in one trial only.
The treatment variable T takes the value 1 for patients in the experimental treatment group and 0 for patients in the control group. Seibold et al. (2016) use the algorithm of MOB (Zeileis et al. 2008) for exploratory subgroup identifications. MOB assumes that in presence of subgroups there is no single global model fitting the data well. Therefore, MOB partitions the dataset with respect to some covariates X 1 , . . . , X p in order to improve the model fit. For MOB a model for modelling the outcome has to be defined first. Using generalized linear models (GLM), we model the expected outcome E(y ik ) of a patient given the treatment indicator t ik through a linear predictor and a suitable link function g(·) assuming that all observations are drawn from the same population.
Partitioning the data for improving the data fit in presence of subgroups leads to separate GLMs in each obtained subgroup j ( j = 1, . . . , J ): Since we assume that the subgroup structure is not known, it is not possible to simply estimate the parameters γ j , the intercept term of subgroup j, and the treatment effect θ j of subgroup j. MOB grows a tree in order to find local models (see M 0 ) which fit the data better. MOB partitions the data when the model parameters of M 0 are not stable over the considered covariates X. The instability of the model parameters are assessed using the M-fluctuation by Zeileis and Hornik (2007) or a permutation test when the assumption of normally distributed partial scores may not hold. More precisely the MOB tree is obtained using the following algorithm: 1 Start with all observations included in the root node 2 Fit model M 0 to all observations in the given node by estimating the model parameters via minimizing the objective function ((Y , X ), γ , θ ) (e.g. the negative log-likelihood) 3 Asses parameter instabilities with respect to X 1 , . . . , X p by using the M-fluctuation test proposed by Zeileis and Hornik (2007) or permutation test. The tested hypotheses are with ψ γ and ψ θ denoting the partial score functions of γ and θ , respectively. 4 If at least one of the 2 × p null hypotheses can be rejected at a pre-specified nominal level (using Bonferroni multiplicity adjustment), select covariate X j * with the lowest p-value as splitting variable. 5 Compute the split point for the chosen variable X j * by optimizing the sum of the objective functions of the conceivable subsets. 6 Repeat Steps 2 to 5 until none of the hypotheses in Step 3 can be rejected or some other stopping criteria (e.g. minimum number of observations in a node) is met.
Since MOB was developed to identify non-overlapping subgroups based on data from one single study, MOB ignores the clustered structure of the data. Fokkema et al. (2018) proposed an algorithm extending MOB in order to allow the assumption of heterogeneity between clusters. The extended algorithm called GLMM-trees uses mixed models in order to model data from K studies. In contrast to GLMs (as used in MOB) assuming independent and identically distributed observations, mixed models assume that observations between studies are independent but observations within each study are correlated. The algorithm of GLMM-trees allows to detect treatmentby-subgroup interactions and non-linearities in generalized linear mixed-effect models (GLMM). Fokkema et al. (2018) restricted their analyses to GLMMs with clusterspecific random intercepts fitting the following model in each subgroup j obtained by partitioning the data: The centred random intercepts b k are normally distributed b k with expectation 0 and variance τ 2 0 . GLMM-trees using model M 1 are referred to as MOB-RI in the following.
A random-intercept model allows only the baseline outcome to vary from study to study whereas the treatment effects are assumed to be the same across studies. This assumption corresponds to a common-effect meta-analysis model which is also known under the term fixed-effect meta-analysis model. The assumption of estimating the same true treatment effect in every trial is a strong assumption (Jackson et al. 2018). Therefore, the random-effects model is more plausible. Random-effects models for meta-analyses are outlined in Sect. 2.1.

GLMMs for meta-analysis
Heterogeneous treatment effects are usually expected in meta-analysis. However, this assumption has not been evaluated in the context of GLMM-trees so far. Several models for random-effect meta-analyses have been suggested. Jackson et al. (2018) examined seven models for random-effect meta-analyses with binary outcome using aggregated data. Legha et al. (2018) compare two approaches for one-stage IPD metaanalysis: random-effect models with either random or stratified intercepts. These two approaches for modelling the baseline are also considered in Jackson et al. (2018), Kontopantelis (2018). The random-effects model with stratified intercepts is defined as follows: The baseline for trial k is denoted by γ k and is assumed to be fixed. The model parameter θ k describes the treatment effect in trial k which is assumed to be normally distributed with mean θ and a between-trial variance τ 2 1 . The model for meta-analysis accounting for both heterogeneity in the treatment effect and in the baseline by using random effects is defined in Eq. 3: The centered random effects b 0k and b 1k are assumed to be normally distributed. As in most models considered in Jackson et al. (2018) we assume that the random effects b 0 and b 1 are independent.

metaMOB
As for MOB, metaMOB assumes that in the presence of subgroups the parameters θ and γ of Eqs. 3 and 2 may not describe the data well. Therefore, the data is partitioned in J disjoint subgroups by using a tree model. Using the model described in Eq. 3 which is analogous to the Simmonds and Higgins' model with random study-specific effects (Jackson et al. 2018), we fit the following model to each subgroup j identified by the algorithms described in Sect. 2.3: where b 1k ∼ N (0, τ 2 1 ) and all b 1k are independent. As for M 1 we assume that the fixed effects γ and θ are subgroup specific, also referred to as local parameters, whereas the random effect part is assumed to be global, meaning that the random effects are the same across the identified subgroups.
The model fitted in each partition j assuming fixed baseline effects in each study is defined as: where the fixed-effects γ jk and the mean treatment effect θ j are assumed to be subgroup specific. For model M 3 the number of parameters to be estimated increases with an increasing number of identified subgroups and increasing number of trials. The method assuming M 2 to be the underlying model for the tree growing procedure is called metaMOB-RI, as a random intercept is assumed for the baseline effects. The tree growing procedure using the model with a stratified intercept, M 3 , is referred to as metaMOB-SI in the following.
The subgroup definitions identified by metaMOB are assumed to apply to all trials. However, depending on the choice of the underlying model, different treatment and baseline effects are estimated within subgroups for different trials.

Algorithm
Models M 0 to M 3 can be represented by the model equation where b is a vector of random effects, vector z T i is the i-th row of the design matrix Z for the random effects, vector a i T is the i-th row of the design matrix A for the fixed effects. The coefficient vector of the fixed effects is denoted by Vector z i is a vector of length 2K , indicating to which study patient i belongs. Subject i enrolled in the active treatment arm of study k has the value 1 at position k and K + k. A subject i enrolled in the placebo arm of study k is a unit vector with value one at position k. And vector a i of Eq. 4 is of length 2K with value 1 at position k and the value indicating the treatment patient i was assigned to at position K + k.
The vectors b, a i and z i depend on the chosen model (M 0 -M 3 ). The algorithm for MOB-RI, metaMOB-RI and metaMOB-SI is outlined in the following.
1 Set r = 0 and all valuesb (r ) to 0. 2 Set r = r +1. Estimate a GLM tree using z T ib (r −1) as an offset. The random effects part is treated as offset because it is assumed to be equal across all subgroups and it is treated as known.
4 Repeat (2) and (3) until the log-likelihood of the model in the previous steps converges. This is usually the case, when the estimated tree does not change from previous iterations.

Simulation study
In Sect. 2 we outlined four models used for the subgroup identifications procedure MOB, GLMM-tree and metaMOB. Model M 0 corresponds to a simple GLM ignoring the clustered structure of the data. MOB as proposed by Seibold et al. (2016) aims at achieving a better model fit by fitting Model M 0 to each resulting partition. The tree-methods accounting for clustered data by assuming M 1 , M 2 and M 3 to be the underlying models for the algorithm in Sect. 2.3 are referred to as MOB-RI, metaMOB-RI and metaMOB-SI, respectively. In our simulation study using the Clinical Scenario Evaluation (CSE) framework (Benda et al. 2010) we compared the performance of MOB, MOB-RI, metaMOB-RI and metaMOB-SI in different IPD settings described in Sect. 3.1. CSE structures the simulation study into three parts, namely assumptions, options and metrics. Dmitrienko and Pulkstenis (2017) use a different terminology for the three components. The three components are named data models, analysis models and evaluation models. Assumptions define the data generation process, options describe the methods applied to the data and metrics specify the criteria for evaluating the methods applied to the data. The performance is assessed by different measures presented in Sect. 3.3. The computational details and the chosen tuning parameters for the algorithms are given in Sect. 3.2.

Assumptions: simulation settings
Each dataset generated for our simulation study consists of a continuous response Y , a binary treatment variable T and 15 covariates for each subject i = 1, . . . , n k of k = 1, . . . , K trials. For simplicity, we assume that each trial has the same number of observations n 1 = . . . = n K . The total number of observations is denoted by N . The treatment indicator T is either 0 or 1, each with a probability of 0.5. Adapted from Dusseldorp et al. (2010), Fokkema et al. (2018) the 15 covariates X 1 ,…,X 15 are drawn from a multivariate normal distribution with μ X 1 = 10, μ X 2 = 30, μ X 4 = −40 and μ X 5 = 70. The other means are drawn from a discrete uniform distribution on the interval [−70, 70]. The variance of all X p is set to σ 2 X p = 100. Furthermore, all 15 covariates are correlated with ρ = 0.3. The outcome Y is generated by The values for the varied parameters are presented in Table 1. For each setting resulting from combining these parameters 2000 datasets are generated.
We investigated different data-generating models by varying the functional relationship of the outcome and the covariates f (·) as defined in Table 2. The Null model is used for investigating the false discovery rate. Sim A or Sim B are based on the tree structure in Fig. 1 adapted from Dusseldorp et al. (2010) and Fokkema et al. (2018). Heterogeneity in baseline τ 0 0, 5, 10 Heterogeneity in treatment τ 1 0, 2.5, 5, 10 Correlation between b 0 or b 1 and one of the X p variables b 0 (or b 1 ) and all X p uncorrelated, b 0 (or b 1 ) correlated with one of the splitting variables (correlation r ≈ 0.42), b 0 (or b 1 ) correlated with one of the non-splitting variables (correlation r ≈ 0.42) is given by the tree structure presented in Fig. 1 with γ j1 = . . . = γ j K = μ γ j , for j = 1, . . . , 4 Sim B f (·) is given by the tree structure presented in Fig. 1 with intercepts drawn from normal distributions: The true data-generating model for settings with γ j1 = . . . = γ j K = μ γ j is either M 0 , M 1 or M 2 depending on the values of τ 2 0 and τ 2 1 . Drawing the overall mean for each study within a subgroup from a normal distribution corresponds to M 3 being the data generating model. Settings using Fig. 1 with equal intercept across studies within one subgroup are going to be referred to as Sim A and settings with studyspecific intercepts will be referred to as Sim B, respectively. Furthermore, it has to be noted that for evaluating the false discovery rate using the Null model, all X p are nonsplitting variables. Therefore, b 0 or b 1 are either uncorrelated with all of covariates, or correlated with a non-splitting covariate.

Options: methods and their tuning parameters
For the computation we used the software environment R (Core 2020). Version 1.2-3 of the R package partykit (Hothorn and Zeileis 2015) was used for growing MOB described in Sect. 2. For the MOB-RI and the two variations of metaMOB we used Version 0.1-2 of the R package glmertree. Since our simulations study focuses on a continuous outcome the functions used for the tree algorithms are called lmtree for MOB and lmertree for MOB-RI and metaMOB. The stopping criteria were set to 5% level of significance for the test in the splitting criterion and a minimum number of 20 observations in the terminal nodes. Furthermore, we set the tolerance in the argument check.conv.grad of lme4 (Version 1.1-21) to a five times higher value  2018) for settings Sim A and Sim B. X 1 ,X 2 and X 5 denote the covariates defining the subgroups. θ denotes the true fixed treatment effects and γ the true fixed intercepts. μ γ represents the mean of the intercepts which are drawn from a normal distribution in Setting Sim B than the default. The glmertree package uses lme4 for step 3 of its algorithm (see Sect. 2.3). The mixed models fitted for MOB-RI and metaMOB using lme4's default value have a poor convergence. Eager and Roy (2017) argue that convergence issues with the max|grad| tolerance arising from the preset tolerance in the check.conv.grad argument is the least problematic of convergence errors. Moreover, they state that in practice many researcher tend to ignore these convergence errors if no other convergence issues are present. For metaMOB and MOB-RI we use REML estimation for fitting the linear mixed effect models (Step 3 of the algorithm, see Sect. 2.3). The formulas used in the functions lmtree or lmertree for the four algorithms are given below: -MOB: y ∼ factor(trt) | x1+ ... +xp -MOB-RI: y ∼ factor(trt) | trial| x1+ ... +xp -metaMOB-RI: y ∼ factor(trt) | (1|trial)+(trt-1|trial)| x1+ ... +xp -metaMOB-SI: y ∼ factor(trial)+factor(trt) | (trt-1|trial)| x1+ ... +xp trt is the vector including the treatment indicator, trial is a vector indicating to which trial an observation belongs and x1, ...+xp are the vectors of the p covariates considered as potential predictive markers. The simulation study can be reproduced using the supplementary online material.

Metrics: performance criteria
The performance of the methods was assessed by different measures as false discovery rate, number of identified subgroups, tree accuracy and the correlation of the true and estimated treatment effects in the identified subgroups. Furthermore, we evalu-ate the computation time of the four algorithms and convergence problems. These performance criteria are explained in more detail below. The frequency of times the algorithm identified more than one final subgroup although none was present is referred to as false discovery rate. The false discovery rate is also often referred to as Type I error rate, although strictly speaking no test is performed. The null hypothesis, that no treatment-covariate interaction is present, is rejected when a tree with at least one split is identified. The number of identified subgroups corresponds to the number of terminal nodes. The tree accuracy takes the number of identified subgroups into account and additionally evaluates the selection of the splitting covariate and the selected cut-off value. A tree is considered to be accurate if the identified tree has the correct number of terminal nodes, all splitting variables are selected correctly and when the selected cut-off values for the split denoted by c are in the interval c ± 5 with 5 corresponding to the population standard deviation.

Convergence and computation time
The GLMM-algorithm described in Sect. 2.3 is an iterative algorithms which stops when the difference of the log-likelihood criterion of the corresponding mixed-effect model M 1 , M 2 orM 3 from two consecutive iterations is below a threshold. This is the case when trees of subsequent iterations do not change. MOB-RI, metaMOB-RI and metaMOB-SI always converged within 2 to 3 iterations steps within our simulation study. As threshold for the convergence criterion we used the default value of the glmertree R-package, namely abstol = 0.001.
Convergence problems of MOB-RI and metaMOB are due to fitting linear mixed-effects regression models (M 1 ,M 2 or M 3 ) in each iteration step. The lme4 R-package reports two different convergence warnings for MOB-RI, metaMOB-RI and metaMOB-SI. The most frequent of the convergence warnings, more than 80% of all the warnings in our simulations, is the warning Eager and Roy (2017) assume to be the least problematic. Another problem for fitting the models M 1 , M 2 and M 3 in the simulation study arises due to the evaluation of the scaled gradient. However, convergence problems were rare in the simulation study, see Table 3. The largest number of obtained convergence warnings across the three MOB algorithms accounting for clustered data was obtained by metaMOB-RI which uses model M 2 and therefore estimates the variance component of two random effects. In practise, researches often simplify their random effect structure in the presence of convergence problems. If metaMOB-RI's underlying mixed-model is the model of choice and researches are faced with convergence problems, metaMOB-SI should be used instead. This is also recommended by Kontopantelis (2018) for IPD meta-analysis one-stage models. Further operation characteristics as for example the tree accuracy are based on cases without convergence warnings.
All simulations are conducted using R 3.6.0 (64 bit) on the GWDG (Gesellschaft fuer wissenschftliche Datenverarbeitung mbH Goettingen) High Performance Cluster (Intel Cascade Lake-based systems 2 × 48 cores, 192 GB RAM), located in Goettingen, Germany. The computation time of MOB is the smallest, since the data partition is only performed once. The mean computation time for MOB is ∼ 2 s. MOB-RI and metaMOB-RI need ∼ 5 s on average for growing a tree adjusting for between-study heterogeneity and metaMOB-SI needs ∼ 7 s for growing a tree. The computation times for all methods seem reasonable. The maximum calculation time was 215 s using metaMOB-RI.

False discovery rate
The false discovery rate is assessed by using the Null model for data generation. When the random effects b 0 or b 1 are not correlated with one of the covariates the false discovery rates lie below 0.055 for all the considered settings and methods. False discovery rates slightly larger than 0.05 are mainly observable for settings including 2000 observations in total. Benefits of using methods accounting for the between trial heterogeneity as MOB-RI, metaMOB-RI and metaMOB-SI are observable in the false discovery rates for settings in which the random effects b 0 or b 1 are correlated with one of the splitting candidates. Figure 2 shows, that MOB's false discovery rate increases with increasing variance of the random intercept τ 2 0 , whereas MOB-RI, metaMOB-RI and metaMOB-SI's false discovery rates are smaller when no treatment heterogeneity between studies is present. When the variance of both random intercept and random slope are unequal to zero resulting in M 2 being the true underlying model, the use of MOB-RI results in higher false discovery rates (see Fig. 2) since it does not adjust for heterogeneity in the treatment effect. A correlation of the random slope b 1 with one of the covariates additionally adds to spurious detections of subgroups using MOB-RI. This is depicted in Fig. 3. metaMOB-RI's and metaMOB-SI's false discovery rates are not strongly influenced by the presence of random effects or a correlation of random effects with a splitting candidate. For the two variations of metaMOB we mainly observe false discovery rates close to 0.05.

Tree accuracy
In settings using Sim A as data generating model, the accuracy of MOB-RI and MOB deteriorates with increasing heterogeneity in the treatment effect if the heterogeneity of the treatment effect is correlated with one of the splitting candidates (see Fig. 4). The accuracy of the trees obtained by MOB or MOB-RI is slightly larger in settings in which the treatment heterogeneity is correlated with X 1 , X 2 or X 5 , one of the subgroup-defining covariates, compared to settings in which the treatment heterogeneity is correlated with a non-splitting variable. Without b 1 and X being correlated (see first column of Fig. 4), no difference between the four methods with regard to tree accuracies is observable. For uncorrelated b 1 and X, but X and b 0 correlated, accuracies much smaller than one are only observable for for MOB. MOB-RI adjusts for between trial heterogeneity in baseline characteristics using a random intercept. Therefore, smaller accuracies for MOB-RI are only observable in settings with τ 2 1 = 0 and X and b 0 correlated (results not shown). The deterioration of the accuracy with increasing variance of the treatment heterogeneity using MOB and MOB-RI is mainly due to estimated trees with too many splits. It seems that MOB and MOB-RI, which do not adjust for heterogeneity in the treatment effect, try to capture this heterogeneity which is linked to covariates by further splitting the data. However, the additional splits are not necessarily performed on covariates correlated to the between-trial heterogeneity.
Using M 3 as data generating model as is Sim B, we observe that metaMOB-SI which is using the correct model identifies trees closest to the true underlying tree (see Fig. 5). However, in settings without between-trial heterogeneity in the treatment effect or without correlation of the random treatment effect and one of the potential splitting covariates, the mean accuracies of MOB, MOB-RI and metaMOB-RI are not much smaller than the tree accuracy of metaMOB-SI. Larger variations in trial and subgroup specific intercepts as the presence of between-trial heterogeneity in the treatment effect result in smaller tree accuracies of MOB, MOB-RI and metaMOB-RI due to their model misspecification. Although metaMOB-RI accounts for heterogeneity in both, the baseline and the treatment effect, the assumption for the baseline is not flexible enough for this setting. The underlying model of metaMOB-RI assumes that the between-trail heterogeneity of the baseline effects is the same across all subgroups, which is not the case in Sim B.

Correlation of the estimated and true individual treatment effect
For the correlation of the estimated and true individual treatment effect the treatment effects are calculated on a test sample. For this calculation, the patients of the test data are assigned to a subgroup according to the grown tree. Based on this subgroup assignment model M 0 to model M 3 are fitted. The model is chosen according to the MOB algorithm used for growing the tree. The true treatment effect in the estimated subgroups is calculated by a weighted mean of the true treatment effect θ . The weights are based on the number correctly assigned patients to each of the true subgroups. Simulations runs which failed to grow a tree, runs with convergence problems of M 1 to M 3 on the test data and runs in which not all identified subgroups are present in the test data are omitted for the calculation of the mean correlations of the settings presented in Fig. 6. Figure 6 shows that MOB and MOB-RI's correlation of true and estimated treatment effect decreases when random effects are misspecified by MOB's underlying model. This is the case in settings with τ 2 1 = 0 for both MOB and MOB-RI and in setting with τ 2 0 = 0 for MOB only. For both metaMOB-RI and metaMOB-SI, which have more general models underlying the tree algorithm, correlations close to 1 are observable throughout the considered settings shown in Fig. 6. The results presented in Fig. 6 consider a larger absolute treatment effect difference, e.g. standardized treatment effects |θ 1 /σ | = |θ 3 /σ | = 1 with σ = 5. As the probability of identifying the correct treatment-by subgroup interaction decreases with decreasing treatment-by subgroup interaction effects Alemayehu et al. 2018;Huber et al. 2019, we can also expect smaller correlations of true and estimated treatment effects in the identified subgroups similar to the results in Fokkema et al. (2018).

Discussion
We proposed metaMOB, a tree based method, allowing to identify treatment-by subgroup interactions while accounting for between-trial heterogeneity in IPD-metaanalysis settings. The method metaMOB falls into the broad class of GLMM-trees introduced by Fokkema et al. (2018) and is applicable in IPD-meta-analyses with larger number of trials. GLMM-trees allow to estimate random effect parameters and to apply recursive partitioning to the data as proposed by Zeileis et al. (2008). The approach used in the algorithm for GLMM-trees was also adapted in PALM-trees Seibold et al. (2018) which allows to estimate further fixed effects which are constant over all groups instead of random effects. GLMM-trees were only investigated for a simpler mixed model, namely a random intercept model, underlying the partitioning. We referred to the investigated GLMM-tree as MOB-RI, model-based recursive partitioning with a random intercept. In a simulation study, we evaluated the performance of MOB, MOB-RI and metaMOB in IPD meta-analysis settings with continuous response. We considered two variations of metaMOB, namely metaMOB-RI and metaMOB-SI. Both methods model between-trial variations of the treatment effect using random effects as commonly done in random-effects meta-analysis. However, metaMOB-RI and metaMOB-SI differ in their assumptions regarding the baseline effects. One of the models assumes the between-trial heterogeneity in the baseline effect to be fixed. Therefore, the method uses a stratified intercept approach (metaMOB-SI), whereas the other model assumes the heterogeneity in the baseline effect to be random as well (metaMOB-RI). The simulation study showed that all four considered methods perform similarly well in terms of false discovery rate and identifying correct subgroups, when no between-trial heterogeneity is present or when between-trial heterogeneity is independent from potential splitting candidates and can therefore not be explained by any of the covariates. However, we believe it is reasonable to assume that heterogeneity between trials is linked to one or more patient-level covariates, since the composition of the trial populations in terms of patient characteristics contributes to the heterogeneity between trials. For those settings, the simulation study showed that misspecifying the structure of the between-trial heterogeneity results in less accurate trees and high false discovery rates. The misspecified between-trial heterogeneity also affects the estimated treatment effects in the identified subgroups. The correlation of the estimated and true treatment effect in the identified subgroups decreases when we model the data with simpler models compared to the more complex true model. As assumptions regarding treatment effect heterogeneity between trials and a correlation between heterogeneity and covariates seem reasonable in IPD meta-analysis, we can conclude that MOB and the MOB-RI are not the best option for subgroup identification in IPD meta-analysis settings. The method metaMOB-SI showed the best performance with regard to false discovery rate, accuracy of identified subgroups and estimated treatment effect throughout all considered simulation settings. Compared to metaMOB-RI, metaMOB with stratified intercepts has also less convergence problems as only one random effect has to be estimated. Commonly, one of the main interest in random-effects meta analysis is obtaining an unbiased summarized treatment effect based on several studies by accounting for between-trial heterogeneity, which is done by both metaMOB-RI and metaMOB-SI. However, metaMOB-SI's underlying model does not impose a constraint on the baseline effects as it is done by metaMOB-RI's underlying model. Therefore, metaMOB-SI is more flexible. It has to be kept in mind, however, that the number of parameters which have to be estimated for metaMOB-SI increases with increasing number of trials and also with increasing number of identified subgroups. This might lead to inconsistent estimators as described by the Neyman-Scott problem (Neyman and Scott 1948).
Therefore, the number of subgroups which can be identified by metaMOB-SI is strongly restricted for meta-analysis with smaller sample sizes and numerous included studies.
The analysis of data from few trials, however, may cause difficulties in estimating the between-trial variance. Especially higher fractions of estimates of the between-trial heterogeneity in the treatment effect (τ 1 ) equal to zero in meta-analyses with few trials (Friede et al. 2017) may negatively affect subgroup identification using MOB-RI or metaMOB. Further research on the performance of metaMOB for IPD-meta-analysis settings with few trials is needed. Therefore, investigators should aim to include larger numbers of trials for the application of metaMOB as suggested by the simulation results.
Although we considered continuous outcomes only in our simulation study, the proposed approach is applicable more widely including binary endpoints, since the implementation by Fokkema et al. (2018) is based on generalized mixed effect models. Therefore, metaMOB might also be applicable to discretely measured time-to-event outcomes as the log-likelihood of a discrete time-to-event model and a regression model for a binary outcome are equivalent (Schmid et al. 2016). However, the assessment of the properties of metaMOB with discrete time-to-event endpoints is subject to future research. SIDES for IPD meta-analysis (Mistry et al. 2018) is similar to MOB-RI regarding the assumptions of the between-trial heterogeneity. Both account for between-trtial heterogeneity in the baseline only. Both, SIDES for IPD meta-analysis and the GLMM-tree framework, however, do not account for between-trial heterogeneity in the treatment effect within identified subgroups. It is assumed that the between-trial heterogeneity accounted for by random effects is constant across identified subgroups. As between-trial variations of the treatment-by subgroup interactions are reasonable assumptions research for identifying subgroups in such settings using tree-based methods is needed.