Abstract
Background
In metaanalyses (MA), effect estimates that are pooled together will often be heterogeneous. Determining how substantial heterogeneity is is an important aspect of MA.
Method
We consider how best to quantify heterogeneity in the context of individual participant data metaanalysis (IPDMA) of binary data. Both two and onestage approaches are evaluated via simulation study. We consider conventional I ^{2} and R ^{2} statistics estimated via a twostage approach and R ^{2} estimated via a onestage approach. We propose a simulationbased intraclass correlation coefficient (ICC) adapted from Goldstein et al. to estimate the I ^{2}, from the onestage approach.
Results
Results show that when there is no effect modification, the estimated I ^{2} from the twostage model is underestimated, while in the onestage model, it is overestimated. In the presence of effect modification, the estimated I ^{2} from the onestage model has better performance than that from the twostage model when the prevalence of the outcome is high. The I ^{2} from the twostage model is less sensitive to the strength of effect modification when the number of studies is large and prevalence is low.
Conclusions
The simulationbased I ^{2} based on a onestage approach has better performance than the conventional I ^{2} based on a twostage approach when there is strong effect modification with high prevalence.
Similar content being viewed by others
Background
Metaanalysis (MA) is a statistical method used to draw an overall conclusion based on the total evidence by reviewing previous research work systematically and pooling effect estimates together [1]. MA is an important tool, widely used, and applied in evidencebased medicine [2].
Individual participant data metaanalyses (IPDMA), collect line by line participant data from each included study, rather than estimates of the parameter of interest. IPDMA offer several advantages over aggregate data MA (ADMA) and are considered the gold standard in metaanalytic techniques [3].
Heterogeneity of effect estimates is an important consideration in both ADMA and IPDMA. Heterogeneity exists if the true effects vary across studies more than would be expected by chance alone. The estimated interstudy variance (τ ^{2}) of the parameter of interest is the most direct measure of heterogeneity, but interpretation, particularly deciding what might be a problematic level of heterogeneity, is difficult, despite some practical suggestions [4]. The I ^{2}, originally proposed by Higgins and Thompson, meets three important criteria for any measure of heterogeneity: it monotonically increases with betweenstudy variance; it is not varied by changing the scale; and it is not affected by the number of studies [5]. Importantly, despite some limitations [4], the I ^{2} remains the most often reported measure of heterogeneity and is easily interpretable, appealing to clinicians.
There are two approaches to analyze the data from IPDMA: the twostage approach and the onestage approach. In the twostage approach, each study is analyzed separately, then standard metaanalytic techniques are applied, and heterogeneity may be quantified by usual methods. Alternatively, in the onestage approach, a mixed model is fit and the data is analyzed altogether, accounting for the correlation that may exist between subjects in the same study and allowing the estimated effect to vary across studies. A review of statistical methods used in IPDMA of binary outcomes found that most do not report any measure of heterogeneity [6]. While some measures of heterogeneity are easily obtained from a onestage model, the I ^{2} is not. Our objective in this work was to consider various approaches to quantifying heterogeneity in IPDMA of binary outcomes analyzed via the onestage approach. We propose a method to obtain an I ^{2} from a onestage model and evaluate it and other possible measures via simulation study.
Metrics of heterogeneity in IPDMA with binary outcomes
In this section, we describe various measures of heterogeneity that may be used. Here, we consider that the primary analysis is a onestage analysis of IPDMA of dichotomous outcome data. Below, we describe four possible measures of heterogeneity: (1) the conventional I ^{2} from the corresponding twostage analysis; (2) the R from the corresponding twostage analysis; (3) a new metric: the I ^{2} from the onestage approach; and (4) the R from the onestage approach.
Betweenstudy variance (τ ^{2})
The betweenstudy variance, τ ^{2}, quantifies the heterogeneity in IPDMA directly. A large value of \(\hat {\tau }^{2}\) indicates that heterogeneity exists among the studies. However, the τ ^{2} is not ideal, since interpretation is difficult: there is no standard criteria to determine the level of heterogeneity (low, moderate, substantial), because the range is from 0 to ∞ [5, 7]. All other approaches to quantify heterogeneity rely on τ ^{2}.
We might estimate the τ ^{2} via the twostage or the onestage approach. For the twostage approach, we estimate \(\tau ^{2}_{\mathrm {twostage}}\) via the method described by DerSimonian, Laird, and Whitehead [7–9]:
where N is the number studies, \(\hat {\omega _{i}}\) is the reciprocal of estimated withinstudy variance, and Q represents Cochran’s heterogeneity statistic [5, 10, 11].
A twostage analysis proceeds as follows. Consider a MA of a binary outcome in N studies. In the first stage, we fit the logistic regressions in each of the N studies:
where p _{ j } is the true response probability for the positive result of the jth individual in this study, β _{0} represents the intercept, and x _{ j } indicates their treatment status. This model could be expanded to include effect modifiers.
In the first stage, we obtain \(\hat {\beta }_{1i}\) the estimated log odds ratio in study i for i=1,2,..,N [12], and the variance of the estimated log odds ratio (\(\text {var}(\hat {\beta }_{1i})\)) for each one of the N studies.
In the second stage, we consider:
where \(\tau ^{2}_{\beta _{1}}\) (\(\tau ^{2}_{\mathrm {twostage}}\)) represents the respective degree of heterogeneity between studies [12]. Here, we assume the covariance between the parameter estimates (β _{0i } and β _{1i }) are equal to 0, which means that we pool the treatmentoutcome associations (β _{1i }) together [12]. This is similar to the classic DerSimonian and Laird randomeffects model [8, 13] and allows us to obtain an estimate of the betweenstudy variance \(\tau ^{2}_{\mathrm {twostage}}\) [12], as in Eq. 1.
For the onestage approach, with binary data, we may estimate the \(\tau ^{2}_{\mathrm {onestage}}\) from a generalized linear mixed model (GLMM) [14–16]. Under the onestage randomeffects model, for each study, a studyspecific intercept and treatment effect may be estimated. The studyspecific intercept and treatment effects are assumed to come from a bivariate normal distribution [3, 17, 18]. Considering again a MA of a binary outcome in N studies:
where p _{ ij } represents the true response probability for the positive result of the jth individual in the ith study and x _{ ij } indicates their treatment status. β _{1} is the parameter of interest, which represents the pooled log odds ratio, and \(\tau ^{2}_{1}\) is the betweenstudy variance (\(\tau ^{2}_{\mathrm {onestage}}\)).
I ^{2} statistic
Using a twostage approach, consider a MA of N studies for the parameter of interest, called θ. Under the assumption that the estimated sampling variances are known and equal for all studies (\(\sigma ^{2}_{1}=\sigma ^{2}_{2}=\sigma ^{2}_{3}=...=\sigma ^{2}_{N}=\sigma ^{2}=1/\omega _{i}\)), Higgins and Thompson [5] defined a measurement function I ^{2} for quantifying the unexplained heterogeneity, where \(E(\theta _{i})=\theta, V(\theta _{i})=\tau ^{2}, E(\hat {\theta }_{i}\theta _{i})=\theta _{i}\), and \(V(\hat {\theta }_{i}\theta _{i})=\sigma ^{2}\).
They proposed to estimate \(I^{2}_{\mathrm {twostage}}\) as [5]:
where
While the I ^{2} is usually presented as a percentage varying from 0 to 100%, we present it as a proportion varying from 0 to 1.
In clustered data analyses, the I ^{2} is very similar to the intraclass correlation coefficient (ICC) [5, 19]. The ICC is the ratio of the betweencluster variance to the total variance in the outcome [20]. It provides a quantitative measure of the amount of heterogeneity across clusters [14]. With binary data, estimating the ICC from a GLMM is possible, though more complicated [14–16]. Several measures have been proposed as estimators of the ICC for binary data [14, 21, 22], though none have been evaluated as measures of interstudy heterogeneity for IPDMA. Goldstein et al. proposed a simulationbased approach that relies on partitioning the variation in the multilevel model to estimate an ICC for binary outcomes [22]. We propose to adapt this ICC estimator to estimate the I ^{2} in a onestage IPDMA. The algorithm is as follows:

Step 1 Fit a random effects model to the data by using a GLMM, and adjust for possible effect modifiers if desired.

Step 2 Simulate a large number (e.g. m=5000) of values from a normal distribution (e.g. Eq. 5), using the estimated covariance matrix from the multilevel random effect logistic regression fitted in Step 1. We denote these as μ _{0,ij },μ _{1,ij }.

Step 3 Using the fitted model from Step 1, we estimate the log odds ratio for each subject (\(\nu _{1,ij}=\hat {\beta }_{1}+\mu _{1,ij}\)) in the dataset, and then estimate the variance as ν _{1}=V(ν _{1,ij }).

Step 4 Estimate p _{ ij } by using the fitted model (from Step 1) and simulated random effect values (from Step 2).

replacing (μ _{0i }, μ _{1i }) by (μ _{0,ij },μ _{1,ij }) for each subject

plugging in the fixed effect estimator from the fitted model and the covariates from the dataset

taking the inverse logit, we obtain the \(\hat {p}_{ij}\) for each individual


Step 5 Using the results from Step 4, it is easy to deduce the variance of the estimated log odds ratio via the Delta method, \(\nu _{2,ij}=\frac {1}{n\hat {p}_{ij}(1\hat {p}_{ij})}\), where n is the average number of subjects among all of the studies. Finally, we obtain ν _{2}=E(ν _{2,ij }).

Step 6 The \(I^{2}_{\text {onestage}}\) is now estimated as
R statistic
The R statistic is the square root of the ratio of the variance of the summary statistic from the randomeffects model divided by the variance of the summary statistic from the fixedeffects model. It quantifies the inflation of the confidence interval in the presence of interstudy heterogeneity [5]. If the estimated value of R is close to 1, then inference from the random and fixedeffects models are similar [5]. However, unexplained betweenstudy heterogeneity may exist when the estimate of R is greater than 1. Interpretation of R is difficult, for the same reasons as for τ ^{2}.
For the twostage approach, we may estimate R _{two−stage} as
where \(se(\hat {\beta }_{1F})\) is the standard error of the estimated pooled log odds ratio in the fixedeffects model and \(se(\hat {\beta }_{1R})\) is the standard error of the estimated pooled log odds ratio in the randomeffects model (Eq. 3).
For the onestage approach, we may fit a GLMM with random intercept and fixed slope. The standard error of the estimated β _{1} (pooled log odds ratio) from that model with fixed intercept may be denoted as \(se(\hat {\beta }_{1F})\). We denote the standard error of the estimated β _{1} from model 4 as \(se(\hat {\beta }_{1R})\). The estimated R _{one−stage} from a onestage model may then be estimated using Eq. 9.
Methods
We use simulations to investigate the performance of (i) conventional I ^{2} and R ^{2} based on a twostage approach and (ii) simulationbased I ^{2} and R ^{2} based on a onestage approach. We generated datasets that consisted of three variables: the binary treatment status, a binary effect modifier, and a binary outcome. Each combination of data generation parameters was used to generate 1000 datasets. We considered 84 distinct data generation scenarios (see Table 1).
Data generation details
The treatment variable is the covariate of primary interest in the IPDMA; the effect modifier changes the effect of treatment on the outcome when present.
The number of studies and subjects
The number of studies in each dataset was given by N and was set to 15 or 30. The number of the subjects within each study (n _{ i },i=1,...,N) followed a lognormal distribution, \(\text {LN}(\sigma ^{2}_{lg}=1.5^{2},\kappa =10)\), that was truncated at 20 and 2000 (to avoid very small and large studies), and rounded to the nearest integer value.
Treatment status
The prevalence of treatment for each study was generated from a uniform distribution, \(p_{x_{i}}\sim U(\theta _{\text {lower}}=0.4,\theta _{\text {upper}}=0.6)\). Using these study specific treatment prevalences, we generated n _{ i } random variables from a Bernoulli distribution for each subject, as \(\phantom {\dot {i}\!}x_{ij}\sim \text {Bernoulli}(p_{x_{i}}).\) These x _{ ij } represented the treatment status of subject j in study i.
Effect modifier
We generated a binary effect modifier. First, N studyspecific effect modifier prevalences were generated as \(p_{w_{i}}\sim \text {Uniform}(\theta _{w_{\text {lower}}}=0.1,\theta _{w_{\text {upper}}}=0.9)\). Then, using these probabilities, we obtained the effect modifiers from the Bernoulli distributions, \(\phantom {\dot {i}\!}w_{ij}\sim \text {Bernoulli}(p_{w_{i}})\).
Outcomes
We generated the outcome y _{ ij } based on the generated values of the treatment and effect modifier, as well as the regression coefficients that described the association of each of these with the binary outcome, using the following equation:
β _{0}, the fixed intercept, was set based on the given value of prevalence p _{pre}, where \(\beta _{0}=\text {log}\left (\frac {p_{\text {pre}}}{1p_{\text {pre}}}\right)\). The prevalence (p _{pre}) was set at 0.3 or 0.7. The random intercepts for individuals within each study were \(\mu _{0,i}\sim \text {Normal}(0,\sigma ^{2}_{\mu _{0}})\), where \(\sigma ^{2}_{\mu _{0}}\) was given. The true pooled treatment effect was β _{1}. Furthermore, μ _{1,i } was the studyspecific random effect for the slope, which followed a normal distribution with zero mean and variance τ ^{2}. β _{ w } and (β _{ w }+β _{ xw }) were the log odds ratio of the effect modifier in untreated and treated individuals, respectively. The parameter value used to generate the random intercepts (\(\sigma ^{2}_{\mu _{0}}\)) was given by 1 and the fixed interested β _{1} was given by log(1.3). The parameter used to generate the random slopes (τ ^{2}) was set to 0.5, 1, or 1.5.
Using Eq. 10, we obtained p _{ ij }. Participant level probabilities of outcome were calculated as \(\pi _{ij}=\frac {e^{p_{ij}}}{1e^{p_{ij}}}\), then y _{ ij } was generated from a Bernoulli(π _{ ij }) distribution.
Datasets
We contemplated two scenarios, including (i) no effect modification and (ii) effect modification, by varying the data generation parameters (β _{ w },β _{ xw }).
Our rationale was to evaluate each measure of heterogeneity according to the following:

(i) Did the measures of heterogeneity increase with increasing τ ^{2} in datasets that were generated such that there was no effect modification?

(ii) Did the measures of heterogeneity decrease when the effect modifier and an interaction term between treatment and the effect modifier were included in the model when effect modification was present?
Furthermore, we investigate whether the simulationbased I ^{2} satisfied the criteria proposed by Higgins et al.: (i) monotonically increasing with increasing betweenstudy variance; (ii) not varied by changing scale; and (iii) not affected by the number of studies [5].
IPDMA with no effect modification
To generate datasets with no effect modification, we set β _{ w } and β _{ xw } to zero.
IDPMA with effect modification
We varied β _{ w } and β _{ xw } to generate datasets with weak or strong effect modification, as presented in Table 1.
Data analysis
For each generated dataset, we considered both twostage and onestage approaches to quantifying heterogeneity.
Twostage approach
In this approach, each study is analyzed separately then pooled together using methods described in “Betweenstudy variance (τ ^{2})” section [12].
In the first stage, we considered two logistic regression models for each study in the dataset: (i) a crude model (logit(p _{ j })=β _{0}+β _{1} x _{ j }) and (ii) an effect modification model (logit(p _{ j })=β _{0}+β _{1} x _{ j }+β _{2} w _{ j }+β _{3} x _{ j } w _{ j }), where p _{ j } was the true response probability for the positive result of the jth individual in this study, β _{0} represented the intercept, x _{ j } indicated the treatment status, and w _{ j } was the effect modifier for the jth individual in this study.
When the IPD were generated without effect modification, we fitted the crude model to estimate the pooled treatment effect. When the IPD were generated with effect modification, we considered a crude model and a model that included the effect modifier, the treatment, and an interaction term between the effect modifier and the treatment to estimate the pooled treatment effect.
In the first stage, we estimated the log odds ratio \(\hat {\beta }_{1i}\) (i=1,...,N) from each study. In the second stage, we pooled these together via the DerSimonian and Laird method and, estimated the betweenstudy variance (\(\hat {\tau }^{2}\)) and the pooled treatment effect. We also applied the methods described in “ I ^{2} statistic” and “ R statistic” sections to estimate the \(I^{2}_{\mathrm {twostage}}\) and \(R^{2}_{\mathrm {twostage}}\) for quantifying the heterogeneity in a twostage IPDMA.
Onestage approach
For each generated dataset, we fitted a logistic regression with random intercept and slope for studies, estimated via adaptive GaussHermite quadrature. Similar to the twostage approach, we considered the following models: (i) a crude model (logit(p _{ ij })=(β _{0}+μ _{0i })+(β _{1}+μ _{1i })x _{ ij }) and (ii) an effect modification model (logit(p _{ ij })=(β _{0}+μ _{0i })+(β _{1}+μ _{1i })x _{ ij }+β _{2} w _{ ij }+β _{3} x _{ ij } w _{ ij }).
In all models, p _{ ij } was the true response probability of disease for the jth individual in the ith study, x _{ ij } indicated treatment status, and w _{ ij } represented the effect modifier. The random intercept and slope were μ _{0i },μ _{1i }, such that:
where \(\hat {\tau }_{1}\) was the estimated betweenstudy variance of the treatment effect. We estimated \(I^{2}_{\mathrm {onestage}}\) via the simulationbased method, and \(R^{2}_{\mathrm {onestage}}\) was computed by the ratio of the estimated variance of β _{1} under a random slopes model and a fixed slope model with random intercepts, as described in the “ I ^{2} statistic” and “ R statistic” sections.
Metrics and performance
We collected I ^{2}, R ^{2}, and τ ^{2} as estimated from both twostage and onestage approaches in each generated dataset for all combinations of data generation parameters. We estimated the median and interquartile range (IQR) from 1000 datasets. If the dataset was generated with effect modification, then the median and IQR of the ratio of the I ^{2} as estimated from a crude model to that estimated from a model that included the effect modifier and the interaction between the effect modifier and treatment status \(\left (\frac {I^{2}_{\text {emod}}}{I^{2}_{\text {crude}}}\right)\) was collected. Similar measures were reported for R ^{2} and τ ^{2}. We collected the ratios because we wanted to investigate the differences in \(\hat {I}^{2}\), \(\hat {R}^{2}\), and \(\hat {\tau }^{2}\) before and after taking effect modification into account. All statistical analysis were carried out in R, version 3.2.3 [23].
Results
With no effect modification
Figure 1 shows the estimated betweenstudy variance \(\hat {\tau }^{2}\) from both the twostage (dashed line) and the onestage (dotted line) approaches versus the true betweenstudy variance, τ ^{2} (solid line). As the true τ ^{2} increased, the estimated \(\hat {\tau }^{2}\) from both approaches also increased. Compared with the estimated \(\hat {\tau }^{2}\) from a twostage model, the \(\hat {\tau }^{2}\) from the onestage model increased more rapidly. The twostage model always underestimated \(\hat {\tau }^{2}\). On the other hand, the onestage approach very slightly underestimated \(\hat {\tau }^{2}\) when the true τ ^{2} was small, and it overestimated \(\hat {\tau }^{2}\) when the true τ ^{2} was larger than 1.3.
Figure 2 shows the conventional I ^{2} from the twostage model (dashed line) and the simulationbased I ^{2} from onestage model (dotted line) versus the true betweenstudy variance τ ^{2}. Both measures increased, then leveled off as the true betweenstudy variance increased.
Table 2 presents the median value and IQR of the \(\hat {I}^{2}\) and \(\hat {R}^{2}\) across 1000 datasets from the twostage and onestage models for different combinations of data generation parameter values. The median estimated I ^{2} and R ^{2} from both the twostage and onestage model increased as the true betweenstudy variance increased. \(\hat {I}^{2}_{\mathrm {twostage}}\) and \(\hat {I}^{2}_{\mathrm {onestage}}\) were very similar for N=15 and N=30. However, the R ^{2} statistic from both approaches slightly increased as the number of studies increased. Varying the prevalence from 30 to 70% did not affect the estimates of I ^{2} and R ^{2} via two and onestage models.
Furthermore, \(\hat {\tau }^{2}\) from the two and onestage approaches were similar for different prevalence and number of studies (Additional file 1: Table S1).
With effect modification
Table 3 presents the median value and IQR of the ratio of I ^{2} and R ^{2} from a model that ignored the effect modifier to one that included the effect modifier and an interaction term between it and the treatment status across 1000 datasets from the twostage and onestage approaches with prevalence =30%. Any measure of heterogeneity should be sensitive to changes in heterogeneity. If we did not account for effect modification when it existed, then heterogeneity might arise due to this effect modification [24]. Hence, if the ratio estimators reported in the Table 3 are less than 1, they indicate good sensitivity of the measure to changing heterogeneity.
When the strength of effect modification was weak, the ratio estimators for \(I^{2}_{\mathrm {twostage}}\) were well below 1, while the ratio estimators for \(I^{2}_{\mathrm {onestage}}\) were close to 1. When the strength of effect modification was moderate or strong, we found the ratio estimators for \(I^{2}_{\mathrm {onestage}}\) were below 1, suggesting the estimated \(I^{2}_{\mathrm {onestage}}\) was sensitive to changing heterogeneity. When the prevalence increased from 30 to 70%, in the twostage model, almost all values of the ratio estimator for τ ^{2} were equal to 1 (Additional file 1: Table S3), as were most values of the ratio estimator for \(I^{2}_{\mathrm {twostage}}\) (Additional file 1: Table S2). The estimated R ^{2} from the twostage model had similar performance. Conversely, most of the ratio estimators for \(I^{2}_{\mathrm {onestage}}\) were less than 1 when we fixed the prevalence to be 70% and varied other parameter values. However, the convergence rate for onestage approach decreased as the strength of effect modification became stronger (data not shown).
When the number of studies and prevalence were 30 and 30%, most of ratio estimators for \(I^{2}_{\mathrm {twostage}}\) were equal to 0.01. This occurred because the estimated \(\tau ^{2}_{\mathrm {twostage}}\) from the effect modification model was close to zero (Additional file 1: Table S3).
Furthermore, in Additional file 1: Table S3, the ratio estimators for τ ^{2} in the twostage model were all less than or equal to 1. However, most of ratio estimators for τ ^{2} in the onestage model were larger than 1.
Discussion
IPDMA are the gold standard of metaanalytic approaches. While the primary objective of most IPDMA is to estimate pooled treatment effects, quantifying interstudy heterogeneity of those effects is also an important goal. Most statisticians agree that a onestage approach is the best and most flexible approach to use when analyzing data from IPDMA. However, how best to quantify interstudy heterogeneity in that case is unclear [3, 5, 12], and most IPDMA of binary outcomes do not report any measure of heterogeneity [6].
In this work, we considered using usual measures of heterogeneity based on twostage approaches, as well as novel approaches based on a onestage model. We evaluated both twostage and onestage approaches via simulation studies. In the twostage approach, we used the usual I ^{2} and R ^{2} statistics proposed by Higgins et al. to measure heterogeneity [5]. In the onestage approach, we adapted a simulationbased ICC proposed by Goldstein et al. to estimate the I ^{2}, as well as considering the R ^{2} based on the onestage model.
Our results demonstrated that when there was no effect modification, the estimated τ ^{2} from the twostage model was always underestimated. When using a onestage approach, the estimated τ ^{2} was underestimated when the true τ ^{2} was small, but overestimated when the true τ ^{2} was large. Correspondingly, we may assume that the estimated I ^{2} from the twostage model was underestimated, whereas the simulationbased I ^{2} in the onestage model was underestimated when interstudy heterogeneity was small and overestimated when it was large. Both the twostage I ^{2} and onestage I ^{2} increased as the true τ ^{2} increased.
Including a variable and the interaction of that variable and the treatment of interest when effect modification is present should decrease the estimated betweenstudy heterogeneity. In the presence of weak effect modification, the estimated I ^{2} from the twostage model that accounted for the effect modification was less than that from a model that did not. Nevertheless, the estimated I ^{2} from the onestage approach that accounted for effect modification quantified heterogeneity well when the strength of effect modification was moderate or strong. The I ^{2} from the twostage model was less sensitive to reflect the strength of effect modification when the number of studies was large and prevalence was low. Overall, this suggests that using the simulationbased I ^{2} based on onestage model is preferable.
Differences between measures of heterogeneity in the twostage and onestage approaches might be due to real differences in the methods, or because slightly different models were used. In the onestage approach, we only considered models that fit a random intercept and slope, while the twostage approaches fit just a random slope. However, these are the approaches most commonly used [6].
Strengths of the work
We have proposed a simulationbased I ^{2} to use in onestage IPDMA of binary outcomes. We have shown that this I ^{2} satisfies the conditions proposed by Higgins et al., for any measures quantifying heterogeneity, i.e., (i) the measurement function should monotonically increase with increasing betweenstudy variance τ ^{2} and (ii) not be affected by the number of studies N [5]. Moreover, we have shown that the simnulationbased I ^{2} is sensitive to changes in heterogeneity.
When the outcome is binary, the withinstudy variance varies across the studies as betweenstudy variance increases [7]. As a result, the assumption of equal estimated sampling variances across all studies, as in Higgins and Thompson’s paper [5], does not hold, and Higgins’s I ^{2} may be biased. For that reason, we would expect the simulationbased I ^{2} based on the onestage approach to have better performance than the conventional I ^{2} based on the twostage approach.
Using a heterogeneity measure based on the onestage model is also advantageous, because the onestage approach allows investigation of patient and studylevel covariates, and the treatment effect can be adjusted for covariates at the participant or studylevel [18]. Moreover, the onestage model allows MA of doseresponse curves (e.g., allowing nonlinearity), improves power for interactions [25, 26], and allows better control of effect modification by patient and studylevel covariates than the twostage approach [3, 17, 27].
While we investigated its performance for binary outcomes, using the ICC as an I ^{2} for continuous outcomes in the context of a mixed model would be possible, though to our knowledge has not been used like this.
Limitations
There are several limitations in this work. We only considered the ICC estimator proposed by Goldstein to estimate the I ^{2} in onestage IPDMA of binary outcomes. However, there are several other measures that have been proposed as estimators of the ICC for binary data [14, 21, 22]. Wu et al. discussed five different methods to estimate the ICC with binary outcomes: an analysis of variance (ANOVA) estimator, the FleissCuzick estimator, the Pearson estimator, an estimator based on generalized estimating equations (GEE), and an estimator from the random intercept logistic model [20]. These could be adapted to estimate I ^{2} in onestage IPDMA. On the other hand, the measure we have proposed is easy to estimate.
Moreover, GLMMs estimated via adaptive quadrature sometimes do not converge in the onestage model [19]. Indeed, we observed a sometimes important rate of nonconvergence when the strength of effect modification was strong and the prevalence was high. Other estimation approaches such as penalized quasilikelihood (PQL) or Bayesian multilevel models might be interesting to explore [28, 29]. While convergence of PQL is more likely, estimates can be biased with few subjects per cluster, low event rates, or high intercluster variability [7, 29, 30].
For the twostage approaches, we estimated τ ^{2} via the method of moments estimator of DerSimonian and Laird, despite more recent evidence suggesting that the Paule and Mandel estimator is preferred [31].
Finally, we invsestigated a finite number of data generation parameters. In particular, we considered datasets of 15 or 30 studies, whereas it may have been interesting to consider fewer (e.g., 5).
Conclusion
When a onestage approach for IDPMA of binary outcomes is preferred, heterogeneity should be quantified via the model estimated. In that case, we have proposed a simulationbased I ^{2} that performs as well or better than the conventional I ^{2} based on a twostage approach. The R ^{2} based on the onestage model also performed adequately but is more difficult to interpret.
Abbreviations
 ADMA:

Aggregate data metaanalysis
 ANOVA:

Analysis of variance
 GEE:

Generalized estimating equation
 GLMM:

Generalized linear mixed model
 ICC:

Intraclass correlation coefficient
 IPDMA:

Individual participant data metaanalysis
 IQR:

Interquartile range
 MA:

Metaanalysis
 PQL:

Penalized quasilikelihood
References
Lyman GH, Kuderer NM. The strengths and limitations of metaanalyses based on aggregate data. BMC Med Res Methodol. 2005; 5(14):14–14.
What Is Metaanalysis? http://www.bandolier.org.uk/painres/download/whatis/MetaAn.pdf. Accessed 5 Mar 2016.
Riley RD, Lambert PC, AboZaid G. Metaanalysis of individual participant data: rationale, conduct, and reporting. BMJ. 2010; 340:521–5.
Rücker G, Schwarzer G, Carpenter JR, Schumacher M. Undue reliance on i2 in assessing heterogeneity may mislead. BMC Med Res Methodol. 2008; 8(1):79.
Higgins JPT, Thompson SG. Quantifying heterogeneity in a metaanalysis. Stat Med. 2002; 21(11):1539–58.
Thomas D, Radji S, Benedetti A. Systematic review of methods for individual patient data meta analysis with binary outcomes. BMC Med Res Methodol. 2014; 14(1):79.
Zhou Y, Dendukuri N. Statistics for quantifying heterogeneity in univariate and bivariate metaanalyses of binary data: the case of metaanalyses of diagnostic accuracy. Stat Med. 2014; 33(16):2701–17.
DerSimonian R, Laird N. Metaanalysis in clinical trials. Control Clin Trials. 1986; 7(3):177–88.
Whitehead A, Whitehead J. A general parametric approach to the metaanalysis of randomized clinical trials. Stat Med. 1991; 10(11):1665–77.
Biggerstaff BJ, Jackson D. The exact distribution of Cochran’s heterogeneity statistic in oneway random effects metaanalysis. Stat Med. 2008; 27(29):6093–110.
Jackson D, White IR, Riley RD. Quantifying the impact of betweenstudy heterogeneity in multivariate metaanalyses. Stat Med. 2012; 31(29):3805–20.
Debray TPA, Moons KGM, AboZaid GMA, Koffijberg H, Riley RD. Individual participant data metaanalysis for a binary outcome: onestage or twostage?pone. 2013; 8(4):60650.
Jackson D, Bowden R, Baker J. How does the DerSimonian and Laird procedure for random effects metaanalysis compare with its more efficient but harder to compute counterparts?J Stat Plan Infer. 2010; 48(4):961.
Ridout MS, Demetrio CGB, Firth D. Estimating intraclass correlation for binary data. Biometrics. 1999; 55(1):137–48.
Yelland LN, Salter AB, Ryan P, Laurence CO. Adjusted intraclass correlation coefficients for binary data: methods and estimates from a clusterrandomized trial in primary care. Clin Trials. 2011; 8(1):48–58.
Thomson A, Hayes R, Cousens S. Measures of betweencluster variability in cluster randomized trials with binary outcomes. Stat Med. 2009; 28(12):1739–51.
Riley RD, Lambert PC, Staessen JA, Wang J, Gueyffier F, Thijs L, Boutitie F. Metaanalysis of continuous outcomes combining individual patient data and aggregate data. Stat Med. 2008; 27(11):1870–93.
Higgins JPT, Whitehead A, Turner RM, Omar RZ, Thompson SG. Metaanalysis of continuous outcome data from individual patients. Stat Med. 2001; 20(15):2219–41.
Diggle P. Analysis of longitudinal data. Oxford: Oxford University Press; 2002.
Wu S, Crespi CM, Wong WK. Comparison of methods for estimating the intraclass correlation coefficient for binary responses in cancer prevention cluster randomized trials. Contemp Clin Trials. 2012; 33(5):869–80.
Browne WJ, Subramanian SV, Jones K, Goldstein H. Variance partitioning in multilevel logistic models that exhibit overdispersion. J R Stat Soc Ser A (Stat Soc). 2005; 168(3):599–613.
Goldstein H, Browne W, Rasbash J. Partitioning variation in multilevel models. Underst Stat. 2002; 1(4):223.
The R project for statistical computing. https://www.rproject.org/. Accessed 5 Jan 2016.
Gentles SJ, Stacey D, Bennett C, Alshurafa M, Walter SD. Factors explaining the heterogeneity of effects of patient decision aids on knowledge of outcome probabilities: a systematic review subanalysis. Syst Rev. 2013; 2:95.
Lambert PC, Sutton AJ, Abrams KR, Jones DR. A comparison of summary patientlevel covariates in metaregression with individual patient data metaanalysis. J Clin Epidemiol. 2002; 55(1):86–94.
Schmid CH, Stark PC, Berlin JA, Landais P, Lau J. Metaregression detected associations between heterogeneous treatment effects and studylevel, but not patientlevel, factors. J Clin Epidemiol. 2004; 57(7):683–97.
Riley RD. Commentary: like it and lump it? Metaanalysis using individual participant data. Int J Epidemiol. 2010; 39(5):1359–61.
Brewlow NE, Clayton DG. Approximate inference in generalized linear mixed models. J Am Stat Assoc. 1993; 88:421.
Jang W, Lim J. A numerical study of PQL estimation biases in generalized linear mixed models under heterogeneity of random effects. Commun Stat Simul Comput. 2009; 38:692–702.
Lin X, Breslow NE. Bias correction in generalized linear mixed models with multiple components of dispersion. J Am Stat Assoc. 1996; 91(435):1007.
Veroniki AA, Jackson D, Viechtbauer W, Bender R, Bowden J, Knapp G, Kuss O, Higgins JP, Langan D, Salanti G. Methods to estimate the betweenstudy variance and its uncertainty in metaanalysis. Res Synth Meth. 2016; 7(1):55–79.
Acknowledgements
We thanks the reviewers for their suggestions. A research grant from the Canadian Insitutes of Health Research (CIHR) supporterd this research. Andrea Benedetti is supported by the Fonds de recherche de Québec Santé.
Funding
This work was supported by the Canadian Institutes of Health Research (CIHR), and Andrea Benedetti is supported by the Fonds de recherche de Québec Santé.
Availability of data and materials
R code for dataset generation is avaliable by request.
Author information
Authors and Affiliations
Contributions
AB gave the original idea to conduct this research work. BC implemented this research work under the guidance of AB. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Table S1. The Median (IQR) of the estimated τ ^{2} when no effect modification was present. Table S2. Sensitivity of heterogeneity measures to accounting for effect modification when prevalence of the outcome was 70%. Median (IQR) was presented. Table S3. Sensitivity of heterogeneity measures \(\left(\frac{\tau_{\text{emod}}^{2}}{\tau^{2}_{\text{crude}}}\right)\) to accounting for effect modification. Median (IQR) was presented. (PDF 145 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Chen, B., Benedetti, A. Quantifying heterogeneity in individual participant data metaanalysis with binary outcomes. Syst Rev 6, 243 (2017). https://doi.org/10.1186/s1364301706304
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1364301706304