Introduction

Meta-analyses (MAs) are used to summarise the results of studies on a specific research question. If the number of studies is large and the sample sizes within these studies are not too small, standard inverse variance random effects models (IV-REMs) can provide valid estimates. However, if only a few (≤ 10) studies are included in the MA, the IV-REMs perform poorly [1,2,3]. The DerSimonian-Laird (DSL) method leads to too narrow 95% confidence intervals (CIs) with poor coverage probabilities below 95%, especially in the case of few studies. The Hartung-Knapp-Sidik-Jonkman (HKSJ) method generally holds the type I error, but frequently results in extremely wide 95% CIs in the case of very few (2 – 4) studies.

The worse performance in the case of few studies is a particular challenge, because such MAs are frequently performed in systematic reviews of interventions. For example, in an analysis of 14,886 MAs from the Cochrane Library, the median number of studies in MAs was 3 and the 3rd quartile was 6 [4].

In the case of binary data, alternatives to IV-REM methods have been proposed. Because outcome (success, failure) and study arm (treatment, control) for each patient can be reconstructed from studies’ four-fold tables, the generalised linear mixed model (GLMM) framework (and generally speaking all logistic regression models accounting for dependent data) can be applied to MAs [5]. The “common-rho” beta-binomial model (BBM) showed good results when pooling data of randomised controlled trials (RCTs), especially in the case of very few studies and/or rare events in the MA [6, 7]. However, there are some concerns about the model because it ignores the fact that each treatment arm is paired with a respective control arm, both originating from the same study (disrespecting the randomisation to a study arm of a specific study).

Therefore, we conducted a simulation study to compare existing BBMs and extensions with established models, such as GLMMs, HKSJ and DSL, especially in situations with very few (2 – 4) studies and for a wide range of risks, including rare events. Our focus was on BBM extensions that accounted for the pairing of a treatment arm with a control arm of the same study by implementing a random effect for the study or conditioning on the study in the maximum likelihood estimation.

The outline of the paper is as follows. In the 2nd chapter, we describe the statistical models for MAs that were included in the comparison (Models section) and explain how the simulation study was conducted (Simulation study section). In the 3rd chapter, we present the results of the simulation study. In the 4th chapter, we discuss the results, and in the 5th chapter, we conclude with final remarks and recommendations for practice.

Methods

We consider situations where K studies compare a binary outcome between two study arms (i = 1 [or T for treatment] and i = 0 [or C for control]). For each study k (k = 1, …, K), nkT and nkC denote the sample size for the treatment and control arm, ykT and ykC the number of events in the treatment and control arm, and θk the treatment effect with a specific within-study variance \({\sigma}_k^2\). We are interested in the estimation of the overall treatment effect θ and use the effect measures odds ratio (OR) and relative risk (RR) to quantify the effect between the treatment and control arm.

Models

We compared the following meta-analytic models and methods in our simulation study:

  • Beta-binomial models (BBMs)

    • ◦ Standard (“common-rho”) beta-binomial model (BBST)

    • ◦ Standard beta-binomial model with an additional random treatment effect (BBFR)

    • ◦ Two “common-beta” beta-binomial models (BBCB1 and BBCB2)

  • Generalised linear mixed models (GLMMs)

    • ◦ Generalised linear mixed model with a fixed intercept and random treatment effect (GLFR)

    • ◦ Generalised linear mixed model with a random intercept and random treatment effect (GLRRI)

  • Inverse variance random effects model (IV-REM)

    • ◦ DerSimonian-Laird (DSL) method

    • ◦ Hartung-Knapp-Sidik-Jonkman (HKSJ) method

  • Mantel-Haenszel (MH) method

  • Peto odds ratio (POR) method

  • Collapsed table (COLL)

and describe them in the following sections.

Beta-binomial models

Standard beta-binomial model

The standard beta-binomial model (BBST) [6, 7] assumes that the observed number of events in the control arm ykC (k = 1, …, K) follows a binomial distribution Bin(πkC, nkC), where the event probability πkC is not fixed, but beta distributed with parameters αC and βC. These parameters are assumed to be constant over all control arms of the studies. The individual binary event zkCj (j = 1, …, nkC; \({y}_{kC}=\sum_j^{n_{kC}}{z}_{kC j}\)) is sampled with a different πkC. The expected value and variance of πkC are:

$$E\left({\pi}_{kC}\right)={\mu}_C$$

and

$$Var\left({\pi}_{kC}\right)={\mu}_C\times \left(1-{\mu}_C\right)\times {\nu}_C/\left(1+{\nu}_C\right)$$

with

$${\mu}_C={\alpha}_C/\left({\alpha}_C+{\beta}_C\right)$$

and

$${\nu}_C=1/\left({\alpha}_C+{\beta}_C\right)$$

and ykC is beta-binomially distributed with the expected value

$$E\left({y}_{kC}\right)={n}_{kC}\times {\mu}_C$$

and variance

$$Var\left({y}_{kC}\right)={n}_{kC}\times {\mu}_C\times \left(1-{\mu}_C\right)\times \left[1+\left({n}_{kC}-1\right)\times {\nu}_C/\left(1+{\nu}_C\right)\right].$$

Because the probabilities for two individual binary events in the control arm are sampled from the same beta distribution, these events are correlated. The intraclass correlation \({\rho}_C= corr\left({z}_{{kC j}_1},{z}_{kC{j}_2}\right)\) between two individual binary events in the control arm k (k = 1, …, K; j1, j2 = 1, …, nkC; j1 ≠ j2) is

$${\rho}_C=1/\left({\alpha}_C+{\beta}_C+1\right)$$

and is assumed to be equal over all ykC (k = 1, …, K). Further, it is assumed that individual binary events from different control (and treatment) arms are uncorrelated, \(corr\left({z}_{k_1{Cj}_1},{z}_{k_2C{j}_2}\right)=0\) for k1 ≠ k2.

The log likelihood of the beta-binomial distribution of all control arms can be written in closed form as

$${\ell}_C\left({\alpha}_C,{\beta}_C\right)=\sum_{k=1}^K{\ell}_{kC}\left({\alpha}_C,{\beta}_C\right)$$

with

$${\ell}_{kC}\left({\alpha}_C,{\beta}_C\right)=\log \left(\Gamma \left({n}_{kC}+1\right)\right)+\log \left(\Gamma \left({y}_{kC}+{\alpha}_C\right)\right)+\log \left(\Gamma \left({n}_{kC}-{y}_{kC}+{\beta}_C\right)\right)+\log \left(\Gamma \left({\alpha}_C+{\beta}_C\right)\right)-\log \left(\Gamma \left({y}_{kC}+1\right)\right)-\log \left(\Gamma \left({n}_{kC}-{y}_{kC}+1\right)\right)-\log \left(\Gamma \left({n}_{kC}+{\alpha}_C+{\beta}_C\right)\right)-\log \left(\Gamma \left({\alpha}_C\right)\right)-\log \left(\Gamma \left({\beta}_C\right)\right)$$

where Γ denotes the gamma function,

$${\alpha}_C={\mu}_C\times \left(1-{\rho}_C\right)/{\rho}_C$$

and

$${\beta}_C=\left(1-{\mu}_C\right)\times \left(1-{\rho}_C\right)/{\rho}_C.$$

The same formulas hold true for the number of events in the treatment arm ykT (k = 1, …, K) with nkT, πT, αT, βT, μT, νT and ρT. The log likelihood for the treatment arms T(αT, βT) is given accordingly.

Importantly, in the BBST it is assumed that ρC = ρT = ρ, which is equivalent to αC + βC = αT + βT. In other words, all individual binary events within a study arm are correlated with the same ρ, regardless of therapy.

The treatment effect θ = bT = g−1(μT)/g−1(μC) is modelled via the link function

$$g\left({\mu}_i\right)={b}_0+{b}_T\times i$$

where b0 denotes the risk of an event in the control arm and i the study arm (1 = treatment; 0 = control). In our simulation study, the link functions are the logit and the natural log to measure the treatment effect as log OR and log RR.

Because g(μC) = b0, g(μT) = b0 + bT and αC + βC = αT + βT, one can write

$${b}_T=g\left({\mu}_{kT}\right)-g\left({\mu}_{kC}\right)=g\left({\alpha}_T/\left({\alpha}_T+{\beta}_T\right)\right)-g\left({\alpha}_C/\left({\alpha}_C+{\beta}_C\right)\right)=g\left({\alpha}_T/\left({\alpha}_C+{\beta}_C\right)\right)-g\left({\alpha}_C/\left({\alpha}_C+{\beta}_C\right)\right).$$

Therefore, only three parameters (αT, αC, βC) have to be estimated in this model.

One advantage of the BBST is that no continuity correction has to be used if there are studies with no events in one study arm (single-zero studies). Furthermore, studies without any events in both study arms (double-zero studies) are not ignored in the analysis and contribute to the overall effect estimation [6]. The only situations where the BBST cannot estimate the OR and RR (but can estimate the risk difference) are situations where no events occur in one study arm (e.g. the treatment arm) over all studies.

In the BBST, the event probability in the control arm πkC is random but the treatment effect is considered to be fixed across all studies. Thus, although the BBST is a true random effects model, from a meta-analytic point of view, it is a model with a fixed treatment effect.

Furthermore, the BBST estimates the treatment effect via μT and μC. Therefore, the fact that the treatment and control arm originate from the same study is ignored in the process of parameter estimation. Thus, the BBST disrespects the randomisation to a study arm of a specific study. According to Senn [8] and Piepho et al. [9] it is unlikely that this property is detrimental in situations where the same treatments are evaluated across trials, because the effects are comparable between the studies. This was indicated by recent simulation results, where the BBST performed well [6, 7].

Standard beta-binomial model with additional random effect

To deal with the aforementioned properties of BBST as a fixed effect model that disrespects randomisation, we implemented another BBM (BBFR) where the treatment effect θ = bT = g−1(μT)/g−1(μC) is modelled as

$$g\left({\mu}_i\right)={b}_0+\left({b}_T+{\gamma}_k\right)\times i$$

with \({\gamma}_k\sim N\left(0,{\sigma}_{BBFR}^2\right)\). By adding a random effect to the treatment effect, this model respects the randomisation to a study arm of a specific study.

Like the BBST, the BBFR takes the information of all studies into account and therefore needs no continuity correction when single- or double-zero studies are included in the MA.

When constructing the 95% CI for the treatment effect bT, Mathes and Kuss [7] showed that using the t-distribution rather than the normal distribution led to better performance of the BBST. Therefore, for both models (BBST, BBFR) we calculated 95% CIs for the treatment effect bT using the t-distribution

$${\hat{b}}_{T(BB)}\pm {t}_{df;0.975}\times {\hat{\sigma}}_{BB},$$

where \({\hat{b}}_{T(BB)}\) is the estimated treatment effect of BBST or BBFR, \({\hat{\sigma}}_{BB}\) the corresponding estimated standard error and tdf; 0.975 the 0.975 quantile of the t-distribution with df degrees of freedom. We chose two different numbers of degrees of freedom: df = K − 1, which are the degrees of freedom for the HKSJ method, and df = 2K − 2, which is a reasonable compromise between the 2K and 2K − 3 degrees of freedom that were used by Mathes and Kuss [7] and that led to too narrow and too wide intervals, respectively.

Common-beta BBM

In the BBST it is assumed that the intraclass correlation ρ is equal for all treatment and control arms implying that αC + βC = αT + βT holds true. Therefore, this model is sometimes called the “common-rho” model. Another possibility is to assume that β is equal in all groups (βC = βT = β). We call this model the “common-beta” BBM. Similar to the BBST, the likelihood function of the “common-beta” BBM can be written in closed form. Guimaraes [10], and more recently in the meta-analytic context Mathes and Kuss [11], were able to show that a negative binomial regression model for count panel data can be interpreted as this “common-beta” BBM.

We considered two versions of the “common-beta” BBM in our simulation. BBCB1, which disrespects the randomisation to a study arm of a specific study by conditioning on study group, while BBCB2 respects the randomisation to a study arm of a specific study by conditioning on the study.

As for BBST and BBFR, we constructed 95% CIs using the t-distribution for the treatment effect bT

$${\hat{b}}_{T(BBCB)}\pm {t}_{df;0.975}\times {\hat{\sigma}}_{BBCB},$$

where \({\hat{b}}_{T(BBCB)}\) is the estimated treatment effect of BBCB1 or BBCB2, \({\hat{\sigma}}_{BBCB}\) the corresponding estimated standard error, and tdf; 0.975 the 0.975 quantile of the t-distribution with df = K − 1 or 2K − 2 degrees of freedom.

Generalised linear mixed models

Generalised linear mixed models (GLMMs) [5] are probably the most common alternatives to standard MA methods (Inverse variance random effects models section) with binary data because of their flexibility. A GLMM with random treatment effect θk = θ + ϵk can be expressed as

$$g\left({\pi}_{ki}\right)={\gamma}_k+i\times \theta +i\times {\epsilon}_k$$

where g(∙) is the link function for the OR (logit) or RR (log), πki the probability of an event in study arm i (i = 1: treatment; i = 0: control) of study k (k = 1, …, K), γk the intercept (baseline risk of an event) of study k and ϵk~N(0, τ2).

We included two GLMMs in our simulation. The first model has a fixed intercept and a random treatment effect (GLFR) (similar to model 2 in Jackson et al. [5] originally suggested by Simmonds and Higgins [12]). The second GLMM included has a random intercept (\({\gamma}_k\sim N\left(\gamma, {\sigma}_{GLRRI}^2\right)\)) and an uncorrelated random treatment effect (GLRRI), similar to model 3 in Jackson et al. [5].

Like the BBST, the GLMM is a random effects model. But as the treatment effect is random it is more comparable to meta-analytic REMs than the BBST.

Like BBMs, GLMMs take the information of all studies into account and therefore do not need a continuity correction for single- or double-zero studies.

We calculated 95% CIs for log OR and log RR using normal approximation, therefore

$${\hat{\theta}}_{GLMM}\pm 1.96\times {\hat{\sigma}}_{GLMM},$$

where \({\hat{\theta}}_{GLMM}\) is the estimated overall effect (log OR or log RR) in the analysed model (GLFR or GLRRI) and \({\hat{\sigma}}_{GLMM}\) the corresponding standard error.

Inverse variance random effects models

The meta-analytic random effects model (REM) assumes no common effect for all studies but instead assumes that the mean of all study effects is the mean of the distribution of the true effect [13]. The study effects are usually assumed to follow a normal distribution. The treatment effect in study k can be expressed as \({\hat{\theta}}_k={\theta}_k+{\varepsilon}_k\) with θk = θREM + δk, δk~N(0, τ2) and \({\varepsilon}_k\sim N\left(0,{\sigma}_k^2\right)\).

The overall effect θREM of this REM can be estimated by the inverse variance approach

$${\hat{\theta}}_{REM}=\frac{\sum_{k=1}^K{w}_{k(REM)}\times {\hat{\theta}}_k}{\sum_{k=1}^K{w}_{k(REM)}},$$

where \({w}_{k(REM)}=1/\left({\sigma}_k^2+{\tau}^2\right)\) are the study-specific weights, \({\sigma}_k^2\) is the within-study variance, and τ2 is the between-study variance (heterogeneity).

In the case of binary data and for OR and RR as the effect sizes, \({\hat{\theta}}_k\) and \({\hat{\theta}}_{REM}\) are generally analysed on the logarithmic scale, thus representing the log OR and the log RR. A continuity correction is applied to single-zero studies by adding 0.5 to the number of events in both groups. Double-zero studies are ignored for parameter estimation.

DerSimonian-Laird method

The DerSimonian-Laird (DSL) method [14] was regarded as the gold standard for performing MAs until it was criticized in recent years for being anti-conservative (i.e., producing too narrow CIs) [15].

The DSL estimator \({\hat{\theta}}_{DSL}\) uses weights \({w}_{k(DSL)}=1/\left({\sigma}_k^2+{\tau}_{DSL}^2\right)\) where

$${\tau}_{DSL}^2=\max \left\{0,\frac{Q-\left(K-1\right)}{\sum_{k=1}^K{w}_{k(FEM)}-\frac{\sum_{k=1}^K{w}_{k(FEM)}^2}{\sum_{k=1}^K{w}_{k(FEM)}}}\right\}$$

is estimated using the method-of-moments principle [14, 16]. Q is the heterogeneity statistic \(Q=\sum_{k=1}^K{w}_{k(FEM)}\times {\left({\hat{\theta}}_k-{\hat{\theta}}_{FEM}\right)}^2\) and \({\hat{\theta}}_{FEM}\) is the pooled effect of a fixed effect model with weights \({w}_{k(FEM)}=1/{\sigma}_k^2\).

The 95% CI for θREM is given by

$${\hat{\theta}}_{DSL}\pm 1.96\times {\hat{\sigma}}_{DSL}.$$

The standard error is given by

$${\hat{\sigma}}_{DSL}=\sqrt{1/\left(\sum_{k=1}^K{w}_{k(DSL)}\right)}.$$

We included the DSL method in our simulation because it is still in use and is important, at least for sensitivity analyses [17].

Hartung-Knapp-Sidik-Jonkman method using the Paule-Mandel heterogeneity variance estimator

For the method of Hartung-Knapp-Sidik-Jonkman (HKSJ) [18, 19] different weights wk(HKSJ) can be applied to calculate the overall estimator \({\hat{\theta}}_{HKSJ}\), depending on what estimator for the between-study variance (heterogeneity) is used. For the analysis presented here, we used \({w}_{k(HKSJ)}={w}_{k(PM)}=1/\left({\sigma}_k^2+{\tau}_{PM}^2\right)\), where \({\tau}_{PM}^2\) is the Paule-Mandel (PM) estimator of τ2 [20,21,22]. The PM estimator of τ2 is derived from the generalised Q statistic

$$Q\left({\tau}_{PM}^2\right)=\sum_{k=1}^K{w}_{k(PM)}\times {\left({\hat{\theta}}_k-\hat{\theta}\left({\tau}_{PM}^2\right)\right)}^2$$

by setting \(Q\left({\tau}_{PM}^2\right)\) to its expected value K − 1 with \({w}_{k(PM)}=1/\left({\sigma}_k^2+{\tau}_{PM}^2\right)\) and \(\hat{\theta}\left({\tau}_{PM}^2\right)=\left(\sum_{k=1}^K{w}_{k(PM)}\times {\hat{\theta}}_k\right)/\left(\sum_{k=1}^K{w}_{k(PM)}\right)\). The equation is solved by iterating \({\tau}_{PM}^2\) until convergence is reached. If no solution exists, \({\hat{\tau}}_{PM}^2\) is set to 0.

The 95% CI for θREM is given by

$${\hat{\theta}}_{HKSJ}\pm {t}_{K-1;0.975}\times \sqrt{q}\times {\hat{\sigma}}_{HKSJ}$$

where tK − 1; 0.975 is the 0.975 quantile of the t-distribution with K − 1 degrees of freedom,

$$q=\frac{1}{K-1}\sum_{k=1}^K{w}_{k(HKSJ)}\times {\left({\hat{\theta}}_k-{\hat{\theta}}_{HKSJ}\right)}^2$$

and

$${\hat{\sigma}}_{HKSJ}=\sqrt{1/\left(\sum_{k=1}^K{w}_{k(HKSJ)}\right)}.$$

In general, this CI tends to be wider than the 95% CI of the DSL method. However, in very homogeneous data situations, this is not always the case. Therefore, Knapp and Hartung [23] suggested an ad hoc modification of q, q = max {1, q}. If the ad hoc modification is used, the 95% CI of the HKSJ method is always wider than the 95% CI of the DSL method. In the simulation study, we followed the recommendations of the literature to carry out sensitivity analyses using a fixed effect model or the DSL method to decide on whether the modification is needed [17, 24]. If the 95% CI of HKSJ was narrower than the 95% CI of DSL, the ad hoc modification was used.

We included the HKSJ method in our simulation because it is well-established that it performs better than the DSL method [17] and is recommended as the standard approach in the Cochrane Handbook in combination with the PM estimator for τ2 [25]. Furthermore, it is the IQWiG standard method for MAs with five or more studies [26]. It is well-known that for MAs with less than five studies, 95% CIs of the HKSJ method tend to be too wide but in general, the method does not lead to increased type I errors.

Other models

Mantel-Haenszel method

The Mantel-Haenszel (MH) method [27] assumes a fixed effect model which is based on the assumption that all studies in the MA have a common effect θFEM. The resulting estimate is a weighted average of the study-specific risk ratios or risk differences.

The MH odds ratio of the overall effect is given by

$${OR}_{MH}=\frac{\sum_{k=1}^K{w}_{k(MH)}\times {\hat{OR}}_{k(MH)}}{\sum_{k=1}^K{w}_{k(MH)}}$$

where \({\hat{OR}}_{k(MH)}=\frac{y_{kT}\times \left({n}_{kC}-{y}_{kC}\right)}{\left({n}_{kT}-{y}_{kT}\right)\times {y}_{kC}}\) is the estimated odds ratio, \({w}_{k(MH)}=\frac{\left({n}_{kT}-{y}_{kT}\right)\times {y}_{kC}}{n_k}\) the weight, and nk = nkT + nkC the sample size of study k (k = 1, …, K).

No continuity correction was applied. Therefore, single- and double-zero studies were ignored during the analysis. We estimated 95% CIs using normal approximation.

We included the MH method in our analysis because it is the standard fixed effect model for binary data in Cochrane Reviews and performs better than the standard (inverse variance) fixed effect model in the case of sparse data [25].

Peto odds ratio method

The Peto odds ratio (POR) method [28] was introduced as the effect estimator for the real underlying OR in the data situation of rare events. Later it was shown that POR is an independent effect measure and cannot be used as approximation of the true OR in all (rare event) data situations [29].

The pooled log POR is estimated as

$$\log (POR)=\frac{\sum_{k=1}^K\left({O}_k-{E}_k\right)}{\sum_{k=1}^K{V}_k},$$

where Ok is the observed number of events in the treatment arm, Ek the expected number of events in the treatment arm under the assumption of no treatment effect, and Vk the variance of their difference.

No continuity correction was applied. Single-zero studies are included in the analysis by definition, whereas double-zero studies are ignored. We estimated 95% CIs using the normal approximation.

Although this method has major limitations [29, 30], we considered the POR in our analysis because according to the Cochrane Handbook, it is the standard MA method for small intervention effects or very rare events [25].

Collapsed table

This method (COLL) ignores the fact that data were collected from different studies and aggregates them in one single four-fold table. The effect is estimated using standard methods for 2 × 2 tables [31]. The procedure assumes a constant underlying risk of an event across all studies, which is rarely given, and therefore the method is vulnerable to Simpson’s paradox [32, 33].

Because of its simplicity and because the differences in event rates across studies might be negligible in scenarios with few events and studies, we included the method in our analysis.

We applied a continuity correction if necessary and estimated 95% CIs using normal approximation.

Simulation study

We performed a simulation study using SAS/STAT® software Version 9.4 (SAS Institute Inc., Cary, NC, USA) for Microsoft Windows 10 to compare the statistical properties of the different meta-analytic methods. In an attempt to investigate the methods in realistic scenarios, the values of the design factors in the simulation study were chosen from MAs actually performed. For this purpose, we used the review by Turner et al. [4] The review included 1991 systematic reviews from the Cochrane Database of Systematic Reviews and analysed 14,886 MAs of dichotomous outcomes from 77,237 single studies.

Design of simulations

In our simulation study, we varied the following parameters: the number of studies in the MA, the sample size of a single study, the event probability in the control arm, the heterogeneity between studies in the MA, and the effect size in the MA.

We chose 2, 3, 4, 5 and 10 as the number of studies in the MA. Ten studies were chosen in addition to gain an impression of how the models perform in rather uncritical scenarios. For each number of studies in the MA, we simulated 10,000 data sets each under the null (H0) and under the alternative hypothesis (H1). In each data set, we sampled the values of the other parameters from distributional functions based on the data from Turner et al. [4] For example, the sample size nk for a single study in the MA was sampled using a log-normal distribution with μND = 4.615 and σND = 1.1, resulting in a mean overall sample size of 185.7 (median: 102.0; 1st quartile: 50.0; 3rd quartile: 213.0). Table 1 shows the distributional assumptions and the resulting data values. The data were simulated according to the REM from the Inverse variance random effects models section. The simulation process was as follows:

  1. 1.

    For each MA, we sampled the true risk πC, true in the control arm and the heterogeneity τ2 between the studies in the MA from the distributional functions given in Table 1. Under H1, we did the same for the effect size θ = log OR and log RR. Under H0, the effect size was set to zero (i.e. OR = 1 and RR = 1).

  2. 2.

    For the kth study in the MA, we

    1. a.

      sampled the study size nk and the size of the treatment (nkT) and control arm (nkC) using the distributional functions given in Table 1

    2. b.

      sampled the number of events in the control arm ykC using a binomial distribution with πC, true as event probability and nkC as number of experiments

    3. c.

      sampled an individual heterogeneity variance \({\tau}_k^2\) using the sampled true heterogeneity and assuming that it follows a normal distribution within the kth MA

    4. d.

      calculated the true risk in the treatment arm πkT, true using πC, true, θ, \({\tau}_k^2\) and the following formula: \({\pi}_{kT, true}=\exp \left(\textrm{logit}\left({\pi}_{C, true}\right)+\theta +{\tau}_k^2\right)/\left(1+\exp \left(\textrm{logit}\left({\pi}_{C, true}\right)+\theta +{\tau}_k^2\right)\right)\)

    5. e.

      sampled the number of events in the treatment arm ykT using a binomial distribution with πkT, true as event probability and nkT as number of experiments.

Table 1 Description of the simulation study

Overall, we simulated 5 (number of studies in the MA 2, 3, 4, 5, 10) × 2 (under H0 and under H1) × 10,000 data sets = 100,000 MAs each for the OR and for the RR.

We performed a sensitivity analysis to assess the robustness of the results regarding heterogeneity. For each MA, we calculated Cochran’s Q test [13] in order to gain an impression of whether the results depend on homogeneity of the data situations. Although we are aware that this dichotomisation is somewhat arbitrary, we used the Cochran’s Q test for the purpose of sensitivity analyses because in practical applications, MAs will frequently not be performed, at least when the test for heterogeneity is statistically significant.

Parameter estimation in the models

For parameter estimation, we used the SAS/STAT® software procedure NLMIXED for BBST and BBFR, COUNTREG for BBCB1 and BBCB2, GLIMMIX for GLFR and GLRRI, and FREQ for MH and COLL. For HKSJ, DSL and POR, we programmed our own syntax that was validated using R 3.3 [36] and the metafor package [37].

Because we used the COUNTREG procedure for parameter estimation, we were only able to estimate the OR but not the RR. In the GLIMMIX procedure, we used maximum likelihood estimation based on adaptive quadrature (METHOD = QUAD) with 1 quadrature point (QPOINTS = 1), which is equivalent to the Laplace approximation. We decided to use the Laplace approximation because we assumed that this would be most robust [38].

Performance measures

To assess the performance of the methods we calculated the following measures:

  • Number of converged simulation runs with estimated effect and standard error (R): Sometimes the procedures converged and an effect was estimated but no standard error was given (most notably when using the NLMIXED procedure). Because such results would cause problems of interpretation, we counted these runs as non-converged. All other measures were based on R, the number of converged simulation runs with an estimated effect and standard error.

  • (Absolute) bias \({\hat{\theta}}_r-{\theta}_r\): Difference between the estimated (\({\hat{\theta}}_r\)) and true effect (θr); r = 1, …, R.

  • Percentage bias under H1\(\left(100\times \left({\hat{\theta}}_r-{\theta}_r\right)\right)/{\theta}_r\): Ratio of the bias (\({\hat{\theta}}_r-{\theta}_r\)) and true effect (θr); r = 1, …, R.

  • Coverage probability: Proportion of converged simulation runs where the 95% CI included the true effect θr; r = 1, …, R.

  • Length of 95% CI: Difference (CIU, r − CIL, r) of upper (CIU, r) and lower (CIL, r) confidence limit of the 95% CI for θr; r = 1, …, R.

  • Power under H1: Proportion of converged simulation runs under H1 where the 95% CI excluded the null effect.

Bias, percentage bias and length of 95% CI were calculated on the corresponding log scale, i.e. log OR or log RR. For these measures, the median as well as the 1st and 3rd quartile are presented.

The simulation code containing the data generation, parameter estimation, and the calculation of the performances measures is available in the Supporting Information.

Results

In the following sections, we describe the results for the OR of all methods under the null hypothesis (Results for the odds ratio under the null hypothesis section), and alternative hypothesis (Results for the odds ratio under the alternative hypothesis section). In the  Direct comparison of results for beta-binomial models section, we compare the results of the BBMs, especially BBST and BBFR. The results of the RR are discussed in the Results for the relative risk section. The results of the sensitivity analysis are presented in the Sensitivity analysis section and the main results are summarized in the Summary of results section.

Although the results of all methods are described in this chapter, in tables and figures we focused on the BBST, BBCB1, BBCB2, GLFR, HKSJ and DSL. As BBST and BBFR yielded almost identical results (Direct comparison of results for beta-binomial models section and Discussion), we refrained from showing them both in tables and figures. The results of all methods can be found in the Supporting Information.

For the BBMs, the results for coverage probability, length of 95% CI and power, with CIs using 2K − 2 degrees of freedom, are presented and discussed. Results for the CIs with K − 1 degrees of freedom, which were generally worse, can be found in the Supporting Information.

Results for the odds ratio under the null hypothesis

Number of converged simulation runs

The methods HKSJ, DSL, COLL and POR were only marginally affected by convergence problems (< 0.5% for all scenarios) due to their construction. The same held true for GLFR (R ≥ 9988 for all scenarios). The BBMs (BBST, BBFR, BBCB1, BBCB2) converged in more than 95% of the simulation runs. The number of converged runs was lower for MH in scenarios with 2 studies (R = 9394) but increased up to 10,000 in scenarios with 10 studies. The GLRRI had the lowest number of converged runs, with 8469 (2 studies) to 6896 runs (10 studies) (see Fig. 1 and Table S1 in the Supporting Information).

Fig. 1
figure 1

Number of converged simulation runs (out of 10,000) for the OR under the null hypothesis

Bias

For all methods, the median bias was similar, mainly positive, increased with an increasing number of studies, and was low (< 0.04 on log OR scale). Because bias was calculated on the log OR scale (bias \(=\log \left(\hat{OR}\right)-\log (OR)=\log \left(\hat{OR}/ OR\right)\)), this could be interpreted as relative effect of the ORs. Therefore, the estimated OR increased up to 4% in median if compared to the true OR (see Table 2 and Table S2 in the Supporting Information).

Table 2 (Absolute) bias for the log OR under the null hypothesis

Coverage probability

Coverage probabilities were at or above 95% for the BBMs (BBST, BBFR, BBCB1, BBCB2) and HKSJ for all scenarios. Coverage probability for GLFR was at or above 95% for 2, 3 and 4 studies and fell below 95% for 5 (93.9%) and 10 studies (90.4%). All other methods had coverage probabilities below 95% for all scenarios (GLRRI: ≤ 85.7%; DSL: ≤ 93.3%; MH: ≤ 81.6%; POR: ≤ 80.6%; COLL: ≤ 82.2%) (see Fig. 2 and Table S3 in the Supporting Information).

Fig. 2
figure 2

Coverage probability (%) for the OR under the null hypothesis (95% CI with 2K − 2 degrees of freedom for BBST, BBCB1 and BBCB2)

BBMs (BBST, BBFR, BBCB1, BBCB2), HKSJ and GLFR had coverage probabilities ≥97.3% in scenarios with 2 studies. In scenarios with 3, 4, 5, and 10 studies, BBMs (BBST, BBFR, BBCB1, BBCB2) were closer to 95% than HKSJ (≥ 96.2% in all scenarios). The same applied to the GLFR in scenarios with 3 and 4 studies (see Fig. 2 and Table S3 in the Supporting Information).

Coverage probabilities of the BBMs were closer to 95%, i.e., less conservative, if confidence intervals with 2K − 2 degrees of freedom were used compared to the use of K – 1 degrees of freedom (see Table S3 in the Supporting Information).

Length of 95% CI

The length of the 95% CI for log OR was largest and approximately the same for HKSJ and GLFR in the scenario with 2 studies. The length was far shorter for BBMs. With an increasing number of studies in the MA, the length of the 95% CIs converged between the methods, but HKSJ always had the widest CIs (see Table 3 and Table S4 in the Supporting Information).

Table 3 Length of 95% CI for the log OR under the null hypothesis (95% CI with 2K − 2 degrees of freedom for BBST, BBCB1 and BBCB2)

Results for the odds ratio under the alternative hypothesis

The results under the alternative hypothesis were quite similar to the results under the null hypothesis for all performance measures investigated. Therefore, for the number of converged simulation runs, bias, squared error, coverage probability and length of 95% CI, only important differences between null and alternative hypothesis are mentioned.

Number of converged simulation runs

The number of converged runs for GLRRI increased by about 100 runs up to between 8590 (2 studies) and 7083 runs (10 studies). GLRRI still had the lowest convergence rate (see Table S5 in the Supporting Information).

Bias

The median absolute bias of log OR was higher than under the null hypothesis and mainly positive, but still small (< 0.05), for most methods. The most notable exception was BBCB2 with a median bias > 0.10 for all scenarios (see Table 4 and Table S6 in the Supporting Information).

Table 4 (Absolute) bias for the log OR under the alternative hypothesis

Percentage bias

The percentage bias of log OR was defined as \(\left(100\times \left(\log \left(\hat{OR}\right)-\log (OR)\right)\right)/\log (OR)\) and because log OR < 0, a negative percentage bias means an overestimation of the log OR. The median percentage bias was low (about − 6 and 0%) for the BBMs (BBST, BBFR, BBCB1), except for BBCB2, and the GLMMs (GLFR, GLRRI). The median values for HKSJ, DSL, MH, POR and COLL were higher, with values between − 14 and − 7%. BBCB2 had much worse median values, about − 30% (see Table 5 and Table S7 in the Supporting Information).

Table 5 Percentage bias (%) for the log OR under the alternative hypothesis

Coverage probability

Only HKSJ had coverage probabilities at or above 95% for all scenarios. BBCB2 had coverage probabilities < 95% for scenarios of 3 or more studies. The other BBMs (BBST, BBFR, BBCB1) had coverage probabilities at, above, or marginally below 95% for all scenarios. Coverage probabilities for GLFR were at or above 95% only for scenarios with 2, 3 or 4 studies. All other methods had coverage probabilities below 95% for all scenarios (GLRRI: ≤ 86.6%; DSL: ≤ 92.5%; MH: ≤ 82.2%; POR: ≤ 82.1%; COLL: ≤ 83.3%) (see Fig. 3 and Table S8 in the Supporting Information).

Fig. 3
figure 3

Coverage probability (%) for the OR under the alternative hypothesis (95% CI with 2K−2 degrees of freedom for BBST, BBCB1 and BBCB2)

Length of 95% CI

Results for the length of the 95% CI were similar to results under the null hypothesis, with HKSJ and GLFR having the broadest intervals in all scenarios (see Table S9 in the Supporting Information).

Power

Power for methods with coverage probability of ≥95% under the null hypothesis (BBST, BBFR, BBCB1, BBCB2, and HKSJ in all scenarios and GLFR in scenarios with 2, 3 and 4 studies) was quite low (still < 40% in scenarios with 10 studies). Power for BBMs (BBST, BBFR, BBCB1, BBCB2) was higher than for HKSJ in all scenarios. In scenarios with 2 studies, none of these methods showed a power > 5%, that is, no method yielded satisfactory results. In scenarios with 3 and 4 studies, power was still low (maximum 21.0% for BBCB1) but the differences in the methods became visible. Power was highest for BBST, BBFR and BBCB1, followed by BBCB2 and HKSJ with the lowest power (see Fig. 4 and Table S10 in the Supporting Information).

Fig. 4
figure 4

Power (%) for the OR under the alternative hypothesis (95% CI with 2K−2 degrees of freedom for BBST, BBCB1 and BBCB2)

Methods with coverage probabilities < 95% under the null hypothesis (GLRRI, DSL, MH, POR, COLL) had higher power up to 55% (see Table S10 in the Supporting Information).

The small power values were to be expected due to the fact that the true ORs were near 1 (> 0.83) for about 25% of the simulations, the moderate sample sizes (mean: 185.7; median: 102.0), and only few studies in the MAs.

Direct comparison of results for beta-binomial models

BBST and BBFR showed almost identical results. We assumed that one reason could be the maximum likelihood approximation method. Therefore, we tried to use another approximation method, the Gauss-Hermite quadrature with two quadratic points (QPOINTS = 2). However, in the case of a higher number of quadrature points, a floating point exception error occurred at one point, stopping the whole simulation. Therefore, we could not run the complete simulation using more quadrature points. We tried to reanalyse the simulated data using the Gauss-Hermite quadrature with 2 quadrature points for the BBFR. From the few existing results, it appears that BBST and BBFR vary the most if the ORs in the studies are very heterogeneous. Of notice, these were mainly data situations with strong heterogeneity (P value of Q test < 0.05).

The coverage probability of the BBCB2 was below the nominal level under the alternative hypothesis in scenarios with 3 or more studies. In contrast, BBCB1 performed well and very similarly to BBST considering all performance measures.

Results for the relative risk

For the RR, we only considered BBST, GLFR, HKSJ, DSL and COLL in our simulation study. Because the results of BBST and BBFR were almost identical, we focussed on BBST. As mentioned earlier, it was not possible to compute RRs for BBCB1 and BBCB2. GLRRI and MH performed quite poorly considering the OR. Thus, we saw no need to reconsider them again. DSL and COLL also performed quite poorly, but we considered them due to their simplicity and as “benchmark” methods.

Results for RR were comparable to OR, except for the fact that BBST and GLFR struggled with convergence problems. BBST converged only in about 82% of each scenario (about 98% for OR) and GLFR between 90 and 95% (> 99% for OR). Noticeably, coverage probability for GLFR was higher and nearer the 95% level than that for OR (see Tables S11 – S20 in the Supporting Information).

Sensitivity analysis

The sensitivity analysis considering only data situations with no statistically significant heterogeneity (P value of Q test > 0.05) did not fundamentally alter the simulation results of the performance measures. The biggest influence was seen in coverage probabilities. Coverage probabilities below 95% increased and approached 95%. Coverage probabilities above 95% remained more or less stable. The biggest improvement was seen in MH, POR and COLL, but their coverage probabilities were still far below 95% (e.g. OR under H0 in scenarios with 4 studies: MH: 88.9%, POR: 88.4%, COLL: 88.9%). The most relevant improvement was shown in the GLFR, where coverage probabilities were far more near 95% in scenarios of 5 and 10 studies than in data situations where the appropriateness of pooling was not considered (see Fig. 5 as an example).

Fig. 5
figure 5

Coverage probability (%) for the OR under the null hypothesis in scenarios with no statistically significant heterogeneity (P value of Q test > 5%; 95% CI with 2K−2 degrees of freedom for BBST, BBCB1 and BBCB2)

Summary of results

The beta-binomial models BBST and BBCB1 (BBCB1 only for OR) performed well in scenarios with 3, 4, 5 and 10 studies. HKSJ was the only standard MA method that had adequate performance measures in these scenarios, although the coverage probability was very high (> 98%) for scenarios < 5 studies. In scenarios with only 2 studies, no method showed coverage probabilities near 95%; especially HKSJ and GLFR were far too conservative. Power was very low in this scenario; therefore, for 2 studies no method appeared appropriate. In scenarios with 3 and 4 studies, BBST, BBCB1 (BBCB1 only for OR), GLFR and HKSJ performed best, with the first two methods having higher power than the last two. In scenarios with 5 and 10 studies, BBST, BBCB1 (BBCB1 only for OR) and HKSJ performed best, with the first two methods having higher power than the last one.

Discussion

We conducted a simulation study to compare BBMs with various standard meta-analytic methods in the case of very few studies. The BBST and the BBCB1 (BBCB1 only for OR) showed good results in the given data situations that were based on realistic data situations of Cochrane Reviews. The only standard MA method that showed acceptable results was HKSJ [39].

The attempt to extend the BBST by a random treatment effect term for the study, due to concerns about disrespecting the randomisation, failed. From the few results we obtained from the BBFR Gauss-Hermite quadrature, it could be seen that the 2 models varied the most in situations where a lot of heterogeneity between the study effects was present. As BBFR uses a random effect attached to the treatment effect in addition to the random intercept, this behaviour is to be expected, because only then “enough” additional heterogeneity remains to be estimated, i.e. not all heterogeneity goes into the random intercept. In practice, pooling will often not be appropriate in situations where even in the case of few studies there is such strong heterogeneity in the data. Thus, there is probably little benefit in using other approximation methods than Laplace in the case of sparse data, because either it has no impact on results, or it has only an impact in situations where there is a lot of heterogeneity. Our finding is in agreement with a recent study that assessed different approximations methods to perform meta-analyses using GLMMs in the case of rare events. This study found that the Gauss-Hermite quadrature is not superior to the Laplace approximation [38]. Thus, there seems to be no benefit in using BBFR in practical applications for meta-analyses of few studies. However, further research is necessary to obtain findings that are more conclusive.

Disrespecting the randomisation to a study arm of a specific study had no strong influence on the results of BBST and BBCB1. This was already pointed out more than 10 years ago [8] and is in line with a recent publication on arm-based (disrespecting randomisation) and contrast-based models (respecting randomisation) in network MA, where the authors conclude that both models are suitable for network MAs [40]. One reason for this is presumably that, especially in the case of few studies, heterogeneity cannot be estimated properly. As our simulation mirrors real MA situations, the problem may not be important in practice, as probably only a few data situations occur where this critical aspect of BBST and BBCB1 actually has negative consequences.

A very important point to keep in mind is the fact that we used adequate data situations based on RCT data. Our conclusions could be flawed when there is doubt about randomisation. Therefore, using another method respecting the randomisation to a study arm of a specific study as sensitivity analysis seems to be a reasonable approach. One could try a BBFR with more quadrature points than 1, but this might not work because of convergence problems. Thus, we recommend the use of standard procedures such as HKSJ in these situations. If the results and especially the point estimates are quite different, we would refrain from using BBST as a final method, because there probably is a problem with disrespecting the randomisation to a study arm of a specific study.

Surprisingly, BBCB2, which strictly respects randomisation, performed very badly regarding coverage probability under the alternative hypothesis. The narrow 95% CIs suggest that the standard errors are underestimated. This may be because only half of observations, concrete each study instead of each single group, contribute to the estimation of the parameters compared to BBCB1, where each arm of a study contributes to the estimation. Moreover, the assumption of homogeneity within one study may lead to overdispersion.

Although GLMMs are an intuitive alternative to the standard MA methods, GLFR and GLRRI performed quite poorly in many of our investigated scenarios. In Jackson et al. [5] both models (labelled as model 2 and 3) and their reparametrized versions were examined. In contrast to our results, the performance of GLFR was worse than that of GLRRI. Apart from this, the coverage probabilities for both models were below 95% for almost all 15 investigated scenarios, including 2 scenarios with 3 and 5 studies. The authors used maximum likelihood estimation based on adaptive quadrature with 7 (GLFR) and 1 quadrature point (GLRRI), whereas we used 1 quadrature point for each model. We do not think that the different number of quadrature points for GLFR can fully explain the differences in results. The noticeable difference in the simulation studies might be the difference in heterogeneity. Jackson et al. [5] simulated data with a median true τ for log OR of 0.024, whereas our median τ was \(\sqrt{0.273}=0.52\), that is, considerably larger. As seen in setting 15 in Jackson et al. [5], both models performed worse if a lot of heterogeneity was present (coverage probabilities below 90%). Therefore, in scenarios with higher heterogeneity, the GLFR with 1 quadrature point can be more robust than with 7 points, leading to better performance measures. The opposite is true for the GLRRI. The greater the heterogeneity, the lesser the assumption of one random effect for the intercept may be justified, resulting in worse performance measures. In agreement with this assumption, the GLRRI often did not converge in the case of large heterogeneity, i.e. large values of τ2. As heterogeneity in the data is better detectable with an increasing number of studies in the MAs, this could also explain why, counterintuitively, the convergence of GLRRI decreases with an increasing number of studies. The partly differing results of these models in different simulation studies indicate how important the parameter settings in simulation studies could be, and thus stress the necessity of (independent) replications of results from simulation studies [41].

Some problems exist when using the RR in MAs [42, 43]. The problems arise from using the log link for the RR while the event probability π is bound to [0, 1] and are more pronounced if the event probabilities are large. Apart from problems with convergence, BBST and GLFR showed quite satisfactory results. A post-hoc analysis revealed that the main reason for the worse convergence were large baseline probabilities, namely values of πkC > 0.5. In addition, in the case of πkC < 0.5, high heterogeneity between the studies and small RRs led to non-converged simulations. Therefore, the log link affects the usability of these methods and further research is needed to improve the performance of these models.

An exact method for combining the effect sizes of the studies has recently been proposed [44]. This method was originally proposed for continuous data but could be easily implemented for binary data when using the logit or log link. Unfortunately, this method does not solve the problem of studies with no events in one or both study arms, because the effect size estimates of the studies are used to construct the 95% CI of the overall effect. Therefore, the same drawbacks exist with a continuous correction as with standard MA methods (HKSJ, DSL, MH for OR and RR).

Günhan et al. [45] investigated a Bayesian BBST. In a simulation study this model showed inappropriate coverage probabilities for very low OR (log OR < − 2), suggesting that BBMs are not suitable in situations where very large effects are expected. Because we tried to investigate realistic scenarios, such extreme values occurred in only a few simulations (≤ 10 in 10,000 simulations), which might be an explanation why no problems regarding coverage probability of the BBST were observed in our simulation study.

Our study has some limitations. As for all simulation studies, our results are only valid for data constellations we investigated. Because we based our simulation data on MAs actually performed in Cochrane Reviews, i.e. rather realistic scenarios, this is probably not a big problem. Another limitation is that we only investigated studies with balanced (1:1) randomisation schemes in the MA. In the case of a mixture of different randomisation schemes (1:1, 2:1, 3:1) the result of the BBST could be affected by the fact that the model disrespects randomisation. This highlights the importance of a sensitivity analysis using a method that respects randomisation.

Concerns could exist about the interpretability of the results with varying effect sizes under the alternative hypothesis. Therefore, we re-analysed the data by classifying the true effect sizes into four groups (< 1st quartile, between 1st quartile and median, between median and 3rd quartile, > 3rd quartile; data not shown). As expected, this had no impact on coverage probability. Likewise, for all other performance measures the values were bigger (smaller) for high effect size categories and smaller (bigger) for lower effect size categories. Thus, the given results can be interpreted as the mean/median of the different simulated effect sizes.

By chance, there were no double-zero studies in our MAs, but only single-zero studies. The number of MAs with at least one single-zero study varied from about 36% (2 studies) to 58% (10 studies). In an additional analysis (data not shown), the exclusion of MAs with single-zero studies led to similar results. However, this could be different if the number of single- and double-zero studies increases, which would require further investigation.

Conclusion

In the case of very few (2 – 4) studies, the beta-binomial models BBST and BBCB1 (BBCB1 only for odds ratio) are valuable alternatives to standard random effects meta-analytic models, if the corresponding 95% confidence intervals for BBST and BBCB1 are constructed using the t distribution with 2K − 2 degrees of freedom.

For meta-analyses with 2 studies, no general recommendation for a specific model can be given due to very conservative coverage probabilities and very low power of all investigated methods. The application of a fixed effect model, if appropriate, or a qualitative summary of the study results could be a solution. For meta-analyses with 3 and 4 studies, the BBST and BBCB1 can be recommended in conjunction with a sensitivity analysis using HKSJ or another adequate method for a random effects model. For meta-analyses with 5 or more studies, the use of HKSJ is recommended. BBST and BBCB1 are useful methods for sensitivity analyses in this case.