Estimation of Effect Heterogeneity in Rare Events Meta-Analysis

The paper outlines several approaches for dealing with meta-analyses of count outcome data. These counts are the accumulation of occurred events, and these events might be rare, so a special feature of the meta-analysis is dealing with low counts including zero-count studies. Emphasis is put on approaches which are state of the art for count data modelling including mixed log-linear (Poisson) and mixed logistic (binomial) regression as well as nonparametric mixture models for count data of Poisson and binomial type. A simulation study investigates the performance and capability of discrete mixture models in estimating effect heterogeneity. The approaches are exemplified on a meta-analytic case study investigating the acceptance of bibliotherapy. Supplementary Information The online version contains supplementary material available at 10.1007/s11336-021-09835-5.

In this supplement, we provide additional material for the article "Estimation of effect heterogeneity in rare events meta-analysis". In the first section, we present tables of quantiles of the empirical distribution ofτ 2 from the simulation study which was described in the main article. In the second section, we describe and evaluate an additional simulation study where we assess the performance of the finite mixture models described in the main article, namely the log-linear and the logistic mixture model with and without effect heterogeneity, for a larger number of simulation conditions. Table S1 displays minimum, maximum and different quantiles of the empirical distribution ofτ 2 of the log-linear mixture models with heterogeneous effects and 2 or 3 components (i.e., S = 2 or S = 3), while Table S2 displays the same figures for the logistic mixture models with heterogeneous effects and 2 or 3 components. Table S1.  Table S2.

S2 EFFECT HETEROGENEITY IN RARE EVENTS META-ANALYSIS
Quantiles of the empirical distribution ofτ 2 of the logistic mixture model

S2. Additional simulation study
This section is structured as follows: First, we describe the design of the simulation study which was conducted and the performance measures which were used. Then, we evaluate the performance of the models investigated. Finally, we formulate a brief conclusion. The simulation study which we describe here was implemented in R (R Core Team, 2020) and run on the computing cluster PALMA II (https://www.uni-muenster.de/ZIV/Technik/Server/HPC.html) at the University of Münster.

S2.1 Simulation conditions
We investigated a total number of 216 simulation conditions. In 144 of these conditions, we generated the data from two components (i.e., S = 2). In the remaining 72 conditions, we generated the data from three components (i.e, S = 3). Different conditions were distinguished by the number of studies k, the control group sample size n 0 , the component weights q s (s ∈ {1, ..., S}), the component baseline probabilities p 0,s , and the true value of τ 2 .
The values of these parameters which were examined in the simulation study are displayed in Table S3. The values of k were inspired by a review of systematic reviews by Moher et al. (2007), and roughly reflect the 25% quantile, the median and the 75% quantile of their sample of Non-Cochrane reviews. The values of n 0 , 50 and 500, were selected since n 0 = 50 mirrors the median number of subjects per group from a review of meta-analyses from the Cochrane Database of Systematic Reviews by Turner et al. (2012), while n 0 = 500 mirros the median number of subjects per group of the Non-Cochrane sample from the above-cited review by Moher et al. (2007). The values of τ 2 reflect conditions without and with moderate heterogeneity. Specifically, the value τ 2 = 0.36 was obtained from a reanalysis of Cochrane reviews of adverse events conducted by Beisemann et al. (2020).
We only considered balanced groups, i.e., n 1 = n 0 . Furthermore, we setβ = 0 for all conditions. For the simulation conditions with S = 2, the component probability of the second component can be obtained as q 2 = 1 − q 1 . The values for β 1 and β 2 can then be obtained by the following equations: and S5

EFFECT HETEROGENEITY IN RARE EVENTS META-ANALYSIS
These are obtained as solutions to the equations and Note that the same equations with reversed signs (i.e.,β 1 =β − (q 2 /q 1 )τ and β 2 =β + (q 1 /q 2 )τ ) would also be solutions to these equations. However, these alternative solutions are incorporated via the simulation design by (i) including the permutation of each pair (p 0,1 , p 0,2 ) and (ii) includingq 1 = 1 − q 1 in the simulation design for each value of q 1 .

S2.2 Data generation
For each condition, 5500 replications were generated. In each replication, meta-analytic data were generated with the following procedure: First, the class s of each study was sampled: For conditions with S = 2, the class was sampled from a Bernoulli-distribution Bern(q 1 ). For conditions with S = 3, the class was sampled from a multinomial distribution with parameters n = 1 and (p 1 , p 2 , p 3 ) = (q 1 , q 2 , q 3 ).
Then, two separate data sets were generated, one for the simulation study of log-linear mixture models, and one for the simulation study of logistic mixture models. The generation of two separate data sets was necessary to ensure that for both models, true heterogeneity matched the value of τ 2 of the respective simulation condition. Thus, for the first data set, the event probability in the treatment group was calculated based on the relative risk (RR), such that p 1,s = exp(β s ) · p 0,s . For the second data set, p 1,s was calculated based on the odds ratio (OR), such that p 1,s = y/(1 + y), with y = (exp(β s ) p 0,s )/(1 − p 0,s ). Finally, for each data set, the observations for each group within each study were drawn from a Bin(n j , p j,s ) distribution, where j is the group index (with j = 0: control group, j = 1: treatment group).

S2.3 Model fitting
Just like for the simulation study described in the main article, we fitted log-linear and logistic mixture models with and without effect heterogeneity. For conditions with S = 2, models with S = 1, S = 2 and S = 3 components were fitted. Thus, in total five different models were fitted to each data set for these conditions (since models with and without effect heterogeneity with S = 1 are identical). For conditions with S = 3, models with S = 4 were fitted in addition to the models with up to three components, such that in total seven different models were fitted to each data set for these conditions. As described above, log-linear mixture models were fitted to the first data set, while logistic mixture models were fitted to the second data set. All models were fitted with the flexmix package (Grün and Leisch, 2008) using the stepFlexmix function with nrep = 10. Parameter estimation was evaluated in terms of mean bias, median bias, and the standard deviation ofβ andτ 2 . In addition, we visually inspected histograms of the estimates of β s for s = 1, ..., S for all simulation conditions. We will present a representative selection of these histograms in the final part of the results section. Further results on these within-component parameter estimates can be obtained from the corresponding author on request.

S2.5 Simulation results
Before the simulation results were calculated, we excluded simulation replications in which one of the following warnings had occurred: "glm.fit: fitted probabilities numerically 0 or 1 occurred", "glm.fit: algorithm did not converge". For the log-linear model, no simulation replications had to be excluded. For the logistic model, the number of simulation replications which were excluded ranged from 0 to 653 across all 216 conditions. S2.5.1. Model selection Figure S1 shows the percentage of simulation replications in which the correct model was selected for conditions with S = 2. Results are displayed separately for the log-linear (red line) and the logistic model (blue line), and for both information criteria (AIC: solid line, BIC: dashed line). In the following, we briefly outline the most important results: • In conditions with large sample sizes (n 0 = 500), both information criteria performed well in terms of selecting the correct model, irrespective of the number of studies, the values of p 0,j and the value of τ 2 .
model selection performance was best for those conditions in which (i) the components were more separated in terms of the difference between p 0,1 and p 0,2 or (ii) effects were truly heterogeneous (i.e., τ 2 = 0.36).
the BIC performed slightly better than the AIC in selecting the correct model.

S8 EFFECT HETEROGENEITY IN RARE EVENTS META-ANALYSIS
• In conditions with small sample sizes (n 0 = 50), model selection performance was severely impaired in many conditions, in particular for small numbers of studies, homogeneous effects (τ 2 = 0) and when components were less separated in terms of the difference between p 0,1 and p 0,2 .
model selection performance only reached a satisfactory level when components were more separated in terms of the difference between p 0,1 and p 0,2 .
the AIC yielded better results in terms of selecting the correct model.
• In conditions with large sample sizes and in conditions in which the components were more separated in terms of the difference between p 0,1 and p 0,2 , either a model with two components and heterogeneous effects or a model with three components and homogeneous effects was often chosen in the few trials in which the correct model was not selected (results not shown).
• In conditions with small sample sizes, small numbers of studies and when components were less separated in terms of the difference between p 0,1 and p 0,2 , a model with one component was often selected instead of the correct model (results not shown). • In conditions with large sample sizes (n 0 = 500), model selection performance was good for the majority of simulation conditions. The correct model was usually selected in about 75% of simulation replications, irrespective of the number of studies.
in simulation replications in which the correct model was not selected, a model with four components which was correctly specified in terms of effect heterogeneity was often selected erroneously. When effects were truly homogeneous (i.e., τ 2 = 0), the AIC also sometimes favoured a model with three components but with heterogeneous effects (results not shown).
-Usually, performance was better in conditions with true heterogeneity (τ 2 = 0.36), with the exception of conditions with p 0,1 = 0.2, p 0,2 = 0.05, and p 0,3 = 0.1, in which the correct model was only selected in around 50% of simulation replications. For these conditions, we found that a model with four components was often selected erroneously.
the BIC usually performed slightly better than the AIC in selecting the correct model.

S9 EFFECT HETEROGENEITY IN RARE EVENTS META-ANALYSIS
• In conditions with small sample sizes (n 0 = 50), model selection performance was usually unsatisfactory. Although performance improved with larger numbers of studies, the percentage of simulation replications in which the correct model was favoured was often below 50% even for conditions with k = 40.
often, a model with two components was favoured by the AIC and BIC instead of the correct model (results not shown).
the AIC usually performed notably better in selecting the correct model than the BIC, but still did not achieve a satisfactory selection performance in most conditions. Since results of log-linear and logistic mixture models were similar, we will describe them simultaneously. The most important results were the following: • In conditions with homogeneous effects (τ 2 = 0), bias was negligible for all models investigated.
standard deviations ofβ were small for all models without effect heterogeneity, and also for models with effect heterogeneity in conditions with large sample sizes (i.e., n 0 = 500).
standard deviations were larger for models with effect heterogeneity in conditions with small sample sizes (i.e., n 0 = 50), in particular in conditions with small numbers of studies or when components were less separated in terms of the difference between p 0,1 and p 0,2 . Standard deviations were notably larger for the model with three components than for the model with two components.
• In conditions with heterogeneous effects (τ 2 = 0.36), the correct model (i.e., a model with two components and with effect heterogeneity) usually performed well in terms of bias. In some conditions with small numbers of studies, the log-linear model yielded a slight negative mean bias (compare subfigures F and G of Figure S2).
models without effect heterogeneity sometimes yielded a pronounced and systematic bias (as can be seen in subfigures F, G and H of Figures S2, S3, S5 and S6).

S10 EFFECT HETEROGENEITY IN RARE EVENTS META-ANALYSIS
the log-linear model with effect heterogeneity and three components sometimes had a notable mean bias in particular for small numbers of studies (see for example subfigures F and H of Figure S2). However, its median bias was negligible.
results with regard to SD(β) were similar to the results which were described for conditions with homogeneous effects (see above).
Figures S9 to S14 illustrate the results with regard to the estimation ofβ for conditions with S = 3. Figures S9-S11 depict mean bias, median bias and SD(β) for the log-linear mixture models, respectively. Figures S12-S14 depict mean bias, median bias and SD(β) for the logistic mixture models, respectively. Again, results of log-linear and logistic mixture models were similar, and will thus be described simultaneously. These were the most important results: • In conditions with homogeneous effects (τ 2 = 0), bias was negligible for all models investigated.
-SD(β) was small for all models without effect heterogeneity across simulation conditions, and also for models with effect heterogeneity in conditions with n 0 = 500.
-SD(β) was larger for models with effect heterogeneity in conditions with n 0 = 50, in particular for the log-linear model, and increased for models with a larger number of components and simulation conditions with smaller numbers of studies.
• In conditions with heterogeneous effects (τ 2 = 0.36), models with effect heterogeneity usually yielded a bias close to zero, in particular for conditions with larger numbers of studies. In one condition, the model with S = 2 and with effect heterogeneity yielded a systematic mean and median bias (see subfigure L of Figures S9 and S10). However, the correct model always performed well in terms of both mean and median bias.
models without effect heterogeneity often showed a systematic bias which did not decline for larger numbers of studies.
the findings with regard to SD(β) were similar as for conditions with homogeneous effects.

S2.5.3. Estimation of τ 2
Figures S15 to S17 show the results with regard to mean bias ofτ 2 , median bias ofτ 2 , and SD(τ 2 ) for conditions with S = 2. Results are displayed separately for log-linear mixture models (red lines) and logistic mixture models (blue lines) with two components (solid lines) S11

EFFECT HETEROGENEITY IN RARE EVENTS META-ANALYSIS
and three components (dashed lines). Note that all models depicted in these figures are models with heterogeneous effects, sinceτ 2 = 0 for all models with homogeneous effects. The main results are the following: • In conditions with large sample sizes (n 0 = 500), bias ofτ 2 was negligible for all models.
• In conditions with small sample sizes (n 0 = 50), both mean and median bias were usually small for models with two components. In conditions in which the components were less separted in terms of the difference between p 0,1 and p 0,2 , there was sometimes a small positive bias for small numbers of studies, in particular when effects were truly homogeneous.
larger biases were typically observed for models with three components, and in particular for the log-linear model. Generally, bias was larger for smaller numbers of studies.
-SD(τ 2 ) was often notably larger than in conditions with large sample sizes. This affected in particular those conditions in which components were less separated in terms of the difference between p 0,1 and p 0,2 and conditions with small numbers of studies. SD(τ 2 ) was larger for models with three components than for models with two components, and was particularly large for the log-linear models.
Figures S18 to S20 show the results with regard to mean bias ofτ 2 , median bias ofτ 2 , and SD(τ 2 ) for conditions with S = 3. Results are displayed separately for log-linear mixture models (red lines) and logistic mixture models (blue lines) with two components (solid lines) and three components (dashed lines). These are the main results: • In conditions with large sample sizes (n 0 = 500), bias ofτ 2 was negligible for all models in conditions with homogeneous effects.
models with S = 2 sometimes yielded a pronounced and systematic negative bias in conditions with heterogeneous effects, irrespective of the number of studies.
• In conditions with small sample sizes (n 0 = 50), models with larger numbers of components (i.e., S = 3 and S = 4) -in particular the log-linear mixture models -often yielded a pronounced mean bias in conditions with S12 EFFECT HETEROGENEITY IN RARE EVENTS META-ANALYSIS smaller numbers of studies. Median bias of these models was notably smaller than mean bias, which raises the suspicion that mean bias was affected by outliers.
large values of SD(τ 2 ) were obtained for models with S = 3 and S = 4, in particular in conditions with smaller numbers of studies. Particularly large standard deviations were obtained from the log-linear mixture models, and similar to mean bias, these might have been affected by outliers. the variances were notably larger and estimates sometimes had a large mean bias, in particular when components were less well separated in terms of the difference between p 0,1 p 0,2 . Similar results were obtained for conditions with different numbers of studies, different orders of (p 0,1 , p 0,2 ) and for different values of q 1 as well as for the logistic mixture model. Figure S22 shows histograms ofβ 1 ,β 2 andβ 3 estimated by a log-linear model with three components and with effect heterogeneity for a selection of simulation conditions with three components. The parameter configuration of the respective simulation condition is denoted in the titles of the subfigures. The mapping of component estimates toβ 1 ,β 2 andβ 3 was obtained in an analogous way as for the results from the simulation with S = 2. Similar to the results for the simulation with S = 2, we see that component estimates were almost unbiased and had small variances for large sample sizes within studies. In contrast, variances were much larger for conditions with small sample sizes and estimates often had a notable mean bias. Similar results were obtained for conditions with different numbers of studies and different orders of (p 0,1 , p 0,2 , p 0,3 ) as well as for the logistic mixture model.

S2.7 Conclusion
The following limitations of our simulation study must be considered: • We only considered simulation conditions with a relatively large number of studies.

S13 EFFECT HETEROGENEITY IN RARE EVENTS META-ANALYSIS
Meta-analyses of rare events from the Cochrane Database of Systematic Reviews are typically based on a smaller number of studies. Note that our results might not be generalisable to smaller numbers of studies.
• We only considered simulation conditions with two or three components. Results might not be generalizable to larger numbers of components.
• We did not consider alternative models, such as generalized linear mixed models, as competitors in our simulation.
• We did not investigate simulation scenarios in which all models were misspecified.
We formulate the following conclusions of our simulation study: • For both log-linear and logistic mixture models, a good model selection performance can be achieved with both the AIC and the BIC as long a (i) sample sizes within studies are large or (ii) components are well separated in terms of their baseline event probabilities. If none of this is the case, a good selection performance might still be achieved if the number of studies is very large, at least when there are only two underlying components. When it is expected that there are more than two components, we strongly recommend to use the AIC and BIC for model selection only if sample sizes within studies are fairly large. By "good selection performance", we mean that the correct model was selected in a large percentage of simulation replications among (i) models which are misspecified in terms of the number of components, (ii) models which are misspecified in terms of their assumption regarding effect heterogeneity and (iii) models which are misspecified in both ways.
• Regarding the estimation of a pooled effect (β), both log-linear and logistic mixture models yield almost unbiased estimates with small standard deviations when there are two underlying components and (i) effects are truly homogeneous or (ii) the correct model is selected. For three components, we obtained similar results, but note that larger numbers of studies or large sample sizes within studies might be required to achieve almost unbiased estimates when effects are truly heterogeneous.
• Regarding the estimation of heterogeneity (τ 2 ), almost unbiased estimates with small standard deviations can be obtained from log-linear and logistic mixture models as long as the correct model is selected and either sample sizes within studies are large or the number of studies is large.
• As long as sample sizes within studies are large, one can expect to obtain estimates of component effect sizes which are almost unbiased.

References
Beisemann, M., Doebler, P. and Holling, H. (2020). Comparison of random-effects meta-analysis models for the relative risk in the case of rare events: A simulation study.              Figure S18. Mean Bias ofτ 2 for conditions with S = 3. Results for the log-linear model are depicted in red, results for the logistic model are depicted in blue. Models with 2, 3 and 4 components are depicted with solid and dashed and long-dashed lines, respectively. A-F: Simulation conditions with τ 2 = 0 (homogeneous effects), G-L: Simulation conditions with τ 2 = 0.36 (heterogeneous effects). Missing values were outside the range of the y-axis.  Figure S22. Histograms ofβ1,β2 andβ3 estimated by a log-linear model with three components and with effect heterogeneity for selected simulation conditions of the simulation with S = 3. Solid black lines indicate the true values of β1, β2 and β3. Means (dashed line) and medians (solid line) ofβ1,β2,β3 are shown in the same colour as the respective histogram.