INTRODUCTION

Network meta-analysis (NMA) is a statistical method often used to draw conclusions about multiple-treatment comparisons.1,2,3 It simultaneously synthesizes both direct and indirect evidence, where the direct evidence comes from head-to-head trials while the indirect evidence comes from indirect comparisons with common comparators.4,5 For example, the comparison between two active drugs A and B can be informed from indirect comparisons of A vs. C and B vs. C, where C may be placebo or standard care, or from direct comparison in clinical trials comparing A vs. B.

In addition to the advantage of combining direct and indirect evidence, NMAs improve the precision of estimates (i.e., make the confidence/credible intervals narrower).6,7 This precision however is affected by the amount of heterogeneity between studies, because heterogeneity is modeled into the uncertainty and impacts the width of confidence/credible intervals.8,9 Currently, many NMAs are performed via a Bayesian framework that uses a prior distribution for the between-study heterogeneity.10 Some NMAs use the traditionally non- or weakly informative priors for heterogeneity.11,12,13 Recently, Turner et al.14 have suggested informative log-normal priors based on a large database of conventional pairwise meta-analyses in the Cochrane Library. These empirical priors have the potential to improve the precision of the treatment effect estimates, especially when the number of studies is small.

The choice of prior distribution is important and should be explicitly reported according to the PRISMA-NMA statement (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Network Meta-Analysis).15 It is critical that authors of NMAs describe the details of choices and assumptions made to select the prior distribution for transparency purposes and to allow reproducibility of the work. Nevertheless, it has been found that the rapid growth of NMAs published in recent years was not accompanied with better methodological and reporting quality.16,17 Therefore, we conducted this empirical study to assess recent NMAs published in high-impact medical journals for the quality of reporting heterogeneity priors (in terms of distribution and rationale) and to evaluate how NMAs’ conclusions would differ based on applying various commonly used prior distributions.

METHODS

Data Collection

We designed and executed a literature search in July, 2019, for research articles published in The BMJ, JAMA, and The Lancet between January 1, 2010, and December 31, 2018, using the terms “network meta-analysis,” “network meta-analyses,” “multiple-treatment comparison,” “multiple-treatment meta-analysis,” and “multiple-treatment meta-analyses.” If Bayesian models were used, we examined if the original articles gave information about prior choices for heterogeneity and their rationales.

In our primary analyses, we excluded methodological reviews that did not present original data. The outcome type was restricted to be binary, because the heterogeneity may depend on the measure scale for other outcome types (e.g., for continuous outcomes) and needed to be modeled on a case-by-case basis. In addition, we focused on articles whose full NMA datasets were publicly available. Originally reported effect measures, outcomes, studies, treatments, event counts, sample sizes, statistical methods (frequentist or Bayesian) used for NMAs, and prior distributions (for Bayesian NMAs) were obtained from the published articles (and the corresponding supplemental files). If the authors used the Bayesian method but did not report their prior distributions, we contacted them for information about the priors. An article may report multiple NMAs with various outcomes; our analyses focused on the primary outcome. If no primary outcome was specified, we used the NMA with the largest number of studies. In addition, regardless the original effect measures, we re-analyzed the collected data based on the odds ratio (OR) (on a logarithmic scale) for a consistent comparison across NMAs.

Prior Distribution Choices for Heterogeneity

The Bayesian framework treats unknown parameters, such as overall treatment effects and heterogeneity variances, as random variables and attempts to estimate them via the assignment of prior distributions. This is commonly implemented via the Markov chain Monte Carlo (MCMC) algorithm.18,19 The Supplementary Material presents the Bayesian contrast-based random-effects model to perform an NMA (Appendix A). Among the various parameters to be estimated, the heterogeneity variance, denoted by τ2, plays a critical role, because researchers often have diverse opinions toward its prior choice and it sometimes has influential impact on credible intervals (CrIs) of treatment comparisons. On the other hand, researchers generally have consensus about the prior for treatment effects (log ORs in this study), which is usually non-informative and follows a normal distribution with mean 0 and a very large variance (e.g., 1002).

To re-analyze the collected datasets, we considered three different non-informative (or arguably, weakly informative) prior distributions for the heterogeneity variance or standard deviation: the inverse-gamma, uniform, and half-normal distributions. Table 1 summarizes the multiple choices.

Table 1 Summary of Prior Distributions for the Heterogeneity Component (Variance τ2 or Standard Deviation τ) for Odds Ratios

The inverse-gamma prior IG(α, β) is conjugate (that is, it produces a posterior also in the inverse-gamma family) and therefore may facilitate the computation in the MCMC algorithm. It also has the potential to improve both stability and convergence, and may be useful for sparse data.20 The hyper-parameters for both α and β (determining distribution shape and scale, respectively) are conventionally assigned to some value close to 0. As both hyper-parameters approach 0, it leads to a flat distribution for the heterogeneity variance on a logarithmic scale. We consider three choices of the hyper-parameters, i.e., 0.1, 0.01, and 0.001.

The uniform prior U(0, c) is another commonly used prior for the heterogeneity standard deviation τ. Here, c denotes the upper bound of the uniform distribution, and is assigned to values 2, 5, and 10 in our analyses; these are common choices for log ORs in practice. We also considered the half-normal prior HN(0, σ2) for τ.21 This distribution is generated by taking the absolute value of a random variable that follows the normal distribution N(0, σ2). The hyper-parameter σ2 determines the range of heterogeneity,22 and is assigned values 0.5, 1, and 2 in our analyses.

In addition to the non-informative priors above for heterogeneity, we considered the empirical informative priors derived by Turner et al.14 for log ORs, which were grouped based on outcome type and treatment comparison type. Specifically, the treatment comparisons were classified into three groups, i.e., pharmacological treatment vs. placebo/control, pharmacological treatment vs. pharmacological treatment, and comparisons involved with non-pharmacological treatments (e.g., medical devices, surgical procedures). The outcome types were classified as all-cause mortality, semi-objective outcomes (e.g., cause-specific mortality, major morbidity events, obstetric outcomes), and subjective outcomes (e.g., pain, mental health outcomes, general physical health). For NMAs containing multiple comparison types, the type of pharmacological vs. pharmacological treatment comparison was used for the primary analyses.

Statistical Analyses

For each dataset, we re-performed the random-effects NMAs with the non-informative priors and the informative priors in Table 1 for the heterogeneity parameter τ or τ2. The non-informative prior N(0, 1002) was used for all treatment effects (log ORs), and we assumed consistency between direct and indirect evidence in each NMA. In addition, all treatment comparisons in each NMA were assumed to share a common heterogeneity variance.5 Of note, the NMA model used in this article was contrast-based, because it is currently the most widely used model and the informative priors were derived under the contrast-based framework. Many alternatives, such as the arm-based model (which focuses on estimating each treatment arm’s absolute effect), may be also used for NMAs.23

The Bayesian analyses were conducted with R (version 3.6.2) package “rjags” (version 4-9). The models were implemented via the MCMC algorithm with three chains24,25,26,27; each chain contained a 50,000-run burn-in period for achieving stabilization and convergence. The samples generated during the burn-in period were discarded prior to the final analyses; the final posterior distributions for each NMA were based on a run of 200,000 updates after the burn-in period. We checked trace plots for assessing MCMC convergence. Trace plots with certain long-term trends or drifts, instead of stable up-and-down variation, may indicate non-convergence; see Figs. S1S4 in the Supplementary Material for illustrations. The MCMC may not converge well in cases such as extreme posterior samples of ORs (produced seemingly due to many zero even counts) or improper priors. When Markov chains converged well, the posterior medians and 95% equal-tailed CrIs can be reliably used as estimates of the parameters of interest. CrIs of log ORs not covering 0 indicated significant treatment comparisons. We obtained the posterior estimates for all treatment comparisons and the heterogeneity variance in each NMA. We also calculated the width of 95% CrI of log OR for each comparison, which implied the estimate’s precision.

Correlation coefficients between the non-informative priors and informative priors were calculated for both point estimates (posterior median log ORs) and CrI widths for each NMA and for all NMAs combined. Bland–Altman plots were used to evaluate the agreement between these results. The kappa statistic, κ, was also calculated to quantify the agreement of statistical significance between the treatment effects produced by the different priors. This statistic is upper bounded by 1; roughly, κ < 0 indicates no agreement, and κ within 0–0.4, 0.4–0.6, and 0.6–1 indicates weak, moderate, and strong agreement, respectively.28,29

Secondary analyses were performed for NMAs which contained a placebo or control treatment. Among these NMAs, we additionally considered the informative prior of the comparison type of pharmacological treatments vs. placebo/control (Table 1).

RESULTS

Basic Characteristics

The literature search identified 67 research articles containing NMAs. Of the 44 NMAs that used the Bayesian framework, 52.3% of the NMAs did not explicitly provide the prior distributions of heterogeneity, and 84.1% did not provide rationales for the prior choices (Table S1 in the Supplementary Material). A total of 19 NMAs met inclusion criteria for our primary analyses (Fig. 1). Total sample sizes in the selected NMAs ranged from 792 to 111,282; numbers of treatments ranged from 3 to 23; and numbers of studies ranged from 7 to 473. We denoted each NMA by the first author’s surname with the publication year of the corresponding article. Table 2 presents summaries of these NMAs; the complete references of these NMAs are in the Supplementary Material (Appendix B). Of note, the NMA of Wu 2013 contained zero events in many treatment arms, causing poor MCMC convergence in our re-analyses; thus, we only present the results of the remaining 18 NMAs in the following.

Fig. 1
figure 1

Flow diagram of network meta-analysis selection.

Table 2 Summaries of the 19 Network Meta-analyses

Overall Impact of Priors

Figure 2 compares the posterior median ORs and 95% CrI widths produced by non-informative priors with those by informative priors among all 18 NMAs. There was a nearly perfect correlation between posterior median (log) ORs by each type of non-informative priors and those by the informative priors; the correlation coefficients for each set of hyper-parameters were larger than r = 0.99. The correlations decreased in terms of 95% CrI widths. Specifically, 95% CrI widths produced by the informative prior were strongly correlated with those by inverse-gamma priors IG(0.1, 0.1), IG(0.01, 0.01), and IG(0.001, 0.001), all having r = 0.90. The half-normal priors, with r = 0.89 for HN(0, 0.5), r = 0.91 for HN(0, 1), and r = 0.94 for HN(0, 2), also displayed strong correlations. The correlation with those by the uniform priors experienced greater variability, with r = 0.87 for U(0, 2), r = 0.82 for U(0, 5), and r = 0.80 for U(0, 10). All P values of the above correlations were < 0.001.

Fig. 2
figure 2

Distributions of posterior median odds ratios on a logarithmic scale (ac) and 95% credible interval widths (df) by various prior distributions among 18 network meta-analyses.

Figure S5 in the Supplementary Material (Appendix C) presents the Bland–Altman plots among all 18 NMAs. It indicates strong agreement for both posterior median log ORs and 95% CrI widths by the various priors. Large differences in posterior median log ORs were likely observed when the log ORs were close to 0, and large differences in 95% CrI widths were likely observed when CrIs were very wide.

Table 3 shows the kappa statistics between significant treatment comparisons identified via informative priors and non-informative priors. A total of 942 treatment comparisons were assessed among all NMAs. The kappa statistic for each pair of informative and non-informative priors was positive, mostly close to 1. A total of 236 treatment comparisons were found to be statistically significant via informative priors, which were more than those via non-informative priors.

Table 3 Kappa Statistics for Assessing the Agreement Between Informative and Non-Informative Priors with Respect to Significant Treatment Comparisons

Compared with the results based on the informative priors, the greatest variability in kappa statistics due to differences in hyper-parameters was observed when using the inverse-gamma prior. Specifically, based on IG(0.1, 0.1), there were 194 significant treatment comparisons identified with κ = 0.87. This number increased to 218 (κ = 0.95) using IG(0.01, 0.01) and further increased to 232 (κ = 0.97) using IG(0.001, 0.001). While IG(0.001, 0.001) produced the strongest agreement with the informative priors, it also produced 5 treatment comparisons that were identified as significant by the non-informative priors but non-significant by the informative prior. All uniform priors had κ = 0.93 as did the half-normal priors HN(0, 1) and HN(0, 2); both uniform priors U(0, 2) and U(0, 5) produced 211 significant treatment comparisons, while this number slightly increased to 212 when using U(0, 10). The half-normal priors HN(0, 1) and HN(0, 2) both produced 213 significant treatment comparisons; HN(0, 0.5) yielded κ = 0.94 with 216 significant ones.

Impact of Priors Within Network Meta-analyses

Table 4 presents the correlation coefficients between the results and the kappa statistics within NMAs. The largest NMA (in terms of sample size) was Cipriani 2018, which included 111,282 samples, while the smallest NMA was Anothaisintawee 2011 with 792 samples. In the NMA of Cipriani 2018, the correlations of posterior median log ORs between the informative prior and each of the non-informative priors were nearly perfect with no discernable difference; they were all > 0.99 with P values < 0.001. An almost identical result was observed for the correlation between the 95% CrI widths.

Table 4 Correlation Coefficients Between the Results (Posterior Median Log Odds Ratios and 95% Credible Interval Widths) and Kappa Statistics for Assessing the Agreement Produced By Informative Priors and Non-Informative Priors Within Each Network Meta-analysis

The correlations between posterior median log ORs for each set of non-informative priors were consistently strong in the NMA of Anothaisintawee 2011; all non-informative priors for the median log ORs had correlation coefficients of at least 0.99. However, as this NMA had the smallest sample size, the correlations between the 95% CrI widths exhibited more variability across different prior types and hyper-parameters; the half-normal priors had the highest correlation coefficients with values of 1.00, 0.99, and 0.97 for HN(0, 0.5), HN(0, 1), and HN(0, 2), respectively. The least variability was observed for the inverse-gamma distribution; the correlation coefficients were 0.92 for IG(0.1, 0.1) and 0.91 for both IG(0.01, 0.01) and IG(0.001, 0.001). The greatest variability across hyper-parameters was observed for the uniform prior; the priors U(0, 2), U(0, 5), and U(0, 10) had correlation coefficients of 0.98, 0.91, and 0.89, respectively. All correlations had P values < 0.001.

A total of 231 treatment comparisons were produced to assess the agreement among priors in the NMA of Cipriani 2018. All hyper-parameters led to 100% agreement likely due to the large number of samples. On the other hand, the small NMA of Anothaisintawee 2011 contained a total of 28 treatment comparisons. The informative prior led to 2 significant comparisons, while all non-informative priors produced no significant comparison. This resulted in a kappa statistic that was incalculable.

Regardless of the size of NMAs, the informative and non-informative priors produced fairly similar point estimates of ORs, because all correlation coefficients were > 0.90. However, the correlation between the 95% CrI widths for the different priors was possibly smaller and exhibited greater variability than that between point estimates of ORs in some NMAs. The informative priors typically produced narrower 95% CrIs than the non-informative priors (Fig. 2d–f). The greatest variability in correlations between 95% CrI widths was observed in Palmerini 2015 and in the smallest NMA of Anothaisintawee 2011. While all non-informative priors in Castellucci 2013 (0.95 ≤ r ≤ 0.98), Giacoppo 2015 (0.88 ≤ r ≤ 0.97), and Palmerini 2015 (0.89 ≤ r ≤ 0.99) had strong correlations, the correlations were not as strong as those in most other NMAs, which had r > 0.99 for all non-informative priors. Chatterjee 2013, Daniels 2012, Dulai 2016, Hazlewood 2016, and Phung 2010 had r > 0.99 for all priors except IG(0.1, 0.1).

The Supplementary Material includes the scatterplots of posterior median ORs and 95% CrI widths by all priors (Figs. S6S23 in Appendix C) and Bland–Altman plots (Figs. S24S41 in Appendix C) in all NMAs separately.

There was generally small variability in kappa statistics across informative priors within NMAs. The four NMAs of Castellucci 2014, Cipriani 2018, Giacoppo 2015, and Phung 2010 had κ = 1 for all priors. For the NMAs of Castellucci 2013, Palmerini 2015, and Zheng 2018, each treatment comparison was non-significant by both the informative and non-informative priors. Xu 2018 had the same kappa statistic for all uniform and half-normal priors (κ = 0.91) and for IG(0.1, 0.1) and IG(0.001, 0.001) (κ = 0.90), while IG(0.01, 0.01) had κ = 1. For Isayama 2016, IG(0.1, 0.1) had κ = 0, IG(0.001, 0.001) had κ = 1, and all other priors had identical agreement (κ = 0.63). Chatterjee 2013 (κ = 0.53), Dulai 2016 (κ = 0.94), and Price 2014 (κ = 0.67) all had identical agreements for the uniform and half-normal priors with different hyper-parameters. The greatest variability in agreements between hyper-parameters occurred when using the inverse-gamma prior; it was consistent with the observations in the overall assessment among all NMAs.

Secondary Analyses

A total of 11 NMAs contained a placebo or control treatment; the secondary analyses were performed for them by additionally considering alternative informative priors (i.e., for the type of comparisons with placebo/control as in Tables 1 and 2). Tables S2 and S3 and Figs. S42S65 in the Supplementary Material (Appendix D) present the results. As in the primary analyses, the correlation coefficients of posterior median log ORs for each set of hyper-parameters were larger than r = 0.99. The correlations of 95% CrI widths for the secondary analyses were higher than their primary counterparts for U(0, 10) (r = 0.80) and HN(0, 2) (r = 0.94); all other correlations were lower in the secondary analyses than in the primary analyses. As in the primary analyses, the half-normal priors led to the highest correlation, and the greatest variability was observed for the uniform priors. The greatest variability in kappa statistics due to differences in hyper-parameters was observed when using the inverse-gamma prior, and the largest number of significant comparisons was observed using IG(0.001, 0.001). All uniform and half-normal priors led to κ = 0.95. Four NMAs had r > 0.99 for all hyper-parameters for both the 95% CrI widths and median log ORs. All NMAs had strong correlations for all hyper-parameters; the weakest correlation was present in Anothaisintawee 2011, as in the primary analyses.

DISCUSSION

Main Findings

In this empirical study of 19 NMAs, we found that posterior median ORs produced by different priors had a very strong association. Noticeable variability appeared in estimates by different priors for NMAs with relatively small sample sizes per treatment comparison; thus, these NMAs tended to be sensitive to the prior specification. For large NMAs, non-informative priors generally produced nearly identical point estimates and 95% CrI widths, thus leading to an almost perfect agreement with the results based on informative priors. For small NMAs, the point estimates by informative priors were approximately the same as those produced by non-informative priors, but the CrIs produced by non-informative priors were often substantially wider than those produced by the informative priors. As a result, overall, informative priors yielded more statistically significant treatment effects. The greatest variability in agreement was observed when using the inverse-gamma priors, while the uniform and half-normal priors yielded approximately similar results.

Strengths and Limitations

This study considered most commonly used prior choices for modeling heterogeneity in the current practice of Bayesian NMAs. The results were based on recent NMAs published in high-impact medical journals, which were thus expected to be of high quality. All R code is provided in the Supplementary Material (Appendix E).

Nevertheless, this study had several limitations. First, we focused on assessing the impact of priors on the posterior medians and 95% CrIs of ORs, while the conclusions may not be directly generalized to other effect measures. Second, because the datasets involved in this study were relatively large, we did not examine the validity of several important assumptions (e.g., transitivity, consistency, reporting bias) in each NMA.30 These factors may also influence NMA estimates along with the choice of priors, and researchers should investigate them on a case-by-case basis. Third, the NMAs published in high-impact journals may contain more studies, treatments, and samples than those published in other journals. The results of much smaller NMAs were likely more sensitive to prior choices.

Implications

Contemporary Bayesian NMAs published in high-impact journals do not adequately report details on the choice of heterogeneity prior distributions and their rationales. If an NMA does not have a large sample size, sensitivity analyses are recommended to examine the impact of using different hyper-parameters or other types of prior distributions, especially for the inverse-gamma prior. When the number of studies included in NMA is large, various non-informative priors produce similar conclusions. When the number of studies is small, conclusions become more sensitive to the prior type and hyper-parameters. In such cases, empirical informative priors may be used to produce more precise estimates.