Introduction

The combination of randomized controlled trials (RCT) and non-randomized studies (NRS [1, 2]) within a meta-analysis, that is, using “all” the available information [3,4,5], has been a problematic exercise both theoretically and practically [1, 2, 6]. With respect to the theoretic, the conventional frequentist analytic approach to such meta-analyses would still appear to be that of (i) combining RCT and NRS without comment about the potential for NRS to bias the estimates, that is naively, or (ii) sub-setting by study type with or without reporting a pooled estimate, thus eliding the question of how best to deal with the inherent bias in NRS [7] and adopt a principled method of combining these different classes of information [8]. Failure to incorporate a principled analysis yields suspect inferential synthesis [9]. Albeit sub-setting RCT and NRS has been recommended [7, 10], the presentation of subgroupings and/or an overall estimate may result in reader extrapolation in a nontransparent manner based upon “…eyeballing…” the data and estimates [11]. The practical aspects refer to a lack of clarity with respect to appropriate search strategies for NRS within systematic reviews [12, 13].

The purpose of the current paper was first, to explore the soundness of estimating a pooled intervention effect [2] from meta-analyses combining RCT and NRS within a focused discipline, that of critical care [14,15,16,17,18]. A principled Bayesian method of combining information [8] via model averaging using the “bayesmeta” package [19, 20], as in previous studies [16, 21], was contrasted with conventional DerSimonian-Laird estimates (DSL [22]). A particular motivation was the suggestion, at least within the frequentist perspective, that the increase of sample size consequent upon the addition of NRS would increase effect estimate precision [4, 5]. Second, the utility of Bayes Factors, the posterior odds of one hypothesis when the prior probabilities of the two hypotheses under consideration are equal (BF [23]), was elucidated as a specific model selection criteria for either pooled or separate estimate(s) of RCT and / or NRS within meta-analyses. By way of such exploration the meta-analyses were fully characterized in the spirit of other studies [4, 5, 7, 24, 25]; that is, the paper conformed to a meta-research perspective [26]. By definition, the choice of meta-analyses addressing a diverse set of outcomes in the critically ill excluded a formal comparative effectiveness (CER) perspective (comparison of relative benefits and harms for a range of interventions for a given condition [12]), albeit such reviews may provide insight into the suitability of combining RCT and NRS within a single analysis.

Methods

Data acquisition

Published meta-analyses which combined RCT and NRS and reported a binary outcome, reflecting the importance of such outcomes in Critical-Care practice, were identified from the critical-care paradigm, using the electronic search engine Web of Science™. No attempt was undertaken to generate new meta-analyses by sourcing new individual RCT or NRS. The key words were: Meta-analysis / randomize controlled trials / observational studies /critically ill, or critical care, or intensive care; and specific journal searches: Intensive Care Medicine, Critical Care Medicine, Critical Care, Journal of Critical Care, Journal of Intensive Care Medicine, Chest, Thorax, Anesthesiology, Anaethesia, Annals of Surgery, Annals of Internal Medicine, JAMA, BMJ Open, PlosOne. Both adult and paediatric meta-analytic reports were included.

On the basis that, in the absence of strong informative priors, Bayesian analysis would be expected to generate wider parameter credible intervals than 95% frequentist confident intervals, the final meta-analytic cohort was chosen if the reported (frequentist) P-value of the pooled estimate (odds ratio (OR) or risk ratio (RR)) was < 0.05 and / or one of the study types (RCT or NRS) pooled estimate was < 0.05. All included non-RCT studies were classified, for analytic purposes, as NRS with the expectation that the number of RCT and non-RCT studies per meta-analysis would be small [27] and not susceptible to meaningful stratification.

Statistical analysis

Bayesian approach

Although there are various methods to combine RCT and NRS [2, 6, 16], pooled meta-analytic estimates were established via the “bayesmeta” package (version 2.6 [19, 20]) within the R (version 4.3.1) statistical environment [28], as in previous studies [16, 21]; in particular, the R code in Appendix A.1 of Rover et al. [20]. Potential moderators of the pooled effects [18, 29] were not considered. This Bayesian approach was (i) based upon the normal-normal hierarchical model (NNHM) and (ii) used a two component model with an informative heavy-tailed mixture prior allowing for adaptive information sharing, whereby such sharing was stronger when RCT and NRS evidence were in agreement and weaker when they were in conflict [8, 20, 30]. That is, the Bayesian posterior constituted a model average, a weighted mixture of the conditional posteriors based upon the prior structures; specific data models corresponded to subgroupings (components) of the data with common or unrelated effects [20]. It is in this sense that the notion of a principled approach to combining RCT and NRS is used. The priors for the heterogeneity parameter (\(\tau\)) were half-normal and half-Cauchy [31] with scale 0.5 and a two component model was used [20]. The prior for the pooled effect estimate \(\left( \mu \right)\) was normal, mean 0 and standard deviation 2, after Roever et al. [20]. Default credible intervals (CrI) of “bayesmeta” were computed as the shortest interval, which for unimodal posteriors (the usual case) was equivalent to the highest posterior density region [19]. Bayesian pooled estimates used the author metric (RR or OR).

Within the same Bayesian framework, model choice, in this case the preference for either a pooled estimate or separate estimates for both RCT and NRS, was addressed using Bayes Factors (BF [32, 33]). For probability model M fitted to data y, the marginal density of the data under model M is given as (we use the model syntax of Sinharay & Stern [34]):

\(p\left( {y|{\text{M}}} \right) = \int {p\left( {y|\omega ,{\text{M}}} \right)} p\left( {\omega |{\text{M}}} \right)d\omega\), where \(\omega\) is the parameter vector, the likelihood function is \(p\left( {y|\omega ,{\text{M}}} \right)\) and the prior distribution for \(\omega\) is \(p\left( {\omega |{\text{M}}} \right)\). The BF for computing two models M1 and M0 is defined as:

\({\text{BF}}^{10} = \frac{{p\left( {y|{\text{M}}_{1} } \right)}}{{p\left( {y|{\text{M}}_{0} } \right)}}\), the ratio of the marginal densities of the data y under the two models; thus the posterior odds equals BF x prior odds [34]. This being said, the determination of BF is a subject of some controversy [35]. BF were provided as part of the estimation routine (Appendix A.1 of [20]) for two-component models for half-normal and half-Cauchy heterogeneity priors. The utilised R code generated three “bayesmeta” objects: “bma.obs”, “bma.rct” and “bma.joint”. Marginal likelihoods were then computed as “pooled” (bma.joint_marginal) and “separate” (bma.obs_marginal*bma.rct_marginal) and Bayes Factors were subsequently derived for these marginal likelihoods as both “pooled” and “separate”; the latter being a reciprocal of the former. Model preference was accepted for BF10 > 3 or < 0.333 for the converse [33]. Posterior probabilities for the pooled estimate models were calculated, being derived from the posterior odds (posterior probability = posterior odds/(posterior odds + 1)), with model prior probabilities set to 0.5. Note the difference between (i) the within-model prior distribution(s) \(p\left( {\theta |M_{i} } \right)\), the specification of the probability or uncertainty about the parameters within the model \(M_{i}\) before observing the data and (ii) the model’s prior probability \(p\left( {M_{i} } \right)\), the probability of the model holding as a whole; these two probabilities are independent. BF address the question of which model (strictly speaking, model class [36]) was more likely to have generated the data (y), whereas posterior model probabilities address the question of the plausibility of the model in light of the data \(p\left( {M_{i} |y} \right)\) [37, 38].

Frequentist approach

All meta-analytic frequentist pooled estimate were re-computed within Stata™ V17 [39] using the “metan” user-written module [40], current version 4.07 15th September 2023) with the DerSimonian & Laird random effects estimator (DSL [22]), as reflecting a conventional usage in meta-analytic statistical programs [16]. Variable distributions were compared with one-way analysis of variance and the effect of RCT proportion on the probability of both frequentist CI and Bayesian CrI excluding the null was estimated using logistic regression (robust variance) and marginal analysis (“margins command” [41]) within Stata™ V18. Frequentist statistical significance was ascribed at P < 0.05.

Results

Fifty meta-analyses [42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91] were identified over calendar years 2009–2021. Twenty-nine were pharmaceutical-therapeutic and 21 were non-pharmaceutical therapeutic; author metric was OR in 23 and RR in 27. The median number of trials / studies, that is, RCT or NRS, was 9, minimum 2 and maximum 60, with 25th percentile 5 and 75th percentile 14. The median percentage of RCT was 0.33, minimum 0.05, maximum 0.80, with 25th percentile 0.20 and 75th percentile 0.57. Mortality was the most frequently reported outcome (50%), the other outcomes being various states consistent with the critically ill: clinical cure, intubation, acute kidney injury and venous thrombo-embolism. The most frequently used statistical programs were RevMan (https://training.cochrane.org/online-learning/core-software-cochrane-reviews/revman, 62%), Stata™ (https://www.stata.com/, 16%), Comprehensive Meta-Analysis (https://www.meta-analysis.com/, 6%) and R (https://www.r-project.org/, 6%). All meta-analyses used a primary frequentist method of analysis: DerSimonian-Laird (DSL) random effects (RE) in 14; Mantel–Haenszel RE (M-H RE) in 18; M-H fixed effects (M-H FE) in 4 (with I2 values of 27, 32, 43 and 49%); RE not specified in 8; and model not specified in 6. Heterogeneity was also varyingly reported as \(\tau^{2}\) and / or I2. Of the 4 studies using M-H FE estimation this decision was made on the criterium of heterogeneity (I2 < 50%) without further justification. Similar reasons (I2 > 50%) for choosing a RE approach were also given as was the disparateness of individual RCT / NRS within a meta-analysis. Of note, no meta-analysis discussed the impact of small study number in meta-analyses [16] or utilized alternate frequentist variance estimators such as the the Hartung-Knapp-Sidik-Jonkman (HKSJ) method [16] for adjusting tests and intervals as recommended by Bender et al. [92] in small RCT number meta-analyses. Only one meta-analysis used a Bayesian method in a sensitivity analysis to test the “robustness” of frequentist results [53].

Author reasons [25] for combining RCT and NRS varied considerably: a brief statement that such would be done, the wish to use all or the best available evidence [93] and the small number of RCTs addressing the meta-analytic question(s) of interest. Three meta-analyses did not detail quality assessment: in Chiumello et al. [48], the latter was not mentioned; in Tagaki et al. [76], adjusted NRS studies were provided; and in Wan et al. [81], as the NRS were not the primary focus, albeit both adjusted and non-adjusted NRS estimates were given. The Cochrane Collaboration risk of bias tool for RCT was the most frequently used [94], also the Jadad score [95] and the RoB2 instrument [96]; for NRS the Newcastle–Ottawa Scale [97] predominated, as well as the Robins-I [98] and MINORS [98] instruments.

An overall pooled estimate produced by combining RCT and NRS was reported in 42 meta-analyses considered. With respect to study type, reported P-values for effect estimates were < 0.05 in 11/26 (42%) statistical analyses for RCT, 18/24 (75%) for NRS, and 37/42 (88%) for pooled estimates (RCT and NRS) within a single meta-analysis report. Pooled recomputed frequentist DSL estimates were significant in 78% (39/50); 36% (18/50) in RCT and 60% (30/50) in NRS.

For Bayesian estimation using the half-normal heterogeneity prior (n = 49), significant effects (CrI excluding the null) were observed in 18/4 (37%); for the half-Cauchy prior (n = 49), 15/49 (31%). For the meta-analytic reports where Bayesian CrI could be computed (see below), pooled (RCT and NRS) estimates demonstrated the following (within the same meta-analytic report): in eighteen meta-analytic reports (37.5%), seven in the OR and eleven in the RR metric, there was agreement between frequentist CI and Bayesian CrI in achieving statistical significance; in twenty-three (48%) meta-analysis reports where frequentist pooled CI achieved statistical significance, Bayesian CrI did not achieved statistical significance; in seven meta-analyses both frequentist CI and Bayesian pooled CrI did not achieved statistical significance. Of interest, the RCT proportion, not the number of studies (both RCT and NRS) appeared determinant with respect to the probability of both frequentist CI and Bayesian CrI in excluding the null (within the same meta-analysis), as seen in Fig. 1.

Fig. 1
figure 1

Probability (with 95% CI) of both frequentist CI and Bayesian CrI excluding the null (within the same meta-analysis) as a function of RCT proportion

Two Bayesian estimates were computed, corresponding to the half-normal and half-Cauchy heterogeneity priors. For one meta-analysis, Barakakis et al. [44], two RCT and one NRS, no Bayesian CrI could be computed. For the Sultan et al. meta-analysis [74], one RCT and one NRS, Bayesian CrI could only be computed for the half-Cauchy heterogeneity prior models, and for Wang et al. [83], three NRS and one RCT, Bayesian CrI could only be computed for the half-normal heterogeneity prior model. The study of Yao et al. [86] presented results in the risk difference metric (RD), 0.099( 0.015, 0.184); as all other estimation results were in the OR or RR metric, RR was utilised.

Table 1 lists the author and Bayesian estimates of the two-component models for half-normal and half-Cauchy heterogeneity priors respectively. All Bayesian estimates for meta-analyses having an author overall-estimate P-value > 0.05 were consistent in terms of the span of CrIs, that is, they encompassed unity.

Table 1 Author and Bayesian (half-normal heterogeneity prior) estimates

A graphical comparison of the author (frequentist) and Bayesian estimates as couplets for OR (Fig. 2) and RR (Fig. 3) was undertaken to further illustrate these differences. With regard to Fig. 2, in six of the meta-analyses, both frequentist CI width and corresponding Bayesian CrI width excluded the null; all Bayesian CrI spans were greater than frequentist CI spans.

Fig. 2
figure 2

Author (frequentist) and Bayesian estimates as couplets for OR metric, with X-axis on the log scale. Significant (left panel) and nonsignificant (right panel) overall OR frequentist estimates compared with Bayesian estimates (half-normal heterogeneity parameter (\(\tau\))). Due to scaling requirements, estimates from the Mei et al. meta-analysis [62] were omitted (frequentist estimate: 3.06(1.15, 8.15); Bayesian: 5.18(1.14, 36.08))

Fig. 3
figure 3

Author (frequentist) and Bayesian estimates as couplets for RR metric, with X-axis on the log scale. Significant (left panel) and nonsignificant (right panel) overall RR frequentist estimates compared with Bayesian estimates (half-Cauchy heterogeneity parameter (\(\tau\))). Due to scaling, estimates from the Sultan et al. meta-analysis [74] were omitted (frequentist estimate: 2.92(0.481, 17.741); Bayesian: 1.418(.718, 2.605))

With regard to Fig. 3, in nine of the meta-analyses, both frequentist CI width and corresponding Bayesian CrI width excluded the null; in two meta-analyses, author frequentist CI width was greater than Bayesian CrI width: Zakhari et al. [90]: RR 0.41(0.26, 0.65) versus 0.472(0.322, 0.657) and Sultan et al. [74]: RR 2.92(0.481, 17.741) versus 1.418(0.718, 2.605).

Preference for either a pooled or separate estimates within the two-component models using BF criteria is shown in Table 2. Note that the descriptor “Separate” in the legend to Table 2 refers to the generation of BF from a single marginal likelihood (dervide from the multiplication of bma.obs_marginal*bma.rct_marginal: see Statistical analysis Bayesian approach, above).

Table 2 BF and posterior model probabilities for model preference for half-normal and half-Cauchy heterogeneity priors. BF > 3, bolded

For the half-normal heterogeneity prior, 21 meta-analyses favored pooling (RR, 11 and OR, 10) and 4 favored separate analysis (RR, 1 and OR,3) with BF > 3. For the half-Cauchy heterogeneity prior, 27 meta-analyses favored pooling (RR, 15 and OR, 12) and 4 favored separate analysis (RR, 2 and OR,2) with BF > 3. Analysis of the table information did not yield convincing predictors of BF > 3 with respect to metric or meta-analytic study number(s).

Discussion

The current study demonstrated a substantial reduction in the nominal frequentist significance of meta-analytic estimates generated by the naïve pooling of RCT and NRS (using the DSL estimator) compared with a principled Bayesian method of information combination. The latter, a model averaging process, adjusted for the agreement or otherwise between the RCT and NRS studies offsetting the increase in frequency of statistically significant (frequentist) treatment effects of NRS studies compared with RCT, within the same meta-analysis report. A plausible expectation that a Bayesian approach would yield a frequency of statistically significant (CrI excluding the null) pooled meta-analyses comparable with that of significant RCTs within a frequentist DSL analysis was also realized: Bayesian 37% (half-normal heterogeneity prior) and 31% (half-Cauchy) compared with DSL 36%.

Several studies have addressed potential conflict between RCT and NRS effect estimate combination with various purposes and results: an endorsement of such combination [25], a finding of consistent direction of overall effect [24] or little difference between the effect estimates [7, 99, 100], and the promise of increased precision of effect consequent upon larger sample size [4, 5]. The analytic assumption behind these studies was frequentist. A larger CI span has also been suggested [10, 101] but, as noted above, a precision increase was not generally found in the current study, more so with the application of Bayesian methods.

Proposals to incorporate randomized and non-randomized evidence within meta-analyses have a considerable history of at least 30 years [102], as has the particular question of the bias or otherwise of NRS [103, 104]. The methodological issues involved in such exercises have been considered in some detail [10, 101, 105, 106]. A general statistical framework to combine multiple information sources was first introduced in 1989 [6, 16], the Confidence Profile Method, and the recent (2021) paper by Nikolaidis et al. provides a more current review of information sharing categories ([8], Fig. 3) as: functional, deterministic functions relating to model parameters of both direct and indirect evidence; exchangeability, a common distribution imposed upon a parameter set; prior-based, a Bayesian method utilizing an informative prior to combine evidence, to wit, the “bayesmeta” approach [20]; and multivariate, whereby a multivariate distribution is imposed across parameters specifying outcomes, not populations or study designs [15]. A plethora of Bayesian models have been proposed to combine direct and indirect evidence and have been usefully summarized in a number of papers [2, 6, 107,108,109] and briefly detailed [16]; this theme is not pursued here.

The “bayesmeta” approach [20] seemed ideally suited to the task at hand; available through the R computing environment and syntax: a computationally efficient method, using numerical integration and analytical tools, not Markov Chain Monte Carlo, with heavy-tailed priors for effect estimation resulting in a model-averaging technique. This approach has been pursued in recent studies [110, 111]. The described method was robust [30] in the sense that a potential prior-data conflict, that is, a discrepancy between source and target data, was explicitly projected. The “bayesmeta” program formulates a random effects normal-normal hierarchical model [19, 20] and there has been some discussion, albeit indeterminate, regarding the impact of the normality assumption [112,113,114]. The experience of Davey et al. that the median number of studies per review in the Cochrane Database of Systematic Reviews was six (inter-quartile range (IQR) 3–12) was consistent with that of the current study (median 9, IQR 5–14). No marked effect of the heterogeneity prior was evident in that point and CrI estimates of the different models, half- normal and Cauchy heterogeneity priors, were comparable and convergence difficulties [115] were not a major issue although (see Results, Tables 1 and 2) no CrI could be computed in two meta-analyses and selective computation occurred for either half-normal or half-Cauchy priors in two.

Preference for the pooled analysis (RCT plus NRS) via BF was indicated in 42% and 54% of meta-analyses depending upon the heterogeneity prior (Table 2). BF are known to be sensitive to model parameter prior distribution, and the fact that different priors result in different BF should “… not come as a surprise” [116]. A kernel density plot (Fig. 4) of the posterior probabilities for the pooled model for both heterogeneity priors, where BF for model choice were indeterminant (0.333 < BF < 1), revealed the highest posterior densities located close to 0.8, giving further support to the pooled model formulation for this subgroup of meta-analyses.

Fig. 4
figure 4

Kernel density plots of posterior model probabilities of pooling for meta-analyses where 0.333 < BF < 1

Limitations

Different approaches to information combination were not explored, as in a previous study, where, with respect to a single exemplar meta-analysis combining RCT and NRS, non-naïve methods, both frequentist and Bayesian, were consistently shown to generate CI and CrI widths embracing the null, as opposed to the simple DSL estimator ([16], Table 2, page 53). It was instructive to note that none of the currently considered meta-analyses reported using non-DSL estimators, despite concerns being raised nearly 10 years ago about biased estimates with falsely high precision with DSL estimator [117]. As a reviewer pointed out, such an observation goes to the heart of the difference between the handling of heterogeneity between the two paradigms: frequentist, where the heterogeneity variance (τ2) is a fixed quantity, albeit it may vary with different frequentist estimators ([118] and see below) and Bayesian, where prior distributions are specified for the heterogeneity parameter [119]; in the current study, half-normal and half-Cauchy. As noted by Rover et al., within Bayesian estimation the choice of a prior for \(\tau^{2}\) is a somewhat nuanced process [120]. Such considerations have been further explored by Rover et al., including the effect of the scaling of the prior whereby the latter was found to have more impact upon results than the prior distribution shape [121]; the current study used a scale of 0.5 for both heterogeneity priors. Rover et al. [120] also found that mortality endpoints in a cohort of meta-analyses from the Cochrane Database of Systematic Reviews had a comparatively low heterogeneity compared with other outcomes. A similar review, by Inthout et al. [122], found that meta-analyses with a dichotomous outcome had τ values (the square-root of \(\tau^{2}\) and on the same scale as the effect size metric) of 0(0–0.41); median, interquartile (Q1-Q3)). If we consider values of \(\tau\) in the range of 0.1–0.5 as reflecting small to moderate heterogeneity [123], then the half-Cauchy distribution would ensure that a value less than τ = 0.4 has a probability of 43% and for the half-normal distribution, 58%; suggesting weakly informative priors for such a scenario ([119], computations performed in the R package “extraDistr” version 1.10.0; @ https://cran.r-project.org/web/packages/extraDistr/index.html). For comparison with the current study, the overall \(\tau\)(median, interquartile (Q1-Q3)) for the combined estimate of RCT and NRS (50 meta-analyses) using the DSL estimator (see Supplement, Table S1) was 0.25(0.10–0.50).

These observations have relevance to the present study with respect to the “disagreements” between the DSL CI and the Bayesian model averaging CrI with respect to the null. A large number of frequentist meta-analytic estimators are provided by the Stata “metan” user-written module [124] and some of these were used in the original published meta-analyses. The Mantel–Haenszel RE (M-H RE) estimator would appear to be available only in “RevMan” software, but with respect to any differences between the DSL and M-H RE estimators, the Cochrane Handbook "Implementing random-effect mete-analyses" (10.4.4) [125], notes that the difference between the DSL and M-H random effects approaches would be "likely to be trivial". The question of the appropriate estimator choice, fixed or random, is not canvassed in this paper; suffice it to say, the (qualified) comment of Borenstein et al. is noted: “in the vast majority of meta-analyses the random-effects model would be the more appropriate choice” [126].

As suggested by a reviewer, two alternate frequentist meta-analytic estimators were also compared with the Bayesian model in terms of the “disagreements”, as above: (i) the Hartung-Knapp-Sidik-Jonkman (HSJK) variance correction (to any standard tau-squared estimator, in this case, the DSL estimator) [127,128,129] and (ii) the inverse-variance heterogeneity model (IVhet) of Doi and colleagues [130, 131]. As these comparisons were not the prime focus of the current paper, they are only summarized here and presented in detail for the reader in the Supplement.

For the HJKS variance correction with the DSL estimator (HJKS-DSL), 55% (27/49, no HSJK-DSL estimates could be computed for the Sultan et meta-analysis [74]) were significant compared with 78% using the conventional DSL estimator. In the OR metric for significant HJKS-DSL estimates (CI not spanning the null), 4 Bayesian CrI spanned the null. For non-significant HJKS-DSL estimates (CI spanning the null), all Bayesian estimates were consistent (Figure S1). In the RR metric (Figure S2), for significant HJKS-DSL estimates (CI not spanning the null), 9 Bayesian CrI spanned the null. For non-significant HJKS-DSL estimates (CI spanning the null), 2 Bayesian estimates did not span the null.

For the Doi et al. IVhet model, 58% (29/50) were significant compared with 78% using the conventional DSL estimator. In the OR metric (Figure S3) for significant IVhet estimates (CI not spanning the null), 6 Bayesian CrI spanned the null. For non-significant IVhet estimates (CI spanning the null), all Bayesian estimates were consistent. In the RR metric (Figure S4) for significant IVhet estimates (CI not spanning the null), 9 Bayesian CrI spanned the null. For non-significant IVhet estimates (CI spanning the null), 1 Bayesian estimate did not span the null.

Future possibilities

In 2009 Sutton et al. [132] suggested that evidence synthesis was the “the key to more coherent and efficient research” and posed the question whether “evidence from observational studies may exist which could augment that available from the RCTs”. A decade on, the answer would appear to be affirmative, at least from a Bayesian perspective. Any combination of RCT and NRS is predicated upon preceding robust study quality assessment; for instance, a checklist that may be applied to both RCT and NRS, such as that of Downs and Black [133] used by Sampath et al. [18]. The former was described as being “suitable for use in a systematic review” [104]. The question of combining RCT and NRS under conditions of “conflict” between conclusions can only be achieved by a principled approach, such as Bayesian model averaging as described above, complemented by BF computation. This being said, the umbrella term NRS, as used in the current study, elides a potential number of important (non-randomised) study types, such as prospective and retrospective, cross-sectional and longitudinal, observational and interventional.

Future studies should replicate or otherwise the findings of the current study, including the utility of BF, model posterior probabilities and different non-randomised study designs. In any concurrent comparison with frequentist estimator(s), the latter choice should be justified; such comparisons are presented for the reader in the online Supplement.

Conclusions

Bayesian estimation of treatment efficacy via model averaging was more conservative than frequentist in meta-analyses combining NRS and RCT. The calculation of BF was able to provide additional evidence for the wisdom or otherwise of meta-analytic pooling of RCT and NRS. Model posterior probabilities also provided plausible evidence for the pooled estimate model. If frequentist estimators are utilized, caution should attend estimator choice and the reporting of meta-analytic pooled estimates.