Background

Randomized controlled trials (RCTs) and cohort studies are the most common study design types used to assess the treatment effects of medical interventions [1, 2]. RCTs are considered the gold standard in medical research to assess benefits and harms of treatments [13]. Randomization allows causal inference [4]. However, RCTs may not be available for certain research questions due to ethical reasons [5] or they may suffer from low external validity [69], too short follow-up duration to assess late adverse events [5], or low adherence [10]. In contrast to RCTs, large cohort studies may often have higher external validity [6], e.g., when including diverse populations [8, 9]. Cohort studies can complement information from RCTs or might even serve as a replacement [11] and enlarge the available body of evidence (BoE: all studies available for a given research question, i.e., all RCTs/cohort studies investigating the impact of oral contraception on breast cancer), or they may be useful to identify relevant subgroups for subsequent RCTs [12]. However, there is an ongoing debate about the trustworthiness of results from cohort studies mainly fuelled by their susceptibility to risk of bias by confounding [8, 13]. For example, systematic reviews from the Cochrane Collaboration impose high thresholds on the inclusion of cohort studies [5]. Several studies have investigated whether the susceptibility to bias in different types of observational studies indeed leads to disagreement of effect estimates [1417]; the largest study so far, a meta-methodological study comparing health care outcomes from RCTs to observational studies (including case-control and cohort studies) concluded that results were mainly concordant [18]. The authors suggested that factors other than the study design only should be investigated in the case of disagreement of results. However, the study lacked an empirical investigation of factors such as PI/ECO (population, intervention/exposure, comparator, outcome)-differences (for example, differences between the interventions tested in RCTs and cohort studies) that potentially account for disagreement of study results and little is known about this topic so far. Therefore, in the present meta-epidemiological study, we do not only evaluate the agreement of effect estimates between BoE from RCTs and cohort studies from the general medical field. Additionally, we investigate whether factors such as PI/ECO-differences between BoE are associated with disagreement. This also allows us, to explore and to better understand potential reasons for statistical heterogeneity. Factors associated with disagreement would require special attention in future health-care evidence syntheses integrating both BoE.

Methods

This meta-epidemiological study was planned, written, and reported in adherence to guidelines for reporting meta-epidemiological research [19]. The detailed inclusion criteria are described in Table 1.

Table 1 Detailed description of inclusion and exclusion criteria

Literature search

The search was conducted in MEDLINE (via PubMed.gov) on June 05, 2020, for the period between January 01, 2010, to December 31, 2019, in the 13 medical journals with the highest impact factor (according to the Journal Citation Report [JCR] 2018; category: general and internal medicine). This cut-off was chosen to cover a 10-year period in line with a recent meta-epidemiological study in nutrition research [20]. Initially, we planned to include the 10 highest impact factor journals, but three journals (New England Journal of Medicine, Nature Reviews Disease Primers, and Journal of cachexia, sarcopenia, and muscle) did not publish any systematic review with an eligible BoE-pair (see inclusion criteria in Table 1). We therefore included the subsequent three journals according to the JCR 2018 (Cochrane Database of Systematic Reviews, Mayo Clinic Proceedings, Canadian Medical Association Journal). The search strategy is given in Additional file 1 (Appendix S1). The title and abstract screening was conducted by one reviewer (NB), and potentially relevant full texts were screened by two reviewers independently (NB, LS). Any discrepancy was resolved by a third reviewer (JJM). Supplementary hand searches identified three additional systematic reviews [2123]. For each included BoE from a systematic review, we included a maximum of three patient-relevant outcomes (e.g., mortality, cardiovascular disease (CVD)), and a maximum of three intermediate disease markers (e.g., blood lipids). If more than three outcomes were available for a given systematic review, we included the primary outcomes, and thereafter, we used a top-down approach (mentioned first).

Evaluating similarity between BoE from RCTs and cohort studies

We evaluated the similarity of PI/ECO between BoE from RCTs and cohort studies. In accordance with a previous meta-epidemiological study [20], the acronym PI/ECO instead of PICO was used, to better represent exposures in cohort studies (e.g., serum vitamin D status) and to distinguish them from interventions in RCTs (e.g., vitamin D supplementation). For each BoE-pair, the similarity of each PI/ECO-domain was rated as “more or less identical,” “similar but not identical,” or “broadly similar.” Overall, the similarity of each BoE-pair was then determined according to the domain with the lowest degree of similarity. For example, when the PI/ECO-rating for the domain “population” was rated as “broadly similar” the overall similarity of this BoE-pair was also rated as “broadly similar.” The PI/ECO-similarity rating was conducted by two reviewers independently (NB, JB) using pre-specified criteria (Additional file 1: Table S1). Categorization of interventions and outcomes was conducted by two reviewers (NB, LH). Discrepancies of PI/ECO-similarity rating or categorizations were resolved through discussion with experts.

Data extraction

Data extraction was performed by two reviewers independently (NB, LH). The following data were extracted for each BoE: effect estimates, type of effect measure, 95% confidence interval (CI), number of studies, number of participants, number of events, and certainty of the evidence. Further, we extracted information on study characteristics of primary studies for each BoE: description of the study population, intervention/exposure, comparator, design of the primary study, intervention duration, and follow-up and risk of bias/study quality.

If RCTs were pooled with other types of studies (e.g., quasi-experimental RCTs), we performed a meta-analysis excluding these other study types. The rationale for this approach was the suggestion in the new Cochrane handbook to classify quasi-experimental RCTs as non-randomized studies of interventions (NRSI) [5]. This was the case for three BoE from RCTs [2426]. Accordingly, meta-analyses of cohort studies were recalculated if they included other study types (e.g., case-control studies); this was the case for 35 BoE from cohort studies [25, 2742]. If RCTs and cohort studies were pooled without subgroup analysis by study type, we performed separate meta-analyses; this was the case for nine BoE-pairs [37, 40, 4345]. Upon request, authors from one systematic review [45] provided data to perform separate meta-analyses. In two BoE-pairs from one systematic review evaluating infection outcomes of influenza vaccines [46] RCTs with different populations (community-dwelling and institutionalized) were combined in a single meta-analysis; we pooled respective cohort studies that were initially not combined. For ten BoE pairs [38, 42, 47, 48], we pooled different types of cohort studies (e.g., clinical cohorts, population-based cohorts) that were not pooled in the corresponding systematic review. If there was a meta-analysis for the BoE from one study type (e.g., RCTs) and a corresponding BoE from the other study type (e.g., cohort studies) was not pooled but relevant data were available, we pooled the respective primary studies: cohort studies for nine BoE pairs [4955] and primary RCTs for one BoE pair [56].

Statistical analysis

If the summary effect measure for binary or continuous outcomes was not the same for BoE from RCTs and BoE from cohort studies, we used the appropriate conversion formulas in order to have the two estimates expressed in the same measure: risk ratio (RR), odds ratio (OR), or hazard ratio (HR) for binary outcomes and mean difference (MD) for continuous outcomes.

If effect measures (RR, OR, HR) for binary outcomes were not the same within a BoE pair, they were converted to an identical effect measure (RR) using an assumed control risk (ACR); \(\mathrm{RR}=\frac{\mathrm{OR}}{1-\mathrm{ACR}\ \mathrm{x}\ \left(1-\mathrm{OR}\right)}\) [13, 57]. If either a RR, OR, or HR was used for both BoE, we did not convert summary effect estimates. We converted effect measures for binary outcomes for 16 BoE pairs [22, 23, 44, 5254, 56, 5860] and for continuous outcomes for one BoE pair [61]. Detailed descriptions about the conversions can be found in Additional file 1 (Table S2 [6266]). We standardized the direction of effect of the outcomes so that summary effect estimates (HR/OR/RR) <1 are always expressing a beneficial effect. We revised the direction of effect for three outcomes from the systematic reviews by Hüpfl et al. [67] (survival to all-cause mortality) and Alipanah et al. [24] (treatment success/completion to low treatment success, low treatment completion) (see Table 2). To quantify differences of effect estimates, we computed a ratio of ratios (RoR) [68] for each BoE pair with a binary outcome. For continuous outcomes, we computed a difference of mean differences (DMD). For the assessment of binary and continuous outcomes cohort studies served as the reference group. We pooled the RoRs across BoE-pairs using a random-effects model [69] to assess whether in total effect estimates of BoE from RCTs are larger or smaller in relation to those of BoE from cohort studies. The RoR does not indicate larger or smaller treatment effects in one of the BoE, but only differences between the two BoEs. The direction of difference depends on the direction of effect of the underlying BoEs. For example, a risk ratio from RCTs of 0.8 and a risk ratio from cohort studies of 1 would yield a RoR of 0.8, whereas a risk of 1.00 in RCTs compared with a risk ratio of 1.25 in cohort studies would also yield a RoR of 0.8. We pooled DMDs for the same continuous outcomes using a random-effects model [69]. We evaluated the statistical heterogeneity of effect estimates across all BoE-pairs with binary outcomes and across BoE pairs using the same continuous outcomes with the I2 and τ2 statistics [69, 70]. To estimate τ2, we used Paule and Mandel method [71, 72]. We computed 95% prediction intervals (PIs) to estimate the extent of differences between results of BoE from RCTs and BoE from cohort studies likely to occur in future comparisons. Meta-analyses were performed with the R package meta [73] using random-effects models [69].

Table 2 Effect estimates and overall PI/ECO-similarity degree for each included body of evidence-pair

Subgroup and sensitivity analyses

We performed pre-specified and post hoc subgroup analyses to explore factors potentially related to the disagreement of effect estimates. The study protocol specified subgroup analysis by degree of PI/ECO-similarity and intervention type (drug, invasive procedure, nutrient, vaccine). Post hoc subgroup analyses were performed by the type of binary effect estimate (RR, OR, HR), type of intervention stratified by degree of PI/ECO-similarity, and type of outcome (e.g., CVD outcomes, cancer outcomes). We performed a post hoc multivariable meta-regression among “similar but not identical” BoE pairs with binary outcomes. For each PI/ECO-domain, the average effect on the pooled RoR of the category “similar but not identical” was evaluated as compared to the reference category “more or less identical.” We performed two post hoc sensitivity analyses: First, by including only the BoE pair from each systematic review with the highest number of RCTs (if the number of RCTs was equal, we primarily included the BoE with the highest number of participants, followed by the highest number of events, followed by the highest number of cohort studies) and second, by direction of cohort study summary effect estimate (HR, OR, RR <1 vs. HR, OR, RR ≥1).

Patient involvement

No patients were involved in setting the research question or the outcome measures, nor were they involved in developing plans for the design or implementation of the study. No patients were asked for advice on interpretation or writing up of results. There are no plans to disseminate the results of the research to study participants or the relevant patient community.

Results

The literature search identified 1362 records of which 234 full texts were assessed for inclusion and 64 systematic reviews were included in this study (Additional file 1: Fig. S1 and Table S3). Overall, we included 129 BoE pairs [2156, 5861, 67, 7496] (Table 2). Three journals contributed a major part of systematic reviews (n = 51; 80%): the BMJ (n=22), Annals of Internal Medicine (n = 15), and the Cochrane Database of Systematic Reviews (n = 14). The number of studies in BoE from RCTs ranged from 1 to 41 (median: 4) and from 1 to 68 (median: 5) in BoE from cohort studies. The range of participants was 99 to 437,600 (median: 3541) in BoE from RCTs and 162 to 1,934,183 (median: 12,850) in BoE from cohort studies. We performed re-analyses for 70 BoE pairs from 38 systematic reviews [2225, 2756, 5861].

Interventions in BoE pairs (n = 129) consisted of invasive procedures (n = 44), drugs (n = 40), nutrition (n = 32), vaccines (n = 9), birth assistance (n = 2), blood transfusions (n = 1), and cardiopulmonary resuscitation (n = 1). The outcomes of the 129 BoE pairs were categorised as follows: all-cause mortality (n = 28), CVD outcomes (n = 27), drug safety outcomes including adherence outcomes (n = 20), infection outcomes (n = 14), orthopedic outcomes (n = 13), obstetrical outcomes (n = 10), oncological outcomes (n = 9), metabolic outcomes (n = 3), urological outcomes (n = 3), and neurological outcomes (n = 2).

The most frequently used tools for risk of bias assessment were the Cochrane risk of bias tool [97] for 94 (73%) BoE from RCTs and the Newcastle Ottawa scale [98] for 61 (47%) BoE from cohort studies. Certainty of the evidence ratings using GRADE [99] or Agency for Healthcare Research and Quality criteria [100] were available for 38 BoE from RCTs and 31 BoE from cohort studies. Study characteristics for each BoE including effect estimates, detailed descriptions of PI/ECO, the certainty of the evidence ratings, and study quality/risk of bias ratings of primary studies are depicted in Additional file 1 (Tables S4-S7); Additional file 1 (Table S8) shows an overview of the instruments that were used for risk of bias assessment.

Similarity degree

Two (1.5%) BoE pairs were rated as “more or less identical”; 90 (69.8%) were rated as “similar but not identical” and 37 (28.7%) as “broadly similar”. The rating “broadly similar” was due to differences of study populations (n = 16), interventions and comparators (n = 20), and both population and outcome (n = 1) (Table 3, Additional file 1: Table S9).

Table 3 Ratings of PI/ECO-similarity degree for the included body of evidence-pairs by each PI/ECO-element

Statistical heterogeneity of included individual comparisons

Median I2 across meta-analyses of RCTs was 8% and 46% across meta-analyses of cohort studies. For binary outcomes, median I2 was 4% for meta-analyses of RCTs and 44% for meta-analyses of cohort studies. For continuous outcomes, I2 was 9% across meta-analyses of RCTs and 69% across meta-analyses of cohort studies. Median I2 across meta-analyses with binary outcomes stratified by PI/ECO-similarity degree indicated higher statistical heterogeneity for “broadly similar” BoE: I2 was 23% for meta-analyses from RCTs and I2 was 62% for meta-analyses from cohort studies, whereas for “more or less identical” BoE, I2 was 0% for meta-analyses of RCTs and I2 was 34% for meta-analyses of cohort studies (Additional file 1: Table S10).

Meta-epidemiological analysis

Pooling RoRs across BoE pairs with binary outcomes resulted in a pooled RoR of 1.04 (95% CI 0.97 to 1.11; n = 120) with considerable statistical heterogeneity (I2 = 69%; τ2 = 0.061; 95% PI 0.63 to 1.71) (Fig. 1 and Table 4). Differences of MDs in continuous outcomes (n = 9) were mostly small, with the exception of operation duration for two types of knee prostheses where clear disagreement was shown [42] (Fig. 2).

Fig. 1
figure 1

Forest plot for binary outcomes, pooled ratio of ratios (RoR) for bodies of evidence from randomized controlled trials vs. cohort studies stratified by type of effect measure. CSs cohort studies, DDP-4 dipeptidyl peptidase 4, DHA docosahexaenoic acid, EPA eicosapentaenoic acid, HR hazard ratio, NSTE-ACS= non-ST elevation acute coronary syndrome, OR odds ratio, RCTs randomized controlled trials, RHR ratio of hazard ratios, ROR ratio of odds ratios, RR risk ratio, RRR ratio of risk ratios, SGLT-2 sodium glucose transporter 2

Table 4 Overview of main results for binary outcomes (n=120)
Fig. 2
figure 2

Forest plot for continuous outcomes, pooled difference of mean differences (DMD) for bodies of evidence from randomized controlled trials vs. cohort studies. CSs cohort studies, DMD difference of mean differences, MD mean difference, RCTs randomized controlled trials

Subgroup analyses

For BoE pairs using RRs as summary effect estimate the pooled RoR was 1.02 (95% CI 0.94 to 1.11; I2= 73%; τ2= 0.072; 95% PI 0.60 to 1.75; n=85) and RoR 1.11 (95% CI 0.98 to 1.25; I2=48%; τ2=0.039; 95% PI 0.72 to 1.70; n=30), RoR 1.01 (95% CI 0.78 to 1.30; I2= 31%; τ2= 0.026; 95% PI 0.52 to 1.95; n=5) for ORs and HRs, respectively (Fig. 1 and Table 4).

Analysis by overall PI/ECO-similarity degree of BoE-pairs showed a pooled RoR of 1.17 (95% CI 0.90 to 1.51; I2=0%; τ2=0.00; 95%; n=2) across “more or less identical,” 1.06 (95% CI 0.99 to 1.14; I2=54%; τ2=0.034; 95% PI 0.73 to 1.54; n=81) across “similar but not identical,” and 0.99 (95% CI 0.85 to 1.16; I2=82%; τ2=0.149; 95% PI 0.45 to 2.21; n=37) across “broadly similar” BoE-pairs (Fig. 3 and Table 4). Results of analyses by similarity of each PI/ECO-domain are depicted in Additional file 1 (Fig. S2a-d); in BoE-pairs with “broadly similar” intervention, the pooled RoR indicated the largest disagreement and statistical heterogeneity were highest (RoR: 1.14, 95% CI 0.87 to 1.49; I2= 86%; τ2= 0.194; 95% PI 0.42 to 3.08; n=15) (Additional file 1: Fig. S2b). Results of multivariable meta-regression by comparing for each PI/ECO-domain the “similar but not identical” to the reference category “more or less identical” among 81 BoE-pairs rated as “similar but not identical” with binary outcomes are as follows: On average, the pooled RoR was changed by the factor 1.14 for populations, 0.89 for interventions, 1.12 for comparators, and 1.02 for outcomes. The results of the meta-regression were not statistically significant (Table 5).

Fig. 3
figure 3

Forest plot for binary outcomes, pooled ratio of ratios (RoR) for bodies of evidence from randomized controlled trials vs. cohort studies stratified by overall PI/ECO*-similarity degree. *PI/ECO population, intervention/exposure, comparator, outcome, CSs cohort studies, DDP-4 dipeptidyl peptidase 4, DHA docosahexaenoic acid, EPA eicosapentaenoic acid, HR hazard ratio, NSTE-ACS non-ST elevation acute coronary syndrome, OR odds ratio, RCTs randomized controlled trials, RHR ratio of hazard ratios, ROR ratio of odds ratios, RR risk ratio; RRR ratio of risk ratios, SGLT-2 sodium glucose transporter 2

Table 5 Multivariable meta-regression for each PI/ECO-domain across body of evidence-pairs with binary outcomes within the category “similar but not identical”

Our analyses stratified by type of intervention showed the following: The pooled RoR was 1.04 (95% CI 0.89 to 1.21; I2= 76%; τ2= 0.139; 95% PI 0.48 to 2.24; n=40) for drugs, 1.00 (95% CI 0.91 to 1.10; I2= 25%; τ2= 0.011; 95% PI 0.79 to 1.26; n=39) for invasive procedures, 1.07 (95% CI 0.98 to 1.16; I2= 71%; τ2= 0.023; 95% PI 0.77 to 1.48; n=28) for nutrition-interventions, 1.24 (95% CI 0.87 to 1.75; I2= 80%; τ2= 0.177; 95% PI 0.42 to 3.63; n=9) for vaccines, 0.97 (95% CI 0.62 to 1.52; I2= 0%; τ2= 0; n=2) for birth assistance, 0.38 (95% CI 0.18 to 0.77; n=1) for blood transfusion, and 0.79 (95% CI 0.62 to 1.00; n=1) for cardiopulmonary resuscitation (Table 4, Additional file 1: Fig. S3). Exploratory analyses with stratification by PI/ECO-similarity degree within subgroups of interventions (Additional file 1: Fig. S3a-e) showed disagreement between both BoE for drugs with divergence between BoE-pairs rated as “broadly similar” (RoR: 0.79, 95% CI 0.56 to 1.11; I2= 69%; τ2=0.290; 95% PI 0.23 to 2.71; n=14) and BoE-pairs rated as “similar but not identical” (RoR: 1.20, 95% CI 1.05 to 1.37; I2=67%; τ2=0.050; 95% PI 0.74 to 1.94; n=26) (Additional file 1: Fig. S3b). For “broadly similar” BoE pairs from nutrition research, differences in effect estimates between both BoE were observed (RoR: 1.17, 95% CI 1.03 to 1.33; n=11) (Additional file 1: Fig. S3c). Exploratory analysis excluding BoE-pairs evaluating effects of vitamin D or calcium (n=8) resulted in estimates that were more in agreement (RoR: 1.09, 95% CI 1.04 to 1.14; I2=0%; τ2=0.00; 95% PI 1.04 to 1.15; n=20) and statistical heterogeneity disappeared (Additional file 1: Fig. S4). Analysis of BoE pairs evaluating vaccines indicated a higher extend of disagreement for “broadly similar” BoE-pairs (RoR: 1.37, 95% CI 0.86 to 2.17; I2=90%; τ2=0.177; 95% PI 0.17 to 10.88; n=4) compared to “similar but not identical” BoE-pairs (RoR: 1.09, 95% CI 0.62 to 1.92; I2=58%; τ2=0.177; 95% PI 0.19 to 6.45; n=5) (Additional file 1: Fig. S3d).

Stratified analyses by outcome-category are shown in Additional file 1 (Fig. S5) and Table 4. The pooled RoR was 0.94 (95% CI 0.82 to 1.09; I2=80%; τ2=0.075; 95% PI 0.53 to 1.69; n=28) for BoE pairs reporting all-cause mortality, 1.12 (95% CI 1.02 to 1.23; I2=43%; τ2=0.022; 95% PI 0.81 to 1.55; n=26) for CVD outcomes, and 1.06 (95% CI 0.89 to 1.26; I2=67%; τ2=0.068; 95% PI 0.60 to 1.90; n=20) for drug safety outcomes.

The results of the sensitivity analysis where only one outcome (with the largest number of RCTs) was chosen from each systematic review confirmed findings from the main analysis (RoR: 1.08, 95% CI 0.97 to 1.20; I2=76%; τ2=0.097; 95% PI 0.57 to 2.03; n=60) (Additional file 1: Fig. S6). Sensitivity analysis by direction of effect yielded a pooled RoR of 1.18 (95% CI 1.10 to 1.27; I2=61%; τ2=0.046; 95% PI 0.77 to 1.82; n=79) and 0.81 (95% CI 0.76 to 0.87; I2=16%; τ2=0.005; 95% PI 0.69 to 0.95; n=41) for BoE pairs where the cohort study effect estimate was <1 and ≥1, respectively (Additional file 1: Fig. S7).

Discussion

Summary of findings

This large meta-epidemiological study identified and compared empirical data investigating the same medical research question to determine the extent to which estimates of BoE from RCTs and cohort studies are in agreement. Overall, 129 BoE pairs derived from 64 systematic reviews were enclosed for the analyses. Only two BoE pairs were rated as “more or less identical” according to PI/ECO-similarity. For binary outcomes, the pooled RoR showed that on average, the extent of deviations towards larger and smaller effect estimates in BoE from RCTs versus cohort studies was almost identical. Differences of effect estimates between the two BoE for continuous outcomes were mostly small. Subgroup analyses by intervention type, type of effect measure, and outcome category showed that on average, there was a little indication for overall differences between both BoE (with the exception of subgroups for ORs and CVD outcomes). Even though the pooled RoR showed that on average effect estimates did not differ, this does not preclude important differences in individual comparisons and/or studies.

Pooling RoRs from BoE-pairs with pharmacological interventions resulted in high statistical heterogeneity. The pooled RoR was similar to the main analysis in BoE pairs with a higher and lower degree of PI/ECO-similarity between both BoE. However, when pooling RoRs, statistical heterogeneity was highest across BoE pairs with the most dissimilar PI/ECO and PIs were substantially wider. Analysis of the pooled RoR by direction of effect in cohort studies indicated differences between both study types. Post hoc analyses revealed that statistical heterogeneity was higher across meta-analyses from “broadly similar” than “similar but not identical” BoE pairs, and higher across cohort studies compared to RCTs.

Comparison with other studies

General medical field

The Cochrane review by Anglemyer et al. [18] evaluated the agreement of effect estimates between RCTs and observational studies in a sample of methodological reviews. Across nine reviews with specific estimates for RCTs versus cohort studies, they computed a pooled RoR of 1.04 (95% CI 0.89 to 1.21), which was nearly identical to our pooled RoR of 1.04 (95% CI 0.97 to 1.11). In the RCT versus cohort analysis, the overall difference of effect estimates was small for seven from nine studies; two studies [101, 102] showed discordance in different directions with a RoR of 0.71 and 3.58, respectively. Anglemyer et al. [18] concluded that on average, the difference of effect estimates between observational studies and RCTs is negligible and proposed that future work should explore other factors than the study design only that could explain occurring differences of effect estimates. In contrast to Anglemyer et al. [18], we performed more detailed data extraction, investigated PI/ECO-similarity degree, and calculated PIs. This allowed us to better understand potential differences. We evaluated statistical heterogeneity on different levels and showed that across the included meta-analyses as well as within the pooled RoR, median statistical heterogeneity and PI were highest across PI/ECO-dissimilar BoE-pairs, and higher across cohort studies compared to RCTs. Further, analysis by each PI/ECO-domain showed that differences of interventions were the main drivers towards disagreement; within the category “similar but not identical,” meta-regression showed that the average effects on the pooled RoR resulting from differences in populations, interventions, and comparators were comparably large, albeit not statistically significant.

Other research fields

Hong et al. [103] conducted a meta-epidemiological study comparing 74 pairs of summary effect estimates from RCTs and observational studies in the field of pharmacology. On average, differences were small albeit with considerable between-study variability, which is in line with our findings. Anglemyer et al. [18] showed differences between RCTs and all observational BoE for pharmacological studies (RoR: 1.17, 95% CI 0.95 to 1.43). In contrast, in our analysis, the pooled RoR for pharmacological BoE pairs was similar to the main analysis (RoR: 1.04, 95% CI 0.89 to 1.21). However, in stratified analyses, PI/ECO-similarity degree was an important driver for discordance across pharmacological BoE pairs: for “similar but not identical” BoE-pairs, the RoR was 1.20 and for “broadly similar” BoE-pairs, the RoR was 0.79, with considerable statistical heterogeneity (I2=67% and 69%, respectively). We found important differences of interventions in “broadly similar” BoE pairs; For example, early interventions at high CD4-cell counts with antiretroviral therapy in RCTs may prevent human immunodeficiency virus infection more likely compared to interventions at various disease stages in cohort studies [77]. Also, exposure to digoxin after myocardial infarction (MI) can increase mortality whereas in chronic heart failure (CHF) with sinus rhythm the effect on mortality is known to be more neutral [104, 105]. Hence, RCTs can show lower mortality when including populations with CHF and sinus rhythm than cohort studies that include MI survivors [96]. From BoE pairs rated as “similar but not identical,” many were from the cardiovascular field [40, 47, 48, 53, 96]. Both, BoE from RCTs and cohort studies often included mixed populations with acute and non-acute CVD [40, 47, 48]; this drives PI/ECO-dissimilarity and may increase statistical heterogeneity. A recent meta-epidemiological study has shown that differences in effect estimates between nutrition RCTs and cohort studies were mainly driven by dissimilarities in population, intervention or exposure, comparator, and outcome [20]. Franklin et al. [106] emulated ten selected pharmacological RCTs using observational data sets. For nine included RCT emulations, differences of effect estimates were within the range of random variation. Disagreement was largest in comparisons with active comparators in observational data and placebo in RCTs. The authors conclude that similar active comparators in RCTs, and observational studies increase the probability of agreement and stressed that different methods have a substantial impact on the finding of agreement.

Potential implications

RCTs are considered the gold standard to evaluate causal inference for medical interventions [13]. Due to a variety of reasons such as low external validity [7, 9] and limited availability of RCTs [5], health care professionals and other decision-makers increasingly rely on results from observational studies. However, results from RCTs and observational studies can differ [15, 18, 107] and efforts to understand under which circumstances this occurs are ongoing [106]. Our study provides valuable insights into the field of general and internal medicine, but also into other important research fields such as public health. We showed that BoE from RCTs and cohort studies included in systematic reviews from high-impact factor medical journals often differ in terms of study populations (e.g., different disease status), interventions and comparators (e.g., different intervention-timing, different drugs of the same class), or outcomes (e.g., late-stage disease versus any disease). Our data highlight the importance of PI/ECO-differences—especially those of interventions—in explaining differences of effect estimates. As a perspective, evaluating differences in factors such as study size, follow-up time, or publication date may serve to further explore disagreement between the two study design types. However, other factors require equal attention. Appropriate adjustment for confounding is a necessary precondition to consider results from observational studies and residual confounding remains a major concern [108]. To deal with these uncertainties evaluating the risk of bias is of tremendous importance to assess the trustworthiness of findings. In our sample, the Cochrane risk of bias tool [97] for RCTs and the Newcastle Ottawa scale (NOS) [98] for cohort studies were mainly used, along with a variety of other instruments to rate the risk of bias/study quality. We assume that the increased use of the ROBINS-I tool [109] may facilitate integrating both BoE in evidence syntheses and facilitate analyses by the risk of bias and certainty of the evidence in methodological studies. The ROBINS-I tool is based on the target trial approach [110] and permits to better compare evidence from RCTs and observational studies. This will be useful to investigate the influence of bias on differences between findings from RCTs and cohort studies. In general, cohort studies may serve as a source for complementary or sequential information, or even replace findings from RCTs [11]. In evidence synthesis, cohort studies are sometimes included as a complementary source of evidence to increase the precision and/or generalizability of findings [12]. However, caution is warranted when pooling both BoE since, as shown in our study, PI/ECO-differences are common between both BoE, and cohort studies showed higher statistical heterogeneity.

Strengths and limitations

Our study has several strengths: First, a large sample of BoE-pairs (n=129) derived from 64 systematic reviews with a high number of RCTs and cohort studies were included. BoE pairs investigated a broad range of medical topics from high-impact factor medical journals. Second, extensive data extraction, including a detailed description of the population, intervention, comparator, outcome, risk of bias ratings, and length of follow-up conducted by two reviewers independently allowed us to rigorously explore the clinical- and design features of the included BoE. Third, our analysis included an evaluation of agreement of effect estimates across the included BoE-pairs for binary and also continuous effect estimates. We stratified the analyses by type of binary effect measure, intervention-type, and outcome category. For the first time in the general medical field, we implemented an approach that allowed us to explore the influence of PI/ECO-differences on the disagreement of effect estimates.

Several limitations should be considered as well: First, meta-epidemiologic studies such as ours are based on an observational analysis and therefore show only non-causal associations [111, 112]. Factors such as publication date can act as meta-confounders. Further, we did not take into account the risk of bias/study quality and certainty of the evidence into the quantitative analysis, since the tools used by the systematic review authors were highly heterogeneous and often the corresponding information was not reported sufficiently in the systematic reviews. However, bias was assessed as follows in our sample: we showed that on average the effect estimates were in agreement (as shown by the pooled RoR) making systematic bias towards smaller or larger effect estimates unlikely. Potential bias may also exist in individual BoE pairs and influence the RoRs additionally to PI/ECO-differences. However, we showed that PI/ECO-dissimilarities were important drivers of statistical heterogeneity and wide PIs. Further, bias may affect individual cohort studies causing higher statistical heterogeneity in meta-analyses [13]. Accordingly, in our sample statistical heterogeneity in meta-analyses of cohort studies (median: I2= 46%) was higher than in meta-analyses of RCTs (median: I2= 8%). We did not explore whether disagreement was larger between RCTs compared to prospective and retrospective cohort studies, respectively. The corresponding information was reported in a suboptimal manner, and researchers may use inconsistent nomenclature [113, 114]. Second, we did not evaluate the methodological quality of the included systematic reviews, but given that we focused on high-impact journals, we assumed that published systematic reviews are of reasonably high methodological quality. Third, even though rating the degree of PI/ECO-similarity was performed by two reviewers using predefined criteria, this process is still partly subjective, and ratings may be too strict since only two BoE were judged as “more or less identical.” Further, PI/ECO-dissimilarities in BoE pairs were usually present in more than one PI/ECO-domain; this complicates drawing conclusions about the difference of effect estimates that results from a given PI/ECO-dissimilarity in one domain (e.g., from a difference of interventions). Fourth, performing several subgroup analyses might increase the likelihood of findings by chance. However, most of these analyses did not find any subgroup differences, thereby increasing our confidence in the findings of the main analysis. Further, with the exception of analysis by PI/ECO-similarity degree and intervention type, subgroup analyses were performed post hoc. However, analyses by type of effect estimate and outcome category were planned before the main analysis was conducted. Fifth, some degree of overlap between BoE cannot be ruled out since some primary studies contributed to more than one included BoE. This might have increased the precision of our findings. However, a sensitivity analysis of only one outcome per systematic review showed similar findings to the main analysis. Sixth, with regard to the search strategy, choosing another time frame may yield different results; however, we chose the dates to cover a 10-year period (January 01, 2010, to December 31, 2019). Further, the restriction on BoE pairs from the same systematic review may limit the representativeness of the sample. However, the main alternative, i.e., the inclusion of BoE from matched systematic reviews from RCTs and cohort studies, may have other drawbacks, such as impaired comparability of systematic review methodology.

Conclusions

On average the pooled effect estimates between RCTs and cohort studies did not differ. Statistical heterogeneity and wide PIs were mainly driven by PI/ECO-dissimilarities (i.e., clinical heterogeneity) and cohort studies. Differences of interventions were the main drivers towards disagreement; however, when focusing on “similar but not identical” BoE-pairs (i.e., with at least moderate similarity), the similarity degree categories (“similar but not identical,” “more or less identical”) affected more the average effect in populations, interventions, or comparators compared to the outcome albeit not statistically significant. The quantitative analysis did not assess how the risk of bias and certainty of the evidence influenced disagreement in addition to PI/ECO-dissimilarities. Upcoming meta-epidemiological studies may further explore the impact of risk of bias, certainty of the evidence, and residual confounding on differences of effect estimates between RCTs and cohort studies.