Introduction

The growth of electronic medical records and other ‘real-world’ digitised sources of clinical data has led to a proliferation of observational studies of the effectiveness of clinical interventions. While the scientific standard for assessing intervention efficacy remains randomised controlled trials (RCTs), well-designed observational studies have been used to elucidate potential harms, and expand the evidence base in situations where existing RCTs have limited generalisability because of selective patient enrolment or outcome reporting, or new RCTs are logistically very difficult to perform [1]. The main concern with observational studies is their vulnerability to bias, particularly confounding by indication [2], whereby patients receive a therapy based on certain patient or clinician characteristics which may not be explicitly stated or recorded, but which are prognostically important and influence the outcome of interest, independently of the therapy [3]. In the past, influential observational studies have helped institutionalise scores of clinical practices for decades that were subsequently shown to be ineffective or indeed harmful when subjected to RCTs where randomisation eliminated selection bias in who received the experimental therapy [4].

Nevertheless, reviews of observational studies suggest that they often report effects and generate inferences similar to those of RCTs studying the same therapy and involving similar populations and outcome measures [5,6,7]. Advances in study design, statistical methods and clinical informatics have potential to lend greater rigour to observational studies [1]. Multiple guidelines detailing methodological [8] and reporting [9] standards, and instruments for assessing study quality [10,11,12,13] exist. Although systems for grading evidence quality, such as Grades of Recommendation, Assessment, Development and Evaluation (GRADE), rank observational studies as being of lower quality than RCTs, they can be regarded as sources of valid data if they are well designed, show large effect sizes and account for all plausible confounders [14]. Many systematic reviews include both RCT and high-quality observational studies in their analyses in deriving causal inferences [15].

However, the level of trustworthiness of observational studies remains controversial. We hypothesised that, due to advances in observational research, such studies are becoming more rigorous and valid. The aim of this meta-epidemiological study was to assess the methodological quality of a sample of recently reported non-randomised clinical studies of commonly used clinical interventions, and ascertain if quality is improving over time.

Methods

In reporting this study, we applied the guidelines for meta-epidemiological methodology research proposed by Murad and Young [16]. No a priori study protocol existed or was registered.

Study selection

A backward search from December 31, 2018 to January 1, 2012 was performed using PubMed with no language filters to identify observational studies of therapeutic interventions. Search terms comprised ‘association’, ‘observational,’ or ‘non-randomised’ or ‘comparative effectiveness’ within titles or abstracts. We included studies which: involved clinician-mediated therapeutic interventions administered directly to adult patients; reported comparison of two concurrent therapeutic interventions which could include ‘usual care’; and whose outcomes included patient-important sentinel events (ie mortality, serious morbid events, hospitalisations) rather than solely disease or symptom control measures.

We excluded studies that: 1) featured case control comparisons, historical controls only, single arm cohorts, adjunct therapies, diagnostic tests (with no associated therapeutic intervention) or cost-effectiveness analyses (with no separate comparative effectiveness data); 2) compared a single intervention group with a placebo group; 3) comprised RCTs, or reviews and meta-analyses of either RCTs or observational studies; 4) involved paediatric, anaesthetic or psychiatric interventions or patients; 5) analysed therapies which were highly specialised, or not in common use (eg genetically guided therapies, investigational agents in research settings); 6) assessed effects of system-related innovations rather than direct clinical care (eg funding or governance structures); 7) studied non-medical interventions (eg effects on cardiovascular outcomes of reducing salt consumption or increasing physical activity); 8) studied exposures, risk factors or prognostic factors that may influence therapeutic effectiveness but did not involve head to head comparisons of two interventions (eg effects of dose changes or co-interventions); or 9) were descriptive studies with no reporting of outcome measures. One author (SG) performed the search and initial study selection, with subsequent independent review by the second author (IAS).

Data collection

From each selected study we extracted the following data: study title, journal, and date of publication; rationale stated in the introduction for choosing an observational study design; existence of a pre-specified study protocol; patient selection criteria; methods of data collection from primary sources; reference to validation checks for coded administrative data or longitudinal data linkage processes; methods for minimising recording bias (in administrative data), recall bias, social desirability bias, and surveillance bias (in clinical registry data); methods for assessing clinical outcomes; choice of predictor variables; population characteristics and statistical methods used for balancing populations; imputation methods used for missing data; subgroup analyses and interaction testing for identifying independent prognostic variables; use of unplanned post-hoc analyses; statistical methods used for adjusting for multiple outcome analyses, clustering effects (in multicentre studies) and time-dependent bias; effect size and confidence intervals; sensitivity analyses for unmeasured confounders; stated intervention mechanism of action, temporal relation between intervention and outcomes, and dose-response relationships; any falsification tests performed; comparisons with results of other similar studies; and statements about study limitations and implications for clinical practice.

Application of quality criteria

Both authors independently read the full text articles of included studies and applied to each study a list of 35 quality criteria which were, with some modification, based on those the authors have previously published (Table 1) [17] and which covered criteria listed in previously cited critical appraisal and reporting guidelines for observational studies [10,11,12,13]. These quality criteria were grouped into 6 categories: justification for observational design (n = 2); minimisation of bias in study design and data collection (n = 11); use of appropriate methods to create comparable groups (n = 6); appropriate adjustment of observed effects (n = 5); validation of observed effects (n = 9); and authors interpretations (n = 2).

Table 1 Quality criteria used for assessing observational studies of interventions

For each study, the extent to which each criterion was satisfied were categorised as fully satisfied (Y) – all elements met; partially satisfied (P) – some elements met; or not satisfied (N) – no elements met; or not applicable (NA) if that criterion was not relevant to the study design, analytic methods or outcome measures. Inter-rater agreement between authors for criterion categorisation was initially 95.2% and consensus was reached on all criteria after discussion.

Summary measures and synthesis of results

For each study, we calculated the proportion of all applicable quality criteria categorised as fully, partially or not satisfied. For each criterion applicable at the level of individual studies, we calculated the proportion of studies which fell into each category of satisfaction. We calculated the proportion of criteria which were fully, partially or not satisfied by all studies for which criteria were applicable. Trend analysis assessed whether the proportion of applicable criteria that were fully, partially or not satisfied changed over time for studies published between 2012 and 2018. All analyses were performed using Excel functions or Graph Pad software.

Results

Study characteristics

The literature search identified 1076 articles from which 50 unique studies met selection criteria [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67] of which 28 (56%) assessed non-procedural, mainly pharmacological, therapies [18,19,20,21,22,23,24, 26, 28, 31,32,33, 36, 38,39,40, 43, 44, 50, 53,54,55,56, 64,65,66,67], 15 (30%) assessed invasive procedures [27, 29, 30, 34, 35, 37, 41, 42, 51, 52, 58,59,60,61,62,63], 4 (8%) assessed investigational strategies [25, 45, 46, 57] and 3 (6%) assessed models of care [47,48,49]. Studies most frequently involved interventions related to cardiology (18/50; 36%) [19,20,21, 23, 26, 30, 32, 35, 39, 40, 44, 50, 52, 61, 64,65,66,67], surgery (13/50; 26%) [27, 29, 31, 34, 37, 38, 42, 53, 58,59,60, 62, 63], neurology (4/50; 8%) [28, 43, 49, 54] and oncology (4/50; 8%) [18, 25, 56, 57]. Most studies (36/50, 72%) [18,19,20,21,22, 24,25,26,27,28,29,30,31,32, 34,35,36,37, 40, 42,43,44,45, 48, 50, 52, 54, 55, 57,58,59,60, 63,64,65] were published in one journal (JAMA), with 13/50 (26%) [33, 38, 39, 41, 46, 47, 49, 51, 53, 56, 61, 62, 66, 67] in another (JAMA Internal Medicine). Sample size varied from as low as 464 participants [25] to as high as 1,256,725 [56]. Study characteristics are summarised in the on-line supplement, and an example of the application of the quality criteria is presented in Table 2.

Table 2 An example of the application of quality criteria

Analyses of methodological quality and risk of bias

The proportions of applicable criteria which were fully, partially or not satisfied for each study are depicted in Fig. 1. No study was shown to have all applicable criteria fully satisfied, with the mean (+/−SD) proportion of applicable criteria fully satisfied across all studies being 72% (+/− 10%). This figure was the same for both non-procedural (68% [+/− 9%]) and procedural (70% [+/− 10%]) interventions. The categories of quality criteria demonstrating the lowest proportions of fully satisfied criteria were measures used to adjust observed effects (criteria 20, 23, 24) and validate observed effects (criteria 25, 27, 33).

Fig. 1
figure 1

Concordance with all quality criteria for individual studies

At the level of individual studies, the proportions of all criteria fully or partially satisfied ranged between 60.7% (17 of 28 criteria) and 96.6% (28 of 29 criteria) and the proportions of all criteria that were not satisfied ranged from 3.4% (1 of 29 criteria) to 39.3% (11 of 28 criteria). Only two studies had more than 80% of applicable criteria fully satisfied (Chan et al. at 87% [50] and Friedman et al. at 81% [58];) while two studies met only 50% of applicable criteria (Merie et al. [19]; Shirani et al [28]).

At the level of individual criteria, the proportions of studies in which a specific criterion was fully, partially and not satisfied, or was not applicable, are depicted in Fig. 2.  One criterion (recall bias) was not applicable to any study as informal patient self-report was not used as a data source.

Fig. 2
figure 2

Concordance with individual quality criteria for all studies combined

Across all studies, criteria associated with high levels (≥80%) of full satisfaction (where applicable) comprised: appropriate statistical methods (most commonly propensity-based methods) used to ensure balanced groups (100%); absence of social desirability bias as all studies either used validated, externally administered questionnaires or did not rely on patient self-reported symptoms or function as their primary end-points (100%); coherence of results with other studies of similar interventions (100%); temporal cause-effect relationships (98%); prospective, validated data collection (97%); plausibility of results (96%); absence of surveillance bias in clinical registry data (95%); formulation of pre-specified study protocol (94%); consistency of results to similar studies of same interventions (88%); clear statements on how prognostic variables were selected and measured (86%); data from the majority of the population sample being used in analyses (86%); absence of overstatement of study conclusions (84%); independent blind assessment of outcomes (84%); and adequate matching of patient populations being compared (80%).

Criteria associated with intermediate (51 to 79%) levels of full satisfaction comprised: absence of recording bias in administrative datasets (79%); presence of dose-response relationships (75%); absence of unplanned post-hoc analyses (76%); statistical exclusion of potentially beneficial effect in studies with conclusions of no effect or harm (76%); adequate accounting for selection bias in patient recruitment (74%); and representativeness of the study population (74%).

Criteria associated with low (≤50%) levels of full satisfaction comprised: imputation or other processes to account for missing data or drop-outs (50%); justification for not performing an RCT (42%); interaction analyses in identifying independent prognostic factors that may have influenced intervention effects (42%); use of statistical correction methods to minimise type 1 error arising from multiple analyses of several different outcome measures (33%); clinically significant effect sizes (30%); residual bias analyses that accounted for unmeasured or unknown confounders (14%); and falsification tests for residual confounding (8%).

Trend analysis

The proportions of all applicable criteria that were fully, partially or not satisfied showed no appreciable change over time (Fig. 3).

Fig. 3
figure 3

Trend analysis of quality criteria concordance over time

Discussion

To our knowledge, this is one of only a few studies to apply a comprehensive list of criteria for assessing the methodological rigour of a cohort of contemporary observational studies of commonly used therapeutic interventions in adult patients reported in high-impact general medicine journals. Overall, there was a high level of adherence to criteria related to study protocol pre-specification, sufficiently sized and representative population samples, prospective collection of validated and objective data with minimisation of various forms of ascertainment and measurement bias, appropriate statistical methods, avoidance of post-hoc analyses, testing for causality, and impartial interpretation of overall study results. These criteria are central to most critical appraisal guides and reporting guidelines for observational studies, are well known to researchers, and hence will likely attract a high level of adherence.

However, there is room for improvement. On average, each study failed to satisfy at least one in four quality criteria which were applicable to that study. The most frequent omission was failing to conduct a falsification (or specificity of effect) test for studies which reported intervention benefits. This test demonstrates whether a benefit is seen for outcomes that can be plausibly attributed to the intervention (eg reduction in myocardial infarctions with coronary revascularisation), but no change for a separate outcome most unlikely to be affected by the intervention (eg in this example, reduction in cancer incidence), whereas if a benefit is seen for both outcomes, then the intervention is probably not the causative factor but some other confounding factor that affects both outcomes [68]. Second was the failure to eliminate the possibility of positive effects being annulled or attenuated by an unmeasured or unknown confounder by undertaking residual (or quantitative) bias or instrumental variable analyses. A new concept called the ‘E value’ and its associated formula have recently been articulated which denotes how prevalent and sizeable in its effects such a confounder would have to be to negate the observed benefit [69, 70]. Understandably, as this is a recent innovation, studies prior to 2017 could not have used this technique, although other methods have been used in the past [71], and this form of bias has been known for decades [72]. Third was the absence of large effect sizes which, according to GRADE, lessens the likelihood that the observed benefit is real, as small effect sizes provide little buffer against residual confounding [73]. Exactly what constitutes a large enough effect size to counter such confounding remains controversial, with relative risks (RRs) > 2 (or < 0.5) [74], ≥5 (or ≤ 0.2) [73] or ≥ 10 (or ≤ 0.1) [75] being cited as reasonable thresholds. We chose the first of these three thresholds as the minimum necessary, cognizant of the fact that RRs varying between 0.5 and 2 are the ones most commonly reported. Fourth was the absence of correction for statistical significance (using Bonferroni or other methods) for multiple outcome analyses in avoiding type 1 errors whereby significant but spurious benefits are generated simply by the play of chance [76]. Fifth was the omission of subgroup analyses and statistical interaction testing that could identify effect modifiers that differentially influence intervention effects [77]. Proper use of such analyses seems to be an ongoing challenge for RCTs as well [78]. Sixth was lack of multiple imputation processes to account for missing data or drop-outs, an omission frequently seen in clinical research [79]. Such analyses assess the potential for observed effects to have been attenuated by unascertained adverse events occurring among those lost to follow-up at study end, particularly if the outcome of interest, such as deaths, is infrequent. Finally, many studies failed to provide a substantive reason why an RCT could not be performed in the absence of existing RCTs. While it may arguably not qualify as a quality criterion, we believe researchers are obliged to explain why a study design vulnerable to bias was preferred over more robust randomised designs if no substantive barriers to doing such an RCT existed.

A further concern is that despite the promulgation of reporting guidelines for non-randomised studies and the development of statistical methods for gauging the level of sensitivity of results to residual bias, our trend analysis indicates little improvement in methodological quality of studies published between 2012 and 2018. Overall, deficits in statistical analytic methods featured more prominently than deficits in study design and conduct. In particular, the absence of falsification tests, E-value quantification, subgroup analyses using tests for interaction, and adjustment for missing data and multiple comparisons limited the ability of many studies to account for residual confounding in their results.

Limitations

Our study has several limitations. First, despite excellent agreement between authors in categorising levels of criterion satisfaction, this task involves subjective and potentially biased judgement. However, this problem is common to most quality assessment tools [80]. Second, our criteria have not been validated, although few tools have, and, in any event, our criteria included those contained within other well-publicised instruments which have recognised limitations [81]. Third, some may argue that studies not using propensity score methods to create matched cohorts for primary analyses and relying solely on multivariate regression models should be classed as more vulnerable to bias than those which do. However, research has not shown the former to be necessarily superior to the latter [82]. Fourth, we made no attempt to rank or weight criteria according to the magnitude of their potential to bias study results, but as far as we aware, no validated weighting method has been reported [83]. Fifth, our chosen threshold for effect size (odds ratio ≤ 0.5 or relative risk reduction ≥50%) is arbitrary and may be regarded as too stringent, but is the upper threshold quoted by other researchers [73,74,75]. Sixth, our small sample of 50 studies, with the majority taken from only 2 journals, and identified from searching only one database is arguably not representative of all observational studies of therapeutic interventions, although PubMed is the database widely used by practising clinicians to find articles most relevant to their practice. The inclusion of the terms ‘association’ and ‘observational’ in our search strategy likely biased study retrieval towards articles published in JAMA and JAMA Internal Medicine, as these journals use these words consistently in their titles and abstracts. However, it is also possible these journals have a greater propensity than other journals to publish observational studies. We would recommend that all journals request authors to have their study titles clearly indicate they are observational. While the sample is small, the included studies involved commonly used clinical interventions, and by being published in high impact journals have considerable potential to influence practice. Moreover, other investigators have found it difficult to find large numbers of observational trials in specific disciplines over extended periods of time [84].

Conclusion

Contemporary observational studies published in two high impact journals show limitations that warrant remedial attention from researchers, journal editors and peer reviewers. Reporting guidelines for such studies should promulgate the need for falsification testing, quantification of E values, effects sizes that denote less vulnerability to residual confounding, appropriate statistical adjustment for multiple outcome analyses, statistical interaction tests for identifying important predictors of intervention effects, complete patient follow-up, and justification for choosing to undertake an observational study rather than an RCT.