Introduction

Evidence-based medicine is the established bedrock of good clinical care. Whilst historically there have been concerns over the strength of evidence base behind orthopaedic interventions [1], the recent past has seen an increase in the number of randomised clinical trials (RCT’s) published in major medical journals, particularly the so-called big five [2]. Understanding the strengths and limitations of these trials is vital to understanding their clinical applicability, as well as providing a key learning opportunity for future trial design and development.

Growth of the orthopaedic trial community has led to increasing interest in the concept of pragmatic trials, where the focus is to reflect real-world applicability of an intervention rather than providing causative explanations for trial outcomes. There have, however, been concerns raised about a risk of overgeneralisation, and associated scepticism about applicability to every circumstance [3, 4] associated with a pragmatic trial design.

The adequacy of reporting [5], design [6] and robustness [7] of clinical trauma and orthopaedic trials have also previously been called into question. Recent analysis [8] from trials published within a specific mainstream orthopaedic journal has identified general improvements in the quality and quality of analyses over time, but with trends towards smaller, single centre trial design.

However, several larger trials are now reported in high-impact non-orthopaedic medical journals, and thus were excluded from this previous analysis. As a result, there is little currently understood about the specifics of design, conduct and reporting related to these large-scale trauma and orthopaedic trials published in major general medical journals with a high impact factor.

We therefore set out to examine the quality of evidence produced from RCT’s published within this setting. Given the high-impact and international influence of these journals it is integral that the literature produced is of sound methodological quality with low risk of bias in order to provide substantial high-quality evidence for interventions.

Materials and methods

Study selection

A systematic search of 5 major high-impact general medical journals (colloquially known as “the big five”—British Medical Journal (BMJ), Journal of American Medical Association (JAMA), New England Journal of Medicine (NEJM), the Lancet and Annals of Internal Medicine (Annals)) was performed from April 2010–April 2020 using online bibliographical archives on each journal website. These journals have a combined mean impact factor of 46.8 (https://academic-accelerator.com/Impact-Factor-IF/), far in excess of any trauma and orthopaedic speciality journal. They have previously been used to examine adequacy of trial design in other areas of healthcare and provide a gold standard reference for trial quality, given their exclusivity and publication standards [9]. Screening of full text articles was performed by one author (VK) and verified by another (LF). All articles pertaining to any area of trauma and orthopaedics that described a surgical treatment-based intervention randomised clinical trial were included. Those articles describing other surgical fields, or pertaining specifically to non-surgical interventions, were excluded.

Data collection

Data extraction was performed using a standardised proforma by three independent reviewers (VK, AA and TG). Any disagreement was mediated by a fourth individual (LF) until a communal decision was reached. Given the purpose of this study as a reflection of the available literature study authors were not contacted in the presence of missing data. All published or freely accessible data sources (for example, study protocols or trial monographs) for each study were, however, utilised. Data fields included in the analysis and extracted for each included trial are displayed in Table 1.

Table 1 Data fields

Statistical analysis

Descriptive analyses of overall trial characteristics were performed. N (%) was calculated for categorical variables, with median values and range presented for continuous variables given these were all non-normally distributed.

Comparison for the predicted control group event rate (identified from sample size calculation) versus the actual control group event rate was made for dichotomous outcomes. Assessment of study pragmatism was compared between significant and non-significant results utilising an unpaired 2-tailed t-test.

Assessment regarding risk of bias was performed for each trial using two measures:

  1. 1.

    The Cochrane Risk of Bias 2 tool [10], with each study summarised as either low, medium or high risk of bias.

  2. 2.

    The modified Delphi list [11], with a maximum score of 9 points. Only items assessed with a “yes” were given a score of 1 point. For the purposes of the study, scores 8–9 were considered high quality, scores 5–7 medium quality, scores 4–6 low quality and scores 1–3 very low quality.

Characterisation of individual article post-publication data was also performed. This included the study Altmetric attention score where available (www.altmetric.com/about-our-data/the-donut-and-score) and number of citations (https://scholar.google.co.uk/).

All statistical analyses were performed using R (R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria) and Microsoft Excel (Microsoft Corporation. (2018). Microsoft Excel, Washington, USA).

Results

Full details of the extracted information are located in supplementary tables 1–3. The summary results are displayed in Table 2. Overall, we identified 25 studies suitable for inclusion [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36]. Of these studies, 9 pertained to trauma, 3 to elective hip surgery, 7 to elective knee surgery, 4 to spinal surgery and 2 to elective shoulder surgery. A greater proportion of trials—16/25 (64%) were identified in the latter half of the study period (2016–2020), than the early period—9/25 (36%). Most studies were lead from the UK 12/25 (48%), with 9/12 (75%) of these funded by the National Institute of Health Research. Patient recruitment was performed from a median of 9 centres (range 1–81). Blinding was present in approximately half of trials 13/25 (52%); of these, 8/25 (32%) were single (assessor) blinded, and 5/25 (20%) were double blinded (patient and assessor). Regarding outcome assessment, the vast majority—22/25 (88%) utilised patient-reported outcome measures or functional scores as the primary outcome measure. Complication rate was only utilised in 3/25 (12%) trials as the primary outcome measure. No trials used mortality as the primary endpoint.

Table 2 Summary of results

Regarding specific journals, 7 articles were published in the BMJ, 7 in the NEJM, 6 in the JAMA, 5 in the Lancet and none in the Annals. The mean Altmetric attention score© was 242 (range 22–681), and the mean number of citations per article was 230 (range 22–743).

Study pragmatism

The PRECIS-2 (Pragmatic Explanatory Continuum Indicator Summary—2) tool[37] was utilised to assess the pragmatism of included studies. Scores for individual domains in each trial are displayed in supplementary table 4. Overall, there was a high degree of pragmatism identified (mean aggregated score across all studies and domains 4.2/5). Studies with statistically significant results had a lower mean overall PRECIS-2 score compared to those with non-significant results (mean 3.71 vs 4.4, respectively; p < 0.001).

Sample size

All studies were set significance as p < 0.05. For studies with power data available, 15/23 (65.2%) utilised a power value of 80%, 7/23 (30.4%) a power value of 90%, and 1/23 (4.4%) a power value of 81.5%. Twenty-three out of 25 studies (92%) reported use of the MCID in order to perform their sample size calculation; however, only 14/25 studies (56%) had appropriate justification for use of the MCID or predicted effect size, and only 15/23 (65.2%) reported standard deviation in outcome for the target population when using MCID that is required to perform appropriate sample size calculations. Furthermore, only 14/25 studies (56%) achieved their target sample size for assessment of the primary outcome. Five out of 24 studies (20.8%) made amendments to the sample size calculation while the trial was ongoing.

Results

Seven out of 25 (28%) trials reported statistically significant results (p < 0.05) for the primary outcome. Seven out of 18 (38.9%) of those trials that reported non-significant results had an actual sample size for the primary outcome smaller than the predicted sample size, indicating a potential type II error. Only 3 trials reported dichotomous outcomes that allowed for assessment of the FI/RFI and comparison of the predicted control group event rate versus the actual control group event rate. Given all three reported non-significant results, the RFI was utilised. For the HEALTH study [16], the RFI was 10, with loss to follow-up (LTFU) of 29 patients. For the FAITH study [29], the RFI was 8, with LTFU of 383 patients. For the WHIST study [21], the RFI was 7, with LTFU of 29 patients. Results of all three trials could have been overturned dependent on the results of those lost to follow-up. With regard to the comparison of the predicted control group event rate versus the actual control group event rate, both the WHIST and HEALTH studies had differences > 50% (Table 2).

Risk of bias

The summary results for the Cochrane risk of bias analysis, including assessment for each included domain, are displayed in Table 3. Three out of 25 (12%) trials were adjudged to be at low risk of bias, 18/25 (72%) trials at some risk of bias, and 4/25 (6%) at high risk of bias. Three out of 4 (75%) of those studies judged at high risk of bias were due to deviations from the intended interventions. Sixteen out of 25 (64%) studies had at least some risk of bias in outcome measurement attributable to the use of PROMs without patient blinding where the outcome may have been influenced by knowledge of the intervention received. A summary bar chart of the percentage of risk by category is displayed in Supplementary Fig. 1.

Table 3 Revised Cochrane risk-of-bias tool for randomised trials (RoB 2)

The summary results for the Delphi list assessment are displayed in Table 4. Five out of 25 (20%) trials were adjudged to be high quality, and 20/25 (80%) of trials were designated as medium quality. No trials were assessed to be of either low or very low quality. The main reason for designation of lower study quality was lack of blinding to patient, care provider or assessor.

Table 4 Delphi list assessment

Discussion

Key findings from this analysis included that only a very small proportion of trauma and orthopaedic RCT’s published in high-impact general medical journals were judged to be of high quality/low risk of bias. Many published trials did not achieve their target sample size for assessment of the primary outcome, and several did not describe appropriate techniques to justify use of the MCID for the intended intervention. We identified a high degree of study pragmatism, with a lower likelihood of statistically significant results for more pragmatic designs.

Despite these concerns, these studies were highly cited in the literature, with evidence for widespread dissemination according to Altmetric attention scores©. Knowledge of deficiencies in the design and reporting of trauma and orthopaedic RCT’s can assist in the planning of future trials to improve scientific rigour and ensure widespread clinical applicability.

Patient-reported outcome measures

The vast majority of studies utilised PROMs, with the MCID as the primary method of defining the delta within the sample size calculation. The MCID is an important concept; however, there are a few potential issues that need to be addressed when considering its use in this context. The first is that there is no universal definition of how best to evaluate the MCID. The MCID produced varies by technique used and depends on the patient’s baseline status, as well as study context [38]. Age is perhaps the most predominant example of this is, having previously been shown to influence baseline PROMs and response to surgical interventions for knee arthritis [39]. Use of PROMs with ceiling effects is also known to impart bias on outcomes following trauma and orthopaedic randomised clinical trials [40]. Given the identified widespread use of MCID, it is imperative future trials utilise appropriate techniques to ensure correct definition of the MCID in the population to be tested by the intervention.

It was notable that a significant proportion of trials (44%) did not achieve their target sample size for calculation of the primary outcome at the prespecified time point. This suggests that current estimates regarding participant retention are often incorrect. Overall, there was mean underestimation of eventual study sample size available for the primary outcome by approximately 5%, but this was as high as 28% in the studies by Frobell et al. [12] and Försth et al. [15], and above 10% in a number of others [14, 24, 25, 36]. Further careful consideration of factors potentially influential towards ongoing involvement or crossover is required during sample size calculation to ensure that sufficient recruitment. Guidelines for the conduct and reporting of RCT sample size calculations (DELTA 2 [41]) have previously been described and should be utilised to ensure a high probability of a study achieving its primary aim.

Sample size

Identified smaller than predicted sample sizes are a concern for those trials with negative results (38.9%), where there is an associated risk of type II error. It is therefore difficult to determine whether the results for these trials were due to absence of evidence or actual evidence of absence of effect. We also identified significant differences in the predicted and actual effect size in the control group for included studies, which may additionally influence the ability to perform accurate outcome assessment. Future use of adaptive trial designs may help to eliminate some of these issues [42] and minimise research that does not achieve its desired intention [8].

Study pragmatism

Another notable finding from our results was the high degree of pragmatism (according to PRECIS-2) identified in the included trials and the fact that a greater degree of pragmatism in approach was associated with lower likelihood of a significant result. Other research has previously highlighted how questions over the routine use of pragmatic trials may have had a role to play in the lack of translation from some trials towards change in clinical practice [43] and issues with trial recruitment [4, 44]. Surgeon and patient preference have both been shown to be influential in recruitment and retention to trauma and orthopaedic pragmatic RCTs [45]. It is vital that equipoise within the surgical community is established prior to embarking on any pragmatic trial, as recruitment bias and crossover remain a major concern. Use of the readiness assessment for pragmatic trials (RAPT) model may provide one method of determining suitability of an intervention for testing in a pragmatic trial [46]. We could not find any evidence of this tool having been used in trauma and orthopaedic research to determine the suitability of previously conducted pragmatic trials.

Alternative approaches to trial design

Another option that requires further exploration regarding utility in the domain of clinical Trauma & Orthopaedic research is a Bayesian approach to trial design. This technique has been increasingly used in the wider trial community [47, 48] and may provide significant potential benefit in the heterogeneous populations seen across the breadth of Trauma and Orthopaedics [49].

Strengths and limitations

Strengths of the study include the in-depth assessment of study quality and design for gold-standard benchmarks regarding the current state of orthopaedic research. Potential limitations to our study include that it may be possible that higher-quality evidence and lower risk of bias are seen in studies contained within other journals, but this is contradictory to what has been previously reported [8]. Post hoc power calculations were not conducted due to known methodological issues with this approach [50].

Applicability

We provide a summary of the literature with note of areas for improvement, but it should be clear that these concerns are not applicable to all included studies, and the use of many of the methods discussed such as the use of PROMs, the MCID and pragmatic trials is supported when utilised appropriately.

Conclusions

The majority of trauma and orthopaedic RCTs published in high-impact major medical journals have evidence of significant knowledge dissemination, but some notable concerns related to study quality.

We suggest the following changes may assist in future publication of low-risk trials: International co-operation in the development and funding of large-scale multi-centre randomised trials, appropriate calculation of the relevant MCID for the study hypothesis with use of widely validated PROMs, measures to improve trial retention, blinding of participants to intervention allocation when utilising PROMs, prior assessment of community equipoise (with potential increased use of explanatory trials where appropriate), and potential use of Bayesian approaches to trial design.

Caution should be used in the interpretation of highly pragmatic trials as these appear less likely to be associated with statistically significant results, although the exact nature of this relationship is unclear.