The International Society of Medical Editors emphasises the importance of effective reporting in medical literature [1, 2]. However, previous studies have identified poor quality of reporting of study methodology in the orthopaedic literature [3, 4].

Since January 2003, all clinical scientific articles published in the American Volume of The Journal of Bone and Joint Surgery (JBJS-A) have included a level of evidence rating [5, 6]. The Levels of Evidence Rating System is a tool that classifies the quality and design of a study. Based on a review of several existing evidence rating systems [5, 6], JBJS-A has designed a scheme that uses five hierarchical levels for each of the four different study reporting types (therapeutic studies, prognostic studies, diagnostic studies, and economic and decision analyses). According to the Levels of Evidence Rating System hierarchy, randomised controlled trials (RCTs) occupy the top positions (Level I & Level II evidence) and expert opinion lies at the bottom (Level V evidence). Previous research has suggested that investigators with training in epidemiology can achieve nearly perfect agreement when applying the Levels of Evidence Rating System to a study [7]. This research suggests reliability; however, the system's validity remains debatable [7].

The Levels of Evidence Rating System causes readers to infer that Level I evidence RCTs are of better methodological quality than Level II evidence RCTs [8]. The Editorial Board Members of the JBJS-A reported that the Levels of Evidence Rating System would have important advantages such as enabling the journal "to monitor and to periodically report trends in the quality of orthopaedic clinical research" [5]. Furthermore, the editors wrote that "higher levels of evidence should be more convincing to surgeons attempting to resolve clinical dilemmas" [5].

The assessment of the true quality of published studies remains challenging [911]. One can judge the true study quality only if the reporting of the trial is done in a clear and comprehensive manner. For example, in some published articles within Internal Medicine literature, the authors failed to report important methodological safeguards that were in fact used during the conduct of the trial [12]. Therefore, high quality depends not only on the nature of the work, but also on the completeness of the reporting [2]. Most readers of medical literature will base their assessment of study quality solely on the information contained in the report of a trial, as they will not be bothered to contact the author for additional information [12].

The most developed criteria for guiding clinicians in their assessment of study reporting quality have been proposed for RCTs, since RCTs are a study design that yield the lowest chance of bias [11, 13]. The Consolidated Standards for Reporting of Trials (CONSORT) statement was developed to help authors present their trial in a structured and complete manner. Assessors, on the other hand, use different tools to assess the quality of a trial. The Cochrane Collaboration, which is the largest database of systematic reviews (N = 4041, October 2005) and clinical trials (N = 454449, October 2005) in existence, has adopted one commonly utilized rating system to guide assessors in their assessment of study quality, as evaluated through the information contained in the report [9, 14].

Given the upcoming use of the Levels of Evidence Rating System in orthopaedic literature, we aimed to evaluate the reporting quality of RCTs published in the JBJS-A from 2003 to 2004 (Level I and Level II evidence ratings). We, therefore, extracted the level of evidence rating as published in each RCT and compared this rating with the well-established Cochrane Bone, Joint and Muscle Trauma Group's reporting quality assessment tool. We chose the JBJS-A because it was the most frequently sited general orthopaedic journal (ISI web of science), and the only journal that used this Levels of Evidence Rating System in the eligible time period.

Our hypotheses were twofold: 1) Level I evidence studies in a high impact general orthopaedic journal would not necessarily have high quality reporting and 2) the reporting quality of RCTs would not differ among trials labelled as Level I or Level II evidence.


Study design

We conducted a methodological study. We assessed the level of evidence rating assigned to a series of RCTs with the Cochrane reporting quality score.

Eligibility criteria

Two assessors (RWP, MB) identified orthopaedic journals that reported a level of evidence rating in their abstracts from January 2003 to December 2004 by searching the instructions for authors of the highest impact general orthopaedic journals (JBJS-A, JBJS-British Volume, Clinical Orthopaedics and Related Research, and Acta Orthopaedica). Within the eligible journal, two assessors (RWP, RK) hand searched all issues from 2003–2004. The eligibility criteria were determined and set a priori. Eligible studies included those reported as RCTs involving a therapeutic intervention and using human subjects. We conducted searches in duplicate, and the consensus of three authors (RWP, RK, MB) resolved any disagreements.

Study demographic information

The relevant demographic information was extracted from each eligible study by one investigator (RWP) and rechecked for accuracy by a second investigator (PAAS). The extracted data included (1) first author (surgeon, non-surgeon, or epidemiologist), (2) cited statistical support or methodological support by a department of clinical epidemiology or public health, (3) year of publication, (4) total sample size, (5) number of centres, (6) name of intervention, (7)category of intervention (fracture treatment, treatment of degenerative disease of the spine and joints, drug trial, pain management, or other), (8) body region (upper extremity, long bones of lower extremity, spine, hip and knee, or foot and ankle, DVT, or other),(9) financial support (yes or no), (10) direction of results(positive [if the findings of the randomised trial were significant]or negative [if they were not significant]), and (11) trial reported according to the CONSORT statement (yes or no).

Levels of evidence

One of the authors (RWP) extracted the level of evidence from each abstract of the included RCTs. A second author (INS) double-checked the evidence rating to ensure that it was correctly extracted from the paper.

Quality of reporting assessment

Two authors (RWP, PAAS), blinded to study author and institution, graded the reporting quality of the included RCTs using the Cochrane reporting quality assessment tool, which was devised by the Cochrane Bone, Joint and Muscle Trauma Group, formally known as the Musculoskeletal Injuries Group. This scoring scheme covers aspects of internal and external validity for the assessment of methodological quality [15]. We used this reporting quality assessment tool as our reference standard due to its widespread use [15] and association with the methodologically rigorous Cochrane reviews of RCTs [9, 16, 17]. The tool consists of twelve items important for the critical appraisal of a RCT report. A coding manual was available from the group's website [15]. The highest possible score for each item was 2 and the lowest was 0. Additional file 1 contains the scoring system that we used to identify the important aspects of reporting methodological quality [see Additional file 1]. We followed therecommendation found in the Cochrane Handbook which stated that at least two authors assess information that involves subjective interpretation and information that is critical to the interpretation of results (e.g., outcome data) [18].

Studies that randomly allocated patients (Item D), concealed randomisation (Item A), blinded participants (Items C, E, F) and documented study withdrawals (Item B) were reported to reflect higher quality [19, 20]. We scored all reported methodological safeguards separately for all identified RCTs. Different quality aspects can be weighted differently and thresholds are arbitrary [10]; therefore, we did not summarize the scores in totals, but reported the raw data.

Ensuring the accuracy of the quality rating

We used Intraclass Correlation Coefficients (ICC) to measure the agreement between the assessors' assessment of study reporting quality. We used Landis and Koch's suggested criteria for the interpretation of the agreement: 0 to 0.2 represented slight agreement, 0.21 to 0.40 fair agreement, 0.41 to 0.60 moderate agreement, and 0.61 to 0.80 substantial agreement. A value above 0.80 was considered almost perfect agreement [21]. Regardless, if two assessors disagreed even slightly, consensus was attempted after carefully reading the article a second time in a consensus meeting. In situations where discrepancies persisted despite a consensus meeting, a third assessor was asked for an opinion on the specific item to reach final consensus. This method of quality assessment a final consensus meeting has been commonly used in Cochrane reviews. All assessors (RWP, PAAS, MB, and RK) were well trained in quality assessments, were clinically active in orthopaedic surgery, had completed a Cochrane Review course, and had co-authored in Cochrane systematic reviews of RCTs.

Statistical analysis

Data was analysed using the SPSS statistical software package (version 11.2; SPSS, Chicago, Illinois). We summarized all individual Cochrane reporting quality items with mean scores, which we then compared with student t-tests. We compared more than two means with single factor analysis of variance adjusted for post-hoc comparison testing. We then compared the total scores (0–2) for each item in the Cochrane reporting quality assessment tool with the level of evidence rating as published in JBJS-A. Prior to the analysis, we identified Cochrane Items A, C, E, F, and L to be most similar to the description of the levels of evidence. In a subgroup analysis, we compared the levels of evidence as described in the instructions for authors with the Cochrane reporting quality items that were deemed similar (Table 1). We used the Spearman's correlation (non-parametric test, non-normally distributed data) to calculate the correlation between the JBJS-A level of evidence rating and the total Cochrane reporting quality score, and the correlation between the JBJS-A level of evidence rating and Items A, C, E, F, and L of the Cochrane reporting quality score. For correlations, we categorized the levels of evidence from 1 to 4 (1 = level 1A, 2 = level 1B, 3 = level 2-1, 4 = level2-2) with 1 representing the highest level. We used p < 0.05 to represent statistical significance. All tests of significance were two-tailed.

Table 1 Cochrane Items Closely Related to the Levels of Evidence

Sample size

Our study sample size included all RCTs published in the JBJS-A from January 2003 to December 2004. We required at least 30 eligible RCTs to provide sufficient correlation data on the level of evidence ratings and the Cochrane reporting quality scores (alpha = 0.05, Beta = 0.20, rhonull = 0.2, rho = 0.7).


Study demographic information

Of the four high impact orthopaedic journals, only JBJS-A used the level of evidence rating from 2003 to 2004. We identified 938 publications in the JBJS-A from January 2003 to December 2004. Of these publications, 32 (3.4%) were RCTs that fit the eligibility criteria. Thirty (94%) of the first authors were surgeons and 2 (6%) were non-surgeons. In 5 (16%) of the RCTs, at least one author had cited training in biostatistics (MSc or PhD) or was affiliated with a department of statistics, public health, or clinical epidemiology. The 32 RCTs included a total of 3543 patients, with sample sizes ranging from 17 to 514 patients. Six (19%) of the studies were performed in two or more centres, 11 (34%) focused on interventions related to the treatment of degenerative joint disease, 7 (22%) focused on fractures, and the remainder involved problems affecting the upper extremity [5 (16%)], the foot and ankle [6 (19%)], and the knee [9 (28%)]. Four (13%) RCTs were reported according to the CONSORT statement (Table 2). References to the included studies can be found in Additional file 2 [see Additional file 2].

Table 2 Characteristics of the Thirty-two Trials

Levels of evidence

Of the 32 included RCTs, 29 were reported as Level I studies and 3 were reported as Level II studies. Level I studies were further subgrouped into 22 Level-1A and 7 Level -1B (RCT-no significant difference, but narrow confidence intervals) studies. Level II studies were also subgrouped into 1 Level II-1 and 2 level II-2 studies as extracted from the included papers' abstract.

Limitations in quality of reporting (Hypothesis 1)

Only 12 (38%) of the 32 included RCTs clearly described allocation concealment (Item A). Seven (22%) clearly described an intention to treat analysis (Item B). Thirteen (41%) clearly described the blinding of outcome assessors (Item C). Twenty-three (72%) clearly described the comparability of the treatment and control group at entry (Item D). Six (19%) of the 32 RCTs clearly described the blinding of participants (Item E). Only 2 (6%) of the studies clearly described the blinding of treatment providers (Item F). Seventeen (53%) clearly described identical care programmes other than the trial options (Item G). Of the 32 RCTs, 25 (78%) adequately described the inclusion and exclusion criteria (Item H). Of all the items, I and J were described best in all 32 RCTs: 31 (97%) clearly described the interventions and 31 (97%) clearly described the outcome measures used. Twenty-two (69%) clearly described a useful diagnostic test in the outcome assessment (Item K). Only 10 (31%) described an appropriate duration of follow-up (Item L).Table 3 shows all data for each RCT.

Table 3 Cochrane Bone, Joint and Muscle Injury Group scores for all 32 RCTs.

Among items closely corresponding to the Levels of Evidence Rating System criteria (Items A, C, E, F, and L), assessors achieved substantial agreement (ICC = 0.80, 95%CI:0.60 to 0.90). Across each of the 12 items, however, agreement varied (Range of ICC = 0 to 0.80). In all cases, assessors achieved consensus, either alone or with a third, intervening reviewer.

Correlation between Cochrane reporting quality scores and reported levels of evidence (Hypothesis 2)

We compared the mean score in each item of the Cochrane reporting quality assessment tool separately (Items A through L) with each level of evidence (Table 4). Mean quality scores did not significantly differ across the 12 separate items of the Cochrane reporting quality assessment tool (Table 4). Correlations varied from 0.0 to 0.2 across the 12 items of the Cochrane reporting quality assessment tool (Table 4).

Table 4 Mean and median Cochrane score for all items compared with Levels of Evidence


Summary of key study findings

The results of our methodological study demonstrated two key findings 1) Level I evidence studies revealed important limitations in their quality of reporting and 2) non- significant difference in the quality of reporting between studies labelled as Level I or Level II evidence.

Strengths and weaknesses

Our study is strengthened by the use of a well-described and commonly used quality assessment tool from the Cochrane Collaboration that identifies the relevant methodological aspects of trials as reported and assesses these aspects individually. Furthermore, all assessors (RWP, PAAS, MB, RK) were well trained in quality assessments. Our decision to conduct assessments in duplicate (and triplicate when assessors disagreed) further strengthened the rigor of our assessments [18]. The paucity of Level II studies in our series limited inferences about the correlation data with level of evidence ratings. Our finding that the mean overall scores between Level I and Level II studies did not significantly differ was likely underpowered. The sample size calculation was difficult since clinicians have made arguments against calculating totals in quality scores (see discussion below). However, to identify a difference in quality scores of 3.5 points, we required at least 12 Level II studies (80% study power, alpha = 0.05). The more relevant comparison of the abridged quality scores that reflect the level of evidence criteria suggested that we would require at least 22 Level II studies. Given that only 3 Level II therapy studies have been published over the two-year period, it may require a decade to gain this additional information from the JBJS-A unless the Levels of Evidence Rating System is widely adopted by multiple orthopaedic journals. Therefore, our findings represent the current best estimate of association until more studies become available for comparison. Our study does, however, have a sufficient number of RCTs to observe variation in the study reporting quality scores. Since 2005, the JBJS-A has abandoned the uses of Level I and II subgroups; therefore, the relevance of analysing differences between Level Ia and Ib studies is limited. Our study described RCTs in one journal dedicated to one surgical field. Although this journal's scope is general orthopaedics, our findings are not generalisable to other surgical fields and journals.

Previous literature

A previous review of published studies in The Journal of Bone and Joint Surgery 1988 through 2000 revealed a similar proportion (3%) of randomised trials compared with our current study (3.4%) [4]. The Cochrane Bone, Joint and Muscle Trauma Group's reporting quality assessment tool describes the following aspects of quality assessment which have previously been shown to be important in preventing bias [9]: allocation concealment, blinding, generation of allocation sequence, similarity of groups at baseline, description of outcomes, intention to treat analysis, and losses to follow-up. Currently, no consensus on the ideal checklist and scale for assessing methodological quality exists [9]. The number and variety of quality assessment scales that exist make it unclear as to how to achieve the best assessment [10, 11]. The Levels of Evidence Rating System used by the JBJS-A can be qualified as one of these quality assessment scales. Summary scores (totals) should not be calculated, although it may be tempting to do so. The use of thresholds skews the direction of results and may lead to false conclusions in a meta-analyses [10]. Furthermore, Juni et al. discouraged the use of individual scales as absolute and objective measures of trial quality and noted "relevant methodological aspects should be identified, ideally a priori, and assessed individually" [10, 18]. For example, the same criteria for blind assessment cannot be applied to drug and surgical trials, since, in the latter group, treatments are usually more difficult to conceal. Ideally, scales that are used to measure the quality of reporting of surgical trials should be tailored to the maximal possible quality, rather than to a unique gold-standard quality [10]. Therefore, the Cochrane Collaboration's handbook advises to describe aspects of critical appraisal separately and to avoid summarizing results [18]. Our findings confirm the variability of scores across each item of the Cochrane reporting quality assessment tool.

Relevance of our findings

Despite the widely held belief that the Levels of Evidence Rating system categorizes studies by quality [5, 6], our study suggests that this system, while reliable [7], may not be a valid tool for determining the quality of a study, as determined through the study reporting. As with any system, whether it is the Levels of Evidence Rating or the Cochrane reporting quality tool, the quality of study reporting is critical. The CONSORT statement was developed to help authors improve the reporting quality of RCTs [22]. In principle, this standardized scheme would explicitly require reporting of all features critical to the validity of a RCT and would require the presentation of results in a standard manner to improve clarity [4]. Use of the CONSORT statement is associated with improvements in the reporting quality of RCTs [22]. However, the reporting quality of RCTs in fracture care did not improve following the introduction of the CONSORT statement because many author's have not adopted the statement to guide their reporting [3]. Our findings further identify a lack of incorporation of the CONSORT statement in orthopaedic trials; only four studies (13%) were adequately reported with CONSORT guidelines. Journal editorial boards and assessors must continue to enforce high quality reporting of RCTs to allow an accurate assessment of the level of evidence and other study reporting quality measures.

Implications for future research

This study was underpowered to explore the influence of reported statistical support, adherence to CONSORT guidelines, multi-centre studies, and sources of funding on the quality of reporting, direction of results, and magnitude of treatment effect size. Future studies are needed to explore any associations.


Our findings suggest that readers should not assume that 1) studies labelled as Level I have high quality of reporting and 2) Level I studies have better reporting quality than Level II studies. Methodological safeguards should be addressed individually.