Background

There is emerging debate about the adherence to standards or recommendations made by regulatory bodies in the design, statistical analysis, description, and reporting of randomised clinical trials (RCT) intended to show noninferiority or equivalence [1]. The recent extension of the Consolidated Standards of Reporting Trials (CONSORT) statement to noninferiority and equivalence trials [1], the U. S. Food and Drug Administration (FDA) guidelines [2], as well as statements from other regulatory authorities, have updated recommendations to guide the conduct and reporting of RCTs investigating noninferiority or equivalence [3].

Noninferiority and equivalence studies aim to demonstrate that a new experimental treatment is not statistically worse than the active control intervention by more than a pre-specified amount (Δ), the noninferiority or equivalence margin. There are now a large number of studies intending to demonstrate noninferiority or equivalence. However, only very few of them have been designed, described, and reported as such [4]. In designs intending to show noninferiority (i.e. that a new treatment is at least as good as or no worse than an existing treatment) or equivalence (i.e. show that a new treatment has equal or comparable efficacy effects), one area of debate has been the application of the intent-to-treat (ITT) principle versus the per-protocol (PP) analysis [5]. There has been an expectation that ITT analyses are less conservative, tending to reduce the observed treatment difference in noninferiority trials, thus permitting wider confidence intervals; a factor that has increased the dependence on per-protocol analysis. The International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) E9 guideline on 'statistical principles for clinical trials' unequivocally states that the use of the ITT analysis in equivalence and noninferiority trials is 'generally not conservative and its role should be considered very carefully' [6]. However, the Committee for Proprietary Medicinal Products (CPMP) consider both analyses to be equally important [7] and simulation studies have not demonstrated the anticonservative approach of ITT versus PP [8]. In the extension of the CONSORT statement, conducting and reporting PP analyses alongside ITT has been suggested as a means of increasing confidence in the study finding particularly when both methods produce similar results.

In order to determine the methodological quality and reporting standards for those trials designed to show noninferiority or therapeutic equivalence, we conducted a systematic survey of the literature to identify noninferiority and equivalence randomised clinical trials involving glaucoma drugs of prostaglandin origin. We applied the extended CONSORT statement for non-inferiority and equivalence trials as methodological criteria of minimum reporting standards [1].

Methods

Eligibility Criteria

Noninferiority and equivalence randomized clinical trials that involved the application of any of the four major drugs of the prostaglandin analogues (latanoprost, travoprost, bimatoprost, and unoprostone) in the treatment of patients with open angle glaucoma (POAG) or ocular hypertension (OH) were eligible for the study. We defined noninferiority trials as 1-sided trials aiming to demonstrate effectiveness not lower than Δ. Whereas, equivalence studies were defined as 2-sided trials wherein the difference between treatments is between -Δ and +Δ margins of equivalence. We further determined a trial to be eligible for inclusion if 'equivalence' or 'noninferiority' was mentioned as the intent anywhere in the methods sections of the manuscripts. We excluded bioequivalence (pharmacokinetic and pharmacodynamic) trials and trials designed to show superiority.

Search Strategy

Using independent and duplicate searchers (OE, EM), we searched the following 6 databases from inception to March 2008: MEDLINE via PubMed, CinAhl, AMED, Toxnet, Cochrane CENTRAL and E-Psyche. Our MEDLINE search strategy identified the exact MeSH terms for the following expressions: open angle glaucoma, ocular hypertension and prostaglandin* (latanoprost, bimatoprost, travoprost aand unoprostone). We conducted 3 main searches. First we collated all articles with the MeSH key terms "glaucoma, open angle OR ocular hypertension" [see Additional file 1]. We then extracted articles containing the MeSH terms "prostaglandin* OR latanoprost OR travoprost OR bimatoprost OR unoprostone." We then pooled these two searches to produce a third list, representing all articles that had as its key terms, "POAG or OH or prostaglandin* OR latanoprost OR travoprost OR bimatoprost OR unoprostone." We also searched multiple databases including CinAhl, AMED, Toxnet, Cochrane CENTRAL and E-Psyche using key search terms like latanoprost, bimatoprost, travoprost, unoprostone, open angle glaucoma, ocular hypertension and prostaglandin*. Simultaneous review of these databases was possible through the use of the Ovid interdisciplinary database that permits the checking of multiple databases. Where full text was unavailable, we searched Google Scholar and also used interlibrary loans to order full text articles. We supplemented this search by reviewing the bibliographies of key papers.

Study selection

Working independently and in duplicate OE and EM reviewed all abstracts and full articles, where available, of our search results. We excluded articles at this stage that were not randomized trials or were reviews and/or comments. All full text articles of identified RCT studies that examined treatment with prostaglandin drugs were sought. Eligibility was determined in consensus. Only comparisons containing at least one study drug of interest (i.e., bimatoprost, latanoprost, travoprost or unoprostone) were eligible for inclusion. When comparisons involved several different dosages (e.g., 0.005% or 0.03%), we included only those that met FDA-approved standards. Where we were unsure of specific intentions of noninferiority/equivalence compared to superiority, we contacted one author via email.

Data Collection

We collected information about the type of interventions tested, study design and purpose. We also obtained data on sample size estimation with margins, type of statistical analysis conducted, population analyzed [ITT or/and PP], efficacy summaries, and use of hyperemia measures. Our study evaluation included an assessment of the following general methodological quality features: allocation concealment, sequence generation, masking status, and the handling of losses to follow-up. We also recorded sample size (recruited, randomized and completed) and the use of a combined test of noninferiority and superiority. Finally, we noted whether there was a clear objective/hypothesis relating to noninferiority or equivalence, a tabular presentation of baseline data (demographic and clinical), and a predetermined noninferiority/equivalence margin.

Data Analysis

In order to assess inter-rater reliability on inclusion of articles, we calculated the Phi statistic, which provides a measure of inter-observer agreement independent of chance [9]. In order to determine the proportion reporting the specific item with appropriate confidence intervals, we first stabilized the variances of the raw proportions (r/n) using a Freeman-Tukey type arcsine square root transformation: y = arcsine (square root(r/(n+1))+arcsine(square root(r+1)/(n+1))), with a variance of 1/(n+1), where n is the denominator overall size [10]. We used StatsDirect (version 2.5.2, http://www.statsdirect.com for all calculations.

Results

Results of literature search

Our first and second search produced 35,207 and 97,523 abstracts respectively. The third and final search (search #1 AND #2 pooled) produced 1144 abstracts. After thorough assessment, 215 abstracts were excluded since they were review articles. Another 549 abstracts were excluded as they were not relevant to present study. Overall, 380 full text papers were retrieved for possible inclusion. Upon careful review of the 380 full text articles, we included 47 full text articles in our analysis (Phi = 0.88). Figure 1 presents details of the exclusion criteria at the various stages during the study selection process.

Figure 1
figure 1

Flow diagram of included studies

Trial interventions

Table 1 outlines the characteristics of all 47 included publications. Of these, 31 reported conducting comparisons of treatments (e.g., travoprost vs. latanoprost) that were duplicated in at least one other included study. More specifically, five trials compared bimatoprost to latanoprost [1115]. Four trials compared bimatoprost and timolol [1619]. Three trials compared bimatoprost to a fixed combination (FC) treatment that contained timolol [2022].

Table 1 Characteristics of included publications

Six trials compared latanoprost to timolol [13, 2327]. Three studies compared latanoprost to brimonidine (or brimonidine-containing fixed combination) [2830]. Two studies compared FC latanoprost/timolol vs FC dorzolamide/timolol [31, 32]. Two studies compared FC latanoprost/timolol versus its individual components (i.e. latanoprost and timolol) [33, 34]. Two studies compared FC latanoprost/timolol versus latanoprost and timolol administered concomitantly [35, 36].

Four studies compared travoprost and latanoprost [12, 13, 23, 37]. Two studies compared travoprost and timolol [23, 38]. Two studies compared travoprost or FC-containing travoprost versus latanoprost (or FC-containing latanoprost) [39, 40] and three studies compared travoprost or FC-containing travoprost versus timolol (or FC-containing timolol) [3941]. Finally, 16 publications contained other comparisons of treatment types that were not duplicated with any other included study. [4257]

Characteristics of trials

Table 2 shows the characteristics of the trials as well as the quality of reporting. The average sample size was 311 (SD 33), although on average, only 86% of patients were included in the final analyses. Seventeen out of the included 47 trials (36%, 95% CI: 24–51) were crossover designs.

Table 2 Methodological quality and reporting

Of the 47 trials, 36 were noninferiority trials (77%, 95% CI: 63–86) and 11 (23%, 95% CI: 14–37) were designed for equivalence. In the final interpretations of the studies, 16/47 (34%, 95% CI: 22–48) trials claimed noninferiority, 14/47 (30%, 95% CI: 19–44) claimed equivalence and 17/47 (36%, 95% CI: 24–51) claimed superiority. For the trials claiming noninferiority, 33/36 (92%, 95% CI: 78–92) were accurate within their noninferiority margin. For those claiming equivalency, 10/11 (90%, 95% CI: 62–98) were accurate within their a priori margins. Seventeen percent of trials (8/47, 95% CI: 9–30) employed a combined test of noninferiority and superiority.

Sequence generation was reported in 22/47 trials (47%, 95% CI: 33–61). Allocation concealment was reported in only 10/47 (21%, 95% CI: 12–35) of the trials. Thirty-five studies (74%, 95% CI: 60–85) employed masking of at least two groups, 4/47 (9%, 95% CI: 3–20) masked only patients and 8/47 (17%, 95% CI: 9–30) were open label studies.

Only 6 of the 47 (13%, 95% CI: 6–25) studies specified that the trial was a noninferiority or equivalence trial in the title or abstract. The background or introduction section of 5/47 (11%, 95% CI: 5–23) articles contained a rationale for using a noninferiority or equivalence trial design. Thirteen articles (28%, 95% CI: 17–42) had a clear objective or hypothesis pertaining to noninferiority or equivalence. Thirty-four trials (72%, 95% CI: 58–83) properly described the method applied in the sample size determination and set appropriate boundaries on noninferiority or equivalence. The pre-stated noninferiority/equivalence in all the trials lied between the ranges of -1 to 2.5. Primary outcome measure was the mean intraocular pressure which was mostly measured as a mean difference or as a mean reduction from baseline.

Only 3/47 (6%, 95% CI: 2–17) studies reported and presented results of both ITT and PP populations. Two studies (4%, 95% CI: 1–14) presented only PP results but mentioned that ITT population had similar results. Twelve studies (26%, 95% CI: 15–39) presented only ITT results but mentioned that PP population had similar results. Thirteen trials (28%, 95% CI: 17–42) presented only PP results with no mention of ITT population results while 17/47 (36%, 95% CI: 24–51) studies presented only ITT results with no mention of PP population results.

From the 47 trials reported in this study, only one (2%, 95% CI: 0.5–11) diagrammatically presented its results by use of a figure showing the confidence intervals and pre-specified margins of equivalence or noninferiority. Handling of losses to follow-up was not addressed in 33 of the 47 trials (70%, 95% CI: 56–81). Loss to follow-up was mentioned but unaddressed in 10 trials (21%, 95% CI: 12–35), while 3/47 (6%, 95% CI: 2–17) were addressed in the statistics and 1/47 (2%, 95% CI: 0–11) had a zero loss to follow-up.

Measurement of hyperemia

We found large heterogeneity of measuring hyperemia. Two of the 47 (4%, 95% CI: 1–14) included studies used a severity scale based on non-serious, mild, moderate, and serious. Six studies (13%, 95% CI: 6–25) used a severity scale based on none, mild, moderate, and serious. Five (11%, 95% CI: 5–23) used a scale that recorded only mild, moderate, and severe. One (2%, 95% CI: 0.5–11) used a scale that included only non-serious and serious. One (2%, 95% CI: 0.5–11) used a four-point scale from 1–4 stating minimum and maximum as the scale endpoints. Six (13%, 95% CI:6–25) studies used a 5 point scale with 0 = none, 0.5 = trace, 1 = mild, 2 = moderate, and 3 severe. A further 26/47 (55%, 95% CI: 41–69) did not report their method of measuring hyperemia, of which 5/26 (19%, 95% CI: 9–38) did not report on hyperemia outcomes.

Discussion

To our knowledge, this is the first study that has attempted to look into the methodological quality and reporting of any ophthalmology RCTs designed to show noninferiority or equivalence for glaucoma treatments. These findings reinforce the gaps that exist in the methodological quality and reporting of these trials in general. We additionally found that within ophthalmology trials, a substantial heterogeneity exists in terms of measuring clinical outcomes, such as hyperemia. It is clear that further efforts at improving the reporting and analyses of clinical trials in this field are needed. Recent work by Hopewell and colleagues [58] highlighted the need for journals to endorse the CONSORT Statement and its extensions, and also make its application a condition for authors wishing to publish in the journal. Journals in the ophthalmology field should consider such an endorsement as well as promoting a strict adherence to the extended CONSORT statement in order to potentially improve manuscript reporting.

There are several strengths and limitations to consider in interpreting our study. Strengths include our extensive searching of the literature to identify a representative sample of a field that commonly utilizes noninferiority and equivalence trials. Our data abstraction methods aimed to reduce investigator driven bias with duplicate data abstraction. Limitations to consider include that we solely relied upon the reporting of specific items rather than inquiring about any missing items through contact with each individual trial investigator. As outlined in the methods, study authors were only contacted when we were unclear whether the trial itself included a non-inferiority or equivalence study design. It is possible that some of the specific methodological items were conducted but not reported. Indeed, this issue has been highlighted by investigators [59]. However, our study aimed to determine reporting quality in order to make specific recommendations on transparent reporting. We did not determine changes pre- versus post-publication of the extended CONSORT statement in 2006. Previous research has found that the quality of reports improved after publication of the original 1996 CONSORT publication [60]. Per-protocol analyses have traditionally been preferred over ITT approaches despite current consensus of displaying both [1, 4, 7]. In our study we were unable to determine if one is superior to the other as we found too few examples where both outcomes are reported.

The findings of our study are consistent with other studies investigating reporting quality in general medicine journals and specialist journals [61]. General medicine journals tend to report important methodological recommendations more frequently than specialist journals. This may be due to the encouragement of journals to utilize the CONSORT recommendations through editorials and letters [62]. Providing editorials or guidelines in the instructions to authors is a potential option for improving the reporting of these RCTs. An option that may be considered is a checklist submitted with clinical trials indicating where in the manuscript specific details are reported, as required by many current journals.

An important methodological issue that has been raised by this review is the use of a combined test of noninferiority and superiority. The purpose of conducting noninferiority and equivalence trials is not to evaluate superiority, as that would require far more power than would be expected in this type of trial and would maintain the need for ITT analysis [63]. In our study, 8 trials employed the combined test for noninferiority and superiority. This allows a trial to claim its original intent of noninferiority if superiority is not identified, but can claim superiority if the arbitrary threshold, upper-limit of noninferiority, is surpassed [64].

Given the heated market of the prostaglandin field, we found that several (n = 17) noninferiority and equivalence trials claimed superiority of the test intervention over competitors products, regardless of overall findings and the study intent. For example, in a randomized trial evaluating travoprost to latanoprost on 408 patients' daily intraocular pressure at 9 am, 11 am and 4 pm, the authors found noninferiority, but report on statistical superiority at the 9 am measurement [40]. Given the lack of ITT analysis, the multiple testing and initial intention of a noninferiority analysis, such findings provide misleading evaluations of superiority. Indeed, this phenomenon exists across a range of medical fields [6570].

Pre-specifying the noninferiority or equivalence margin is necessary to provide strong inferences about the suitability of one treatment compared to another. If the margin employed is too wide, risk for type 1 error exists. Similarly, although unlikely, if one were to establish the margin too narrow, a type 2 error is possible. Margins should be established based on clinical justification and not statistical rule [71].

Conclusion

In conclusion, based on the analysis of the study reports that we have examined, our findings demonstrate deficiencies in the design, planning, and reporting of noninferiority and equivalence trials in the ophthalmology literature. Many studies of clinical noninferiority and equivalence do not set boundaries for equivalence. Claims of "superiority" and "similarity" are often displayed despite the use of improper analysis. These methodologic deficiencies appear to lead to false claims, unrepeatable experiments, and possible costs or harms to patients.