Background

Over 50 treatment modalities for osteoarthritis (OA) of the hip and knee have been evaluated by the Osteoarthritis Research Society International (OARSI) [1, 2]. Oral pharmacologic modalities included acetaminophen, non-steroidal anti-inflammatory drugs (NSAIDs), and both strong and weak opioids. Guidelines have recommended acetaminophen for first-line use, with NSAIDs and opioids as second and third lines of treatment [1, 35]. However, reservations have been expressed concerning the long-term safety and efficacy of NSAIDs and opioids [1, 2, 5, 6]. Some reviews have gone further and recommended against their long-term use [7, 8]. Recently published meta-analyses suggest that currently available oral treatments have only limited efficacy in the average patient with OA [6]. In addition, the efficacy seen in trials seems to be impacted by trial design and baseline factors and may be limited to the first few weeks of use [6].

Earlier meta-analyses have primarily focused on pain and have not assessed broader functioning. They have predominantly investigated single-substance classes, included both short- and long-term trials, and sometimes encompassed both OA and other chronic pain indications [725]. Also, these analyses could not include evidence for substances that were unavailable when they were performed, such as duloxetine, a newly available treatment option in the US.

Duloxetine is a selective serotonin and norepinephrine reuptake inhibitor (SNRI) that has demonstrated efficacy in OA in Phase III clinical trials as well as a favorable adverse event profile across indications [2628]. Duloxetine is thought to inhibit pain through its enhancement of serotonergic and noradrenergic activity in the central nervous system. It is currently indicated in the US for the management of pain disorders, including diabetic peripheral neuropathic pain (DPNP), fibromyalgia, and chronic musculoskeletal pain due to OA and chronic low back pain [29].

We conducted a systematic literature review followed by a meta-analysis to assess the efficacy of duloxetine versus other commonly used post first-line OA treatments, including NSAIDs and opioids. Our study reflected the chronic nature of OA by including only trials of 12 or more weeks duration (the recommended duration for confirmatory trials) [30] and a more inclusive set of OA symptoms by using the Western Ontario MacMaster Universities Osteoarthritis Index (WOMAC), which includes subscales for function and stiffness as well as pain [31]. We also sought to confirm the influence of design and baseline factors observed in a recent OA meta-analysis [6]. Both frequentist and Bayesian analyses were undertaken to assess the effect of duloxetine compared to the other available oral treatments.

Methods

Inclusion and exclusion criteria

Randomized controlled trials (RCTs) were included for OA treatment with duloxetine, NSAIDs or opioids at dosages consistent with United Kingdom prescribing information [32]. All included studies were of at least 12 weeks duration and published in English. Articles were included if they evaluated clinical efficacy using WOMAC total scores. Studies were excluded that did not report clinical efficacy of OA, and did not have at least 2 arms of a treatment of interest, or 1 arm of a treatment of interest and a placebo arm.

When it was unclear from the title or abstract whether a study met the criteria, the full paper was acquired and read. Determination of inclusion/exclusion was performed by 2 persons working independently. When their conclusions were not in agreement the persons met and came to a consensus.

Literature search

The literature search was performed on all articles published between January 1985 and March 2013 in PUBMED, EMBASE, MEDLINE In-Process & Other Non-Indexed Citations, Cochrane Central Register of Controlled Trials, Cochrane Database of Systematic Reviews, and ClinicalTrials.gov. The search conducted in PUBMED used the following terms: (ibuprofen OR naproxen OR diclofenac OR meloxicam OR etoricoxib OR celecoxib OR mefenamic OR indometacin OR etodolac OR tramadol OR morphine OR codeine OR dihydrocodeine OR oxycodone OR diamorphine OR methadone OR hydromorphone OR duloxetine) AND (osteoarthritis) AND (English [lang]) AND (clinical trial [ptyp]). The search conducted in the other databases used the same search terms, but without the specific limitation of clinical trial publication type.

Data extraction

Data extraction was performed by 1 reviewer and checked by a second reviewer using a predefined data extraction form. Discrepancies were resolved by discussion between reviewers. For each study, reviewers extracted data that were deemed to potentially impact efficacy outcomes, such as study population (percent women, mean age, mean duration of OA), study design (duration, washout period, flare requirement, concomitant analgesic use, enriched enrollment, missing imputation technique), and outcomes (WOMAC score at baseline, endpoint, and change from baseline with measures of variance). Studies were categorized as having a washout period if the publication mentioned a period of washout or no treatment before randomization. A study was classified as requiring flare if the publication stated that after the washout/no treatment period patients were required to exhibit a flare of symptoms to continue in the study. Studies were classified as allowing concomitant analgesic use if patients could use analgesic medications in addition to their assigned treatment throughout the study; rescue medication was not considered concomitant use.

For studies that did not report sufficient data to be included in the analysis, 3 attempts were made to contact authors by email to obtain missing information. Studies were assessed for quality using the assessment tool from the National Institute for Health and Clinical Excellence (NICE) guidelines for Single Technology Appraisal submissions [33]. This 7-item questionnaire evaluates each trial based on randomization, adequate concealment of treatment allocation, similarities between treatment groups, degree of blinding, balance of withdrawals and dropouts between treatment groups, reporting of all outcomes measured, and use of intention to treat analyses. Studies were assessed by one reviewer and independently checked by a second reviewer. Positive responses were tallied for a total possible score of 7, with higher scores representing better quality.

Outcome measure

The outcome measure for the meta-analysis was the change from baseline total WOMAC score as reported at 12 or more weeks. The WOMAC instrument consists of 24 questions answered on a 0–4 Likert or 0–100 visual analogue scale (VAS). The WOMAC has 3 subscales: function (17 questions), pain (5 questions), and stiffness (2 questions). A lower WOMAC score indicates fewer symptoms, thus improvement is shown as a negative value; negative values of larger magnitude are indicative of greater efficacy. WOMAC total and subscale scores are reported inconsistently, with publications reporting scores on different scales, some subscale scores and not others, different measures of variance, or no measures of variance. Scores are commonly reported as: a) a total of the Likert scores, b) a total of the VAS scores, or c) normalized units with total and subscale scores reported on 0–100 scales [34]. To overcome this issue, WOMAC total scores were converted to a 0-100 normalized scale using a direct ratio. If change from baseline was not reported, it was calculated as the difference between baseline and endpoint or, if not possible, as the difference between baseline and a weighted average of multiple observations during treatment [35]. When subscale scores were reported without the total score, the total score and variance were calculated from the subscales. Missing stiffness subscale scores were imputed by substituting the mean of those reported for that treatment. Studies reporting neither the total score nor the pain and function subscale scores were omitted from the analysis.

Statistical analysis

Frequentist and Bayesian methods were used to assess the effect of including the direct and indirect data in the analysis. The frequentist meta-analysis using Bucher indirect comparisons was chosen because it reports traditional statistical measures, whereas the Bayesian network meta-analysis allows for inclusion of both direct and indirect information in a single step. In both frequentist and Bayesian methods, if multiple arms for a treatment were present in a study at different doses, the arms used were consistent with the United Kingdom prescribing information. For tramadol, the 400-mg daily dose was not included as it is associated with higher rates of adverse events and similar efficacy to the 300-mg dose [36].

The frequentist meta-analysis used the difference between treatment and placebo of the change from baseline WOMAC score for each active treatment. Random effects models using the DerSimonion-Laird method were employed regardless of heterogeneity due to study design and population dissimilarities [37]. Estimated treatment effects compared to placebo and compared to duloxetine were calculated with their 95% confidence intervals using the Bucher method of indirect comparison [3841]. Frequentist analyses were performed with Comprehensive Meta-Analysis software (CMA; Biostat, Englewood NJ) [42]. Publication bias was assessed by funnel plot with Duval and Tweedie’s trim and fill [37].

Random effects Bayesian network meta-analyses were performed using the change from baseline score for all available studies. Bayesian methods described in NICE Decision Support Unit documents were modified to accommodate continuous data analysis [43, 44]. Each trial’s specific relative treatment effect was assumed to be drawn from a random effects normal distribution with a common random effects variance for all treatment comparisons. The best model was selected based on the deviance information criteria (DIC), described in Cooper et al. [45] and Dias et al. [46], and standard deviation (SD), which provide measures of model fit. The consistency between direct and indirect evidence was performed using node splitting methods described by Dias et al. [46]. Estimated treatment effects compared to placebo and duloxetine were given with their associated 95% credible intervals as well as the probability of the treatment being superior to duloxetine. Sensitivity analyses were run on various scenarios, including adjustment for baseline scores, flare requirement, and analgesic use. The Bayesian analyses were conducted using WinBUGS version 1.4.3 (MRC Biostatistics Unit; Cambridge, UK) [47].

Heterogeneity was assessed by calculating the I2 statistic. Twelve population and study characteristics were assessed as possible confounding factors by visually inspecting forest plots for the magnitude and variability of study WOMAC scores. These characteristics included washout period [yes/no], enriched enrollment [yes/no], flare required [yes/no], chronic pain definition [<6 months/> = 6 months], baseline pain level, concomitant analgesic use allowed [yes/no], missing imputation technique, quality assessment, study mean age, study mean duration of OA, site of OA, and the percent women. When forest plots suggested a possible relationship, both frequentist and Bayesian meta-regression were conducted to account for heterogeneity of treatment effect. Bayesian methods assumed the same covariate effect for all active treatments. Noninformative priors were used for all parameters; a uniform distribution was used for random effects variance and normal distributions with very large variance for all other parameters, including treatment effect and covariate effect.

Results

Literature search

Figure 1 provides a flow diagram of the article selection process. Of the initial 1045 articles identified, 124 met the eligibility criteria for possible inclusion based on abstract review. Most excluded studies lacked a treatment of interest or the duration was too short. Thirty-two articles with 47 active treatment arms reported sufficient information to be included in the meta-analysis, for a total number of 17,442 patients (mean age 60.3 years, 64.9% women). Sixteen articles were found for celecoxib, 9 for naproxen, 5 each for tramadol and etoricoxib, 3 for duloxetine, and 2 each for ibuprofen, hydromorphone and oxycodone. Of the 20 other studies identified in the literature search, the most frequent reason for exclusion was incomplete reporting of WOMAC scores, especially the omission of a measure of variance. One full paper was unavailable [48].

Figure 1
figure 1

Article selection flow chart. *Reporting 34 studies.

Table 1 presents the studies included in the meta-analysis with 5 extracted study characteristics as well as baseline and change from baseline WOMAC scores. The duration of nearly all studies was 12 to13 weeks, with a range of 12 to 26 weeks. The size of treatment arms ranged from 51 patients in a placebo arm to 481 in a celecoxib arm. Seven studies did not report baseline WOMAC scores. Three studies were identified in which complete WOMAC scores were not reported in the publication, but were available on clinicaltrials.gov. These studies are identified in the table with both the publication reference and the NTC number from clinicaltrials.gov. Table 2 presents descriptive statistics of the included studies grouped by treatment. In Table 3 the quality assessments of the included studies are presented. Of the 32 included articles, 26 (81%) had a quality score of 6 or 7 (maximum score 7) and the other 6 studies had a quality score of 5, indicating that the included studies were of sufficiently high quality. A funnel plot assessing publication bias, run on all studies as not enough studies per compound were available, was roughly symmetrical, with slightly more studies on the left, indicating little effect of publication bias on the results of this analysis (Figure 2). Missing publications have been imputed using Duval and Tweedie’s trim and fill and appear as solid points among the actual publications depicted as circles [37]. This method suggests that possible missing studies would trend to non-significant differences in means.

Table 1 Characteristics of all included studies (Alphabetically ordered)
Table 2 Study descriptive statistics by treatment
Table 3 Quality assessment of included articles
Figure 2
figure 2

Funnel plot of standard error by difference in mean. Note: o = actual publication; ● = hypothetical omitted study.

Statistical results

Results of both the frequentist and Bayesian analyses are shown in Table 4. The frequentist approach analyzed 32 of the 34 studies, excluding Sowers et al. [74] and Essex et al. [58] due to the lack of placebo arms. All active treatments, except hydromorphone and oxycodone, were found to statistically improve the WOMAC total score compared to placebo. Indirect comparisons to duloxetine using the Bucher method found all confidence intervals but etoricoxib encompassed zero, indicating the differences between duloxetine and all treatments except etoricoxib were not statistically significant. Two compounds, ibuprofen and etoricoxib, had an I2 of zero while naproxen, celecoxib, duloxetine, oxycodone, hydromorphone, and tramadol had I2s of 52%, 33%, 44%,72%, 64%, and 58%, respectively, indicating substantial heterogeneity [78, 79]. However, the direction of the treatment effect was the same for all but one study; the magnitude of the treatment effect in these studies was the source of heterogeneity.

Table 4 Indirect comparison: results for WOMAC total score change from baseline

The Bayesian network meta-analysis included all 34 studies. Figure 3 depicts the network of direct and indirect evidence. As shown in Table 4, the results lead to similar conclusions as the frequentist results, as all 95% credible intervals of the difference between duloxetine and active treatments included zero.

Figure 3
figure 3

Network of evidence including direct and indirect comparisons. Note: the numbers represent number of comparisons between treatments.

To explain heterogeneity/inconsistency, we graphically explored the association of relative effect of the active treatment versus placebo with study-level covariates. Forest plots were generated for each population and study characteristic showing the difference between placebo and treatment of the change from baseline, ordered by the value of the characteristic (see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11). Figure 4 is the forest plot for baseline WOMAC scores. A visual association was indicated between baseline and change from baseline scores, with a higher baseline score associated with a larger negative (improved) change from baseline. Figure 5 is a verifying scatter plot showing the trial-level baseline WOMAC scores between 45 and 70 and the relative treatment effect appearing to increase as the trial-level baseline increases. A frequentist meta-regression confirmed an association between the baseline and change from baseline scores (p < 0.0001) with an R2 of 0. 573, indicating much of the observed improvement in symptoms was associated with a higher baseline level of symptoms.

Figure 4
figure 4

Forest plot by baseline WOMAC showing difference in change from baseline. Note: the lower limit in the Markenson study extends beyond the -20.00 scale of the plot.

Figure 5
figure 5

Correlation between baseline WOMAC score and the relative effect of active treatments and placebo.

Bayesian meta-regression models including study-level covariates were used to evaluate the extent to which covariates accounted for heterogeneity of treatment effects. Three models including study-level covariates yielded lower, similar DICs. (See Table 5). The model including the baseline score yielded both the lowest DIC and a substantially smaller SD of heterogeneity. Therefore, the model including the baseline score was preferred. Adjusted for baseline score, credible intervals of all treatments but tramadol and hydromorphone included zero, indicating no evidence of difference from duloxetine. In the cases of tramadol and hydromorphone, duloxetine demonstrated evidence of a clear advantage. When adjusted for baseline, the probability of duloxetine being superior increased for naproxen (19% to 57%), ibuprofen (28% to 82%), and etoricoxib (4% to 38%), but went down for oxycodone (41% to 15%).

Table 5 Comparison of Bayesian models a

Discussion

Our analysis employed the WOMAC, a common instrument in OA trials, with subscales for function, pain, and stiffness. It is, therefore, a broader measure of OA health than instruments that focus solely on pain. Randomized controlled trials and meta-analyses in OA commonly focus on the difference between the treatment and placebo arms of improvement from baseline to endpoint. Although a commonly reported measure in meta-analysis is the standardized mean difference Cohens d, we chose to report the unstandardized total WOMAC score, as it is a more meaningful outcome to clinicians. In the absence of consistent statistical significance, clinical relevance was not discussed. Because OA is a chronic condition, studies were included only with a treatment duration of at least 12 weeks, the current recommended minimum duration of confirmatory chronic pain trials [30]. This has not been universal practice in other meta-analyses of OA [811, 1517].

With our choice of the WOMAC composite score as the outcome of interest, we chose a continuous endpoint (mean and standard deviation) rather than a dichotomous variable. It is recognized that others recommend the use of dichotomous variables (eg, 50% reduction in pain score) for evaluation of chronic pain trials. This recommendation is based on the benefits of treatment being frequently unequally distributed, typically presenting as a u-shaped distribution [81]. The WOMAC, however, is rarely reported in this manner, and our aim was to report the broader definition of health that the WOMAC encompasses, rather than pain alone.

Song et al. [41] suggests that judicious use of meta-analytical methodology can come to similar results as direct head-to-head evidence. It is frequently not possible, however, to fully account for differences in patient populations, the impact of different trial designs, and additional hidden confounders. For example, some of the trials applied flexible dose regimens (including 1 duloxetine trial) while others applied fixed dose regimens; this could impact comparative results. Enriched enrollment, a treatment run-in after screening to titrate patients up to optimal tolerability, is frequently used in opioid trials due to their well-known dosing requirements. NSAID trials, on the other hand, tend to exclude patients with a known bleeding risk or cardiovascular risk factors due to NSAIDs’ known safety profile. In the case of duloxetine, and in contrast to most other trials, a washout of previous NSAIDs was not enforced. Patients in duloxetine trials were allowed to continue (but not increase) treatment with NSAIDs with a higher proportion of patients receiving NSAIDs in placebo arms. Because this design feature only applied to duloxetine trials, they could not be accounted for overall. Such aspects can limit the interpretation and generalizability of meta-analytic results.

Statistical analyses were performed using both frequentist and Bayesian methods. Frequentist methods have the advantage of using more familiar concepts and terminology. Bayesian network meta-analysis methods have the advantage of using all the data available, such as arms from active treatment controlled trials. In this study both methods produced similar results.

Our results mirror similar findings from previous studies. A 1997 study could not recommend a choice of NSAID therapy [21]. A more recent meta-analysis commissioned by NICE did not find a statistically significant difference among NSAIDs [82]; guidelines treat NSAIDs as a class differentiated primarily by adverse events [2, 3]. A meta-analysis of the short-term efficacy of treatments for OA of the knee found no statistical difference in pain relief between NSAIDs and opioids [6]. For duloxetine, our analysis repeats findings from previous studies in other pain indications. For both DPNP and fibromyalgia, duloxetine has been shown to be of similar efficacy to alternative treatment options [83, 84]. Our study found a significant relationship between baseline symptoms and the magnitude of treatment effect. The related issue of the influence of flare design in trials of NSAIDs has previously been noted [7, 85].

A limitation of this meta-analysis was the low number of studies available for analysis. Four or more studies were available for celecoxib, naproxen, tramadol, and etoricoxib. For all other treatments, 3 or fewer studies were found. Eight studies were omitted from the Bayesian adjusted for baseline WOMAC analysis, due to the omission of baseline scores in study publications. These numbers were, however, similar to several other meta-analyses in OA [7, 8, 18, 21]. Limiting the literature search to English language publications may have lead to missed RCTs. However, a study examining the effect of an English-language restriction in systematic reviews and meta-analyses found no evidence of bias as a result of the restriction [86]. The funnel plot suggests that publication bias, if any, was towards the exclusion of statistically non-significant studies, further supporting our findings of no difference among comparators. Another limitation of this study is the potential for ecological fallacy associated with patient level characteristics. For example, the mean baseline WOMAC score used in the regression analysis could represent a wide variety of patient level baseline scores. A study by Lange et al. [13] points out that imputed data may bias results, showing benefit of treatment where no benefit is seen in the non-imputed data. Thus, the imputation methods used in several of the included studies could have introduced bias in the results However, its reported effect size seems to be in the range of alternative opioid treatment options such as tramadol or oxycodone [50, 87].

Conclusions

This meta-analysis found no difference between duloxetine and other post-first line oral treatments for OA in the total WOMAC score after approximately 12 weeks of treatment in a consistent manner. Etoricoxib was more effective than duloxetine in the frequentist analysis and resulted in a 96% probability of being better than duloxetine in the nonadjusted Bayesian analysis. After adjustment for baseline pain score, however, duloxetine showed evidence of superiority to both tramadol and hydromorphone, but not for the other treatments, including etoricoxib.