Background

Effect sizes (ES) provide information about the magnitude of differences between groups in interventional studies [1, 2]. While treatment differences should be based primarily on the original metric of the outcome (e.g., difference in mean scores between two treatments), the ES when standardised and expressed in standard deviation units can lend further interpretation to the magnitude of effect. Standardised ES are also used to calculate sample sizes for studies and to support comparisons of effects across studies [3, 4]. Comparing standardised ES across interventions or studies, however, must be done with caution as ES may vary depending on study design, outcome measures, and approach to calculation of the standard deviation (SD) [5].

The (standardised) ES metric for a parallel-group clinical trial is defined as the difference in mean scores between two treatments (numerator) divided by the SD of these two treatments (denominator) [6]. However, there are different approaches to defining the SD to be used when computing ES. Therefore, it is of interest to assess the impact of different approaches to defining the SD on ES using data from well-controlled clinical studies. Here, we report results from three phase 3 trials of tanezumab, an antibody to nerve growth factor, in participants with painful knee and hip osteoarthritis. We focus on the ES for the pain response, as it is the outcome most-commonly evaluated.

Methods

Data were from two phase 3, randomised, double-blind, multicentre, placebo-controlled, parallel-group trials (Study 1: NCT02697773, Study 2: NCT02709486) [7, 8] and one phase 3, randomised, double-blind, multicentre, active-controlled (nonsteroidal anti-inflammatory drugs [NSAIDs]), parallel-group trial (Study 3: NCT02528188) [9]. Trial details have been published previously.

Overall, study treatment (tanezumab, placebo, or NSAID) was received by 696 participants in Study 1, 849 participants in Study 2, and 2996 participants in Study 3 [7,7,8,9].

ES calculations of tanezumab versus placebo used Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC, ©1996 Nicholas Bellamy; WOMAC® is a registered trademark of Nicholas Bellamy [CDN, EU, USA]) Pain scores at Week 16 (Study 1) or Week 24 (Study 2). ES calculations of tanezumab versus NSAIDs used WOMAC Pain scores at Week 16 (Study 3). Mixed model for repeated measures (MMRM) was used to analyse change from baseline on observed data from each study [10]. The model included time (study week), treatment, treatment-by-time interaction, and randomisation stratification variables. Randomisation stratification variables included index joint and highest Kellgren-Lawrence grade, which were treated as fixed effects. Baseline WOMAC Pain scores and baseline diary average pain scores were treated as covariates.

ES were defined as least squares mean difference (from the MMRM model) in each score divided by a pooled SD of the outcome scores. Three different approaches to computing the pooled SD (in the denominator of the ES) were used: the pooled SD of WOMAC Pain values at baseline (combined across treatments); the pooled SD of these values at the time when the primary endpoints were assessed (Week 16 for Studies 1 and 3, Week 24 for Study 2); and the median pooled SD of these values based on the pooled SDs across all available timepoints (baseline, intermediate post-baseline timepoints, and primary timepoint at the end of a trial). Specifically, the median pooled SD was computed as the median of pooled SD from baseline to Week 16 (Studies 1 and 3) or Week 24 (Study 2).

Given there is no convenient closed-form solution to derive standard errors and confidence intervals (CI) for ES statistics, the non-parametric bootstrap approach is recommended to compute a 95% CI for an ES and was applied to individual WOMAC Pain patient data [11]. One thousand data sets were sampled from individual patient WOMAC Pain data. The bootstrap was done at the patient-level; if a patient was selected, all WOMAC Pain data (at all visits) for this patient were selected. The bootstrap was performed with replacement, using the same number of patients as the original sample. The bootstrap sample data set was used to compute pooled SDs. For each study, each treatment comparison, and each approach to calculate SD (baseline, endpoint, and median), the 95% CI (2.5% percentile, 97.5% percentile) of the ES were reported.

Results

Standard deviations

The pooled baseline SDs were the smallest and the pooled SDs at the time when the primary endpoints were assessed were the largest for the WOMAC Pain endpoint in all studies. The SDs for the median of pooled SD were similar to those determined at the primary endpoint (Table 1). SDs across studies were comparable (Table 1).

Table 1 Standard deviations used to calculate the ES

Effect sizes

Based on the bootstrapping method, the mean (95% CI) ES of tanezumab 2.5 mg on pain were − 0.416 (− 0.796, − 0.060) versus placebo in Study 1 when pooled baseline SDs were used; −0.195 (− 0.371, − 0.028) when pooled SDs at the time primary endpoints were assessed; and − 0.196 (− 0.373, − 0.028) when the median of pooled SDs from baseline to the time when the primary endpoints were assessed (Table 2). In Study 2, the corresponding ES of tanezumab 2.5 mg on pain versus placebo were − 0.547 (− 0.900, − 0.208), − 0.250 (− 0.403, − 0.095), and − 0.256 (− 0.414, − 0.098), respectively (Table 2). In Study 3, the ES of tanezumab 2.5 mg on pain versus NSAID were − 0.167 (− 0.324, 0.001), − 0.084 (− 0.163, 0.001), and − 0.085 (− 0.165, 0.001), respectively (Table 2). Similar patterns of differences in ES (based on the SD calculation method) were obtained for higher doses of tanezumab (Table 2).

Table 2 ES of tanezumab on pain as measured by WOMAC Pain score based on bootstrap samples

Discussion

Different approaches to calculating pooled SD affect the magnitude of ES, which in turns affects interpretation of treatment effect and complicates comparisons across different studies. Our results showed that ES derived from pooled SDs, at the time when the primary endpoints were assessed and from the median pooled SDs from baseline to the time when the primary endpoints were assessed, were similar for all endpoint comparisons in all three studies. However, ES derived from pooled SDs at baseline were larger than the ES derived from the other two SDs for all endpoint comparisons in all studies.

All three approaches to calculate SD attempt to estimate “true” variability of the measured outcome in the sample. Use of only baseline data for the SD represents natural variability in the sample, which is not affected by introduction of a treatment (assuming the outcome was not an entry criterion). SDs based at the primary endpoint are calculated by pooling data by treatment and, thus, effectively exclude the treatment effect (as the pooled SD is based on a weighted average of each treatment’s SD of scores rather than an overall SD of scores lump summed as one grouping from both treatment groups; see Supplementary Text 1 for more detail). Using median SD from the set of pooled SDs represents an attempt to use a representative value of variability.

For patient-reported outcome studies, ES using baseline SD or SD of individual changes are typically used for within group pre- versus post-intervention comparisons. For ES comparison between treatment groups, the pooled SD from scores of the treatment groups at baseline, pooled SD from scores of the treatment groups at time of post-treatment assessment, or pooled SD from scores of individual changes (when mean change from baseline is the outcome) have been applied [5, 12]. For a clinical trial where the outcome measures also serve as inclusion/exclusion criteria, the population studied at baseline will not represent an unbiased sample. Indeed, the goal of entry criteria is to define a homogeneous population, and it is expected that baseline SD values will be smaller. Furthermore, since response to treatment varies across individuals, SDs based on data after treatment initiation will likely be confounded by effects of treatment and time. Therefore, pooled SD at baseline and pooled SD at post-treatment assessment could be different, which would lead to the differences in ES presented here.

Different factors have been shown to have an impact on the ES of scores in randomised controlled trials [13]. However, our analyses have shown the methods used to calculate the SD directly affect the calculated ES. ES derived from baseline SD tend to be more optimistic (i.e., larger) than ES derived from SD post-treatment. It is noteworthy that the commonly used Cohen thresholds—in which an ES < 0.20 indicates trivial effect; while small, moderate, large, or very large effect is represented by ES of ≥ 0.20 and < 0.50, ≥ 0.50 and < 0.80, ≥ 0.80 and < 1.30, or ≥ 1.30, respectively [14]—were developed for use in the social sciences and are based on Cohen’s d when, gauging the magnitude of the difference in means between treatment groups, the pooled standard deviation of scores (pooled across treatments) are based on the same time as when the means are assessed. In contrast, the Cochrane Handbook recommends using the SD from the pooled outcome data (known as Hedges’ g). Thus, when describing the magnitude of an ES, and particularly when comparing across different studies and interventions, it is essential to describe how the SD was determined in order to make appropriate comparisons. This is of even greater importance when using ES estimates to determine sample sizes for clinical trials, as larger effect sizes will lead to smaller sample sizes for equivalent power and may lead to underpowered studies.

Generally, if an outcome scale was not used as part of a study’s entry criteria, we recommend using baseline SD for calculations of ES in longitudinal studies since those SDs are not affected by treatments. If an outcome scale was part of the entry criteria or was highly similar to a measurement used as part of the study’s entry criteria, then baseline SD will be artificially attenuated. In this case, we recommend using the largest pooled post-baseline SD measured at different time points across two (or more) treatment arms since it would lead to the smallest (most conservative) ES. However, ES based on pooled SDs at end of study can also be reported in sensitivity analyses.

Conclusion

Standardisation of the method used to determine SD would allow researchers to more accurately compare the magnitude of treatment effects across studies, including when different measures are being used to assess the same concept of interest. In the absence of such standardisation, we advocate for reporting, in addition to ES, information about how the individual elements (e.g., means, SDs) were defined/calculated.