Background

Health related quality of life (HRQoL) is an important endpoint in cancer trials for several reasons. First, where effect sizes are small, HRQoL can ‘add value’ to expensive cancer treatments. Secondly, considerable time is spent completing instruments for the purpose of estimating the impact of treatments on HRQoL. Therefore, such efforts should result in HRQoL effects that are meaningful and interpretable, especially where HRQoL is a primary or co-primary endpoint [1]. Thirdly, some anti-cancer treatments exhibit serious side-effects, despite improvements in overall survival (OS); HRQoL is also reported to be a predictor of survival in lung cancer patients [2], the leading cause of death among cancers [3]. It would be important to understand for example, how survival differs between patients with ‘poor’ baseline HRQoL, compared to those with ‘Good’ HRQoL. Finally, HRQoL outcomes are often required for cost-effectiveness analyses and drug reimbursement [4, 5]. Therefore, understanding and interpreting HRQoL data is crucial in evaluating cancer treatments.

The EORTC-QLQ-C30 (QLQ-C30) is a widely used cancer specific instrument [6]. The instrument has 30 questions from which 15 domains (sub scales) are determined, consisting of 5 ‘function’ scales, 8 ‘symptom’ scales, a global quality of life (QL) scale and a finance scale (FI). For QL and function domains, high scores indicate better HRQoL. For symptom domains (and FI), low scores indicate better HRQoL.

Treatment effects from the QLQ-C30 are often reported as mean differences (MDs) [7], despite scores having heavily skewed distributions with ceiling effects (many patients with scores of 0 or 100) and censored data due to progressive disease, death or failure to complete questionnaires. The interpretation of HRQoL MDs can be more complicated than survival endpoints. Consequently, alternative measures of treatment effect have been proposed.

Maringwa suggests a minimally important ‘difference over time’ as a measure of effect [8]. The area under the curve (AUC) can be difficult to interpret, although useful for reducing multiple observations to a single value [9]. However, if HRQoL is measured at a few time points (e.g. baseline and month 12), the AUC will have limited value. Moreover, the interpretation of the effect can become tricky (e.g. for HRQoL scores of 100 at each of 0, 1 and 2 months, the AUC score is but the original HRQoL scale is 0 to 100).

Categorizing scores: e.g. improvements in symptoms from ‘moderate’ or ‘severe’ (67–100 points at baseline) to ‘non’ or ‘little’ (0 to 33 points) was proposed by Langendjik [10]. Reck and Norman [11, 12] suggested ‘noted’ changes in HRQoL occur when a ‘shift’ of greater than half of the baseline standard deviation is observed). Time to HRQoL deterioration (TD) has been suggested (Anota) [13]. However different definitions of ‘deterioration’ lead to different conclusions and median TD may not be estimable (e.g. few events) and further complicated by non-proportional hazards (PH). Interpretation of effects with TD using HRs is however similar to ORs. Reporting a ‘Trend’ is also a way of describing HRQoL over time (Schaake) [14], although difficult to interpret (e.g. how much ‘more trend’ is there for experimental vs. control?).

The above measures of HRQoL effects can be difficult to interpret for patients and clinicians. The mean is often the statistic of choice to define treatment effect sizes for HRQoL endpoints in most of these measures.

One commonly reported clinically relevant effect size proposed by Osoba and King [6, 15, 16] is ≥10 points MD (on any domain), a value used as a benchmark by researchers to determine whether HRQoL benefits exist [7]. Some researchers interpret a 10 point improvement as a difference between treatments, while others as a 10 point change (improvement) from baseline (Hirsh) [6, 17], which is not always possible. For example, if a patient scores 8 points (or 92 points) at baseline, a reduction (or increase) of 10 points is not possible. Moreover, ‘important’ treatment differences need not be the same for symptom as functional scales. A worsening of 5 points in a symptom scale may be more important than a 10 point improvement in a functional scale.

For HRQoL endpoints, the magnitude of effect sizes are often considered to be clinically relevant if a difference of 10 points is observed, regardless of whether HRQoL is a primary or secondary outcome. Such requirements are not expected of other secondary clinical endpoints in cancer trials (e.g. time to progression (TTP)). One reason may be that secondary endpoints are not powered or there is a clinical rationale that the secondary outcome cannot be expected to yield effects similar to primary endpoints. In a similar vein, effect sizes should not be expected to be uniform across HRQoL domains for demonstrating treatment benefit because some smaller effect sizes (e.g. < 10 points) may be important. In this research we attempt to show that some small effect sizes on a MD scale might be dismissed as clinically irrelevant but remain important on a relative scale.

Little attention has been given to smaller HRQoL effects (MDs) which are often glossed over unless a ‘statistically significant’ p-value is reported alongside. Small MDs tend to be perceived as offering limited HRQoL benefit but can mask important improvements, particularly when data are analysed using an alternative scale (e.g. OR scale). This presents a challenge for setting thresholds for defining clinically relevant HRQoL effect sizes. Moreover, ORs can facilitate an interpretation of effects similar to hazard ratios (HR), familiar to many oncologists (OR are interpreted in a similar way to HRs).

Therefore, in this article after presenting baseline characteristics, we offer effect size categories based on the OR and describe example situations of the relationship between ORs and MDs. We discuss aspects of statistical significance of small effects in the context of ORs and MDs and compare preferences between ORs vs MDs from several clinicians; Finally, we compare ORs and MDs with time a to deterioration (TD) approach (TD ≥5 points) following Anota [13].

Methods

Data

HRQoL data from six randomized controlled trials (RCT) conducted by the CRUK & UCL CTC were analayzed [9, 1822]. These were selected because they comprised of all patient level QLQ-C30 data available in the CTC database from RCTs in lung cancer which had been published.

  1. (i)

    ‘TOPICAL’: A phase III trial in NSCLC patients unfit for chemotherapy comparing erlotinib with placebo [18]; N = 670 patients.

  2. (ii)

    ‘SOCCAR’: A phase II trial comparing concurrent vs. sequential chemotherapy in NSCLC patients [19]; N = 130.

  3. (iii)

    ‘Study 10’: A phase II trial comparing Gemcitabine/Carboplatin versus Cisplatin/Etoposide in patients with small cell lung cancer (SCLC) [20]; N = 241.

  4. (iv)

    ‘Study 11’: A phase III trial comparing Gemcitabine/Carboplatin versus Mitomycin/Ifosfamide /Cisplatin in patients with stage IIIB or IV NSCLC [9]; N = 422

  5. (v)

    ‘Study 12’: A phase III trial comparing Thalidomide combined with chemotherapy versus chemotherapy alone in SCLC patients [21]; N = 724

  6. (vi)

    Study 14: A phase III trial comparing Thalidomide/Gemcitabine/Carboplatin versus Gemcitabine/Carboplatin alone in NSCLC patients [22]; N = 722

Assessments

Data were collected during clinic visits and questionnaires returned by patients during follow up; QLQ-C30 was assessed at several time points including baseline, pre and post chemotherapy and at monthly intervals for at least 24 months or until disease progression.

Statistical analysis

Patient level HRQoL scores for each of the 15 domain scores were analysed using a a repeated measures [21, 22] analysis for reporting MDs and a more novel Beta Binomial (BB) model in a mixed model framework [23] for reporting ORs. For the BB model, responses were transformed to a (0,1) scale using the transformation [23] Y-a/b-a, where a and b are the minimum and maximum possible scores and Y the observed response. For example, a score of 80 is transformed as 80- 0/(100-a) = 80/100 = 0.8. Dichotomization is not required for a BB model to generate ORs.

The BB model has been used in a variety of applications [2325]. Its advantages over standard (linear) models in terms of statistical properties are widely reported [25, 26]. The BB is also flexible because it models scores at the extreme ends of the scale (e.g. many patients scoring 0 or 100), a common feature of QLQ-C30 scores, using zero–one inflated model [25, 26]. MDs were classified similar to those described by Cocks [7]; ‘Trivial’ (0–3 points), ‘Small’ (3–10 points), ‘Modest’/ ‘Medium’ (10–15 points) and ‘Large’ (>15 points). Similarly, ORs were classified as 1 ± 0.05 (‘Trivial’), 1 ± 0.1 (‘Small’), 1 ± 0.2 (‘Medium’) and <0.8 or >1.2 (‘Large’). Time to Deterioration (TD) was determined using the first time where scores reduced/increased by ≥ 5 points. Patients without deterioration were censored. A Kaplan-Meier and Cox proportional hazards (PH) analysis was carried out.

A pilot survey was carried out to determine preliminary evidence of whether clinicians and/or patients preferred ORs or MDs for expressing treatment effects. Three items, physical function (PF), Pain (PA) and cognitive function (CF) from the 15 domains were randomly selected and presented to each of five clinicians and their patients (where possible). Patients/clinicians were asked to state preferences for ORs or MDs (Additional file 1). Lower/High scores express preferences for ORs; scores close to 5 express indifference.

Results

Demographics and baseline characteristics

The median age was 64 years (range 27–86 years) with oldest patients in the TOPICAL trial (median age 77); 61 % were male; 67 % were ECOG (0–1), 24 % ECOG 2 and 9 % ECOG 3 (Table 1); less than half were stage IIIa-IIIb (47 %) [9, 1822]. Most QLQ-C30 responses were >90 % complete at baseline (Additional file 2: Table S1) with the exception of study 10 (about 60 % complete). More than 50 % of data were available for at least 5 time points.

Table 1 Summary of baseline characteristics for each trial

Distribution of QLQ-C30

Most (>85 %) QLQ-C30 responses were very skewed (Fig. 1 & Additional file 2: Figure S1). For TOPICAL, 14/15 (93 %) of scores had alpha or beta values (special values associated with a BB distribution relating to the mean and variance) <1; Kolmogorov-Smirnov tests rejected normality (p-value <0.001). Therefore, using the mean as a measure of HRQoL benefit and consequently MDs is not considered a suitable reporting metric for HRQoL scores. Statistical analysis should be conducted according to the underlying (true) distribution of the data. The distribution of QLQ-C30 scores from the six trials were not normally distributed in most (≥85 %) of cases.

Fig. 1
figure 1

Distribution of QLQ-C30 responses: TOPICAL (x-axis is QLQ-C30 score on a scale of 0 to 1 and - y axis is relative frequency)

Relationship between MDs and ORs

Few 4/90 (4 %) HRQoL treatment effects (MDs) were ‘Large’ (>15 points) or ‘Medium’ (10–15 points); 27/90 (30 %) were ‘Small’ (3–10 points) and 59/90 (66 %) ‘Trivial’ (0–3 points) MDs; For ORs, 22/90 (24 %) were ‘Large’ (effects > 20 %) or ‘Medium’ (effects between 10 % to 20 %) with the rest being ‘Small’ or ‘Trivial (10 % and 5 % respectively). ORs were therefore more than seven times more likely to detect larger differences which can yield up to 20 % improvements in HRQoL ([0.24/0.76]/[0.04/0.96]) compared with MDs (Tables 2 and 3).

Table 2 Mean differences compared with odds ratios
Table 3 Magnitude of effect sizes

Additional file 2: Figure S2 shows the relationship between MDs and ORs and shows general agreement in terms of the direction of effects (i.e. observations in the upper right quadrant are ORs >1 and MDs >0; estimates in the lower left are ORs < 1 and MDs <0).

Four examples are provided to understand the relationship between ORs and MDs.

Example 1: when MDs are small but ORs are large

In the TOPICAL Trial the MD for constipation (CO) symptoms were 2.6 points (p = 0.1314) while this was an OR of 1.17 (p < 0.0001) – the choice of interpretation is ‘a worsening in CO by a mean difference of 2.6 points with erlotinib compared to placebo’ vs ‘patients are 17 % more likely of having worsening CO symptoms with erlotinib compared to placebo’. The MD scale gives the impression that CO symptoms worsens by a ‘Trivial’ amount of 2.6 points (Table 2). This tends to occur when responses are skewed (Fig. 1 and Additional file 2: Figures S1, S2 and S3). In the presence of heavily skewed data, the OR is a suitable choice for presenting HRQoL effects from the QLQ-C30.

Example 2: when MDs are ‘Large’ but ORs are ‘Medium’ or ‘Small’

In the TOPICAL trial, patients had worse diarrhoea (DI) with erlotinib: MD of 15.1 (‘Large’ effect) points (p <0.001) with a corresponding OR of 1.12 (p = 0.0505). The DI scores were considerably skewed (Fig. 1) which might explain why the larger MD corresponded with only 12 % (‘Medium’ effect) higher odds of diarrhoea with erlotinib compared to placebo (OR = 1.12). The OR appears to have modified the ‘Large’ effect size (borderline significance) to a smaller (non-significant) effect size.

Example 3: when MDs are ‘Medium’ but ORs are ‘Large’

In study 10, RF improved by a MD of about 13 points (Table 2) with the experimental treatment – a ‘Medium’ effect. Using an OR, this was an improvement in role function by almost 30 % (OR =1.29 ‘). On examination of Additional file 2: Figure S1, responses fell into only three distinct categories at 0, 50 and 100 and scores were not Normally distributed making use of the MD questionable. The OR approach has relegated a ‘Medium’ effect to a ‘Large’ effect.

Example 4: when MDs and ‘ORs agree on the direction of effects

In the TOPICAL trial, two of the MDs (MD of 3.2 and 3.6 in TOPICAL; p-values of 0.0017 and 0.0007 for PF and CF respectively) had corresponding ORs of 1.10 and 1.14 (p-value = 0.0168 and 0.0107). Both MDs and ORs are in agreement that PF and CF are improving with the experimental treatment. Hence, on average, patients had 10 % and 14 % higher odds of improved PF and CF on erlotinib compared with placebo respectively (Table 2).

The above are a limited number of examples reflecting the challenges associated with defining thresholds of HRQoL differences with the MD. Another issue that can complicate interpretation is when small effects become difficult to interpret and justification is made through statistical significance. Statistical significance of small HRQoL effects are often reported, but the clinical relevance not always discussed. Table 3 shows that 28/90 (31 %) of ‘small’ or ‘Trivial’ effects based on MD were statistically significant compared with 7/90 (8 %) for ORs.

Example 5: Potentially unreliable statistically significant conclusions using MD

In study 12, for Diarrhoea, the MD was −2.3 (p = 0.0017). The corresponding OR was 1.05 (p = 0.2909). The clinical relevance of the small improvement in DI symptoms with experimental treatment might be difficult to judge. On the ORs scale, DI is actually shown to be worse: a 5 % likelihood of worsening diarrhoea (a common side effect with this chemotherapy) on the experimental treatment. Examination of Additional file 2: Figure S2 shows heavily skewed DI scores – with about 15 % of patients showing worsening DI symptoms. The choice of a mean statistic here is likely to lead to an unreliable or unexpected statistical conclusion. Further examples of differing statistical conclusions between ORs and MDs are shown in Additional file 2: Tables S2, S3.

Effect size classification for ORs and MDs

Estimates for OR effect size categories similar to those described earlier [7] were determined using a cumulative frequency plots from MDs and ORs (Fig. 2 and Additional file 2: Tables S2, S3, S4). Effect sizes in terms of ORs were broadly classified as: ‘Trivial’: ORs within ±5 % of 1 (i.e. ORs between 0.95 and 1.05); ‘Small’ effects (ORs 1.05 -1.10 or 0.90 – 0.95); ‘Medium effects (ORs 1.10 – 1.20 or 0.80-0.90) and ‘Large’ effects ORs either >1.20 or <0.80. Additional file 2: Table S4 shows that 12/59 (20 %) of ‘Trivial’ effects based on MDs might be clinically important because on an OR scale these were ‘Medium’ or ‘Large’. Consequently some clinically important effects may be missed using MDs.

Fig. 2
figure 2

Cumulative Frequency Plot of Effect Sizes for MDs and ORs. Horizontal reference lines are MDs effect sizes of 3, 10 and 15 points; circles refer to ORs and squares refer to MDs

Figure 2 shows median HRQoL effect sizes are 2.5 points (half of effect sizes are ≤2.5), roughly equivalent to 7 % changes in HRQoL on the OR scale; similarly for the lower and upper quartiles, 25 % of effect sizes ≤1 point or 4 % changes on the OR scale; and 75 % of effect sizes are ≤3.6 points (ORs of about 1.10).

Secondly, for effect sizes of 1, 3, 5 10 and >15 points, the equivalent ORs are about 1.02, 1.07, 1.13, 1.25 and 1.37 respectively. The threshold for a large effect size of >15 points is challenging: patients expected to improve/worsen by almost 40 %. This may be a difficult target for some cancer drugs to achieve when compared with each other.

Summary of preference scores from survey

Five lung cancer clinicians completed a pilot (Additional file 1) survey (London UCH, Liverpool, Leeds, Chester and Imperial College London). At this time no patient responses were available. Hence a total of 15 scores from 5 clinicians who expressed preferences for either ORs or MDs for each of PF, Pain and CF were analysed. Stronger preferences were expressed for ORs over MDs: mean scores of 2.4, 3.1 and 2.8 for PF, Pain and CF respectively. Hence, initial evidence suggests clinician preference was greater for ORs than MDs. The results would need to be confirmed in a larger sample.

Comparison with time to deterioration

The time it takes for a patient to deteriorate from baseline by ≥5 was not possible for about 13 % HRQoL domain scores due to too few events (i.e. patients did not show of ≥5 points). Moreover, a TD of ≥5 points was not always possible because scores were clustered in values such as 16.7, 33.3 and 66.6 (e.g. as in CF scores for TOPICAL -Fig. 1). No patient experienced (or could experience) a TD of exactly 5, 10 or 15 points (the possible values of the QLQ-C30 for CF were only 0, 16.7, 33.3, 50.0, 66.7, 83.3 and 100). The median TD (Additional file 2: Table S5) was not calculable for some symptom and function scores: for CF, a HR of 1.05 (p = 0.241) was reported: patients had a 5 % increased risk of deteriorating (≥5 point reduction) CF with erlotinib compared to placebo. The OR of 1.14 and MD of 3.2 in contrast show improvements in CF. The definition of deterioration is therefore critical for a valid estimate to be possible. When the TD for CF was changed to ≥16 points (‘Large’ effect), the medians become calculable as 77 vs 87 months for erlotinib vs placebo (HR = 0.92; p = 0.56): the risk of deterioration in CF was slightly worse (by 8 %) with erlotinib compared to placebo. The Kaplan Meier curves cross and the PH assumption was violated, a complication the OR analysis avoids.

Conclusion

An alternative metric to the commonly reported MD was presented in the form of ORs. Skewness of QLQ-C30 scores might render statistical and clinical interpretation of MDs questionable. Alternative effect size categories for ORs were proposed. We have also shown a relationship between ORs and MDs for QLQ-C30 measures; ORs can on the one hand reveal important HRQoL effects which might otherwise be missed with MDs, particularly those perceived to be ‘Trivial’ or ‘Small’. Conversely, effect sizes based on MDs thought to be ‘Medium’ or ‘Large’ may appear less exaggerated with ORs; Treatment effects from TD type analyses did not always result in estimates of effect sizes and interpretations were complicated by non PH assumptions. Finally we showed results from a pilot survey which suggest oncologists may prefer ORs over MDs for interpreting QLQ-C30 effects.

The use of the ORs has been used previously in HRQoL data. Feddern et al. (2015) [27] reports them for assessment of pain; Chie et al. (2015) [28] uses a propensity score (logistic regression) approach to report odds of HRQoL deterioration; Kurita et al. (2015) [29] use ORs with the QLQ-C30 in renally impaired patients. In these analyses scores were dichotomized in order to generate the ORs. In our analysis, no such dichotomization (and consequent loss of information) was required due to flexibility of the Beta-Binomial regression approach.

Patient and clinician understanding of MDs have not been previously shown to be concordant [7] and this may in part be due to how HRQoL benefits are expressed to patients. Clinicians and patients may find it easier to agree on relative quantities than absolute differences. The pilot survey results may support relative quantities. The choice between interpretations such as: “your diarrhoea will be worse with the new treatment by 15 points, on average” instead of: “the likelihood of diarrhoea with the new treatment is significantly higher by about 11 % compared to placebo”, is a matter of preference, but the latter may be appealing for some. Aligning understanding of smaller effect sizes is increasingly important with the emergence of novel treatments for lung cancer being compared with each other (and not just placebo).

There are several advantages and disadvantages of both MDs and ORs. First, ORs evaluate relative (instead of absolute) treatment effects. For objective endpoints, absolute differences (e.g. 4 vs 3 months survival) may provide easier interpretations of treatment benefits (although the effects are median and not mean differences in cancer trials). However, HRQoL are self-reported endpoints for which even the most experienced clinician has difficulty interpreting. For such endpoints, a relative scale may be more useful. If treatment effects from primary endpoints are judged by relative quantities (e.g. hazard ratios), there are no reasons why treatment effects from HRQoL endpoints should not also be assessed this way. Both survival time and HRQoL share some similar distributional properties (e.g. skewed or censored). There is some concern that effects near the boundaries (floor/ceiling) will be overvalued with ORs compared to effects around the middle. However, such concerns can be addressed through the use of zero–one inflated models (Khan, 2014) [25] which model the over/under dispersion.

Secondly, the OR model assumes a fixed odds ratio over time (i.e. the effect is constant over time), which may not hold in a longitudinal QoL setting. Reliable interpretation of MDs also depends on an absence of treatment by time interactions (i.e. ORs and MDs are not dependent on specific time points). Thirdly, statistical models for MDs will provide predicted patient level HRQoL responses. For example, a patient taking experimental treatment with a certain demographic profile might yield a predicted PF score (e.g. 5 points). Similarly, a model for estimating ORs can be used to predicted a probability of a achieving a specific PF score for a given patient (group of patients) on the experimental treatment (response curves are advocated by the FDA for patient reported outcomes) [30].

The suggested effect size of >10 units on the QLQ-C30 was proposed almost two decades ago when fewer treatment comparators were available [15]. Few (about 2 %) MDs were >10 points and this research confirms earlier conclusions that small changes in HRQoL can be important (Cella, 2002) [7, 31]. Importantly, the implications of skewed distributions were not factored in when the magnitude of effect sizes were defined in earlier research.

There are several strengths and limitations of this analysis. First, a large sample size is used from clinical trials in similar groups of patients. Secondly, established criteria for classifying effect sizes were used for MDs [7]. Third, the BB model is a robust approach to analysing skewed data with ceiling effects, without arbitrary dichotomisation of responses. Finally, interpreting ORs is similar to that of HRs which many oncologists are familiar with.

Although the BB approach offers an alternative approach to analyse and interpret HRQoL effects, it is more complex. The complexity is outweighed by the benefits of reliable and potentially easier to interpret estimates of effect. A further limitation is that analysis has been restricted to lung cancer patients, but can be applied to other tumour types and disease areas. The classifications suggested for ORs in this analysis are arbitrary (even if based on the observed data) and different results can occur with alternative categories. Definition of effect sizes may require some threshold to be set which may necessarily be subjective. However, a starting point in our view is that the most appropriate metric is used to present HRQoL effects in cancer patients, an area for further research. The initial survey results too should also be confirmed in a larger sample size.

Treatment effects for HRQoL from the QLQ-C30 should be reported using relative quantities such as ORs which appear to be clinically intuitive, easier to interpret and where analysis involves modelling the skewed distribution of responses.

Highlights

The highlights of this paper are:

  • Mean differences in HRQoL are difficult to interpret for clinicians and patients alike, especially when the difference is small.

  • An alternative measure to reporting and interpreting HRQoL treatment differences using a relative quantity such as an odds ratio can greatly facilitate patient –clinician understanding of a ‘relevant’ HRQoL improvement.

  • We offer a way in which mean differences in HRQoL can be interpreted as approximate odds ratios. Effect sizes are categorized as ‘Trivial, ‘Small’ ‘Medium’ and ‘Large’ for odds ratios in a similar way to mean differences

  • Although the BB approach offers an alternative approach to analyse and interpret HRQoL effects, it is more complex. The complexity is outweighed by the benefits of reliable and potentially easier to interpret estimates of effect.

  • Our approach will allow patients and clinicians to align their understanding of treatment benefits using HRQoL outcomes.