Background

A challenge in the interpretation of health-related quality of life (HRQOL) data in clinical research is that HRQOL is self-reported by the patient, and might be influenced by psychological phenomena such as adaptation to illness. Patients who experience changes in health often accommodate and adapt to these changed conditions. When measuring changes in HRQOL with a pre-test (assessment prior to intervention)/post-test (assessment after intervention) design, as in a Randomized Controlled Trial (RCT), adaption to increased symptom level or impaired HRQOL can affect results, a change referred to as response shift (RS) [1]. Sprangers and Schwartz defined RS in the field of HRQOL as a change in the meaning of an individual's self-reported HRQOL [2]. It can be divided into 1) Reconceptualization (i.e. a re-definition of HRQOL), 2) Reprioritization (i.e. a change in the importance attributed to component domains constituting HRQOL) and 3) Recalibration (i.e. a change in a patient's internal standards of measurements). The most widely used approach for assessing changes in a patient's internal standard is the retrospective pre-test design (then-test) [1, 3]. At post-test, patients are retrospectively asked to provide a renewed judgment of their HRQOL at baseline (pre-test). The then-test is ideally completed simultaneously with or in close proximity to the post-test, assuming patients rate their HRQOL on both tests using the same internal standards.

During the last years, several studies have found evidence for the occurrence of RS in HRQOL in cancer patients -e.g. [410]. RS may sometimes be the result of an adaptive response to a changed health status, and may then be viewed as a positive phenomenon to patients. However, the altered meaning of HRQOL over time poses a challenge to clinicians in the interpretation of changes in HRQOL. In a study by Visser et al, fatigue was assessed in 216 cancer patients before and after treatment with radiotherapy [4]. When the conventional pre-test was compared to the post-test, no differences in fatigue were found. This might lead to the conclusion that radiotherapy does not affect fatigue. However, when the then-test was used as the measure of fatigue at baseline, there appeared to be a statistically significant increase in fatigue after treatment.

The magnitude and importance of the RS phenomenon remains unsolved. A meta-analysis by Schwartz et al suggested that RS may play a significant role in HRQOL research and that the direction of this shift varies across studies [11]. In a previous report we attempted to determine the clinical significance of changes in quality-of-life scores in patients with multiple myeloma (MM) [12]. MM is an incurable malignant disease of the bone marrow with an expected median survival of five years [13]. At diagnosis, myeloma patients report a pronounced impairment of HRQOL, with reduced physical functioning, fatigue and pain as the major problems [14]. Aims of treatment are to control disease, maximize quality of life and prolong survival. Hence, HRQOL is an important outcome in clinical trials. We estimated the Minimal Important Difference (MID) in patients with MM for the HRQOL instrument, the EORTC QLQ-C30. MID is defined as "the smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troublesome side effects and excessive cost, a change in the patients' management" [15]. Our results suggested that a change in the EORTC QLQ-C30 score in the range of approximately 6-17 (on a 0-100 scale) is considered important by patients with MM. Here, we evaluate whether patients experienced RS, and if so its magnitude and direction. We also explore how RS affects the MID-results and whether RS impacts on the interpretation of HRQOL results in clinical trials.

Methods

Patients

Patients with MM, irrespective of their disease status (newly diagnosed, plateau phase, relapsed) or treatment, were enrolled from January 2006 to April 2008. Eligibility was expected survival greater than three months and ability to complete a self-report questionnaire in Norwegian. Consecutive patients admitted to 17 hospitals in the South-Eastern Norway Regional-Health-Authority, a region representing about 50% of the Norwegian population, were recruited. Written informed consent was obtained from all participants. The Helsinki Declaration guidelines were followed. The Regional Committee for Medical Research Ethics, Health region I, Norway, approved the study.

Questionnaire

HRQOL was measured using the EORTC QLQ-C30, a cancer-specific questionnaire with 30 items [16]. The questionnaire is composed of five functional scales, three symptom scales, a global health/quality of life scale, and six single items. All scores were calculated and transformed to a 0-100 scale according to EORTC methods [17]. For the functional scales and global health status, higher scores represent a higher level of functioning. In the symptom scales and single items, higher scores represent more symptoms or difficulties. The questionnaire is reliable and valid for MM patients [18].

Interview and Then-test approach

Patients completed the EORTC QLQ-C30 at inclusion (T1) and after three months (± 2 weeks window) (T2). At T2, a structured interview was performed and the patients were asked: "Compared with the last time you filled in the questionnaire (T1, date mentioned to the patients), has your quality-of-life improved, stayed the same or deteriorated?" The response choices ranged on a seven-point scale from 1 = much better to 7 = much worse. This global rating of change (GRC) question was asked for the four domains physical functioning, fatigue, pain and global quality of life. Because of small sample sizes in some of the GRC categories, we pooled the data into three categories (improved, unchanged, deteriorated) to yield sufficient numbers of cases in each category. "Improved" included much better, moderately better and a little better and "deteriorated" included a little worse, moderately worse and much worse for the four domains. MIDs for improvement and deterioration were defined as the mean score changes in these domains for patients declaring improvement or deterioration. During the article we would use improved as shorthand for patients "who reported themselves as improved", and similarly for deteriorated and unchanged patients.

After the GRC questions, the patients were asked to provide a renewed judgment of their baseline ratings of the EORTC QLQ-C30 for the four domains (Then-test). The questions were asked in past tense for each of the 12 items included in these domains. We emphasized that the purpose of the then-test was not to recall their previous answers but to provide a renewed judgment of their HRQOL at baseline.

The mean difference between the pre-test and then-test scores was used to provide an estimate of the direction and magnitude of the RS effect. Observed changes were calculated by the difference between the mean post-test and pre-test scores while adjusted changes were measured as the difference between mean post-test and then-test scores.

Statistical methods

Wilcoxon tests for pair differences were used to calculate the significance of differences between pre-test, post-test and then-test. We divided the patients into groups according to whether they thought they were improved, unchanged or deteriorated for the four domains.

To examine the magnitude of recalibration RS, effect sizes (ES) were calculated by dividing the mean score changes by the standard deviation at baseline (T1). We used Cohen's generally accepted criteria for interpreting the magnitude of an ES: > 0.20 is a small change, > 0.50 a moderate change, and > 0.80 a large change [19].

The GRC results and the observed and adjusted changes all appeared approximately to reflect underlying normal distributions. Analysis of variance (ANOVA) was performed and F-statistics values were calculated to see which approach (a seven-point GRC scale, observed changes or adjusted changes) was most efficient at detecting changes in phases of the disease (newly diagnosed, relapse/progression or stable disease). Newly diagnosed patients were expected to improve, relapsed patients to deteriorate and patients with a stable disease to stay unchanged. The relative efficiency of a test is measured by the ratio of the F-statistics values [20]

Missing data

If any item was missing in the first questionnaire (T1), we accepted the data as missing. For the second questionnaire (T2), the forms were checked and if any item was missing the patients were asked to fill it in before the interview. Still, if any of the constituent items in a scale were missing, the scale score for that patient was excluded from the statistical analyses.

Sample size calculation

The study primarily aimed to estimate the MID and sample size calculation was based on being able to detect a MID of 0.50 × SD, yielding a sample size of 260 patients. The response shift evaluation is descriptive and so the impact of sample size is indicated by confidence intervals around the estimates.

The statistical analysis was performed using The Statistical Package for the Social Sciences (SPSS), version 16 (SPSS Inc., Chicago, IL, USA).

Results

Study sample

  1. 260

    patients were recruited, and 239 (92%) who filled in both questionnaires were interviewed. Of the 21 patients lost to follow up, seven had died and nine were too ill to complete the questionnaire at T2. For the remaining five cases, the reason for lack of follow-up was administrative problems. Table 1 shows patients' characteristics. Fifty-seven percent of the patients completed the post-test and the then-test within the same or next day while 99% completed them within a week (range 0-22 days). At baseline, 0.6% of the items were missing from the EORTC QLQ-C30 questionnaires, which decreased to 0.3% at follow-up. Missing items appeared randomly distributed across domains.

Table 1 Patient characteristics

EORTC QLQ-C30 mean scores

Overall, for patients who improved, the EORTC QLQ-C30 at post-test showed statistically significant (p < 0.01) better scores (less symptoms and higher functioning) than at pre-test and then-test (Table 2). For patients who deteriorated, the EORTC QLQ-C30 at post-test showed consistently worse scores (more symptoms and lower functioning) than at pre-test and then-test (p = 0.01). There were no significant changes in the EORTC QLQ-C30 score from pre-test to post-test for the unchanged patients. However, for pain and fatigue, patients reporting no change had statistically significant more symptoms at post-test than at then-test (p < 0.01).

Table 2 EORTC QLQ-C30 scores for the pre-, post- and then-test

Magnitude and direction of RS

Table 3 summarizes the magnitude and direction of RS (pre-test - then-test scores). Overall, there were differences in both the magnitude and direction of RS between patients who improved and those who deteriorated. Patients improving from T1 to T2 retrospectively underestimated their baseline values on all the four dimensions. However, a statistically significant score difference (p < 0.01) emerged only for the global quality of life dimension. In contrast, among patients who deteriorated, the participants retrospectively overestimated their baseline values on the four dimensions. Thus, there was a statistically significant RS for the domains pain, fatigue and physical function (p < 0.01). An illustration of the RS-effect for MM patients who deteriorated in fatigue is presented in Fig. 1. In the unchanged group, there was a statistically significant RS for the domains pain and fatigue (p < 0.01). In these domains, the unchanged patients retrospectively overestimated their baseline values.

Table 3 Magnitude and direction of response shift for the entire sample
Figure 1
figure 1

Observed and adjusted scores of the EORTC QLQ-C30 for MM patients deteriorated in fatigue (n = 58). The patients evaluated their fatigue in retrospect as better (less fatigue) than they did at T1. The difference between the pre-test and then-test score is the response shift effect.

ESs were largest for those who deteriorated in the domains fatigue, pain and physical function (ESs were 0.49, 0.35 and 0.33, respectively, Table 3). Using Cohen's criteria, all of these ESs could be considered small. There were trivial ESs for the domains fatigue, pain and physical function for patients who improved.

Effect on MID estimates

Table 4 shows the observed (post-test - pre-test) and adjusted (post-test - then-test) mean score changes in the EORTC QLQ-C30 for the four domains. The observed changes are defined as MIDs because patients regard these changes as a definite improvement or deterioration. MIDs (absolute values) for patients rating themselves as improved ranged from 6.2 (physical function) to 14.7 (pain). Patients reporting deterioration had MIDs (absolute values) in the range of 8.6 (fatigue) to 17.3 (pain). However, there was considerable variation in the observed scores, as shown by the wide confidence intervals. By using the adjusted, mean changes as MIDs, the EORTC QLQ-C30 scores varied from 9.3 to 17.5 for improved patients. Patients who deteriorated had adjusted mean change scores in the range of 12.2 to 27.

Table 4 Minimal important differences calculated by observed- and adjusted mean score changes

Efficiency in detecting changes

Phase of disease was classified as newly diagnosed, stable or relapse/progression. Impact of phase of disease on GRC, observed and adjusted mean scores was explored using F-statistics values from ANOVA. There were statistically significant differences at the p < 0.05 level with the largest F-statistics value for GRC, closely followed by adjusted changes (Table 5). For pain, fatigue and physical function the GRC and adjusted changes have a relative efficiency (ratio between F statistics) of approximately three compared to the observed changes, and for global quality of life the relative efficiency is two.

Table 5 Results of the F-statistics from ANOVA for the domains pain, fatigue, physical function and global quality of life, by phase of disease

Discussion

The results of the present study indicate that RS exists in MM patients, mainly in those who deteriorated over the 3-month observation period. We found that patients who deteriorated in the domains pain, fatigue and physical function, retrospectively minimized their troubles at baseline. These changes in internalized standards could be a desirable adaptation mechanism to patients with cancer to maintain equilibrium in HRQOL in the face of loss.

Our findings are generally consistent with those of previous studies among other categories of cancer patients with deteriorating health conditions [4, 5, 21]. Jansen et al assessed RS in 46 patients with breast-cancer undergoing radiotherapy. They found that patients, who had deteriorated, retrospectively reported fewer symptoms at baseline. They concluded that RS measured by the then-test was stronger for deterioration in HRQOL than for improvement in HRQOL.

For patients who improved, there was no statistically significant evidence of RS except for the domain global quality of life. In RCTs in newly diagnosed patients with MM or cancer in general, patients are usually followed from the start of treatment and the majority of patients are expected to improve [22, 23]. Thus, the RS phenomenon may arguably be disregarded in the interpretation of the HRQOL results from such trials. Our results are in contrast to findings in studies regarding patients with non-fatal disorders, where improved patients retrospectively have reported significantly higher disability [24, 25]. Razmjou et al discussed this issue in a study of patients with total knee arthroplasty and concluded that "it appears that patients who wish to maintain a stable HRQOL would consciously or unconsciously magnify their treatment effect by endorsing a higher disability level retrospectively" [25].

We found some evidence for RS even in patients who were unchanged from T1 to T2, mainly for the domains pain and fatigue. On the average, these patients retrospectively underestimated their symptoms. A meta-analysis by Hagedoorn et al [7] concluded that RS is a common and significant phenomenon in HRQOL measurement, and that in cancer studies, patients with a declining HRQOL may report no decrease in their HRQOL due to positive adaptation. This could be an explanation for the findings for the unchanged group in our study.

ESs can be calculated to evaluate the importance of the observed RS. In our study, we found that the ESs of the RS were small according to Cohen's criteria with the largest ES detected for fatigue. Fatigue has been identified by patients with cancer as a major obstacle to normal functioning and a good quality of life [26]. Previous studies have suggested that fatigue is a symptom that is especially RS prone [4, 21].

It is important to know the clinical significance of changes in HRQOL scores for the interpretation of the results from clinical trials. We have previously reported that a difference of 6-17 points (scale range 0-100) in the EORTC QLQ-C30 score represents a clinically meaningful change in patients with MM. In the present study, we found that by controlling for RS in patients who improved, the same interval for MIDs could be used. However, if we adjust for RS in patients who deteriorated, larger MIDs (12-27 points) are obtained. The question is still: does adjusting for RS provides more reliable estimates of MIDs?

The F-statistics values from ANOVA indicates that the GRC is the most effective method for detecting differences in phase of disease, with RS adjusted changes being second best. The GRC method accords most with actual clinical practice, in which health-care providers usually rely on patients' judgment if they are better, the same or worse. However, the question remains, which is the most meaningful and least biased outcome? The most sensitive outcome could be the most biased. If patients are aware that the phase of their disease is deteriorating, they may be more prone to assuming that their HRQOL must as a consequence be similarly declining, resulting in biased reports of GRC and possibly RS adjusted changes.

A possible explanation for the discrepancy between the pre-test and then-test assessment is the potential for recall bias. In HRQOL research, recall bias refers to memory distortion; that is if patients incorrectly recall their health condition at T1 [27, 28]. However, in a study by Visser et al comparing different approaches to detect RS, recall bias did not invalidate then-test result [29]. A factor such as the length of period between measurements may affect the influence of recall bias. Like Visser and others [9, 29], we used a relatively short interval between assessments (3 months). If we had chosen a shorter interval between pre-test and then-test, the patients could have remembered what they actually answered on the pre-test. A longer period between the initial measurement and the retrospective then-test would pose a considerable challenge to memory. The choice of 3 months in the present study was a compromise between these considerations. Another possible explanation for the observed results could be the "implicit theory of change". This theory suggests that patients begin with their presumed present state (post-test) and work backwards to their pre-test state (pre-test), and not on their perception of their health at a specific time point [27]. A consequence could be that patients view the decline in their HRQOL as bigger than it actually is because they believe their disease is progressing and that consequently their HRQOL must be deteriorating.

Although RS could be a challenge for the measurement and interpretation of self-reported HRQOL, adaption to illness could serve as a form of psychological buffer that helps reduce the stressful impact of a deteriorating health status. For most patients, living after being diagnosed with cancer is not the same as before. An important part of every cancer treatment is helping patients to adapt to their illness. Thus, the positive adaption we found in our study in patients saying that they deteriorated is actually a desired effect for the patients.

We chose to study MM patients because we anticipated large differences in HRQOL score between those who improved or deteriorated. A comparison with the results obtained with the EORTC QLQ-C30 in patients with other haematological diseases [30] and in solitaire cancers [31, 32] indicates that patients' HRQOL is lower in MM than in several other malignant diseases.

The evaluation of external validity is important to enhance the transfer of results into the clinical routine. The strength of our study is that we included an almost representative sample of patients with MM within the South-Eastern Norway, although the median age was somewhat lower (66 years) than in a newly published population based study from Sweden (72 years) [33]. However, the mean EORTC QLQ-C30 scores for the whole sample in our study is comparable to a nationally representative study among MM patients in Denmark [30]. Given the representativeness of the patients included, we can expect the results to be relevant to other MM patients. We would also expect these findings to apply to other cancers or other illnesses, and we encourage confirmatory studies to investigate this.

Conclusions

In our study, MIDs estimated from pre-test/post-test data appeared to be robust against RS in patients who improved over 3-months. This could indicate that RS has a minimal impact on the results in patients who respond to treatment, and that RS may not have an important impact on interpretation of changes reported in clinical trials where an improvement occurs. Although the ESs of the RSs were small, RS in deteriorated patients may augment MID estimates with up to 12 points and may have an important impact on the interpretation of changes reported in clinical trials.