There is an increasing demand for economic evaluations of health care which compare the costs and benefits of interventions in order to identify which provide the greatest health gain per unit of investment. Assessments based on quality-adjusted life years (QALYs) gained are recommended for economic evaluation of new interventions [1] and have been adopted by decision-making organisations in many countries including the United Kingdom [2], Canada [3] and the USA [4]. QALYs are the product of the time spent in a health state multiplied by a utility value, representing quality of life, for that particular health state. Utility is the preference for a health state (rated in the presence of choice) relative to full health (scored 1) and death (scored 0).

Decision makers such as the National Institute for Health and Clinical Excellence (NICE) in the United Kingdom aim to maximise the health of the population. They, therefore, require preference data from a representative sample of the public based on ratings of health states described using standardised validated generic instruments using a choice-based method [5]. This allows interventions for a range of different diseases and specialties to be assessed on a standard scale. Generic instruments developed for this purpose include the EQ-5D and SF-6D. The EQ-5D has a number of country-specific choice-based preference weights including the United Kingdom and the USA, whilst the SF-6D to date has only UK preference weights.

In the United Kingdom, NICE currently suggests that the most appropriate measure is the EQ-5D but recognises that the EQ-5D may not be appropriate in all circumstances. The SF-6D has considerable potential as it can be calculated from both SF-36 [6] and SF-12 [7], which have been routinely collected in numerous studies. The choice of preference-based measure, however, depends on the validity of the measure in that setting. The EQ-5D is one the most extensively validated measures for use in patients with rheumatoid arthritis (RA) [8]. The SF-6D has been less extensively studied in this setting, but evidence to date suggests the measure has potential [8].

One important test of validity is the ability of a measure to reflect the change in patients over time. The EQ-5D and SF-6D have been shown to be capable of detecting some degree of change in RA patients [912] The responsiveness of the EQ-5D and SF-6D have been compared head-to-head in North American populations, but not to date in UK or European populations. In two studies of North American populations, the SF-6D appeared more responsive than the EQ-5D to improvement in patients health [10, 11]. However, other results have been conflicting. In patients with one of a number of rheumatological conditions (51% RA), the EQ-5D was more responsive than the SF-6D to improvement [12]. A recent review of the use of generic utility measures in RA recommended more head-to-head comparisons of the measures in longitudinal studies across the spectrum of RA disease severity [8].

We aimed to compare the responsiveness to change of the EQ-5D and SF-6D in UK patients from a range of studies covering early inflammatory arthritis through to severe RA.


Data were taken from four cohorts of patients:

  1. 1.

    The Steroids in Very Early Arthritis (STIVEA) randomised controlled trial (RCT) of intramuscular steroid treatment versus placebo in patients with very early inflammatory arthritis (4–11 weeks duration). The trial follow-up finished in late 2007 [13]. At the time of this analysis, the STIVEA trial remained blinded. Therefore, the patients studied comprised patients receiving either active or placebo treatment, but the proportion receiving each allocation is unknown.

  2. 2.

    British Rheumatoid Outcome Study Group (BROSG) RCT of aggressive versus symptomatic control of inflammation in patients with established (>5 years duration) stable, symptomatic rheumatoid arthritis (RA) followed for 3 years. The BROSG trial was conducted between 1998 and 2001 [14].

  3. 3.

    A subsample from the British Society for Rheumatology Biologics Register (BSRBR) of RA patients treated with anti-TNF therapy and followed for 6 months. The BSRBR was established in October 2001, and the methods of this study have been described in detail previously [15]. As part of the current study, from 1st August 2006 to 31st December 2007, newly enrolled patients were also asked to complete the EQ-5D at baseline and the 6-month assessment.

  4. 4.

    A subsample of patients in the control arm of BSRBR, who also received the EQ-5D at baseline and 6-month assessment in the same time period as the anti-TNF treatment cohort. These patients were biologic-naive with active RA (guideline DAS28 >4.2) currently treated with Disease Modifying Anti-Rheumatic Drugs (DMARDs) and were recruited in parallel within the BSRBR and followed up with identical methodology.

Baseline data for all cohorts included age, sex and disease duration. All patients completed the EQ-5D [16] and the SF-36 [17] (used to calculate the SF-6D utility measure [6]) and the Health Assessment Questionnaire (HAQ), a measure of functional disability. A patient global assessment, the 28 tender and swollen joint counts and the erythrocyte sedimentation rate (ESR) were also collected, which enabled the Disease Activity Score (DAS28) [18] to be calculated. In the BSRBR control arm, the composite DAS28 score was frequently reported in isolation without separate 28 tender and swollen joint counts and the ESR. Higher HAQ (range 0–3), DAS28 (range 0–10), tender and swollen joint counts (range 0–28) and ESR denote more severe disease (Table 1). Lower EQ-5D and SF-6D scores denote poorer HRQoL.

Table 1 Summary of outcome measures used in this study

Expectations of improvement/deterioration

These four cohorts reflect a range of arthritis states/severity found in routine practice from first presentation with undifferentiated inflammatory arthritis through long-standing established RA to patients with severe, active disease. The health-related quality of life (HRQoL) of patients with early disease (STIVEA trial) and active disease patients BSRBR anti-TNF treatment arm was expected to improve. Patients from STIVEA may improve in response to steroid treatment, by natural remission which may be expected to occur in up to 25% of patients [19, 20], or by adaptation—improvements in functional disability are often seen in the early stages of RA [21, 22]. Patients in the BSRBR receiving treatments which inhibit the action of TNFα were expected to have dramatically improved outcome [2325]. Improvement in response to treatment was assessed according to the European League Against Rheumatism (EULAR) response criteria definition based on the DAS28 [26]. Responders were patients achieving a good or moderate EULAR response. Good responders improve by >1.2 units on the DAS28 score and achieve an absolute DAS28 score <3.2 at 6 months. Non-responders improve <0.6 and have a 6-month DAS28 score >5.1. Moderate responders fall in between these definitions.

Patients with long-standing established (BROSG) and severe disease (BSRBR control) receiving DMARD treatment were expected to experience disease progression which would be reflected by deteriorating HRQoL. Successful DMARD treatment is expected to slow disease progression [27], but patients with long disease duration are less likely to respond to DMARD treatment [28, 29].

Change in EQ-5D and SF-6D over the first year of the BROSG trial was assessed in relation to the EuroQol ‘feelings thermometer’ visual analogue scale (EQ-VAS). The EQ-VAS asks the respondent to indicate “how good or bad is your health today, in your opinion” on a vertical 0–100 scale. Change between baseline and 1-year follow-up was calculated as a percentage change using the formula ((EQ-VAS2 − EQ-VAS1)/EQ-VAS1) × 100. The percentage change was then defined as small if it was between 20 and 50% similar to the methods of Marra et al. [10].

Hypotheses were formulated on the basis of expected improvement or deterioration, and the magnitude of expected change within these groups was estimated using ‘benchmark’ criteria for effect sizes of small (ES = 0.2), moderate (ES = 0.5) and large (ES = 0.8) [30].

We expected

  1. 1.

    moderate EQ-5D and SF-6D improvements (ES ~ 0.5) in STIVEA patients based on the known improvement in symptoms of patients in early arthritis (for example improvements in HAQ, pain and SF-36 HRQoL), steroid treatment which reduces inflammation and may slow the progression of disease, and the possibility of disease remission [22, 31, 32]. Only half of the STIVEA patients will have been randomised to steroid treatment as part of the trial.

  2. 2.

    moderate to large improvements in EQ-5D and SF-6D (ES > 0.5) in BRSBR patients. In a US study of arthritis patients receiving infliximab, an anti-TNFα treatment, patients were shown to have moderate (EQ-5D 0.6) to large (SF-6D 1.4) ES [11].

  3. 3.

    small (ES ~ 0.2) improvement or deterioration in patients reporting changes in health over 12 months of the BROSG trial. In a study using similar VAS defined and self-reported improvement and deterioration, ES for deterioration ranged from −0.24 (EQ-5D) to −0.55 (SF-6D) and improvement ranged from 0.36 (EQ-5D) to 0.54 (SF-6D) [10]. However, these estimates were based on groups with no limit for deterioration or improvement. Our definition of 20–50% improvement/deterioration was restrictive; therefore, it is likely the ES will be lower.

  4. 4.

    Small deterioration (ES ~ 0.2) in the BROSG and BSRBR control groups. The progression and duration of arthritis is associated with small gradual increases in functional disability (approximately 0.033 per annum) and accumulated joint destruction [31]. The reduction in EQ-5D and SF-6D scores would be expected to mirror the gradual increase in burden of disease.

Statistical analysis

We treated responsiveness as a part of the validation process of an outcome measure which requires longitudinal data and methods distinct from other techniques used to assess other types of validity [33]. We defined responsiveness using the effect size (ES) and standardised response mean (SRM) [34]. Both provide a ratio of signal (mean change) to noise (standard deviation). ES for this study was calculated using a formula based on Cohen’s d, \( d = \left( {{{\bar{x}_{1} - \bar{x}_{2} } \mathord{\left/ {\vphantom {{\bar{x}_{1} - \bar{x}_{2} } s}} \right. \kern-\nulldelimiterspace} s}} \right) \). The source of the standard deviation (s) in Cohen’s formula is not specified, as the true standard deviation is assumed to remain the same regardless of the mean of the population. The ES in this study used mean change between baseline and a follow-up assessment and the standard deviation of the group at baseline: ES = μ1 − μ2/σχ1, where μ1 = mean at follow-up, μ2 = mean at baseline and σχ1 = standard deviation of the group mean at baseline. The SRM is calculated in the same way as the ES, although the standard deviation (s) is the standard deviation of the mean change \( \left( {\bar{x}_{1} - \bar{x}_{2} } \right) \) instead of the baseline standard deviation: SRM = μ1 − μ2/σ(χ1 − χ2), where μ1 = mean at follow-up, μ2 = mean at baseline, and σ(χ1 − χ2) = standard deviation of the group change between baseline and follow-up.

The Pearson product-moment correlation was used to calculate the correlation of change between measures. The comparative strength of correlations between a disease-specific outcome measure and the EQ-5D and SF-6D utility measures was compared using Steiger’s Z test for two correlated correlation coefficients [35, 36].

Floor and ceiling effects (the percentage of patients occupying the worst/best health states) were calculated and considered small if ≤15% of patients occupy the worst and best health states, respectively, and serious if >15% of patients occupy these states. These criteria have been used previously in reviews of outcome measures in musculoskeletal disease [8, 37].


Baseline characteristics

The study population consisted of 466 patients from the BROSG trial, 182 patients from STIVEA, 223 patients from the BSRBR register and 188 from the BSRBR comparison cohort. One hundred and eighty-eight (84%) of the BSRBR patients received adalimumab and 35 (16%) received infliximab. The disease duration of patients ranged from 7.8 weeks (s.d. 2.6) in the STIVEA trial to 13.4 years (s.d. 11.5) in the BSRBR. There were differences in demographic and clinical characteristics between the four groups of patients (Table 2).

Table 2 Baseline characteristics of patients from the four cohorts, ordered by mean utility score

Patients in the BROSG study had the highest mean EQ-5D scores (mean 0.59, s.d. 0.22) followed by the BSRBR control arm (mean 0.55, s.d. 0.27), STIVEA (mean 0.46, s.d. 0.31) and the BSRBR group (mean 0.34, s.d. 0.33). The pattern was the same for SF-6D scores, but these scores were consistently higher than EQ-5D scores; mean scores ranged from 0.64 (s.d. 0.13) in BROSG to 0.50 (s.d. 0.09) in the BSRBR treatment arm.

Change over time

A EULAR response could be calculated for 161 BSRBR patients (72%) and 171 STIVEA patients (94%). One hundred and thirty-two (82%) of the BSRBR patients and 135 (79%) of the STIVEA patients were responders. Over half of the STIVEA responders (55%) and two-thirds of BSRBR (35%) were good responders.

A total of 436 out of 466 patients in the BROSG trial attended a 12-month follow-up and completed the EQ-VAS. Eighty-one patients (19%) had an EQ-VAS score worse (>20 & <50%) than baseline at the 1-year follow-up, and 62 patients (14%) had an EQ-VAS score better (>20 & <50%) than at baseline. The improvers were called BROSG(I) and the deterioraters BROSG(D). The proportion of patients from each treatment arm of the BROSG trial was similar in each group. Four hundred and six patients completed 3 years of follow-up in the BROSG trial.

Patients in the BROSG trial and BROSG(D) and the BSRBR control arm deteriorated over the period of follow-up; mean change in EQ-5D ranged from −0.05 to −0.13, and SF-6D ranged from −0.01 to −0.04 (Table 3). Deterioration in SF-6D in the BSRBR control arm was minimal (mean −0.01, s.d. 0.09). The HAQ scores for all these groups deteriorated (mean 0.09–0.16). The DAS28 scores did not show consistent direction of change in these groups, worsening only in the BROSG(D) reported group.

Table 3 Mean change over time (s.d.) in each of the groups of patients

In BROSG(I) patients and those in the BSRBR and STIVEA trial, the EQ-5D (mean 0.06–0.20) and SF-6D (mean 0.03–0.13) indicated improvement. All other outcome measures reflected this improvement over the follow-up period apart from HAQ, which deteriorated in the BROSG(I) patients (0.07). The improvement in patients in the BSRBR and STIVEA studies was considerable for all outcome measures.

ES and SRM

The hypothesised magnitude of change, based on the ES, for each of the 6 groups of patients defined by direction of expected change, was equaled or exceeded on 5 occasions by the EQ-5D, and on 4 occasions by the SF-6D (Table 4). The ES for the EQ-5D in patients in the BSRBR group (ES = 0.46) was slightly smaller than the hypothesised moderate response (ES ~ 0.5). The ES for the SF-6D was smaller than expected for patients in the BROSG group (ES = 0.15) and the BSRBR Control (ES = 0.08) group where a small effect size was expected (ES ~ 0.20); the latter group had a very small effect size. The ES for patients from patients reporting a deterioration over 1 year of follow-up (BROSG(D) was larger than anticipated for both the EQ-5D (ES = 0.62) and SF-6D (ES = 0.35). The ES for improvement in patients from STIVEA (EQ-5D ES = 0.64, SF-6D ES = 0.97) unexpectedly exceeded those for BSRBR(EQ-5D ES = 0.46, SF-6D ES = 0.82).

Table 4 Responsiveness of the EQ-5D and SF-6D to change in each of the groups, ordered by increasing magnitude of change (EQ-5D)

Responsiveness, whether assessed by the ES or SRM, yielded largely similar results (Table 4). However, the comparative responsiveness of the EQ-5D and SF-6D differed according to the direction of change. When health deteriorated over the follow-up period, the EQ-5D was consistently more responsive than the SF-6D. The EQ-5D was most notably more responsive than the SF-6D in the BSRBR control group (ES ratio 3.0); the SF-6D failed to respond to deterioration [mean change −0.01 (0.09)] in this group. All ES ratios indicated that the EQ-5D was more than 1.5 times more responsive to deterioration. In contrast, when patients improved over follow-up, the SF-6D was more responsive than the EQ-5D, particularly in the STIVEA (ES ratio 1.5) and BSRBR groups (ES ratio 1.8) where a large improvement was detected.


The correlation of change in EQ-5D and SF-6D ranged from 0.25 in the BROSG 12-month deterioration group to 0.48 in the STIVEA patients (Table 5). The change in SF-6D correlated more strongly than the change in EQ-5D with change in patient EQ-VAS rated health in all of the cohorts apart from the BSRBR control cohort group, where correlations were equal. Change in DAS28 and its components (tender and swollen joint counts, ESR) was generally more strongly correlated with change in SF-6D score than change in EQ-5D score. Similarly, change in HAQ in STIVEA and BSRBR was significantly more strongly correlated with the change in SF-6D score than change in EQ-5D score. The SF-6D was significantly more strongly correlated than the EQ-5D with the DAS28 in the BROSG 12-month improvement group.

Table 5 Correlation of change between outcome measures

Floor and ceiling effects

Overall floor and ceiling effects of the EQ-5D and SF-6D were small in this study (Table 6). No patient scored at the floor of the EQ-5D, and 2% or fewer scored at the floor of the SF-6D. However, floor effects existed for individual domains. Floor effects for EQ-5D pain/discomfort were small in the BROSG (7%) and BSRBR control groups (9%), but serious (26% in STIVEA and 39% in BSRBR). Serious floor effects were also evident in the role limitation subscale of the SF-6D (32–65%), the vitality subscale (19–36%) and in the physical functioning scale (18–24%) in the BRSBR, BRSBR control and STIVEA groups.

Table 6 Floor/ceiling effects for the EQ-5D and SF-6D

There were no serious ceiling effects for the EQ-5D (<1–8%) and no patient scored at the ceiling of the SF-6D. However, serious ceiling effects existed in the self-care (22–49%) and anxiety/depression (44–66%) EQ-5D domains for all groups, in the usual activities domain for STIVEA (15%), BROSG (22%) and the BSRBR control (17%) groups, and mobility in BROSG (22%), BSRBR control (23%) and STIVEA (36%). Ceiling effects were serious in all groups for the social functioning subscale (18–45%) and all groups but STIVEA (12%) for the mental health subscale (16–27%). In addition, there were small to serious (7–27%) ceiling effects in the mobility subscale.


This study is the first to compare the responsiveness of the EQ-5D and SF-6D to longitudinal changes in UK RA patients with different expected disease trajectories. These ranged from patients with early disease expected to improve through to patients with severe long-standing disease expected to deteriorate. Our results have highlighted key differences in the ability of the EQ-5D and SF-6D to measure change. Of note, the EQ-5D was more responsive to deterioration in health than the SF-6D, whereas the SF-6D was more responsive to improvement. The SF-6D was unable to detect further deterioration in a group of patients with already severe disease.

The finding that the EQ-5D is more responsive to deterioration than the SF-6D is in keeping with all previous reports in the literature [10, 12]. Similarly, the greater responsiveness of the SF-6D to improvement supports the majority of previous findings [10, 11]. All previous studies have suggested that both measures are generally responsive to change in the RA patient [8]. However, the SF-6D has a clear limitation in severe RA patients, which has not previously been demonstrated.

The ability of the SF-6D to detect change is thought to be inhibited by the high floor of the measure. However, in this study, few patients scored at the floor of the SF-6D in any of the cohorts, although within-domain floor effects were considerable for the role limitation, vitality and physical functioning domains. These domains relate to key aspects of the limitation caused by RA, and the floor effects may explain the lack of response when patients deteriorated further in the BSRBR cohort. These floor effects were severe but smaller in the groups where the SF-6D was responsive to deterioration. However, the correlations of change in the SF-6D with change in the HAQ and DAS28 scores were stronger than corresponding correlations with the EQ-5D. This suggests that the superior responsiveness of the EQ-5D to deterioration in health in this group is related to some aspect other than functional disability or disease activity. The EQ-5D had no floor effects in the overall utility score or any domains in the BSRBR control group, providing scope for detecting extra deterioration in all aspects of disease measured by this instrument in these patients. The domains of the SF-6D with floor effects, role limitation, physical functioning and vitality are likely to be captured by the self-care and usual activities domains of the EQ-5D, which have no floor effects.

The mean change of the EQ-5D exceeded the mean change of the SF-6D in all of the cohorts used in this study. This has implications in the use of responsiveness statistics and in using the measure for cost-effectiveness analyses. The EQ-5D was less responsive than the SF-6D to improvement despite the larger mean change using the EQ-5D highlighting the variance around the measure. The SF-6D has smaller increments between scoring levels than the EQ-5D which allows patients to report smaller improvements. This may explain why the SF-6D is more responsive to small but important improvements. The SF-6D shows relatively small absolute change but has a small standard deviation [11, 12], which leads to a good responsiveness statistics. Changes in a single domain of the EQ-5D can result in changes of 0.036–0.655 in the overall utility score. Therefore, change in a small number of patients can lead to a large group mean change effect. The impact on the overall EQ-5D scores is largest when a domain is scored at the most severe level for the first time. This attracts both reductions in utility associated with the change in domain (range 0.094–0.386) and reduction of 0.269 for the first domain scored as severe, known as the N3 term.

The larger mean change in improving and deteriorating patients suggests that an intervention will be more likely to be seen as cost-effective if assessed using the EQ-5D rather than the SF-6D. In cost-effectiveness analysis, the incremental cost of an intervention is divided by its incremental effectiveness, measured using a measure such as the EQ-5D or SF-6D. The larger the effect estimate, the lower the cost per unit of effect. A recent study in RA reported that change estimated using the EQ-5D resulted in a cost per QALY over 50% lower than the cost per QALY calculated using the SF-6D [38]. However, as the mean effect is only an estimate, the uncertainty around the estimate must be presented. The smaller variance of the SF-6D should result in less uncertainty in a concerning the relative cost-effectiveness of two treatments.

There are weaknesses in the use of responsiveness statistics. There is an array of such statistics, and different methods may lead to different conclusions [33, 39, 40]. To date, no measure has been proven conclusively to be superior to another. Furthermore, responsiveness statistics is limited in the information they convey. They give an indication of whether a measure can detect a statistically significant difference between two groups. However, statistical significance is dependent on factors external to the measure under study such as sample size and does not indicate whether the change detected is meaningful or useful [33, 41]. The ES and SRM express change in terms of the standard deviation and provide a useful indication of the relative sample sizes required to detect statistically significant difference between groups [39]. The basis of ES and SRM is that relevant change should exceed random noise or the variability in unchanged patients. These measures used without an anchor of important change give no information about the ability of the instrument to measure change in the underlying construct, [33] and essentially are measures of sensitivity. We used the responsiveness measures alongside some external reference of change, for example the change in EQ-VAS score, or response to treatment. This was not possible for the overall BROSG data or for the BSRBR control cohort, and we, therefore, cannot assume that all patients in these cohorts deteriorated; the HAQ score for these cohorts suggested deterioration of functional disability approaching the minimally important difference for clinical practice [42]; however, the DAS28 scores in the BSRBR control cohort suggested some improvement in disease activity from the high baseline level.

Comparison of change in different cohorts was limited by the different follow-up periods used. The analysis relied on data collected concurrently within each study and was therefore limited by the design of each cohort. Patients in the BSRBR treatment and control studies were only followed for 6 months. This may be sufficient to capture the large expected improvements in patients treated with anti-TNF therapy, but may be insufficient to capture clinically meaningful deterioration in patients in the control arm continuing with traditional treatment of RA. However, the change in EQ-5D for this latter group of patients was in excess of estimates of the minimum important difference for this measure [8].

A further limitation of the ES is that in highly selected groups of patients, the ES may be artificially inflated. The BROSG and STIVEA trials by definition were selected groups of patients. It was therefore important to use the SRM to verify the results based on the ES. In only one instance were the conclusions based on the ES and SRM conflicting, and where this occurred, the difference between responsiveness of the EQ-5D and SF-6D was marginal. Finally, the methods of the ES and SRM assume that all patients change in the same direction [39]. It is likely that there is some misclassification of change between the anchors and some of the outcome measures. This may be particularly true for the EQ-VAS scale which frames a patient’s health on the day in question; therefore, change in EQ-VAS assesses the difference between a person’s health on 2 days, 1 year apart. The framing of the question gives considerable potential for transient and possibly trivial factors to influence the rating of change in health. However, the design of this study aimed to classify patients on the basis of important change and only compared responsiveness of the EQ-5D and SF-6D in patients changing in the same direction by more than a certain amount.

The ability to measure change in the RA patient is indicative of longitudinal construct validity [43]. Concerns have been voiced about the ability of the EQ-5D to measure change due to its bimodal distribution, crude scoring and possible ceiling effects within domains [11, 12, 44, 45]. These issues were evident in the data used in this study, but the EQ-5D appeared to respond to both improvement and deterioration. The EQ-5D was more responsive to deterioration, and the SF-6D more responsive to improvement in patients with inflammatory arthritis. The SF-6D does not appear appropriate for use in patients with established severe RA, who are expected to experience disease progression. Responsiveness of a measure affects the power of a given sample size to detect a statistically significant difference. As an outcome measure in an epidemiological setting, the SF-6D requires a smaller sample size than the EQ-5D to detect improvement in patients whose health is getting better. The opposite is true in worsening patients. In economic analysis, however, the approach is different. The incremental cost-effectiveness ratio (ICER) is the primary estimate of cost-effectiveness of an intervention, and if this value is less than a decision-makers willingness to pay, then the intervention should be adopted [46, 47]. The EQ-5D consistently provided larger mean change estimates than the SF-6D, even when less responsive than the SF-6D (due to the greater variance around the EQ-5D), which would result in a more optimistic incremental cost-effectiveness ratio.


The results from the four cohorts of patients used in this study demonstrate that the comparative responsiveness of the EQ-5D and SF-6D differs according to the direction of change; the EQ-5D was more responsive to deterioration in health than the SF-6D, whereas the SF-6D was more responsive to improvement. The level of mean change of the EQ-5D, which is consistently larger than that of the SF-6D, has potentially serious implications for decision-making on the basis of cost-effectiveness analysis; the EQ-5D is likely to provide more optimistic cost-effectiveness ratios than the SF-6D. Our results support the responsiveness of the EQ-5D to improvement and deterioration across a range of arthritis states/severity. The SF-6D was responsive to improvement in cohorts of patients with range of arthritis severity and to deterioration in patients with established stable disease; however, use of the SF-6D in patients with severe progressive disease may be inappropriate.