FormalPara Key Points for Decision Makers

EQ-5D-3L systematically overestimates health problems when compared with EQ-5D-5L, leading to biased utilities.

This may impact quality-adjusted life-year (QALY) estimations when using EQ-5D-3L, leading to over- or underestimations of QALYs gained, depending on the condition and condition severity.

QALY calculations based on EQ-5D-5L data will result in more accurate estimates.

1 Introduction

The EQ-5D is a widely used preference-accompanied measure of health, developed to inform and improve decision making in health care [1]. EQ-5D value sets are a major feature of the EQ-5D instrument, facilitating the calculation of quality-adjusted life-years (QALYs) for use in economic evaluation of health care interventions. The original EQ-5D with three response levels (‘3L’) was revised and refined into the EQ-5D-5L (‘5L’) including five response levels [2]. Most studies comparing 3L with 5L demonstrated improved measurement properties with 5L, using cross-sectional data, assessing feasibility, distributional characteristics, inconsistencies, informativity and validity [3]. However, head-to-head comparisons of sensitivity to change (i.e., responsiveness) are rare [4,5,6,7] and use a variety of methods, leading to inconclusive results. Responsiveness to health changes is especially relevant in intervention and prevention studies where capturing improvement in health (or the absence of deterioration) is at the forefront.

The inconclusive evidence suggested that 5L does not necessarily result in better responsiveness than the 3L. In a sample of stroke patients, Golicki et al. found 5L values to be slightly less responsive compared with 3L [4], while Jia et al. found 5L and 3L to be equally responsive in a sample of hepatitis B patients [5]. Rather than using values, Buchholz et al. compared the responsiveness of 3L and 5L focusing on the descriptive systems, indicating 5L performed better within all comparisons [6]. The inconclusive findings when comparing values could be due to the use of interim (‘crosswalk’) values in both studies, based on a mapping function that converts 5L descriptive system scores into values based on 3L value sets [4, 5]. Empirical evidence showed that values based on mapping functions are differently distributed than values derived from valuation studies [8]. The only study available using value sets derived from standard 5L valuation studies demonstrated 5L to be more responsive over 3L in cataract surgery patients [7].

EQ-5D consists of five one-item dimensions of health: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression, with 3L dimensions describing levels of problems as ‘no’, ‘some/moderate’, and ‘unable to/extreme’ problems (‘confined to bed’ for mobility), and 5L as ‘no’, ‘slight’, ‘moderate’, ‘severe’, and ‘unable to/extreme’ problems. Value sets, providing values for each health profile, are available for many countries for both 3L and 5L, reflecting the societal preferences for the corresponding country [9].

Our study aim was to perform a more robust and comprehensive comparison of 3L versus 5L responsiveness, by comparing 3L with 5L using two independent patient samples, applying nine country-specific value set pairs. We built on a recently developed framework [10], separating the potential contribution of descriptive and value components in responsiveness. There, a bias in 3L was detected, with 3L systematically overreporting health problems when compared with 5L in nine study samples across six countries. Under the assumption that more levels lead to less measurement bias, while also taking the validity of 5L level descriptors into account [2], we concluded that 5L leads to more accurate estimations of health on the individual and group level. For example, we found that most respondents scoring the middle-level descriptor on 3L (‘some/moderate’) did not select the middle level on 5L (‘moderate’) but the milder level (‘slight’). For aggregate results, this led to an overestimation of self-reported health problems on 3L in comparison with 5L, translating into an underestimation of aggregate values (utilities).

2 Methods

2.1 Paired 3L–5L Descriptive Data

Two longitudinal datasets were used: a German sample of inpatient rehabilitation patients (n = 225) and a Polish sample of stroke patients (n = 112) [4, 6]. The rehabilitation sample was tested at baseline and at the end of rehabilitation (follow-up), while the stroke sample was tested 1 week (baseline) and 4 months (follow-up) post stroke. Respondents were asked to complete both 3L and 5L as part of a larger paper survey (Table 1). In rehabilitation patients, the order of 3L and 5L was randomized, and that sequence was maintained across time points. In the stroke sample, the order was fixed, always starting with 5L. Only data of patients who fully completed both 3L and 5L at both time points were included.

Table 1 Patient characteristics of the two study samples

The two different patient samples represent different patterns of health and potential health change. This may have an effect on responsiveness when a large part of observations would be at the ‘tipping point’ between two levels in 3L, but not in 5L.

2.2 Pairs of 3L–5L Value Sets

Nine pairs of 3L and 5L country-specific value sets were included: Canada, China, England/United Kingdom, Germany, Japan, The Netherlands, Korea, Poland, and Spain [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28]. The same value sets were selected as for our previous study [10], with the addition of Germany and Poland, appropriate for the study samples. Most 3L valuation studies followed similar protocols, although there were differences in the sampling of respondents (affecting representation), sample size and health state selection [29, 30]. With the introduction of 5L, the EuroQol Valuation Technology Platform (EQ-VT) was developed—a standardized valuation protocol for uniform data procurement [31]. In addition to standardization of a computer-assisted personal interview mode of administration, health state selection, and valuation methodology, a protocol of interviewer training and quality control during data collection was implemented [32]. For the United States (US), instead of using the recommended (separately developed) national value sets, 3L and 5L value sets were included that were derived for methodological purposes via EQ-VT from a common same sample [33], eliminating any potential effects induced by different protocols, study sample, valuation technique, or interviewers. The US values allowed for further assessment of the separate impact of descriptive results and values on responsiveness.

2.3 Analysis

2.3.1 Descriptive Cross-Sectional Analysis

Descriptive 3L and 5L statistics were calculated on the cross-sectional data (baseline and follow-up separately). The number of unique health profiles was determined for 3L and 5L in both patient samples. Next, we compared level sum scores (LSS) between 3L and 5L, by dimension. Recoding was applied to arrive at commensurability across levels: no problems = 0 (3L/5L), slight problems = 1 (5L), some/moderate problems = 2 (middle level 3L/5L), severe problems = 3 (5L), and extreme problems/unable to = 4 (most severe level 3L/5L). Dimension-specific LSS differences were ‘standardized’ by dividing absolute differences between 3L and 5L dimensions by sample size and the maximum possible level value (i.e., 4). The overall difference was calculated by summing the differences across dimensions and additionally dividing by the number of dimensions (i.e., 5). The resulting values (for both dimension-specific and overall standardized differences) range from − 1 to 1, with 0 meaning no difference and − 1 (or 1) meaning maximum difference of reported health problems between 3L and 5L. All 3L–5L dimension differences were statistically compared using Wilcoxon signed-rank tests.

2.3.2 Descriptive Responsiveness

First, inconsistencies in change between 3L and 5L were calculated within patients; an inconsistency exists if a dimension in 3L improves, while the same dimension in 5L deteriorates, or vice versa. Second, the absolute and average number of reported level changes from baseline to follow-up by respondent (here, ‘moves’) were calculated as a key descriptive indicator of responsiveness (e.g., moving from level 4–2 involves two moves). Third, the percentage of improved, stable, and deteriorated patients by dimension, and the percentage of improved patients according to the Paretian Classification of Health Change (PCHC) [34] were calculated and compared for 3L and 5L. According to PCHC, a health profile is considered to be ‘better’ if it is better on at least one dimension and not worse on any other dimension, and vice versa for ‘worse’. Health profiles are considered ‘the same’ if there is no change on any dimension, and ‘mixed’ if a health profile is better in at least one dimension and worse in at least one dimension. Finally, a non-parametric effect size measure (probability of superiority [PS]) was calculated [6, 35] by dividing for each dimension the number of patients with positive changes by the total sample size. Ties (persons with no changes) were accounted for by adding half the number of ties in the numerator. The percentage of improved patients by dimension, the PCHC and the PS were interpreted as effect measures of descriptive responsiveness.

2.3.3 Value Responsiveness

Responsiveness of values was assessed using anchor-based approaches based on standardized response mean (SRM) and standardized effect size (SES), which are commonly used responsiveness statistics in patient-reported outcomes and the most commonly used in studies focused on EQ-5D [36, 37]. SRM was calculated as the ratio of the mean change to the standard deviation (SD) of that change. SES was calculated by dividing the mean change by the SD of the baseline measurement (originally introduced as Glass’s Delta [38]). External anchors that classified patients into change categories (improved, stable and deteriorated) were based on the five-level self-rated general health (SRH) question (item 1 of the SF-36: poor, fair, good, very good, excellent) for the rehabilitation sample, and on the modified Rankin Scale (mRS) and the 10-item version of the Barthel Index (BI) for the stroke sample (Table 1). The mRS and BI are widely used validated outcome measures in stroke with good psychometric properties [39]. Change categories were defined as follows for improved, stable and deteriorated, respectively; for mRS: improvement of at least one level; no change; worse at least one level; for BI (based on earlier published minimal clinically important differences [40]): more than or equal to 9.25 points; less than 9.25 points and more than − 9.25 points; less than or equal to − 9.25 points; SRH: response follow-up better; no change; response follow-up worse. Resulting SES and SRM statistics were interpreted using general benchmarks for effect size: 0.2–0.49 was interpreted as a small magnitude of effect; 0.5–0.79 was interpreted as a medium effect; and ≥0.8 was interpreted as a large effect [41].

Finally, to compare responsiveness for the nine value sets between 5L and 3L directly, we computed the 5L/3L ratio of the SRM and SES statistics as a measure of relative efficiency, so that a ratio higher than 1.0 indicated that 5L was more responsive than 3L [42]. For all comparisons, 95% confidence intervals (CIs) of SES, SRM and ratios were calculated using 1000 bootstrap samples.

Statistical significance was achieved when the values were different from 0 for SRM and SES, and different from 1.0 for the ratios.

As additional analysis, we investigated descriptive results for the improved subsamples by calculating LSS changes and 3L–5L differences to assess which dimensions account for the largest impact on responsiveness.

2.4 Hypotheses

In line with previous evidence, we expected 5L descriptive cross-sectional results to reflect a higher number of different profiles and to show an overestimation of reported 3L health problems compared with 5L, with a possible exception of mobility (due to the ‘confined to bed’ level descriptor that is rarely scored). For a detailed analysis and description of the 3L bias, and our related claim on superior accuracy of 5L, we refer to our earlier study [10]. Overestimation of 3L was expected to be highest at the mild part of the severity spectrum [10]. In terms of descriptive responsiveness, we hypothesized that the number of moves will increase substantially with 5L, and PS will increase slightly to moderately (note that PS was previously reported for the rehabilitation sample [6]).

For the rehabilitation patients, we expected better value responsiveness for 5L. As, on average, rehabilitation patients moved from moderately impaired health states to mildly impaired health states [6], 3L overestimation might increase from baseline to follow-up (as we previously observed overestimation to be higher in mild conditions), leading to a reduced mean 3L difference from baseline to follow-up, and hence reduced responsiveness. The stroke patients generally moved from severe/moderate to moderately impaired health. Here, it is difficult to predict what to expect due to the mixed evidence of 3L overestimation in the moderate to severe spectrum [10].

3 Results

3.1 Descriptive Cross-Sectional Results

At baseline, 37 unique 3L profiles were observed versus 99 for 5L for rehabilitation, and 37 3L profiles versus 84 5L profiles for the stroke sample (results from follow-up were comparable). Level distributions by dimension are available elsewhere [4, 6]. Overestimation of 3L was observed on most dimensions (average 14% increase of reported problems in terms of LSS by dimension when compared with 5L), although not for mobility and self-care for the rehabilitation sample, and for baseline mobility for stroke, where underestimation is observed (Table 2). Overestimation of 3L health problems compared with 5L across the two patient samples was largest in pain/discomfort, followed by anxiety/depression and usual activities. Standardized differences varied from − 0.029 (rehabilitation baseline mobility) to 0.087 (rehabilitation follow-up pain/discomfort). For all dimensions combined, the standardized differences varied from 0.022 for rehabilitation baseline to 0.040 for stroke follow-up, generally showing larger 3L–5L differences for the stroke sample.

Table 2 Baseline and follow-up 3L versus 5L LSS by dimensiona and patient sample, and standardized differencesb

3.2 Descriptive Responsiveness

There were very few inconsistencies between 3L and 5L regarding the change data, with an average of 1.0% over dimensions for the rehabilitation sample and 1.1% for stroke. A considerable increase of reported changes (moves) was observed in 5L over 3L, with an average number of moves across dimensions of 47 (3L) versus 100 (5L) for rehabilitation, and 38 (3L) versus 65 (5L) for stroke. Average moves per respondent were 0.22 (3L) versus 0.46 (5L) in rehabilitation, and 0.34 (3L) versus 0.58 (5L) in stroke. The change data further show that 5L results in both more improvement and deterioration, which implies that more change is captured by 5L in both directions (Fig. 1). According to the PCHC, 50% (3L) versus 60% (5L) of patients improved in the rehabilitation sample, while 51% (3L) versus 44% (5L) improved in stroke. The PS index shows a clear but modest improvement of 5L over 3L in both samples, with PS3L = 0.57 and PS5L = 0.59 for rehabilitation, and PS3L = 0.59 and PS5L = 0.63 for stroke.

Fig. 1
figure 1

Sensitivity to change (t1–t2) for rehabilitation (n = 219) and stroke (n = 112) samples. Deteriorated, stable and improved percentages were calculated according to individual-level responses that were worse at follow-up (more reported problems), stayed the same, or were better (less reported problems), respectively. 3L EQ-5D-3L, 5L EQ-5D-5L

3.3 Responsiveness of Values

Mean values, SDs and differences for all value sets at baseline and follow-up are available in Appendix 1. Due to small sample sizes, SRM and SES statistics for the deteriorated subsamples were not statistically significant (Appendix 2); we will only report on the improved subsamples from here. All SRM and SES statistics for the improved subsamples were statistically significant and, with a few exceptions, were considered to be of medium to large magnitude. SRM values ranged from 0.51 (3L Germany) to 1.30 (3L China), while SES statistics ranged from 0.46 (3L Germany) to 1.01 (5L China). Both SRM and SES results were generally highest for the BI-based stroke subsample. Overall, 5L SRM and SES statistics were higher for rehabilitation and lower for the stroke subsamples. Overall, there was little variation in SRM and SES results between value sets in the three improved subsamples.

Differences in responsiveness demonstrated a clear and strong trend in favor of 5L for the rehabilitation subsample for both SRM and SES (Fig. 2). For the BI-based stroke sample, results are in favor of 3L, while for mRS there is a clear trend in favor of 3L for SES, while for SRM results are mixed and close to 0.

Fig. 2
figure 2

SRM and SES differences in responsiveness for the improved subsamples. US value sets were derived using an identical methodological approach within the same sample of respondents. ∆ = 5L – 3L; 5L shows better responsiveness over 3L if ∆improvements are positive. CA Canada, CN China, ENGL/UK England/United Kingdom, ES Spain, GER Germany, JP Japan, KR Korea, Republic (South Korea), NL The Netherlands, PL Poland, US United States, SRM standardized response mean, SES standardized effect size, 3L EQ-5D-3L, 5L EQ-5D-5L, SRH self-rated health, mRS modified Rankin scale, rehab rehabilitation

Relative efficiency results demonstrated the strongest results in favor of 5L for the rehabilitation subsample for SRM, with significant results for all value sets except Canada (Table 3). SES results for rehabilitation were significant for South Korea and Spain. The mRS-based subsample does not result in significant relative efficiency for SRM, while SES shows significant results for five value sets in favor of 3L. For the BI subsample, significant SRM results in favor of 3L were obtained for two value sets (China and The Netherlands), while SES results indicated significance for five value sets. Note that all non-significant SES results for the mRS stroke subsample were borderline significant, while many BI subsample results (both SRM and SES) were also close to significance. Although there were differences between value sets in terms of significance, all statistics were in the same direction (except for mRS stroke), and differences in CIs were relatively small across value sets, including the US value sets.

Table 3 Observed relative efficiency of 5L over 3L values using the SRM and SES statistic ratio (improved subsamples only)a

LSS change and difference scores between 3L and 5L by dimension are shown in Table 4. For rehabilitation, the largest effect between 3L and 5L was in mobility (an LSS increase of reported problems of 18 for 5L over 3L), which was due to the lack of level 3 scores for 3L, as only one respondent scored ‘confined to bed’. For both stroke subsamples, only mobility and pain/discomfort showed increased differences for 3L over 5L, while the other dimensions showed similar or increased responsiveness for 5L. Many stroke patients were evidently confined to bed at baseline but improved during follow-up, contributing to larger differences in LSS for 3L over 5L. Overestimation of 3L reported problems for the improved subsamples increased over time for rehabilitation, while 3L overestimation slightly decreased for stroke (Fig. 3).

Table 4 Descriptive LSS change scoresa (baseline minus follow-up) and 3L–5L differences for the improved subsamples (Δ = 5L–3L)
Fig. 3
figure 3

Reported 3L versus 5L health problems over time (LSS*) for the improved subsamples. *Recoded: no problems = 0; 3L and 5L on the same scale. For 3L, level 2 = 2 and level 3 = 4; and for 5L, level 2 = 1, level 3 = 2, level 4 = 3 and level 5 = 4, summed across dimensions. 3L EQ-5D-3L, 5L EQ-5D-5L, LSS level sum score, SRH self-rated health, BI Barthel Index, mRS modified Rankin scale

4 Discussion

Although responsiveness results between 3L and 5L varied across patient samples, general conclusions could be formulated across study results. Cross-sectional descriptive results confirmed findings from our earlier study [10], concluding that the increased sensitivity of 5L over 3L leads to a more accurate measurement of health at individual and group levels. Again, we found that descriptive system results are the largest driver in outcomes, which confirms findings from Richardson et al. [43], showing that differences between utilities of different preference-accompanied instruments are mainly attributable to the descriptive data. Results for the US value sets further confirmed this finding, where US results were largely comparable with results of the other value sets.

Encouragingly, descriptive responsiveness results showed very few inconsistencies between 3L and 5L patient-level data (confirming the robustness of the data), but 5L showed a large increase of moves across dimensions. As previously demonstrated, 3L systematically overestimated health problems when compared with 5L, leading to biased (underestimated) utility values. The extent to which overestimation changed over time influenced responsiveness differences between 3L and 5L in both patient samples. On average, rehabilitation patients moved from moderately to mildly impaired health, and overestimation of 3L increased over time, therefore leading to worse responsiveness results for 3L. Apparently the tipping point in moving to milder levels is easier reached for 5L than for 3L. For stroke, opposite results were found, with 3L overestimation decreasing over time, although the effect was small for the mRS subsample.

The 3L ‘confined to bed’ level descriptor played a crucial role in differences between 3L and 5L in both patient samples. For rehabilitation, this response option was only endorsed once, leading to a less efficient use of 3L mobility compared with 5L, and, consequently, to lower 3L responsiveness. In stroke however, 21% of respondents scored ‘confined to bed’ at baseline, versus 7% at follow up, ultimately leading to an increase in value responsiveness. We did not expect there to be an underestimation of reported problems for 3L on self-care for the rehabilitation sample, although the effect is small. This could partly be explained by inconsistencies between 3L and 5L of the cross-sectional data [6]. For the German 3L value set, it must be noted that several utility decrements are lacking (usual activities levels 2 and 3, and anxiety/depression level 2), which might have affected the comparative results with the 5L value set.

It was striking that relative efficiency results for rehabilitation only showed two significant value set comparisons using SES, despite relatively large 3L–5L absolute differences in SES values (Fig. 2), which was due to large baseline SDs within the subsample.

SRM and SES statistics led to different results in relative efficiency, especially for the mRS stroke sample. As the denominator in SRM is the SD of the difference score, and in SES the baseline SD, apparently there were significant differences between SDs of the difference scores and baseline SDs, with the former being larger for 5L when compared with 3L, leading to similar SRM results. It seems that 5L shows more dispersion for assessing change scores for this particular subsample. Arguably, SRM could be considered to be the preferred method, making optimal use of the change data.

Contrary to our earlier study that assessed discriminatory power between patient groups, we judge the differences in responsiveness between values sets to be rather small, which is likely due to the nature of the data. When comparing the same group of patients over time, individual differences are cancelled out to a large extent, leading to less heterogeneity in values. This allowed for general conclusions with regard to 3L and 5L across value sets.

Some limitations must be acknowledged. There may have been an order effect for stroke, as 5L was always tested first. For the rehabilitation sample, only one anchor was used. Ideally, responsiveness would be tested with multiple anchors, preferably including a clinical anchor. Small sample sizes for the stroke improvement subsamples could have limited the number of significant results, especially for SES where many results were borderline significant. Proxy assessments were used in 18.7% of the stroke sample, which could have affected responsiveness results. However, Pickard et al. showed that although patient versus proxy agreement in a sample of 124 stroke patients was suboptimal, no systematic bias was observed [44]. Moreover, proxy assessments were only used in a relatively small proportion of our study sample. Responsiveness was assessed for only two patient samples, which, although very different in nature and disease severity, do not allow for generalization to other disease areas, especially about how 3L bias will manifest itself, e.g., dimension-specific problems frequently occurring in certain conditions might play a deciding role, like confined to bed did in our study samples. Furthermore, because patient samples from only two different countries were used, there could be country-specific effects related to translations of EQ-5D or to response behavior. Bias may also be present in 5L when compared with more refined classifications, although these effects will be substantially smaller. Moreover, an extended classification system beyond 5L will increasingly complicate valuation studies in terms of health state selection, valuation and modeling.

Based on the results from these two patient samples, it is expected that for moderate conditions where an improvement towards mild health states might be expected (such as in our rehabilitation sample), 3L underestimation of QALY gains may be observed. In severe conditions where overall improvements might be expected towards mild to moderate health (such as in our stroke sample), 3L might overestimate QALY gains, although this will be dependent on which health aspects are affected in any given condition.

5 Conclusion

This study shows that 5L leads to more accurate estimations of self-reported health and utility values over time, which will result in more accurate QALY estimations. Future 3L–5L comparative studies will have to show how the 3L bias will affect responsiveness when compared with 5L in other patient groups.