There are two sources of differences between the 3L and the 5L: [1] the way they describe patient health via the health state classifier; and [2] the way they value health using preferences obtained from the general public. It is the combination of these two key elements that determines estimates of QALYS. Therefore, an assessment of the merits of the two instruments needs to consider both.
While the 3L and 5L contain the same five dimensions, there are other, important differences between them. Most obviously, the 5L has increased the number of levels from 3 to 5 and the total number of health states described from 243 to 3125. There are also differences in the descriptors, most notably for the worst level of mobility: ‘confined to bed’ in the 3L has been replaced with ‘unable to walk about’ in the 5L.
Because of its expanded-level structure, the 5L has the potential to capture the health of subjects more accurately than the 3L, but there is an increase in cognitive burden from offering more choice that may result in lower response rates and perhaps greater measurement error from not knowing which level to choose. Ultimately any measurement benefits from the increased descriptive system must be empirically demonstrated. Papers in this issue, as well as others recently published, suggest these advantages are being realised. Advantages of the 5L over the 3L include:
(a) A reduction in the ceiling effect: The 3L suffers from a ceiling effect, i.e. respondents reporting no problems on any dimension despite (e.g. slight) problems being present. The effect is reinforced by the large gap, in most 3L value sets, between full health and the next best state (in the 3L UK value set, valued at 0.88). In many 3L studies, more than 40% of subjects self-report full health, which dropped by 10% using the 5L [10,11,12]. Larger and smaller reductions in ceiling effects have been reported elsewhere, reflecting differences in the study samples, e.g. [13,14,15].
(b) Reduced clustering on just a few states: The lack of granularity in the 3L descriptive system imposes constraints on the self-report of health. Observations tend to cluster on a few health states [15, 16]. The 5L consistently produces considerably more unique health states than the 3L, as shown by Buchholz et al. [17]. For example, Feng et al. [18] reported that just three health states accounted for almost 75% of respondents on the 3L, while a similar proportion of respondents on the 5L were accounted for by 12 health states.
The clustering of descriptive data on the 3L is also reflected in the characteristics of utility-weighted 3L data. 3L health states are relatively far apart on the value scale; for example, the presence or absence of extreme problems in practice predicts almost perfectly whether utility is above or below 0.5. The distribution of utility-weighted 5L data is less prone to this sort of artefactual clustering [16].
(c) Improved ability to discriminate between patient groups/subgroups: The 5L has better discriminative ability, as demonstrated by improved ability to detect differences between subgroups defined by severity at a given sample size [13, 19, 20]. 5L users thus benefit from lower sample size requirements within samples of patients [21]. Although the 3L seemingly has better ability to detect differences between patients and a general population group, this is an artefact [13, 17]. The 5L has improved ability to measure health accurately at the top of the scale and therefore provides finer differences between mild ill-health states and full health at the top of the scale, whereas the 3L has much larger steps between levels 2 and 1. As a result, the 3L can overestimate health gains and produce biased ICERs.
(d) Improvements in the 5L with respect to problems with mobility: Abandoning the 3L level 3 descriptor ‘confined to bed’ constitutes an important improvement in the 5L. Level 3 problems on mobility are rarely observed in 3L data. For example, among patients about to receive hip replacement surgery in the National Health Service, none reported a level 3 problem [22]. In effect, in most settings, the 3L only has two dimensions on mobility: no and some problems. Consequently, the 3L will underestimate benefits of treatments that improve severe problems with mobility [13].
Overall, this evidence suggests that the 5L retains the benefits of 3L—its brevity and validity in a wide range of conditions—and produces a more accurate measurement of patient health than the 3L. At the same time, there is no evidence for lower completion rates, and the increase in the number of levels has reduced the amount of variability.