Paired EQ-5D-3L–EQ-5D-5L (3L–5L) Descriptive Data
A large multinational dataset that included paired descriptive 3L and 5L data for eight patient groups and a student cohort was used [15, 24]. These data were obtained with the standard 3L and 5L versions for self-report use in adults, describing health on the dimensions of mobility, self-care, usual activities, pain/discomfort and anxiety/depression. The 3L version applied the level descriptors (or labels) ‘no problems’, ‘some/moderate problems’ and ‘extreme problems/unable to’, and the 5L version used ‘no problems’, ‘slight problems’, ‘moderate problems’, ‘severe problems’ and ‘extreme problems/unable to’. For mobility, the most severe response option was changed from ‘confined to bed’ for 3L to ‘unable to walk about’ for 5L. The 3L classification describes 243 unique health states (or health profiles) that are often reported as vectors ranging from 11111 (full health) to 33333 (worst health), whereas the 5L defines 3125 unique health states, with 55555 as the worst health state.
Paper-and-pencil versions of the questionnaires were used in all countries except in England where data collection took place online. Since there were many condition-specific subgroups with small sample sizes, it was decided to combine related patient groups, resulting in nine main condition groups. Only respondents who completed both the 3L and 5LFootnote 1 without any missing responses were included in the analyses (a 3L–5L comparison of missing values is reported elsewhere [15]). It was assumed that within a specific condition group country differences were not important so that descriptive data could be pooled.
Paired 3L–5L Value Sets
At the time of this study there were seven countries with both 3L and 5L value sets available, namely Canada, China, England/UK (5L/3L, respectively), Japan, The Netherlands, South Korea and Spain [2, 25,26,27,28,29,30,31,32,33,34,35,36,37]. All EQ-5D value sets were obtained using representative samples of the general public, ensuring that they represented the societal perspective. A value set is a set of weights that can convert each health state into an index value on a scale anchored at 1 (referring to full health) and 0 (referring to a state as bad as being dead), allowing for negative values for health states considered to be worse than dead. Most 3L valuation studies followed similar protocols, although there were notable differences with regard to the sampling of respondents (affecting representation), sample size and health state design (varying from 17 to 101 valued health states) [38, 39]. All 3L valuation studies were performed with face-to-face interviews and paper-and-pencil methods except for Canada where a web survey was used. All 3L value sets were based on time trade-off (TTO) data. With the introduction of 5L a standardised valuation protocol was developed, the EQ-VT (EuroQol Valuation Technology Platform) [13]. In addition to standardisation in terms of health state design, valuation methodology and a computer-assisted personal interview mode of administration, a strict protocol of interviewer training and quality control during the entirety of the data collection process was developed and implemented [14]. Discrete choice experiment (DCE) methodology was introduced in the EQ-VT, along with composite TTO as the main valuation method. Since there is no standardised analytic protocol, some 5L value sets were based on hybrid models utilising both TTO and DCE data while others were based on TTO data only. After the initial valuation studies were performed using EQ-VT version 1.0 (Canada, China, England, The Netherlands, Spain) some data quality issues and interviewer effects were apparent and a cyclic quality control process was introduced in version 1.1, which led to a substantial improvement [14].
Usually country-specific utility values are used to conduct analyses in a population or patient sample from that particular country, reflecting the appropriate preferences. Since our research questions were of a methodological nature, aiming at making generalisations across value set characteristics, we used the pooled multi-country dataset to compare the characteristics of 14 country-specific 3L and 5L value sets.
Analyses
3L and 5L Value Sets for Seven Countries
Characteristics of all value sets were reported in terms of model parameters and model characteristics, such as the modelled value range, intercept, interaction parameters and histograms of all possible values (3L: 243; 5L: 3125), which may be responsible for differences in performance between 3L and 5L (see Table 1).
Table 1 Characteristics of EQ-5D-3L and EQ-5D-5L value sets from seven countries
Distributional Analyses of 3L and 5L Utility Values
Country-specific 3L and 5L utility values were calculated for each value set for all condition groups combined and described numerically and graphically using histograms. We examined clusters and discontinuities (‘gaps’) in the histograms as such patterns theoretically diminish the sensitivity and the accuracy of the instruments and might lead to estimation problems [40].
In order to assess the frequency and efficiency of use of the utility scale we applied Shannon’s indices as a means of assessing distributional evenness [17, 18, 21, 22]. While Shannon’s H′ captures absolute informativity and is simultaneously powered by evenness and the number of categories used, Shannon’s J’ index of relative informativity solely reflects the evenness of a distribution [41]. Since Shannon’s J′ corrects for the total number of possible categories (here: possible utility values), which could be potentially close (or equal) to 243 for 3L and 3125 for 5L, it was not considered to be a fair comparison (we expected that J′ would result in higher values for 3L for this reason). Hence, we also calculated both indices by subdividing the scale range in categories (‘bins’) with a width of 0.05, making the number of categories between 3L and 5L more comparable.
Subsequently, we presented mean utility values (and standard deviations [SDs]) by condition group for all 14 value sets, with the addition of an equal weighting score (Level Sum Score [LSS] transformed to a 0–1 scale) in order to assess the impact of the descriptive data without the effect of utility weights. The transformed LSS (tLSS) was calculated by summing the level scores for the five dimensions and performing a linear transformation on this sum score to a 0–1 scale so that the value for 11111 is equal to 1.0 and 33333 (for 3L) or 55555 (for 5L) is equal to 0.
Discriminatory Performance of 5L Versus 3L
Two tests of discriminatory power were conducted, accommodating different distributional assumptions with respect to utility values: one based on the F statistic (parametric), the second on receiver-operating characteristics (non-parametric).
Discriminatory power was assessed using the F statistic derived from analysis of variance (ANOVA) to test the equality of means. The F statistic is widely used to assess the relative efficiency of patient-reported outcome measures [21, 42, 43] and is based on differences in group means divided by the standard error of the difference. A higher F statistic means a higher likelihood for a measure to show statistical significance when used to compare groups. Hence, higher F statistic values indicate higher discriminatory power. To express the discriminatory power of 5L relative to 3L we computed the ratio of their F statistics resulting from comparisons of different condition groups, in such a way that a ratio higher than 1.0 indicated that 5L was more discriminative than 3L: relative efficiency = F statistic5L/F statistic3L.
Comparisons were made between (1) the eight disease groups and the student cohort, assuming the students were a valid proxy for a healthy population sample; and (2) patients with a mild condition versus those with a moderate or severe condition. Using the observed mean EQ-5D visual analogue scale (EQ VAS) ratings as reference, we defined diabetes and liver disease as mild conditions (relative to the other conditions), and the remaining six as moderate to severe conditions. Since our main aim was to compare measurement properties of 3L and 5L, we considered this method to be suitable for assessing their ability to distinguish between mild and moderate/severe condition groups.
As a second analysis, we calculated the area under the receiver-operating characteristics curve (AUROC) as a non-parametric method of assessing discriminatory power. AUROC analyses were performed for each pair of condition group comparisons using pooled data on the groups, with group membership being the outcome and the 3L/5L utility score being the exposure. AUROCs for 3L and 5L were calculated and the ratio (5L/3L) was used as the measure of discriminatory power. The AUROC value can range from 0.5 (no prediction) to 1.0 (perfect prediction). Consequently, a 5L/3L AUROC ratio > 1.0 indicates 5L to be more discriminative than 3L. While the F statistic is directly based on means and dispersion, the AUROC employs the full distribution.
For all comparisons 95% confidence intervals (CIs) of the F statistic and AUROC ratios were calculated using 3000 bootstrap samples, enabling us to test whether the ratio was statistically different from 1.0.
Exploration of Factors Affecting Discriminatory Power
At least three separate factors determine discriminatory power results:
-
1.
The effects of the descriptive system, involving choice of dimensions, number of levels and corresponding labels, translation effects and reporting heterogeneity.
-
2.
Valuation effects, relating to the valuation protocol, the valuation study (interviewer effects, quality control, etc.) but also to the modelling of the valuation data. Valuation effects also encompass true country-specific variation in preferences, which may be caused by many underlying factors, e.g. cultural, geographical or related to demographics, language or health system.
-
3.
A third factor is related to the ability of any scale to capture the location of a respondent on the true latent scale. The precision of measuring this location will have an impact on the descriptive data and consequently the utility distribution of any study sample. As it appears this important factor is often ignored, we discuss this in some detail.
A graphical example can illustrate potential misclassification effects due to distributional descriptive 3L–5L effects (Fig. 1). The general methodology has been widely discussed in research on reporting heterogeneity [44,45,46,47,48]. Imagine a health dimension scaled with three levels of granularity: 3L, 5L and 10L (3, 5 and 10 levels respectively). In this example we do not take specific labels into account (although ‘1’ refers to no problems). There is an underlying unobservable latent scale which is assumed to be continuous: all three measurement systems (3L, 5L, 10L) will only be approximations of the true latent value. The transition area of two adjacent categories is called the cut-off point (or ‘cut-point’), and in the development of measurement scales one strives for clearly defined cut-points with little overlap (as defined by the labels), to avoid error. The distribution of observed scores of the 3L, 5L and 10L ordinal scales depends on the cut-points. Random error may occur at the cut-points when overlap exists, and this overlap may differ between 3L, 5L and 10L. Note that random error may cause a shift of average values for the extreme categories of the scale, as misclassification can only be towards the middle level of the scale due to the censored nature of the EQ-5D dimensions. Also note that when applying labels, the middle category of 3L does not necessarily coincide with the middle level of 10L, or would have the same latent midpoint, i.e. the middle point of the category, equidistant from both cut-points. Various types of misclassification may occur between the three systems. Imagine five different locations on the latent scale (A through to E), which we here refer to as respondents, although these also might indicate group averages. For respondent A there is no discrepancy between 3L, 5L and 10L: no problems are scored in all three systems. For respondent B both 3L and 5L lack refinement (no problems) as evidently there are reported problems on 10L. Respondent C illustrates the reduced ceiling effect with the introduction of 5L over 3L: no problems are reported in 3L whereas problems are reported on 5L. Respondent D might contribute to an overestimation of reported health problems in 3L when compared to 5L: the middle 3L category is chosen whereas a milder category is chosen for 5L. The distance from the 3L midpoint to the true latent value (X) is larger than the distance from the 5L midpoint to the latent value (Y) and smallest with 10L (Z). The same goes for respondent E: the most extreme category is chosen for 3L whereas a less severe category is scored on 5L. As mentioned, these location effects may also apply to group means, potentially leading to misclassification, especially when the group is rather homogeneous. Random error will increase if the mass of observations of a group is close to a cut-point of the scale such as location D, and may then have a strong impact on a crude scale such as 3L, but may only have a small effect on a more refined scale such as 5L, and even less on 10L. Generally, we assume that more levels theoretically will lead to less measurement bias.
With regard to factor 2, specific modelling outcomes on the intercept and dimension coefficients and the use of interaction terms such as the N3 term (representing whether any dimension is at level 3) will affect the resulting utility distributions and may subsequently affect discriminatory power. To explore the role of these modelling effects we studied the impact of altering the models (based on the original valuation data) by performing a sensitivity analysis in which we excluded the N3 term for two 3L value sets (The Netherlands, UK).
We explored the role of factors 1–3 both numerically and graphically. The point of departure was the LSS of the descriptive data, both by dimension and summed over all dimensions. From the LSS, difference scores between 3L and 5L were calculated by condition. We investigated how various value set characteristics contributed to discriminatory power results using tLSS (LSS transformed to a 0–1 scale) as a reference.
As a way of disentangling the intertwined effects of various factors affecting discriminatory power, we performed a multiple regression analysis with the F statistic and AUROC as dependent variables and the following variables representing value set or descriptive system characteristics as independent variables: intercept (continuous), modelled range (continuous), N3 (continuous, we included only N3 since this was the most prominent interaction term), version (with 3L as reference) and country (with Canada as reference).