Data
Data were drawn from the General Practice Patient Survey (GPPS), a survey designed to provide information on individuals’ experience of services provided by general practices in England [6]. Approximately 2.6 million surveys are sent annually to randomly selected adults (18+ years) who must have been registered with a practice for at least 6 months [7]. As almost the entirety of the English population is registered with a GP practice [8], the GPPS represents a general population sample. The annual response rate is approximately 35%, resulting in approximately 1 million individuals per year. The GPPS is a repeated cross-section, meaning that although it is possible for individuals to be randomly selected in different years, the same individual cannot be tracked over time.
The GPPS collects information on individual sociodemographic characteristics and multiple measures of individual’s health, including the EQ-5D. Our study used data from the 2011 and 2012 surveys years. Importantly, between these years the EQ-5D version changed from the EQ-5D-3L to the EQ-5D-5L.
We restricted our sample to individuals with complete data on EQ-5D and relevant health and sociodemographic characteristics. Missing data were not imputed. The total eligible sample size was 1,411,680 individuals (2011: 718,239; 2012: 693,441).
Matching
We used coarsened exact matching (CEM) to ensure individuals in the 2011 (EQ-5D-3L) and 2012 (EQ-5D-5L) samples were identical on characteristics that are either predictive of health or predictive of how health is reported [9]. CEM first coarsens continuous variables into bins, with binary or categorical variables left uncoarsened. Strata are then created for each unique combination of the coarsened and binary/categorical variables and exact matching is applied to these strata. Strata containing individuals in the 2011 sample only or 2012 sample only were pruned from the study sample. k-to-k matching was used to ensure the same number of individuals from the 2011 and 2012 years were included in each strata.
CEM improves on exact matching by reducing the number of unmatched observations when matching on continuous variables [10], and outperforms other matching methods on a variety of other criteria [11]. Information on characteristics used in the matching process and the chosen degree of coarsening are provided in Table 1, including individual-level sociodemographic characteristics and measures of health and health behaviours. Area-level deprivation was measured using the overall score of the 2010 Index of Multiple Deprivation (IMD) [12]. The sole continuous variable, the IMD score, was coarsened into quintiles.
Table 1 Matching variables and the degree of coarsening To measure balance in the distributions of characteristics following matching, we computed differences in the mean, median, range and 25th and 75th quantile between the 2011 and 2012 samples. We additionally use the L1 statistic to examine balance on the joint distribution of all characteristics [9]. Further details on CEM and the L1 statistic are provided in Electronic Supplementary Appendix 1.
Analysis
We conducted all our analysis using both the whole sample (i.e. general population analysis) and by considering individuals with multimorbidity. We followed the most commonly used definition of multimorbidity, i.e. an individual with two or more long-term health conditions [13]. All analyses were conducted in Stata version 14 (StataCorp LLC, College Station, TX, USA).
Distributional Properties of EQ-5D Responses
We first examined how the change to the EQ-5D-5L altered average response patterns. This was done by comparing the percentage of respondents reporting each EQ-5D level in both descriptive systems. We also explored whether the introduction of additional levels reduced ceiling effects present in the EQ-5D-3L, by comparing the proportion of individuals reporting ‘no problems’ across all dimensions.
Inconsistency
We compared 3L and 5L responses within pairs of randomly matched individuals within an identical stratum and used matched responses to examine inconsistency. We recoded 3L responses to the equivalent for the 5L, e.g. 1 = 1, 2 = 3, and 3 = 5, and assumed that in the absence of the version change, the 5L and recoded 3L responses within matched pairs would have been identical. Thus, any differences in responses can be attributed to the version change.
Following Janssen et al. [14], we defined an inconsistent response for a domain as a response to the EQ-5D-3L that is at least two levels away from the EQ-5D-5L, e.g. an individual reporting ‘no pain or discomfort’ on the EQ-5D-3L but their matched respondent reporting ‘moderate pain or discomfort’ on the EQ-5D-5L. For total inconsistency, we defined an inconsistent profile as one where any individual domain is different by more than two levels. For example, the EQ-5D-3L (coded to EQ-5D-5L) profile 12221 would be consistent with the EQ-5D-5L profile 12223 but would be inconsistent with the EQ-5D-5L profile 12224.
For calculations of inconsistency, we repeated (n = 10) the process of randomly matching individuals within stratum. All results are based on mean values across all randomisations. We examined variation in inconsistency across different randomisations as an indirect test of the matching assumption. Similarity in changes when 3L respondents are matched to different 5L respondents (with potentially different unobserved characteristics) provides evidence supporting the matching assumption.
Informativity
We used Shannon’s indices to examine the relative discriminatory power of the 3L and 5L [3]. These indices are derived from informational theory and assume that a measure provides the most information if responses are distributed equally across all response categories. They also provide a quantitative measure of the degree of response redistribution due to the additional categories. We first calculated the Shannon–Weaver index (H′), which is calculated as (Eq. 1):
$$ H^{\prime} = - \mathop \sum \limits_{i = 1}^{L} p_{i} \log_{2} p_{i} , $$
(1)
where L denotes the total number of response categories, and \( p_{i} \) is the proportion of individuals reporting health in response category \( i \). \( H^{\prime} \) ranges from zero (least informative) to \( \log_{2} L \) (most informative), the latter indicating proportions in each response category are identical. However, it is difficult to compare \( H^{\prime} \) across measures as its maximum level is determined by the number of levels (\( \log_{2} 3 = 1.58 \) for the 3L and \( \log_{2} 3 = 2.32 \) for the 5L). We therefore additionally computed the Shannon’s Evenness index (\( J^{\prime} \)), which scales \( H^{\prime} \) by the maximum \( H^{\prime} \) for a measure with the same number of response categories (Eq. 2):
$$ J^{\prime } = \frac{{H^{\prime } }}{{H_{{\max }}^{\prime } }}. $$
(2)
We computed \( H^{\prime} \)and \( J^{\prime} \) for response categories for each individual domain.
We also explored how the full descriptive systems are used by calculating the proportion of health states selected by individuals for the EQ-5D-3L and EQ-5D-5L. We rank the health states in terms of frequency selected and plot the cumulative frequency.
Response Change
We first measured the impact on responses by assessing the sensitivity of the two measures for picking up small movements away from full health for both the general population and for those reporting having no chronic health conditions. We calculated the proportion of individuals reporting ‘no problems’ for each domain and stratified the results by the response to the question “Have your activities been limited today because you have recently become unwell or been injured? (‘no limitations’/‘some limitations’)”. We calculated the change in the proportion reporting ‘no problems’ when moving from no limitations to some limitations for each domain and then calculated the difference in the changes between the 3L and the 5L to calculate a difference-in-difference. A positive difference-in-difference is indicative of greater sensitivity of the 5L compared with the 3L.
To provide a consideration of response change across all levels, we also used matched pairs from the inconsistency analysis to examine whether the version change caused individuals overall to select levels indicative of poorer or better health. We did this first at a domain-level by comparing mean level responses between the recoded EQ-5D-3L (1 = 1; 2 = 3; 3 = 5) and the EQ-5D-5L. Mean level response differences are calculated for matched pairs by subtracting EQ-5D-3L from EQ-5D-5L responses, with positive scores implying lower levels of health are reported with the EQ-5D-5L and negative scores implying lower levels of health are reported on the EQ-5D-3L. We then compared the completion of the total profile by comparing EQ-5D-3L and EQ-5D-5L respondents on a ‘misery index’ that is equivalent to the sum of the levels across all domains. Calculations of response change are based on randomly matched samples of individuals within strata.
For a selection of matched strata where total profile inconsistency was below 2.5%, we also depicted flows in responses between the EQ-5D-3L and EQ-5D-5L. To do this, we randomly matched individuals across the EQ-5D-3L and EQ-5D-5L datasets who were in the same strata, i.e. those who have identical matched characteristics apart from being given the two different EQ-5D versions across 2 different years. We then calculate cross-tabulations, showing for each level and domain of the EQ-5D-3L the corresponding level and domain for the EQ-5D-5L. We then depicted these cross-tabulations graphically using ‘Sankey’ diagrams [15, 16].
Impact on Utility Indices
We examined the consequences of the EQ-5D version change on the distribution of utility scores by generating utility values from the EQ-5D-3L using the value set derived by Dolan [17], and values for the EQ-5D-5L using the mapping algorithm developed by van Hout et al. [18]. We also calculated utility values for 5L respondents using the value set derived by Devlin et al. [19]. We compared the utility distributions by examining differences in kernal density and exploring the impact on mean utility for condition counts of up to five concurrent conditions.