Psychometric properties of the EQ-5D-5L: a systematic review of the literature

Purpose Although the EQ-5D has a long history of use in a wide range of populations, the newer five-level version (EQ-5D-5L) has not yet had such extensive experience. This systematic review summarizes the available published scientific evidence on the psychometric properties of the EQ-5D-5L. Methods Pre-determined key words and exclusion criteria were used to systematically search publications from 2011 to 2019. Information on study characteristics and psychometric properties were extracted: specifically, EQ-5D-5L distribution (including ceiling and floor), missing values, reliability (test–retest), validity (convergent, known-groups, discriminate) and responsiveness (distribution, anchor-based). EQ-5D-5L index value means, ceiling and correlation coefficients (convergent validity) were pooled across the studies using random-effects models. Results Of the 889 identified publications, 99 were included for review, representing 32 countries. Musculoskeletal/orthopedic problems and cancer (n = 8 each) were most often studied. Most papers found missing values (17 of 17 papers) and floor effects (43 of 48 papers) to be unproblematic. While the index was found to be reliable (9 of 9 papers), individual dimensions exhibited instability over time. Index values and dimensions demonstrated moderate to strong correlations with global health measures, other multi-attribute utility instruments, physical/functional health, pain, activities of daily living, and clinical/biological measures. The instrument was not correlated with life satisfaction and cognition/communication measures. Responsiveness was addressed by 15 studies, finding moderate effect sizes when confined to studied subgroups with improvements in health. Conclusions The EQ-5D-5L exhibits excellent psychometric properties across a broad range of populations, conditions and settings. Rigorous exploration of its responsiveness is needed. Electronic supplementary material The online version of this article (10.1007/s11136-020-02688-y) contains supplementary material, which is available to authorized users.


Background
The EQ-5D is a broadly used generic multi-attribute health utility instrument. In addition to a thermometer-like visual analog scale (VAS) anchored by 0 (worst imaginable health) and 100 (best imaginable health), the EQ-5D's descriptive system comprises five dimensions with one item per dimension: mobility (MO), self-care (SC), usual activities (UA), pain/discomfort (PD) and anxiety/depression (AD). Responses to these items can be converted into a single measure of health utility using preference-based (typically country-specific) weights. Preference weights are derived from preference elicitation studies using hypothetical EQ-5D health profiles [1], typically sampling a general population. Until 2005, respondents could select from three response levels of function or symptoms for each dimension (the EQ-5D-3L; 3L). However, due to evidence of notable ceiling effects of the EQ-5D-3L in some populations [2][3][4][5] and concerns regarding the instrument's sensitivity to certain patient-relevant changes [6][7][8][9][10], a five response level version of the instrument was developed by the EuroQol group in 2010 [11,12]. The five-level version (EQ-5D-5L; 5L) added two response levels: one between "no problems" (level 1) and "moderate/some problems" (level 2 in 3L, level 3 in 5L), and another one between "moderate/some problems" and "severe problems" (level 3 in 3L, level 5 in 5L). The EQ-5D-5L also updated the middle response level with the term "moderate" from the EQ-5D-3L's "some" for the first three dimensions, while the most severe response level for MO was changed from "confined to bed" to "unable to walk about". Additionally, the instructions for marking overall health today on the visual analog scale (VAS) were different between the two versions until 2019. The EQ-5D-5L is currently available for more than 130 languages [13] and has been formally tested against the EQ-5D-3L in numerous studies, demonstrating improved psychometric properties over the EQ-5D-3L [14]. An interim scoring strategy that applies existing EQ-5D-3L preference weights to EQ-5D-5L can be used if EQ-5D-5L preference weights for certain populations are not yet available [A4].
Although its use has expanded to a wide range of settings and research purposes, there is no study reporting a comprehensive review of the measurement properties of the EQ-5D-5L. This review will be informative for researchers interested in economic evaluation and preference measurement, decision makers, users of EQ-5D-5L as patient-reported outcome measure for improving health care, and readers who need to interpret the findings from studies incorporating the EQ-5D-5L. The 5L instrument has now enjoyed over a decade of use and this paper aims to summarize the existing evidence on the psychometric properties of the EQ-5D-5L. A second objective of this review is to identify knowledge gaps regarding the psychometric properties of the EQ-5D-5L, and to highlight important areas for future research.

Methods
This literature search and review was guided by the PRISMA guidance on systematic reviews and meta-analyses [15]. This review focuses on the descriptive system of the EQ-5D-5L (the five items) as it was not always clear which version of the EQ-VAS was used in extracted studies.

Literature search
Four online databases-PUBMED (MEDLINE), PsycINFO, Excerpta Medica Database (EMBASE), and the EuroQol website-were searched using pre-determined terms: "EQ-5D," "EQ-5D-5L," "5L," "EuroQol" and "5 Level." The search included publications up to January 2019. Duplicates were assessed using author names, titles and journals. Exact search strategy and terms can be found in Supplementary  Table 1.
Two screening phases were conducted: (1) title and abstract, and (2) full text. Two researchers experienced in psychometric research methods and the EQ-5D instruments (IB and YF) independently screened the publications and reached consensus on any disagreements to determine inclusion. When consensus could not be reached, two senior researchers with extensive experience in psychometric research, health-related quality of life (HRQoL) measurement and the EQ-5D instrument were consulted for a final decision (TK and MFJ).
The a priori exclusion criteria were: 1. does not study humans 18 years or older; 2. publication language is other than German or English; 3. study does not assess the official version of the EQ-5D-5L or an experimental version of the 5L was used; 4. published prior to 2005 (prior to development of the 5L); 5. not a peer-reviewed primary study, literature review or conference paper (conference papers were included but other conference proceedings such as presentations or posters were excluded); and 6. not evaluating the measurement and psychometric properties of the EQ-5D-5L.

Data extraction
Publications selected for inclusion were reviewed and data entered into pre-determined tables by either YF or IB. Sometimes, values needed to be estimated from available information. When information on means and standard deviations were not available, but other sufficient data were reported (such as range or median), the mean and standard deviations were estimated using recommendations from Wan et al. 2014 [16]. When multiple studies use the same underlying dataset, data was extracted only once (e.g., [A20, A26, A31, A36-A38, A49, A53, A77, A79, A96]). General study characteristics including sample size, study design, sample characteristics and version of EQ-5D-5L were extracted, as were information on distributional properties such as means, percent reporting best health ("no problems" on dimensions or '11111' across the health profile), percent reporting worst health ("extreme" or "unable to" on dimensions or '55555' across the health profile) and missing values, for dimensions as well as the health profile. Although no guidance for level of missing values indicate the feasibility of an instrument, ≤ 5% has been found to be acceptable for multiple imputation [17]. Missing values ≤ 5% and floor ≤ 15% are considered acceptable [18]. Reliability is the consistency of an instrument, internally (extent to which subscale items are interrelated) as well as the instrument's stability across time (whether the instrument produces similar results in stable environments). Internal consistency is not a relevant psychometric property for the EQ-5D instruments and therefore we did not include it in this review. Agreement between two applications of the instrument over a period of time over which it should be stable (test-retest) is usually evaluated using Cohen's Kappa (κ) for categorical items (EQ-5D-5L items) or ICC for continuous values (EQ-5D-5L index value), with a level of ≥0.8 and ≥0.7 determined as acceptable, respectively [19][20][21]. We relied on the guidance from Cicchetti 1994 [22] to define Kappa and ICC: < 0.40 = poor, 0.40-0.59 = fair, 0.60-0.74 = good, 0.75-1.00 = excellent. Other methods such as area under the receiver operating characteristic curve (AUROC) were also reported [23,24].
In general, validity refers to the degree to which a measurement tool captures the underlying construct of interest. We extracted all information regarding different forms of validity from included publications, the most commonly investigated being convergent validity (a specific subtype of construct validity), that examines how closely two instruments that are intended to measure the same construct are related. This is most often done by testing the correlation between the EQ-5D-5L and other measures of health or health-related quality of life (including those measuring pain, and mental or physical health or HRQoL). Other validity results extracted include known-groups validity 1 3 (examining whether the 5L can distinguish between a priori determined groups).
Responsiveness is the ability of an instrument to capture true changes (e.g., due to a health intervention) in the construct of interest over time. Some argue that responsiveness is a subtype of validity or reliability [25]. Responsiveness is of particular importance for the EQ-5D-5L: one of the reasons the instrument was created was to address criticisms that the EQ-5D-3L was not sufficiently sensitive to change [26]. Responsiveness can be specific to population, context, and depends on the direction of change in the underlying construct [27]. In the case of the EQ-5D-5L, responsiveness addresses the question if the index value or individual items can detect relevant changes in underlying health. Preliminary research conducted on experimental five-level versions of the EQ-5D found its index value to be sensitive to change. Commonly used methods evaluating responsiveness include standardized effect size (SES) and/or standardized response mean (SRM) [25,27,28]. Both standardize the difference in means from two measurement points by dividing by standard deviation (of the mean or of the change scores). An SES of 0.2 to 0.3 is considered small, ≈ 0.5 medium and ≥ 0.8 large effect sizes [29]. Some studies examined the EQ-5D-5L's ability to detect a change as defined by external criteria, or anchor, to estimate minimally important differences (MID) or the smallest change in score that is beneficial or relevant for patients [27,28,30]. The external anchor is usually a patient-assessment.

Analysis
Due to the heterogeneity of studies and outcomes included, we were only able to summarize three outcomes across studies: proportion of respondents reporting the best health, mean index values, and EQ-5D-5L's correlations with other measures (Spearman's or Pearson's Rho). When multiple index scores are reported in a study, the most up to date (EQ-5D-5L as opposed to the interim or 'crosswalk') or most appropriate (closest to the sampled population) index scores were extracted. The signs of correlation coefficients were changed if authors had not corrected for the directionality of the scales. Subgroup analysis was performed when there were at least three studies representing a relevant subgroup.
Data were pooled by means of random-effects models using inverse variance weight for pooling. Pooling was based on Fisher's z transformation of correlation coefficients and logit transformation of proportions. Microsoft excel was used for data extraction, while R was used for data analysis [31]. The R package "meta" was used to estimate pooled values [32].

Results
We identified 496 papers during the initial search and additional 397 papers during the updates in 2018 and 2019, of which 99 papers were included for review ( Fig. 1; reference . These papers included general population (n = 32) and patients (n = 58) from 32 countries (see Table 1). The country where the most numerous studies were conducted was the UK/England (n = 18), while Canada, Germany, Singapore and the USA were the locations with the second most numerous studies (n = 8 each). The patient groups represented by the most studies are musculoskeletal/orthopedic (n = 8), cancer (n = 8) and lung/respiratory diseases (n = 7). The Multi-Instrument Comparison study (MIC) [A20, A26, A31, A36-A38, A49, A53, A77, A79, A96] and the study that developed a method of deriving 5L interim index values from 3L value sets [A4, A6, A83] were represented by 11 and 3 studies, respectively. General characteristics of included studies can be found in Supplementary Table 2.

Distribution properties
Missing values (17 of 17 papers) and most severe health state (43 of 48 papers) were under 5% and 15%, respectively, showing the 5L to be feasible and free from floor effects ( Table 1). Studies with greater than 15% reporting the most severe health (in certain dimensions) were those studying patients with stroke [A28, A46], spinal cord injury [A56], women just after giving birth [A84] and patients with chronic illnesses [A83]. These patients were reporting severe health impairments in MO, SC, and/or UA. Enough information was reported by 48 studies to pool proportion reporting the best health state '11111,' which was 23% for patients, ranging from 2% (musculoskeletal diseases) to 36% (cancer; Fig. 2a). Pooled proportion of over 15% at full health was observed for patients with diabetes, cancer, liver diseases, kidney diseases and skin diseases. General and healthy population studies were 48% and 41% reporting full health, respectively ( Fig. 2b).
By dimension, proportions reporting "no problems" were smallest across the board for stroke, while SC consistently had large ceilings except for patients with stroke, diseases of the nervous system and diseases of the musculoskeletal system (pooled proportion reporting "no problems" in EQ-5D-5L dimensions can be found in Supplementary Table 3). Konnopka and Koenig (2017) also found SC to be most problematic in terms of percentage at the ceiling, even for those reporting four or more diseases and needing one or more hours of daily care [A61].
Index value means could be pooled from 58 publications, showing they were generally lower for disease groups than healthy populations and lower socio-economic/socio-demographic groups than higher (Fig. 3a, b).

Reliability
Nine papers addressed test-retest reliability, eight found the scale agreement (ICC) excellent and the remaining study finding an ICC of 0.7. However, five studies found fair agreement on the item level (Cohen's Kappa) for certain dimensions: they tend to be smaller for PD and highest for MO (Table 1).

Validity
Studies examining construct validity typically compared the EQ-5D-5L to the EQ-5D-3L: the focus has been on the response categories as the items themselves were identical. As we did not include studies with experimental versions of the 5L, most of the earlier studies examining the construct validity of various response options of the 5L have not been included. One included study used exploratory factor analysis to examine the structure of the EQ-5D-5L, Satisfaction with Life Scale and MacNew questionnaire [A96]. They found MO, SC, UA, and PD to load onto one factor with other physical health and usual activity items, and AD to load onto a second factor including items addressing mood, depression, and confidence. Of the five included papers addressing content validity, three used qualitative methods. Keeley et al. (2013) sampled research professionals who found the SC item to be too narrowly defined and the UA item to be too broad, while deeming PD and AD as the most relevant dimensions related to health-related quality of life [A7]. Whitehurst et al. (2014) sampled patients with spinal cord injuries, who generally found the 5L to be relevant for their health problems [A21]. However, some found the instrument to lack coverage of specific aspects of spinal cord injury. A more recent qualitative study found the EQ-5D-5L to lack relevancy for asthma patients except for some physical limitations, but also praised the instrument for its generic nature [A92]. Craig et al. (2014) found via regression analysis that the 5L encompasses a slightly larger range of EQ-VAS scores from best to worst health state compared to the 3L [A15].  also investigated the distance between the 3L and 5L levels using a direct approach asking patients to place the labels onto a horizontal VAS scale, finding a larger range covered by the 5L [A83].
Convergent validity was assessed by the greatest number of papers (n = 33), usually examining correlations of EQ-5D-5L with other measures of health using Pearson's correlation or Spearman's Rho rank correlation coefficient. Figure 4a-c illustrates pooled correlations of the EQ-5D-5L index value with other measures of physical health, mental/social/cognitive health and global health. The strongest correlations were observed for multi-attribute utility instruments (pooled rho = 0.756), physical/functional measures (pooled rho = 0.582) and pain/discomfort measures (pooled rho = 0.595). The EQ-5D-5L index value correlated poorly with measures of satisfaction (pooled rho = 0.335) and cognition/communication (pooled rho = 0.259). Blank cells imply that the study did not investigate and/or report on the psychometric property a Floor defined as reporting worst health response levels 5 ("extreme problems" or "unable to") for EQ-5D-5L items (Mobility MO, Self-Care SC, Usual Activities UA, Pain/Discomfort PD, Anxiety/Depression AD) and on the profile ('55555'). When not specified, reports of the worst health level for all dimensions and the profile were below 15% b Generally assessed with effect size or tests of difference in means c Kappa and ICC defined as     Table 4. Bhadhuri et al. 2017 examined the EQ-5D-5L's ability to measure spillover effects and found strong correlations between EQ-5D-5L scores of family of meningitis survivors and survivors' social lives (Spearman's Rho = 0.52, 0.45), exercise (rho = 0.55, 0.82), and personal health (rho = 0.88, 0.95) [A57]. Poor correlations were found between carers' and survivors' EQ-5D-5L dimensions (rho = 0.07 to 0.24), index (rho = 0.19, 0.26), and EQ-VAS (rho = 0.22, 0.24). Table 2 includes information from studies, which examined validity other than convergent. Generally, the 5L can distinguish across disease groups, disease severity, symptoms, and related groups, and also across age and education. However, it does not consistently distinguish across groups differing with certain clinical outcomes (e.g., presence of deformities in the spine, frequency of medication use, gender, use of health services, and marital status.

Responsiveness
Fifteen studies examined whether the EQ-5D-5L captures change in health over time. All of these papers included SES and/or SRM. Although not reported, the SES could be calculated for two papers using reported information [A71, A84]. Five assessed results across respondents who improved, remained stable or deteriorated over time based on an anchor measure [A28, A39, A57, A59, A68, A87]. Four papers also reported MID [A46, A50, A71, A85]. Two used retrospective items to define change [A50, A71]. Table 4 summarizes the responsiveness results-when available, the SES and SRM are used for ease of interpretability. The EQ-5D-5L index values typically had moderate effect sizes for improved patients and those expected to improve (over the course of medical or therapeutic intervention). The largest effect sizes were observed for patients days and weeks after giving birth [A84]. Compared to other instruments, the 5L generally performs as well or better. Two additional papers addressed dimensionlevel changes [A23, A74], both finding the 5L to be more sensitive than the 3L. Crick et al. 2018 examined only the AD dimension and noted that both the 3L and 5L were limited in responsiveness [A74].

Discussion
The EQ-5D is a generic preference-based health status instrument that has enjoyed widespread use since its creation in the 1980s [33]. The psychometric properties of the threelevel version have been well established [34][35][36][37][38][39][40]. Any reluctance of using the more recently developed five-level version might come in part from limited experience and evidence for validity, reliability or responsiveness in different populations [41]. This review summarized published evidence on the psychometric properties of the EQ-5D-5L, which has been investigated in a broad array of countries, populations and contexts in the past decade. No studies found missing values to be problematic for the instrument, demonstrating feasibility. Test-retest results show potential problems with stability over time on an item level, but not at the instrument (index score) level. Note that internal consistency is not a relevant psychometric property for the EQ-5D-5L since its index score is based on a completely different measurement framework (as a preference-based measure). Rather large proportions of respondents reporting the best health profile were observed for general population studies but less so for patient populations. The EQ-5D was conceptualized to measure deviations from full health (or negative health) and is more prone to larger ceilings than instruments that include positive health dimensions (e.g., the SF-6D). Therefore, studies with samples for which impact on the functions covered by the EQ-5D-5L (e.g., recovered cancer patients, liver disease, diabetes) is less relevant, other disease-specific instruments should be used in conjunction. On the item level, most studies, even those with populations in poorer health, reported a substantial ceiling with the dimension "self-care", although the ceiling for self-care was low for respondents who were expected to have limitations with this function (e.g., patients before hip replacement surgery, patients shortly after cesarean section, patients with spinal cord injury [A21, A24, A84]). These results suggest that while most populations may not report problems in "selfcare", it is relevant for particular patient groups.
Our results overall solidly establish the validity of the EQ-5D-5L as supported by observed trends across subgroups (pooled means, known-group validity) as well as the convergent validity (correlation of items and index to other measures of health-related quality of life). Index values as well as the dimensions show moderate to strong correlations with physical/functional measures, pain, measures of mental and emotional health, activities of daily living and clinical/ biological measures as well as with other multi-attribute utility measures. On the other hand, the 5L is not found to be correlated with satisfaction with life and cognition/ communication measures. Indeed, current efforts investigating adding dimensions (so-called "bolt-ons") to the 5L has identified cognition as an important dimension missing from the EQ-5D [42][43][44].
Included studies on responsiveness are heterogeneous in terms of the population, whether and which anchors were used, whether a health intervention was administered, and stratification of results across subgroups. This is not a problem unique to the EQ-5D-5L as, unlike other psychometric properties, there is not a set of recommended analyses to address responsiveness [25,30]. Therefore, it is difficult to elucidate whether the EQ-5D-5L has problems with sensitivity to change in certain populations or with certain treatments. Despite this limitation, responsiveness is found to be acceptable by all included studies. A previous review found the EQ-5D-5L to be responsive to half of the conditions included, but found mixed evidence for the other half [26]. Responsiveness and sensitivity to changes in health is clearly an area that needs further investigation. Future studies could benefit from defining what a relevant change is for the EQ-5D-5L (MID) and defining appropriate anchor measures that can be used across populations (e.g., a level of change in EQ-VAS scores or a single self-rated health item).
Parkin and colleagues (2016) demonstrated the EQ-5D-5L distribution to be affected both by the descriptive system and the value set applied [45]. Although not a focus of this study, the valuation method and applied utility scores are as important as the descriptive system when assessing responsiveness of index values. It has been shown that choice of value set has an impact on utility scores [46][47][48][49] and may change results of cost-utility analyses [48,50,51]. Other results show that the effect of value sets on utility scores is relatively small [A37, A83]. Due to the heterogeneity of studies found in this review, we have insufficient information to evaluate how value sets impact responsiveness. Future research will benefit from systematically examining responsiveness of the descriptive system and how choice of value set farther impacts responsiveness.
This review included nearly one hundred studies published in the past decade that investigated the psychometric properties of the EQ-5D-5L, the majority of which sample populations from western Europe, OECD countries and secondarily, from East Asia. This clearly reflects where the EQ-5D-5L is currently used [52]. However, almost a third of new user registrations in 2018 come from countries accounting for less than 1.5% of total registrations, demonstrating widespread as opposed to concentrated use of the instrument [52]. For instance, two reviews report rapid uptake of the instrument in Eastern Europe [53,54]. Establishing validity in other regions is crucial as the EQ-5D-5L expands in its use. Similarly, as the EQ-5D instrument has expanded in its application, it would also be important to assess how well it performs in particular settings and applications, such as used to inform clinical practice, in health services research or in health surveillance programs.

Study limitations
A limitation of this study is that studies using experimental versions of the EQ-5D-5L were excluded. Early experimental work on the content validity of the instrument [55][56][57][58][59][60][61][62] and investigations of bolt-on items [63] are therefore not captured by this review. Similarly, due to the very large number and range of quality of studies identified, we did not include application studies of the EQ-5D-5L which did not explicitly address psychometric properties, and therefore are missing distributional and perhaps responsiveness information that may have been captured by those publications. As already discussed, choice of value set and valuation methodology are as important as the descriptive system in the case of the EQ-5D. This review does not address valuation methods and therefore does not tackle a crucial component of the instrument and its index value. A previous review of valuation methodology provides valuable information on this topic [64].

Conclusions
The EQ-5D-5L is a reliable and valid generic instrument that describes health status which can be applied to a broad range of populations and settings. The assessment of responsiveness, in particular, needs further and more rigorous exploration. Rather large ceilings persist in general population samples, reflecting the conceptualization of the EQ-5D instrument, which focuses on limitations in function and symptoms, and does not include positive aspects of health such as energy or well-being. 22. Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4) (4) Application and measurement properties of EQ-5D to measure quality of life in patients with upper extremity orthopaedic disorders: A systematic literature review. Archives of Orthopaedic and Trauma Surgery, 138 (7), 953-961. 38. Pickard, A. S., Wilke, C. T., Lin, H. W., & Lloyd, A. (2007).