Combining EQ-5D-5L items into a level summary score: demonstrating feasibility using non-parametric item response theory using an international dataset

Feng, You-Shan; Jiang, Ruixuan; Pickard, A. Simon; Kohlmann, Thomas

doi:10.1007/s11136-021-02922-1

Combining EQ-5D-5L items into a level summary score: demonstrating feasibility using non-parametric item response theory using an international dataset

Special Section: Non-parametric IRT
Open access
Published: 08 July 2021

Volume 31, pages 11–23, (2022)
Cite this article

Download PDF

You have full access to this open access article

Quality of Life Research Aims and scope Submit manuscript

Combining EQ-5D-5L items into a level summary score: demonstrating feasibility using non-parametric item response theory using an international dataset

Download PDF

You-Shan Feng ORCID: orcid.org/0000-0003-1509-3409^1,2,
Ruixuan Jiang³,
A. Simon Pickard⁴ &
…
Thomas Kohlmann²

3210 Accesses
13 Citations
4 Altmetric
Explore all metrics

Abstract

Background

The EQ-5D-5L is a well-established health questionnaire that estimates health utilities by applying preference-based weights. Limited work has been done to examine alternative scoring approaches when utility weights are unavailable or inapplicable. We examined whether the Mokken scaling approach can elucidate 1) if the level summary score is appropriate for the EQ-5D-5L and 2) an interpretation of such a score.

Methods

The R package “mokken” was used to assess monotonicity (scaling coefficients H, automated item selection procedure) and manifest invariant item ordering (MIIO: paired item response functions [IRF], H^T). We used a rich dataset (the Multiple Instrument Comparison, MIC) which includes EQ-5D-5L data from six Western countries.

Results

While all EQ-5D-5L items demonstrated monotonicity, the anxiety/depression (AD) item had weak scalability (H_i = 0.377). Without AD, scalability improved from H_s = 0.559 to H_s = 0.714. MIIO revealed that the 5 items can be ordered, and the ordering is moderately accurate in the MIC data (H^T = 0.463). Excluding AD, H^T improves to 0.743. Results were largely consistent across disease and country subgroups.

Discussion

The 5 items of the EQ-5D-5L form a moderate to strong Mokken scale, enabling persons to be ordered using the level summary score. Item ordering suggests that the lower range of the score represents mainly problems with pain and anxiety/depression, the mid-range indicates additional problems with mobility and usual activities, and middle to higher range of scores reveals additional limitations with self-care. Scalability and item ordering are even stronger when the anxiety/depression item is not included in the scale.

Scoring the EQ-HWB-S: can we do it without value sets? A non-parametric item response theory analysis

Article Open access 21 February 2024

Head-to-head comparison between the EQ-5D-5L and the EQ-5D-3L in general population health surveys

Article Open access 16 August 2018

ExternalRefStart http://www.common-metrics.org www.common-metrics.org ExternalRefEnd : a web application to estimate scores from different patient-reported outcome measures on a common scale

Article Open access 19 October 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Background

The EQ-5D is a widely used generic measure of health [1, 2]. As it is brief and not disease specific, the EQ-5D is applied in a broad range of settings, including measurement of health status in clinical practice, population health surveillance, assessment of healthcare quality, medical decision making, and patient communication [3,4,5,6,7,8,9]. The EQ-5D-5L expanded the response levels to five from the original three-level version (EQ-5D-3L) [10].

The EQ-5D is best known for the generation of quality-adjusted life years (QALY) in cost-utility analysis, used to inform drug reimbursement and pricing decisions in some countries/regions. Utility values, which are used to estimate QALYs, are calculated for EQ-5D-5L health states by applying a societal value set. Societal value sets are preference-based scoring weights estimated using valuation studies [11]. In valuation studies, hypothetical EQ-5D-5L health states are valued using choice-based methods, such as the time trade-off. These studies are generally conducted using representative, location/region-specific population samples. However, for many applications of the EQ-5D, population/country-specific utility scores may be unjustifiable or even introduce additional statistical biases [7, 9, 12]. An alternative method to summarize the instrument, relevant when utility weights are unavailable or unsuitable (e.g., EQ-5D-Y), is a total sum score of the severity levels on each dimension. Because each item of the EQ-5D has the same number of response levels, all items and severity levels contribute equally to this additive score. This approach has been termed “equally weighted” score [13], “unweighted” scoring approach [14, 15], and informally the “misery” score/index [16,17,18]. The term “level sum score” (LSS) was used in the recently published guidebook for analyzing EQ-5D data [16] and will be used for the remainder of this paper for consistency and clarity. The appeal of the LSS is its simplicity and consistency across populations (i.e., the same scoring system for all countries and populations).

Both the LSS and utility values are summary scores with similar limitations in interpretation; two patients may have the same summary score, but one may have extreme problems in a single dimension, whereas the other may have slight problems in several dimensions. Utility scores have found widespread acceptance over the LSS for the EQ-5D, potentially due to the rigorous development of preference elicitation.

The LSS has one major merit over utility scores when societal preference scores are unnecessary (i.e., non-economic applications): no algorithm is required to estimate the LSS, the end-user does not need to choose a specific value set to use (e.g., in multinational studies). Although previous investigations into the use of the EQ-5D LSS found substantial agreement and similar psychometric properties between the LSS and utility scores [13,14,15], the high correlations (ICC/Rho > 0.9) do not prove LSS accurately describes HRQoL or is appropriate for statistical inference. There is a dearth of literature specifically assessing the appropriateness of the LSS to describe HRQoL.

Item response theory (IRT) comprises a large set of models used to aid the construction and evaluation of multi-item scales. In general, these models assess the relationship between a latent variable of interest (θ) and the manifest/observable response patterns of a set of items. The probability of endorsing a particular response level on items of a scale is dependent on the respondent’s θ level. Parametric IRT has been previously applied to study the EQ-5D, although not to elucidate scoring [19,20,21,22]. Non-parametric item response theory (NP-IRT) approaches do not make strict assumptions about the shape of the function that describes the relationship between the response probability and the latent variable [23]. NP-IRT investigates whether the ordering of respondents along the summary score reflects the stochastic ordering of persons along θ [23, 24] instead of estimating θ. If the LSS is a proxy for θ (i.e., underlying health), then ordering of persons along the summary score is the ordering of persons along θ. Mokken scaling is a scaling approach comprising of a set of methods to assess whether the data fit a set of NP-IRT models. Two nested NP-IRT models included in Mokken scaling are as follows: the monotone homogeneity model (MHM), which examines ordering of persons along θ; and double monotonicity model (DMM), which examines ordering of persons and items along θ [25, 26]. If EQ-5D-5L data fit the MHM or DMM, then the use of LSS to represents underlying health can be justified and interpreted. The EQ-5D-5L is a good candidate for applying Mokken scaling as all items have the same number of ordered response categories with analogous adjectives.

The aims of these analyses were to investigate whether the MHM and DMM fit EQ-5D-5L data in order to 1) determine whether the LSS can be justified for the EQ-5D-5L and 2) examine whether an interpretation can be applied to such a score.

Methods

EQ-5D-5L

The EQ-5D health profile includes the items mobility (MO), self-care (SC), usual activities (UA), pain/discomfort (PD) and anxiety/depression (AD) [2]. The EQ-5D-5L asks respondents to endorse one of five response levels for each item: “no problems,” “slight problems,” “moderate problems,” “severe problems,” and “extreme problems”/ “unable to” [20, 27], describing 3125 (5⁵) health state profiles. The instrument also includes a visual analog scale (VAS) anchored by 0 (worst imaginable health) and 100 (best imaginable health) that is usually analyzed separately from the health profile.

The LSS is typically calculated by assigning a numerical value to each response level (i.e., 1 for “no problems”, 5 for “extreme problems”/”unable to”) and summing these values across the five items, resulting in a score from 5 (11,111, no problems on any dimension) to 25 (55,555, extreme problems on all dimensions) for the EQ-5D-5L.

Dataset

The Multi Instrument Comparison (MIC) project surveyed six countries in 2012 (Australia, Canada, Germany, Norway, UK, and USA), sampling respondents who self-reported seven chronic illnesses plus a healthy sample with no self-reported chronic conditions [28, 29]. Respondents completed a battery of health status, subjective well-being and capability measures, including the EQ-5D-5L. This dataset provides an opportunity to assess the scaling properties of the EQ-5D-5L in a large sample across disease and country subgroups. The disease groups chronic obstructive pulmonary disease and stroke were only sampled in the Australia and therefore excluded from analysis. All analyses were repeated by the subgroups self-reported disease and country.

Data management and descriptive statistics were handled in Microsoft Excel and Stata SE 13 [30], while all other analyses were conducted using the statistical language and environment R [31] with Van der Ark’s package “mokken” [32, 33]. The R script is included as supplementary material A. Permission to use the MIC dataset can be obtained here: https://www.aqol.com.au/index.php/mic-data.

Mokken scale analysis

We investigated the assumptions of two nested NP-IRT models that examine the ordinal location of patients and items along a single latent variable θ: respondents were ordered according to their LSS and items are ordered according to mean item scores [23, 25, 26]. The polytomous MHM and DMM models are extended from the dichotomous models [34, 35]. The MHM can elucidate whether a summary score can be used to order individuals along the latent variable. The more restrictive DMM is nested within the MHM and can further elucidate whether the items (i.e., EQ-5D-5L dimensions in these analyses) can be ordered invariantly along the latent variable. We examined how well polytomous MHM and DMM models fit EQ-5D-5L data.

Assessment of fit of the monotone homogeneity model

The MHM has three assumptions:

1.
Unidimensionality: items within the scale measure the same underlying latent variable;
2.
Local independence: responses to scale items are influenced only on level by θ; and
3.
Monotonicity: probability of endorsing particular response levels is monotonically non-decreasing as θ increases.

Loevinger’s homogeneity coefficients, automated item selection procedure, and manifest monotonicity were used to assess the fit of the MHM to EQ-5D-5L data. Additionally, we examined scale reliability using Molenaar and Sijtsma’s rho (ρ) [36] and Guttman’s lamda-2 (λ-2) [37, 38].

Scalability of the EQ-5D-5L items was assessed using Loevinger’s scalability coefficients H, for which H values reflect item fit within a scale. H is measured on the item pair (H_ij), item (H_i), and scale (H_S) levels. H_ij is the normed covariance between a pair of item scores while H_i is the normed covariance between item and rest scores [23, 32]. H_S is a weighted mean of H_i. Negative H_ij and H_i coefficients indicate an item violates MHM. The closer H_i is to 1, the better an item can discriminate subjects along θ. On the item level, H_i > 0.3 is considered sufficient, while H_i > 5 indicates a strongly discriminating item. The commonly accepted rules of thumb for interpreting H_S were applied: H_S < 0.3 indicates the item set is unscalable, H_S between 0.3 and 0.4 indicates a weak scale, H_S between 0.4 and 0.5 indicates moderate, and H_S ≥ 0.5 indicates strong [25]. H_ij > 0 indicates that the data fit the MHM. We also used the H_ij to examine which item pairs are more strongly related than other pairs.

Automated item selection procedure (AISP) is a standard feature of the “mokken” package which selects subsets of items from a larger set that can represent attributes on which respondents can be ordered by total scores [32]. Although the lower bound of 0.3 is suggested for accepting items in a scale, it was more informative to determine at which level of H_i was items no longer scalable. Therefore, we first executed the AISP 12 times with the lower bound for H_i set between 0 and 0.5, increasing in steps of 0.05 [23, 32]. Then we pinpointed the level of H_i at which each of the five items was no longer appropriate for the scale by decreasing H_i in steps of 0.001 from the cutoff identified in the previous step.

Monotonicity

Latent monotonicity generally also implies manifest monotonicity, which is observable in the data [32] Therefore, if the LSS is a proxy for θ, then ordering of persons along the LSS reflects the ordering of persons along θ. Manifest monotonicity was assessed by examining whether the cumulative probability for a dimension-level rating at or above each dimension-level rating does not decrease across rest score groups. Rest scores are calculated by subtracting the item of interest from the LSS. Rest score groups are created automatically based on minimum sample size requirements for each group [32, 33]. Only violations greater than the default minimum (minvi = 0.03 for the function check.monotonicity of the R package “mokken”) were reported [32]. Furthermore, item step response functions (ISRFs) and item response functions (IRFs) were visually inspected for monotonicity. ISRF plots the probability for endorsing a response level or higher across the latent variable. IRF for polytomous items is the sum of an item’s ISRFs.

Assessment of invariant item ordering

The DMM model is a special case of MHM for which all assumptions of the MHM hold with an additional assumption that the IRF or ISRF of items does not intersect. Non-interception of ISRF is not necessarily evidence of item order [39] and would not be meaningful for interpretation of the LSS. Therefore, we did not examine non-interception of ISRF as a measure of DMM fit, rather focusing on invariant item ordering. Invariant item ordering can provide an interpretation: If the items have the same ordering along θ, then the summary score might be interpreted based on that order [32, 33, 39]. We therefore examined manifest invariant item ordering (MIIO) as suggested by Ligtvoet et al. (2010, 2011) [40, 41].

We assessed MIIO using the check.iio function of the R package “mokken,” which orders items by their conditional mean scores and checks each item pair for violations of ordering for rest score groups. Violations that exceed the default minimum value (number of ISRFs times 0.03) are reported [33, 41]. Coefficient H^T gives an indication of the degree to which the sample follows item ordering. We applied the rules of thumb that H^T < 0.3 implies the item ordering accuracy is too low, H^T between 0.3 and 0.4 as ordering with low accuracy, H^T between 0.4 and 0.5 as moderate accuracy, and H^T > 0.5 as highly accurate item ordering [41].

Results

The included 7,933 subjects of the MIC reported 566 of the 3125 possible response patterns on the EQ-5D-5L; “11,111” (full health) and slight problems with PD with no problems on the other dimensions (“11,121”) were the first and second most often endorsed (19.3% and 14.3%, respectively). Subjects without chronic conditions were most homogeneous in regard to health profile (94 unique profiles), while those with diabetes reported the most diverse range of health (239 unique profiles; supplementary materials B and C). Number of distinct health profiles ranged from 164 (Norway) to 276 (UK) across country samples. Although over 8% of MIC respondents noted their general health as “poor,” endorsements of the most severe EQ-5D-5L levels were rare, especially for MO and SC (Table 1).

Table 1 Characteristics of the study sample (MIC)

Full size table

AISP and scalability

The EQ-5D-5L is a reliable scale, with ρ = 0.822 and λ-2 = 0.819. AISP placed all five items onto a single latent variable when the lower bound for H_i was set at the default 0.3, even when considering the 95% confidence interval (derived from standard errors). AD was identified as an unscalable item at H_i ≥ 0.378. PD was rejected from the scale at H_i ≥ 0.685, SC at H_i ≥ 0.721, and no items could be scaled at H_i ≥ 0.75 (Table 2).

Table 2 Item characteristics of the EQ-5D-5L

Full size table

H_i values were above 0.6 for all items except for AD, which had a H_i of 0.377. H_ij of AD with the other items ranged from 0.292 (MO) to 0.448 (UA) (Table 3). H_ij of SC and PD was larger than all AD item pairs, but smaller than 0.7, while all other item pairs had H_ij above 0.7. Because the H_i of AD was close to 0.3, the value of acceptability for H_i, we decided to assess scalability by omitting this item. If the reduced item set would yield a much stronger scale, this would be an important finding. Researchers would possibly decide to employ the reduced items set in studies where a scale with increased scalability is needed, such as in instances where item ordering must be strictly maintained. When AD was removed from the model, H_S increased from 0.559 to 0.714, and the H_i of the four remaining items also increased (Table 2).

Table 3 Scalability coefficients and standard error for item pairs of the EQ-5D-5L

Full size table

Fit of the MHM model

Figure 1 illustrates the IRF and ISRF charted over rest score groups for the five items of the EQ-5D-5L. All IRFs and ISRFs increased monotonically with no violations of manifest monotonicity observed (Table 2). Critical values of all items were zero, showing no misfit of the MHM.

Fit of MIIO

Two violations of MIIO were observed between 1) AD and MO, and 2) AD and UA (Table 4). AD had the highest critical value, and in backward selection was recommended for exclusion. Due to this recommendation for exclusion, we examined MIIO excluding the AD item, after which no violations of MIIO remained.

Table 4 EQ-5D-5L Item scaling coefficients stratified by disease type and country

Full size table

In order to visualize the IRF of all items in one figure, we selected item-pair results from the check.restscore function (Fig. 2). IRF charted over rest score groups indicate that the lower rest scores (≤ 3) were driven by PD and secondarily by AD. In the slightly higher rest score groups (2–4), the IRFs of MO and UA equally increased and overlapped, while AD’s IRF flattened. IRF of AD crossed both MO and US at rest scores 4–5. The IRF of SC did not increase until reaching higher rest score groups (4–5). Moderate item ordering was observed for the complete MIC sample (H^T = 0.463) (Table 4).

Stratified analysis across subgroups

H coefficients were estimated for disease and country subgroups for the complete EQ-5D-5L scale as well as for the scale omitting the AD item as AD was recommended for exclusion by the check.iio procedure for many subgroups (Table 4). For the complete scale, H_s was weak for the healthy subsample (0.363), moderate for subjects with hearing problems and from Norway (0.496, 0.476, respectively). H_s for all other subgroups was “strong” but was all below 0.6.

H^T for the full scale ranged from 0.373 (Norway) to 0.747 (depression). Violations were found in the AD and MO and AD and UA pairs consistently across all subgroups except for respondents without self-reported chronic illness, depression, arthritis, and Canadian respondents (Table 4). Backward item selection recommended excluding AD for all subsamples that detected violations except for Norway. Critical values for Norway were 34 for UA and 50 for AD, demonstrating non-serious misfit. Figure 3 plots IRF of item pairs AD/MO and AD/UA for subgroups which did not recommend AD for removal. Not surprisingly, AD was easier to endorse at all rest score groups than MO or UA for the subgroup with depression, and the IRFs are far enough apart that they do not intersect.

H_ij tends to be largest between AD and UA, AD and PD across all subsamples except for healthy respondents, those reporting hearing problems and the Australian sample, showing that AD is more closely related to UA and PD than MO and SC (Table 5). H_ij between AD and all the other EQ-5D-5L items was particularly small for the healthy subsample.

Table 5 Item pair coefficient for anxiety/depression

Full size table

Discussion

The EQ-5D-5L items form a strong Mokken scale, fitting the MHM and thus demonstrating that LSS, an additive summary score independent of population value sets, is acceptable and meaningful for measurement. These results empirically demonstrate that the EQ-5D-5L LSS orders respondents along a latent variable of health, with higher score indicating poorer health. The MHM fit of the EQ-5D-5L data reflects the rigorous work in questionnaire development, especially with refinement of the response levels [19, 27, 42]. Meijer and colleagues cautioned that sometimes strong Mokken scales are not optimal because they could reflect items covering similar or overlapping content [43, 44]. However, the EQ-5D is a brief scale with items covering diverse aspects of function and symptoms, so this concern is minimized.

MIIO results suggest that an interpretation of functional limitations and health symptoms can also be applied to the LSS: the low range of the score represents mainly problems with PD and AD, the lower to mid-range scores indicate additional problems with MO and UA, while the middle to higher scores reveal limitations in SC. The ordering of these items was found to be moderate. The finding that item ordering was not accurate for the healthy sub-sample reflected the observation of less variation in EQ-5D-5L responses in that subsample.

Our results empirically demonstrate what is conceptually understood: the LSS of the EQ-5D-5L orders persons by their levels of health. The relatively consistent performance of the EQ-5D-5L scale across countries is encouraging for the purpose of providing evidence to support the use of the LSS to compare the EQ-5D across countries. This is important because the EQ-5D has historically been scored using weights based on country-specific societal preferences. The LSS is used to describe data quality of valuation studies [45, 46] but has yet seen broader acceptance. A summary scoring function independent of population-specific value sets that is simple, psychometrically valid, and international in its applicability has tremendous advantages for researchers and population health scientists who wish to have a composite indicator of health for international comparisons using a measure available in hundreds of languages and is freely licensed and distributed by the EuroQol by non-profit organizations.

Although AD was initially retained in the scale as its H_i was above the commonly accepted cutoff of 0.3, it was excluded when the cutoff was only raised to above 0.378. Additionally, AD was found to violate MIIO in most subgroups—its IRF crosses the UA and MO IRFs at rest scores 3–4—and AD removal from the scale was suggested in backward model selection. The determination of whether an item should remain in a scale is not based solely on H_i but depends on conceptual and empirical considerations and the application of the instrument. When AD was omitted, H_s and H^T improved to above 0.7 to indicate very strong person and item ordering. Therefore, in applications where scalability or item ordering is required to be strong, one could apply the LSS to only the four physical items of the EQ-5D and assess the AD item separately. Although the EQ-5D is rarely used as a diagnostic tool on the level of individual patients, item ordering can still be relevant for group level applications. For example, although patient groups with mainly physical symptoms do not suffer from anxiety/depressive problems more than the general population, the AD item may be more difficult to endorse than the physical items at moderate or more severe levels of disease (as indicated in these results). However, for conditions for which mental health is affected, the AD item could be easier to endorse than MO, SC and UA across the scale (as supported by our findings of MIIO in the subgroup with depression). The relationship between items may also be modified by other factors such as age or gender. This is an area needing future research.

IRT approaches to evaluating the EQ-5D have been relatively scarce in the literature: our results are comparable to available evidence. A recent investigation of the EQ-5D using Rasch rating scale model reported similar item ordering as our findings: PD was the easiest to endorse, UA, AD, and MO are at middle levels of difficulty of endorsement, and SC was the most difficult to endorse item [21]. Our scalability results were similar to previously published results for the physical function subscale of the SF-36—H_S of 0.69 and H^T of 0.53 [44].

IRT assumes items are indicators of a single latent variable. However, the EQ-5D was constructed using five different dimensions to create a composite measure of health status. AD conceptually measures mental health, while the other four items address physical health [48,49,50]. A previous study revealed that when several health measures were modeled with the EQ-5D-5L, MO, SC, and UA belonged to one dimension, AD to a second, and PD to a third [51]. However, other investigations found sufficient evidence that self-reported physical and mental health can be summarized using a single score [52]. Recent confirmatory factor analysis found the model including all five EQ-5D-5L items to have acceptable fit statistics [47]. These previous findings along with this study illustrate the tension between the multidimensional nature of health and summarizing health as a single latent construct. The theoretical measurement model, such as whether the EQ-5D is a formative or reflective measurement [47, 54, 55], must be considered when applying scoring approaches.

A limitation of this study was that the dataset only included adult participants from Western, developed countries. If person and item ordering are dependent on how item descriptions and response categories are interpreted, then these results may not extend to other populations. Further, the data were collected via online survey panels, and such participants may differ from the general population [29]. There is also a pressing need to conduct similar research in children. Due to ethical, methodological, and conceptual problems involved in eliciting preferences for children, the version of the EQ-5D for children and adolescents (EQ-5D-Y) does not have a preference value set [53]. Therefore, application of the LSS may be particularly relevant for the EQ-5D-Y as its use expands.

Conclusion

A conceptually cohesive scale of health can be operationalized using the LSS using all five items of the EQ-5D-5L as higher LSS scores indicate worse health and more severe functional limitations. In general, lower range of the score represents mainly problems with pain, the mid-range indicates additional problems with mobility and usual activities, and middle to higher range of scores reveals additional limitations with self-care. Anxiety/depression is easier to endorse than MO or UA at the lower range of scores, but at moderate and higher scores becomes more difficult to endorse. Compared to utility scores, LSS scores have advantages depending on the application and subgroup/population. However, the scale is weak in the healthy subsample, indicating it may be less informative in such populations. More work must be done to investigate whether person and item order holds for other populations, especially for children and adolescents.

References

Brazier, J., Ara, R., Rowen, D., & Chevrou-Severac, H. (2017). A review of generic preference-based measures for use in cost-effectiveness models. PharmacoEconomics, 35(1), 21–31.
Article PubMed Google Scholar
van Reenen, M., & Janssen, B. (2015, April 2015). EQ-5D-5L User guide: Basic information on how to use the EQ-5D-5L instrument. 2.1. Retrieved January 23, 2017, from http://www.euroqol.org/fileadmin/user_upload/Documenten/PDF/Folders_Flyers/EQ-5D-5L_UserGuide_2015.pdf.
APERSU - Alberta PROMS and EQ-5D Research and Support Unit. from http://apersu.ca/.
Brooks, R. (2013). EuroQol Group after 25 Years. Rotterdam, The Netherlands: Springer.
Book Google Scholar
Devlin, N., & Appleby, J. (2010). Getting the most out of PROMS - Putting health outcomes at the heart of NHS decision-making. Retrieved January 5, 2017, from https://www.kingsfund.org.uk/sites/files/kf/Getting-the-most-out-of-PROMs-Nancy-Devlin-John-Appleby-Kings-Fund-March-2010.pdf.
Devlin, N. J., & Brooks, R. (2017). EQ-5D and the EuroQol group: Past, present and future. Applied Health Economics and Health Policy, 15(2), 127–137.
Article PubMed PubMed Central Google Scholar
Devlin, N. J., Parkin, D., & Browne, J. (2010). Patient-reported outcome measures in the NHS: New methods for analysing and reporting EQ-5D data. Health Economics, 19(8), 886–905.
Article PubMed Google Scholar
Hostetter, M., & Klein, S. (2012). Using Patient-Reported Outcomes to Improve Health Care Quality. Retrieved January 5, 2017, from http://www.commonwealthfund.org/publications/newsletters/quality-matters/2011/december-january-2012/in-focus.
Parkin, D., Rice, N., & Devlin, N. (2010). Statistical analysis of EQ-5D profiles: Does the use of value sets bias inference? Medical Decision Making, 30(5), 556–565.
Article PubMed Google Scholar
Hernandez, G., Garin, O., Pardo, Y., Vilagut, G., Pont, A., Suarez, M., Neira, M., Rajmil, L., Gorostiza, I., Ramallo-Farina, Y., Cabases, J., Alonso, J., & Ferrer, M. (2018). Validity of the EQ-5D-5L and reference norms for the Spanish population. Quality of Life Research, 27(9), 2337–2348.
Article PubMed Google Scholar
Stolk, E., Ludwig, K., Rand, K., van Hout, B., & Ramos-Goni, J. M. (2019). Overview, update, and lessons learned from the international EQ-5D-5L valuation Work: Version 2 of the EQ-5D-5L valuation protocol. Value in Health, 22(1), 23–30.
Article PubMed Google Scholar
Gutacker, N., Bojke, C., Daidone, S., Devlin, N., & Street, A. (2013). Hospital variation in patient-reported outcomes at the level of EQ-5D dimensions: Evidence from England. Medical Decision Making, 33(6), 804–818.
Article PubMed Google Scholar
Wilke, C. T., Pickard, A. S., Walton, S. M., Moock, J., Kohlmann, T., & Lee, T. A. (2010). Statistical implications of utility weighted and equally weighted HRQL measures: An empirical study. Health Economics, 19(1), 101–110.
PubMed Google Scholar
Lamu, A. N., Gamst-Klaussen, T., & Olsen, J. A. (2017). Preference weighting of health state values: What difference does it make, and why? Value Health, 20(3), 451–457.
Article PubMed Google Scholar
Prieto, L., & Sacristan, J. A. (2004). What is the value of social values? The uselessness of assessing health-related quality of life through preference measures. BMC Medical Research Methodology, 4, 10.
Article PubMed PubMed Central Google Scholar
Devlin, N., Parkin, D., & Janssen, B. (2020). Analysis of EQ-5D Profiles. Methods for Analysing and Reporting EQ-5D Data (pp. 23–49). Cham: Springer International Publishing.
Chapter Google Scholar
Geraerds, A. J. L. M., Bonsel, G. J., Janssen, M. F., de Jongh, M. A., Spronk, I., Polinder, S., & Haagsma, J. A. (2019). The added value of the EQ-5D with a cognition dimension in injury patients with and without traumatic brain injury. Quality of Life Research, 28(7), 1931–1939.
Article CAS PubMed PubMed Central Google Scholar
Yang, Z. H., Luo, N., Bonsel, G., Busschbach, J., & Stolk, E. (2019). Effect of health state sampling methods on model predictions of EQ-5D-5L values: Small designs can suffice. Value in Health, 22(1), 38–44.
Article PubMed Google Scholar
Pickard, A. S., Kohlmann, T., Janssen, M. F., Bonsel, G., Rosenbloom, S., & Cella, D. (2007). Evaluating equivalency between response systems: Application of the Rasch model to a 3-level and 5-level EQ-5D. Medical Care, 45(9), 812–819.
Article PubMed Google Scholar
van Hout, B., Janssen, M. F., Feng, Y. S., Kohlmann, T., Busschbach, J., Golicki, D., Lloyd, A., Scalone, L., Kind, P., & Pickard, A. S. (2012). Interim scoring for the EQ-5D-5L: Mapping the EQ-5D-5L to EQ-5D-3L value sets. Value Health, 15(5), 708–715.
Article PubMed Google Scholar
Wahlberg, M., Zingmark, M., Stenberg, G., & Munkholm, M. (2021). Rasch analysis of the EQ-5D-3L and the EQ-5D-5L in persons with back and neck pain receiving physiotherapy in a primary care context. European Journal of Physiotherapy, 23(2), 102–109.
Article Google Scholar
Pickard, A. S., De Leon, M. C., Kohlmann, T., Cella, D., & Rosenbloom, S. (2007). Psychometric comparison of the standard EQ-5D to a 5 level version in cancer patients. Medical Care, 45(3), 259–263.
Article PubMed Google Scholar
Sijtsma, K., & van der Ark, L. A. (2017). A tutorial on how to do a Mokken scale analysis on your test and questionnaire data. British Journal of Mathematical & Statistical Psychology, 70(1), 137–158.
Article Google Scholar
van der Ark, L. A., & Bergsma, W. P. (2010). A note on stochastic ordering of the latent trait using the sum of polytomous item scores. Psychometrika, 75(2), 272–279.
Article Google Scholar
Sijtsma, K., & Molenaar, I. W. (2002). Introduction to Nonparametric Item Response Theory. Thousand Oaks, CA: SAGE Publications Inc.
Book Google Scholar
van Schuur, W. H. (2003). Mokken scale analysis: Between the Guttman scale and parametric item response theory. Political Analysis, 11(2), 139–163.
Article Google Scholar
Herdman, M., Gudex, C., Lloyd, A., Janssen, M., Kind, P., Parkin, D., Bonsel, G., & Badia, X. (2011). Development and preliminary testing of the new five-level version of EQ-5D (EQ-5D-5L). Quality of Life Research, 20(10), 1727–1736.
Article CAS PubMed PubMed Central Google Scholar
Richardson, J., Khan, M. A., Iezzi, A., & Maxwell, A. (2015). Comparing and explaining differences in the magnitude, content, and sensitivity of utilities predicted by the EQ-5D, SF-6D, HUI 3, 15D, QWB, and AQoL-8D multiattribute utility instruments. Medical Decision Making, 35(3), 276–291.
Article PubMed Google Scholar
Richardson, J. L., & Angelo; Maxwell, Aimee;. . (2012). Cross-national comparison of twelve quality of life instruments: MIC paper 1: Background, questions, instruments, research paper 76. Melbourne, Australia: Monash University.
Google Scholar
StataCorp. . (2013). Stata Statistical Software: Release 13. College Station, TX: StataCorp LP.
Google Scholar
R Development Core Team. (2018). R: A Language and Environment for Statistical Computing (Version 3.5.2). Vienna, Austria: R Foundation for Statistical Computing.
Google Scholar
Van der Ark, L. A. (2007). Mokken Scale Analysis in R. 2007, 20(11), 19.
van der Ark, L. A. (2012). New Developments in Mokken Scale Analysis in R. 2012, 48(5), 27.
Molenaar, I. (1997). Nonparametric Models for Polytomous Responses. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 369–380). New York, NY: Springer.
Chapter Google Scholar
Wind, S. A. (2017). An instructional module on mokken scale analysis. Educational Measurement-Issues and Practice, 36(2), 50–66.
Article Google Scholar
Sijtsma, K., & Molenaar, I. W. (1987). Reliability of Test-scores in nonparametric item response theory. Psychometrika, 52(1), 79–97.
Article Google Scholar
Callender, J., & Osburn, H. (2005). An empirical comparison of coefficient alpha, Guttman’s Lambda-2, and MSPLIT maximized split-half reliability estimates. Journal of Educational Measurement, 16, 89–99.
Article Google Scholar
Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10(4), 255–282.
Article CAS Google Scholar
Sijtsma, K., Meijer, R., & van der Ark, A. (2011). Mokken scale analysis as time goes by: An update for scaling practitioners. Personality and Individual Differences, 50, 31–37.
Article Google Scholar
Ligtvoet, R., van der Ark, A., Bergsma, W., & Sijtsma, K. (2011). Polytomous latent scales for the investigation of the ordering of items. Psychometrika, 76, 200–216.
Article Google Scholar
Ligtvoet, R., van der Ark, L. A., te Marvelde, J. M., & Sijtsma, K. (2010). Investigating an invariant item ordering for polytomously scored items. Educational and Psychological Measurement, 70(4), 578–595.
Article Google Scholar
Luo, N., Li, M., Liu, G. G., Lloyd, A., de Charro, F., & Herdman, M. (2013). Developing the Chinese version of the new 5-level EQ-5D descriptive system: The response scaling approach. Quality of Life Research, 22(4), 885–890.
Article PubMed Google Scholar
Meijer, R. R., & Baneke, J. J. (2004). Analyzing psychopathology items: A case for nonparametric item response theory modeling. Psychological Methods, 9(3), 354–368.
Article PubMed Google Scholar
Meijer, R. R., & Egberink, I. J. L. (2012). Investigating invariant item ordering in personality and clinical scales: Some empirical findings and a discussion. Educational and Psychological Measurement, 72(4), 589–607.
Article Google Scholar
Golicki, D., Jakubczyk, M., Graczyk, K., & Niewada, M. (2019). Valuation of EQ-5D-5L health states in Poland: The first EQ-VT-based study in central and Eastern Europe. PharmacoEconomics, 37(9), 1165–1176.
Article PubMed PubMed Central Google Scholar
Pickard, A. S., Law, E. H., Jiang, R., Oppe, M., Shaw, J. W., Xie, F., Boye, K. S., Gong, C. L., Chapman, R. H., & Balch, A. (2018). United States valuation of EQ-5D-5L health States: An initial model using a standardized protocol. Value in Health, 21, S4–S5.
Article Google Scholar
Feng, Y. S., Jiang, R., Kohlmann, T., & Pickard, A. S. (2019). Exploring the internal structure of the EQ-5D using non-preference-based methods. Value Health, 22(5), 527–536.
Article PubMed Google Scholar
Davis, J. C., Liu-Ambrose, T., Richardson, C. G., & Bryan, S. (2013). A comparison of the ICECAP-O with EQ-5D in a falls prevention clinical setting: Are they complements or substitutes? Quality of Life Research, 22(5), 969–977.
Article PubMed Google Scholar
Keeley, T., Coast, J., Nicholls, E., Foster, N. E., Jowett, S., & Al-Janabi, H. (2016). An analysis of the complementarity of ICECAP-A and EQ-5D-3 L in an adult population of patients with knee pain. Health and Quality of Life Outcomes, 14, 36.
Article CAS PubMed PubMed Central Google Scholar
Wittrup-Jensenm, K. L., & Jørgen. (2008). An Assessment of Two Generic Health-Related Quality of Life (HRQoL) Instruments in Patients Suffering from Low Back Pain. Odense: University of Southern Denmark.
Google Scholar
Finch, A. P., Brazier, J. E., Mukuria, C., & Bjorner, J. B. (2017). An exploratory study on using principal-component analysis and confirmatory factor analysis to identify bolt-on dimensions: The EQ-5D case study. Value Health, 20(10), 1362–1375.
Article PubMed Google Scholar
Yin, S., Njai, R., Barker, L., Siegel, P., & Liao, Y. (2016). Summarizing health-related quality of life (HRQOL): Development and testing of a one-factor model. Population Health Metrics, 14(1), 22.
Article PubMed PubMed Central Google Scholar
Kreimeier, S., & Greiner, W. (2019). EQ-5D-Y as a health-related quality of life instrument for children and adolescents: The instrument’s characteristics, development, current use, and challenges of developing its value set. Value Health, 22(1), 31–37.
Article PubMed Google Scholar
Costa, D. S. (2015). Reflective, causal, and composite indicators of quality of life: A conceptual or an empirical distinction? Quality of Life Research, 24(9), 2057–2065.
Article PubMed Google Scholar
Gamst-Klaussen, T., Gudex, C., & Olsen, J. A. (2018). Exploring the causal and effect nature of EQ-5D dimensions: An application of confirmatory tetrad analysis and confirmatory factor analysis. Health and quality of life outcomes, 16(1), 153–215.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

EuroQol group for funding this research and investigators of the Multiple Comparison Project for sharing their data. This project was supported by the EuroQol Research Foundation (Grant Number: EQ Project 20170130). The submitted manuscript was not censored or directed by the foundation. The views expressed by the authors in the publication do not necessarily reflect the view of the EuroQol Group.

Supporting Information

Supplementary Material Table 1B: EQ-5D-5L Characteristics Across Health Conditions of the MIC Dataset

Supplementary Material Table 1C: EQ-5D-5L Characteristics Across Country Subsamples of the MIC Dataset

Funding

Open Access funding enabled and organized by Projekt DEAL. EuroQol group fully funded this project (Grant ID EQ Project 20170130).

Author information

Authors and Affiliations

Institute for Clinical Epidemiology and Applied Biometrics, Medical University of Tübingen, Silcherstraße 5 72076, Tübingen, Germany
You-Shan Feng
Institute for Community Medicine, University of Greifswald, Greifswald, Germany
You-Shan Feng & Thomas Kohlmann
Center for Observational and Real-World Evidence, Merck & Co, Kenilworth, NJ, USA
Ruixuan Jiang
College of Pharmacy, University of Illinois At Chicago, Chicago, IL, USA
A. Simon Pickard

Authors

You-Shan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Ruixuan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
A. Simon Pickard
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Kohlmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to You-Shan Feng.

Ethics declarations

Conflict of Interest

All authors are members of the EuroQol group. Outside of scientific meetings, group members do not receive any financial support. Ruixuan Jiang is an employee of Merck; however, conceptualization and most of study analyses were completed during her graduate studies.

Ethical approval

This paper only used secondary data and authors did not contain human or animal data collection performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (R 12 KB)

Supplementary file2 (DOCX 48 KB)

Supplementary file3 (DOCX 24 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Feng, YS., Jiang, R., Pickard, A.S. et al. Combining EQ-5D-5L items into a level summary score: demonstrating feasibility using non-parametric item response theory using an international dataset. Qual Life Res 31, 11–23 (2022). https://doi.org/10.1007/s11136-021-02922-1

Download citation

Accepted: 19 June 2021
Published: 08 July 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s11136-021-02922-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Combining EQ-5D-5L items into a level summary score: demonstrating feasibility using non-parametric item response theory using an international dataset

Abstract

Background

Methods

Results

Discussion

Similar content being viewed by others

Scoring the EQ-HWB-S: can we do it without value sets? A non-parametric item response theory analysis

Head-to-head comparison between the EQ-5D-5L and the EQ-5D-3L in general population health surveys

ExternalRefStart http://www.common-metrics.org www.common-metrics.org ExternalRefEnd : a web application to estimate scores from different patient-reported outcome measures on a common scale

Background

Methods

EQ-5D-5L

Dataset

Mokken scale analysis

Assessment of fit of the monotone homogeneity model

Monotonicity

Assessment of invariant item ordering

Results

AISP and scalability

Fit of the MHM model

Fit of MIIO

Stratified analysis across subgroups

Discussion

Conclusion

References

Acknowledgements

Supporting Information

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Ethical approval

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (R 12 KB)

Supplementary file2 (DOCX 48 KB)

Supplementary file3 (DOCX 24 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation