Combining EQ-5D-5L items into a level summary score: demonstrating feasibility using non-parametric item response theory using an international dataset

Background The EQ-5D-5L is a well-established health questionnaire that estimates health utilities by applying preference-based weights. Limited work has been done to examine alternative scoring approaches when utility weights are unavailable or inapplicable. We examined whether the Mokken scaling approach can elucidate 1) if the level summary score is appropriate for the EQ-5D-5L and 2) an interpretation of such a score. Methods The R package “mokken” was used to assess monotonicity (scaling coefficients H, automated item selection procedure) and manifest invariant item ordering (MIIO: paired item response functions [IRF], HT). We used a rich dataset (the Multiple Instrument Comparison, MIC) which includes EQ-5D-5L data from six Western countries. Results While all EQ-5D-5L items demonstrated monotonicity, the anxiety/depression (AD) item had weak scalability (Hi = 0.377). Without AD, scalability improved from Hs = 0.559 to Hs = 0.714. MIIO revealed that the 5 items can be ordered, and the ordering is moderately accurate in the MIC data (HT = 0.463). Excluding AD, HT improves to 0.743. Results were largely consistent across disease and country subgroups. Discussion The 5 items of the EQ-5D-5L form a moderate to strong Mokken scale, enabling persons to be ordered using the level summary score. Item ordering suggests that the lower range of the score represents mainly problems with pain and anxiety/depression, the mid-range indicates additional problems with mobility and usual activities, and middle to higher range of scores reveals additional limitations with self-care. Scalability and item ordering are even stronger when the anxiety/depression item is not included in the scale. Supplementary Information The online version contains supplementary material available at 10.1007/s11136-021-02922-1.


Background
The EQ-5D is a widely used generic measure of health [1,2]. As it is brief and not disease specific, the EQ-5D is applied in a broad range of settings, including measurement of health status in clinical practice, population health surveillance, assessment of healthcare quality, medical decision making, and patient communication [3][4][5][6][7][8][9]. The EQ-5D-5L expanded the response levels to five from the original threelevel version (EQ-5D-3L) [10].
The EQ-5D is best known for the generation of qualityadjusted life years (QALY) in cost-utility analysis, used to inform drug reimbursement and pricing decisions in some countries/regions. Utility values, which are used to estimate QALYs, are calculated for EQ-5D-5L health states by applying a societal value set. Societal value sets are 1 3 preference-based scoring weights estimated using valuation studies [11]. In valuation studies, hypothetical EQ-5D-5L health states are valued using choice-based methods, such as the time trade-off. These studies are generally conducted using representative, location/region-specific population samples. However, for many applications of the EQ-5D, population/country-specific utility scores may be unjustifiable or even introduce additional statistical biases [7,9,12]. An alternative method to summarize the instrument, relevant when utility weights are unavailable or unsuitable (e.g., EQ-5D-Y), is a total sum score of the severity levels on each dimension. Because each item of the EQ-5D has the same number of response levels, all items and severity levels contribute equally to this additive score. This approach has been termed "equally weighted" score [13], "unweighted" scoring approach [14,15], and informally the "misery" score/index [16][17][18]. The term "level sum score" (LSS) was used in the recently published guidebook for analyzing EQ-5D data [16] and will be used for the remainder of this paper for consistency and clarity. The appeal of the LSS is its simplicity and consistency across populations (i.e., the same scoring system for all countries and populations).
Both the LSS and utility values are summary scores with similar limitations in interpretation; two patients may have the same summary score, but one may have extreme problems in a single dimension, whereas the other may have slight problems in several dimensions. Utility scores have found widespread acceptance over the LSS for the EQ-5D, potentially due to the rigorous development of preference elicitation.
The LSS has one major merit over utility scores when societal preference scores are unnecessary (i.e., non-economic applications): no algorithm is required to estimate the LSS, the end-user does not need to choose a specific value set to use (e.g., in multinational studies). Although previous investigations into the use of the EQ-5D LSS found substantial agreement and similar psychometric properties between the LSS and utility scores [13][14][15], the high correlations (ICC/Rho > 0.9) do not prove LSS accurately describes HRQoL or is appropriate for statistical inference. There is a dearth of literature specifically assessing the appropriateness of the LSS to describe HRQoL.
Item response theory (IRT) comprises a large set of models used to aid the construction and evaluation of multiitem scales. In general, these models assess the relationship between a latent variable of interest (θ) and the manifest/ observable response patterns of a set of items. The probability of endorsing a particular response level on items of a scale is dependent on the respondent's θ level. Parametric IRT has been previously applied to study the EQ-5D, although not to elucidate scoring [19][20][21][22]. Non-parametric item response theory (NP-IRT) approaches do not make strict assumptions about the shape of the function that describes the relationship between the response probability and the latent variable [23]. NP-IRT investigates whether the ordering of respondents along the summary score reflects the stochastic ordering of persons along θ [23,24] instead of estimating θ. If the LSS is a proxy for θ (i.e., underlying health), then ordering of persons along the summary score is the ordering of persons along θ. Mokken scaling is a scaling approach comprising of a set of methods to assess whether the data fit a set of NP-IRT models. Two nested NP-IRT models included in Mokken scaling are as follows: the monotone homogeneity model (MHM), which examines ordering of persons along θ; and double monotonicity model (DMM), which examines ordering of persons and items along θ [25,26]. If EQ-5D-5L data fit the MHM or DMM, then the use of LSS to represents underlying health can be justified and interpreted. The EQ-5D-5L is a good candidate for applying Mokken scaling as all items have the same number of ordered response categories with analogous adjectives.
The aims of these analyses were to investigate whether the MHM and DMM fit EQ-5D-5L data in order to 1) determine whether the LSS can be justified for the EQ-5D-5L and 2) examine whether an interpretation can be applied to such a score.
The LSS is typically calculated by assigning a numerical value to each response level (i.e., 1 for "no problems", 5 for "extreme problems"/"unable to") and summing these values across the five items, resulting in a score from 5 (11,111, no problems on any dimension) to 25 (55,555, extreme problems on all dimensions) for the EQ-5D-5L.

Dataset
The Multi Instrument Comparison (MIC) project surveyed six countries in 2012 (Australia, Canada, Germany, Norway, UK, and USA), sampling respondents who self-reported 1 3 seven chronic illnesses plus a healthy sample with no selfreported chronic conditions [28,29]. Respondents completed a battery of health status, subjective well-being and capability measures, including the EQ-5D-5L. This dataset provides an opportunity to assess the scaling properties of the EQ-5D-5L in a large sample across disease and country subgroups. The disease groups chronic obstructive pulmonary disease and stroke were only sampled in the Australia and therefore excluded from analysis. All analyses were repeated by the subgroups self-reported disease and country.
Data management and descriptive statistics were handled in Microsoft Excel and Stata SE 13 [30], while all other analyses were conducted using the statistical language and environment R [31] with Van der Ark's package "mokken" [32,33]. The R script is included as supplementary material A. Permission to use the MIC dataset can be obtained here: https:// www. aqol. com. au/ index. php/ mic-data.

Mokken scale analysis
We investigated the assumptions of two nested NP-IRT models that examine the ordinal location of patients and items along a single latent variable θ: respondents were ordered according to their LSS and items are ordered according to mean item scores [23,25,26]. The polytomous MHM and DMM models are extended from the dichotomous models [34,35]. The MHM can elucidate whether a summary score can be used to order individuals along the latent variable. The more restrictive DMM is nested within the MHM and can further elucidate whether the items (i.e., EQ-5D-5L dimensions in these analyses) can be ordered invariantly along the latent variable. We examined how well polytomous MHM and DMM models fit EQ-5D-5L data.

Assessment of fit of the monotone homogeneity model
The MHM has three assumptions: 1. Unidimensionality: items within the scale measure the same underlying latent variable; 2. Local independence: responses to scale items are influenced only on level by θ; and 3. Monotonicity: probability of endorsing particular response levels is monotonically non-decreasing as θ increases.
Scalability of the EQ-5D-5L items was assessed using Loevinger's scalability coefficients H, for which H values reflect item fit within a scale. H is measured on the item pair (H ij ), item (H i ), and scale (H S ) levels. H ij is the normed covariance between a pair of item scores while H i is the normed covariance between item and rest scores [23,32]. H S is a weighted mean of H i . Negative H ij and H i coefficients indicate an item violates MHM. The closer H i is to 1, the better an item can discriminate subjects along θ. On the item level, H i > 0.3 is considered sufficient, while H i > 5 indicates a strongly discriminating item. The commonly accepted rules of thumb for interpreting H S were applied: H S < 0.3 indicates the item set is unscalable, H S between 0.3 and 0.4 indicates a weak scale, H S between 0.4 and 0.5 indicates moderate, and H S ≥ 0.5 indicates strong [25]. H ij > 0 indicates that the data fit the MHM. We also used the H ij to examine which item pairs are more strongly related than other pairs. Automated item selection procedure (AISP) is a standard feature of the "mokken" package which selects subsets of items from a larger set that can represent attributes on which respondents can be ordered by total scores [32]. Although the lower bound of 0.3 is suggested for accepting items in a scale, it was more informative to determine at which level of H i was items no longer scalable. Therefore, we first executed the AISP 12 times with the lower bound for H i set between 0 and 0.5, increasing in steps of 0.05 [23,32]. Then we pinpointed the level of H i at which each of the five items was no longer appropriate for the scale by decreasing H i in steps of 0.001 from the cutoff identified in the previous step.

Monotonicity
Latent monotonicity generally also implies manifest monotonicity, which is observable in the data [32] Therefore, if the LSS is a proxy for θ, then ordering of persons along the LSS reflects the ordering of persons along θ. Manifest monotonicity was assessed by examining whether the cumulative probability for a dimension-level rating at or above each dimension-level rating does not decrease across rest score groups. Rest scores are calculated by subtracting the item of interest from the LSS. Rest score groups are created automatically based on minimum sample size requirements for each group [32,33]. Only violations greater than the default minimum (minvi = 0.03 for the function check. monotonicity of the R package "mokken") were reported [32]. Furthermore, item step response functions (ISRFs) and item response functions (IRFs) were visually inspected for monotonicity. ISRF plots the probability for endorsing a response level or higher across the latent variable. IRF for polytomous items is the sum of an item's ISRFs.

Assessment of invariant item ordering
The DMM model is a special case of MHM for which all assumptions of the MHM hold with an additional assumption that the IRF or ISRF of items does not intersect. Non-interception of ISRF is not necessarily evidence of item order [39] and would not be meaningful for interpretation of the LSS. Therefore, we did not examine noninterception of ISRF as a measure of DMM fit, rather focusing on invariant item ordering. Invariant item ordering can provide an interpretation: If the items have the same ordering along θ, then the summary score might be interpreted based on that order [32,33,39]. We therefore examined manifest invariant item ordering (MIIO) as suggested by Ligtvoet et al. (2010Ligtvoet et al. ( , 2011 [40,41]. We assessed MIIO using the check.iio function of the R package "mokken," which orders items by their conditional mean scores and checks each item pair for violations of ordering for rest score groups. Violations that exceed the default minimum value (number of ISRFs times 0.03) are reported [33,41]. Coefficient H T gives an indication of the degree to which the sample follows item ordering. We applied the rules of thumb that H T < 0.3 implies the item ordering accuracy is too low, H T between 0.3 and 0.4 as ordering with low accuracy, H T between 0.4 and 0.5 as moderate accuracy, and H T > 0.5 as highly accurate item ordering [41].

Results
The included 7,933 subjects of the MIC reported 566 of the 3125 possible response patterns on the EQ-5D-5L; "11,111" (full health) and slight problems with PD with no problems on the other dimensions ("11,121") were the first and second most often endorsed (19.3% and 14.3%, respectively). Subjects without chronic conditions were most homogeneous in regard to health profile (94 unique profiles), while those with diabetes reported the most diverse range of health (239 unique profiles; supplementary materials B and C). Number of distinct health profiles ranged from 164 (Norway) to 276 (UK) across country samples. Although over 8% of MIC respondents noted their general health as "poor," endorsements of the most severe EQ-5D-5L levels were rare, especially for MO and SC (Table 1).

AISP and scalability
The EQ-5D-5L is a reliable scale, with ρ = 0.822 and λ-2 = 0.819. AISP placed all five items onto a single latent variable when the lower bound for H i was set at the default 0.3, even when considering the 95% confidence interval (derived from standard errors). AD was identified as an unscalable item at H i ≥ 0.378. PD was rejected from the scale at H i ≥ 0.685, SC at H i ≥ 0.721, and no items could be scaled at H i ≥ 0.75 (Table 2).  Table 3). H ij of SC and PD was larger than all AD item pairs, but smaller than 0.7, while all other item pairs had H ij above 0.7. Because the H i of AD was close to 0.3, the value of acceptability for H i , we decided to assess scalability by omitting this item. If the reduced item set would yield a much stronger scale, this would be an important finding. Researchers would possibly decide to employ the reduced items set in studies where a scale with increased scalability is needed, such as in instances where item ordering must be strictly maintained. When AD was removed from the model, H S increased from 0.559 to 0.714, and the H i of the four remaining items also increased ( Table 2). Figure 1 illustrates the IRF and ISRF charted over rest score groups for the five items of the EQ-5D-5L. All IRFs and ISRFs increased monotonically with no violations of manifest monotonicity observed (Table 2). Critical values of all items were zero, showing no misfit of the MHM.

Fit of MIIO
Two violations of MIIO were observed between 1) AD and MO, and 2) AD and UA (Table 4). AD had the highest critical value, and in backward selection was recommended for exclusion. Due to this recommendation for exclusion, we examined MIIO excluding the AD item, after which no violations of MIIO remained.
In order to visualize the IRF of all items in one figure, we selected item-pair results from the check.restscore function (Fig. 2). IRF charted over rest score groups indicate that the lower rest scores (≤ 3) were driven by PD and secondarily by AD. In the slightly higher rest score groups (2-4), the IRFs of MO and UA equally increased and overlapped, while AD's IRF flattened. IRF of AD crossed both MO and US at rest scores 4-5. The IRF of SC did not increase

Stratified analysis across subgroups
H coefficients were estimated for disease and country subgroups for the complete EQ-5D-5L scale as well as for the scale omitting the AD item as AD was recommended for exclusion by the check.iio procedure for many subgroups (Table 4). For the complete scale, H s was weak for the healthy subsample (0.363), moderate for subjects with hearing problems and from Norway (0.496, 0.476, respectively). H s for all other subgroups was "strong" but was all below 0.6. H T for the full scale ranged from 0.373 (Norway) to 0.747 (depression). Violations were found in the AD and MO and AD and UA pairs consistently across all subgroups except for respondents without self-reported chronic illness, depression, arthritis, and Canadian respondents (Table 4). Backward item selection recommended excluding AD for all subsamples that detected violations except for Norway. Critical values for Norway were 34 for UA and 50 for AD, demonstrating non-serious misfit. Figure 3 plots IRF of item pairs AD/MO and AD/UA for subgroups which did not recommend AD for removal. Not surprisingly, AD was easier to endorse at all rest score groups than MO or UA for the subgroup with depression, and the IRFs are far enough apart that they do not intersect.
H ij tends to be largest between AD and UA, AD and PD across all subsamples except for healthy respondents, those reporting hearing problems and the Australian sample, showing that AD is more closely related to UA and PD than MO and SC (Table 5). H ij between AD and all the other EQ-5D-5L items was particularly small for the healthy subsample.

Discussion
The EQ-5D-5L items form a strong Mokken scale, fitting the MHM and thus demonstrating that LSS, an additive summary score independent of population value sets, is acceptable and meaningful for measurement. These results empirically demonstrate that the EQ-5D-5L LSS orders respondents along a latent variable of health, with higher score indicating poorer health. The MHM fit of the EQ-5D-5L data reflects the rigorous work in questionnaire development, especially with refinement of the response levels [19,27,42]. Meijer and colleagues cautioned that sometimes strong Mokken scales are not optimal because they could reflect items covering similar or overlapping content [43,44]. However, the EQ-5D is a brief scale with items covering diverse aspects of function and symptoms, so this concern is minimized.
MIIO results suggest that an interpretation of functional limitations and health symptoms can also be applied to the LSS: the low range of the score represents mainly problems with PD and AD, the lower to mid-range scores indicate additional problems with MO and UA, while the middle to higher scores reveal limitations in SC. The ordering of these items was found to be moderate. The finding that item ordering was not accurate for the healthy sub-sample reflected the observation of less variation in EQ-5D-5L responses in that subsample.
Our results empirically demonstrate what is conceptually understood: the LSS of the EQ-5D-5L orders persons by their levels of health. The relatively consistent performance of the EQ-5D-5L scale across countries is encouraging for the purpose of providing evidence to support the use of the LSS to compare the EQ-5D across countries. This is important because the EQ-5D has historically been scored using weights based on country-specific societal preferences. The LSS is used to describe data quality of valuation studies [45,46] but has yet seen broader acceptance. A summary scoring function independent of population-specific value sets that is simple, psychometrically valid, and international in its applicability has tremendous advantages for researchers and population health scientists who wish to have a composite indicator of health for international comparisons using a measure available in hundreds of languages and is freely licensed and distributed by the EuroQol by non-profit organizations.
Although AD was initially retained in the scale as its H i was above the commonly accepted cutoff of 0.3, it was excluded when the cutoff was only raised to above 0.378. Additionally, AD was found to violate MIIO in most  subgroups-its IRF crosses the UA and MO IRFs at rest scores 3-4-and AD removal from the scale was suggested in backward model selection. The determination of whether an item should remain in a scale is not based solely on H i but depends on conceptual and empirical considerations and the application of the instrument. When AD was omitted, H s and H T improved to above 0.7 to indicate very strong person and item ordering. Therefore, in applications where scalability or item ordering is required to be strong, one could apply the LSS to only the four physical items of the EQ-5D and assess the AD item separately. Although the EQ-5D is rarely used as a diagnostic tool on the level of individual patients, item ordering can still be relevant for group level applications. For example, although patient groups with mainly physical symptoms do not suffer from anxiety/depressive problems more than the general population, the AD item may be more difficult to endorse than the physical items at moderate or more severe levels of disease (as indicated in these results). However, for conditions for which mental health is affected, the AD item could be easier to endorse than MO, SC and UA across the scale (as supported by our findings of MIIO in the subgroup with depression). The relationship between items may also be modified by other factors such as age or gender. This is an area needing future research.
IRT approaches to evaluating the EQ-5D have been relatively scarce in the literature: our results are comparable to available evidence. A recent investigation of the EQ-5D using Rasch rating scale model reported similar item ordering as our findings: PD was the easiest to endorse, UA, AD, and MO are at middle levels of difficulty of endorsement, and SC was the most difficult to endorse item [21]. Our scalability results were similar to previously published results for the physical function subscale of the SF-36-H S of 0.69 and H T of 0.53 [44].
IRT assumes items are indicators of a single latent variable. However, the EQ-5D was constructed using five different dimensions to create a composite measure of health status. AD conceptually measures mental health, while the other four items address physical health [48][49][50]. A previous study revealed that when several health measures were modeled with the EQ-5D-5L, MO, SC, and UA belonged to one dimension, AD to a second, and PD to a third [51]. However, other investigations found sufficient evidence that self-reported physical and mental health can be summarized using a single score [52]. Recent confirmatory factor analysis found the model including all five EQ-5D-5L items to have acceptable fit statistics [47]. These previous findings along with this study illustrate the tension between the multidimensional nature of health and summarizing health as a single latent construct. The theoretical measurement model, such as whether the EQ-5D is a formative or reflective measurement [47,54,55], must be considered when applying scoring approaches.
A limitation of this study was that the dataset only included adult participants from Western, developed countries. If person and item ordering are dependent on how item descriptions and response categories are interpreted, then these results may not extend to other populations. Further, the data were collected via online survey panels, and such participants may differ from the general population [29]. There is also a pressing need to conduct similar research in children. Due to ethical, methodological, and conceptual problems involved in eliciting preferences for children, the version of the EQ-5D for children and adolescents (EQ-5D-Y) does not have a preference value set [53]. Therefore, application of the LSS may be particularly relevant for the EQ-5D-Y as its use expands.

Conclusion
A conceptually cohesive scale of health can be operationalized using the LSS using all five items of the EQ-5D-5L as higher LSS scores indicate worse health and more severe functional limitations. In general, lower range of the score represents mainly problems with pain, the mid-range indicates additional problems with mobility and usual activities, and middle to higher range of scores reveals additional limitations with self-care. Anxiety/depression is easier to endorse than MO or UA at the lower range of scores, but at moderate and higher scores becomes more difficult to endorse. Compared to utility scores, LSS scores have advantages depending on the application and subgroup/population. However, the scale is weak in the healthy subsample, indicating it may be less informative in such populations. More work must be done to investigate whether person and item order holds for other populations, especially for children and adolescents.

Declarations
Conflict of Interest All authors are members of the EuroQol group. Outside of scientific meetings, group members do not receive any financial support. Ruixuan Jiang is an employee of Merck; however, conceptualization and most of study analyses were completed during her graduate studies.
Ethical approval This paper only used secondary data and authors did not contain human or animal data collection performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.