Introduction

Atopic dermatitis (AD) is a common chronic inflammatory skin disorder affecting up to 10% of adults [1, 2]. It can appear on any area of the body, but predilection sites are the face, hands, and flexural surfaces of the extremities [1]. Clinical symptoms include recurrent eczematous lesions and intense itch that may considerably decrease patients’ health-related quality of life (HRQoL). The excessive dryness, itching and scratching may cause substantial limitations in daily functioning, social interactions, leisure activities, and may lead to sleep disturbance [3,4,5,6]. Treatments include topical emollients, topical corticosteroids, calcineurin inhibitors and systemic immunosuppressants (e.g. corticosteroids, cyclosporine A, azathioprine, mycophenolate mofetil or methotrexate), according to disease severity [7, 8]. Recently, an increasing number of new treatment options have become available for moderate-to-severe AD, such as targeted biological therapies (dupilumab and tralokinumab) and small molecules (baricitinib, abrocitinib and upadacitinib) [9].

AD represents a large burden on patients and society with an average annual total cost per patient of up to €20,000 [10,11,12]. New treatments typically require more health resources, and thus, providing evidence on their cost-effectiveness is important to show their value for money and to support financial decision-making in healthcare [13]. In these economic evaluations, quality-adjusted life year (QALY) is used as a summary outcome that combines quantity and quality of life. Using generic preference-accompanied instruments is the most common way to assess HRQoL to generate QALYs. These measures consist of a descriptive system and a set of utility values [14]. The most commonly used generic preference-accompanied measure is the EQ-5D [15]. Over the past three decades, it has been used in over 10,000 studies and by now, it has become a preferred instrument to estimate QALYs in pharmacoeconomic guidelines in nearly 30 countries [16,17,18].

The EQ-5D has two versions for adults, the three-level (EQ-5D-3L, hereafter 3L) [19], and the five-level (EQ-5D-5L, hereafter 5L) [20]. Both have been increasingly used in dermatological patient populations [21,22,23,24]. The major difference between the two adult EQ-5D questionnaires is that the 5L includes not three, but five levels in each dimension and uses a standardised wording across dimensions. In many countries, including Hungary, both adult questionnaires are recommended by pharmacoeconomic guidelines [18]; however, these may lead to different cost-effectiveness outcomes, therefore understanding their psychometric properties in different contexts and settings is critical to inform the debate about the choice of instrument.

Several previous studies in different health condition groups and general population samples showed improved measurement properties of the 5L descriptive system, such as reduced ceiling effect, better informativity and construct validity [25, 26]. Among dermatological conditions, the measurement properties of the 3L and 5L have been compared in psoriasis [27] and hidradenitis suppurativa (HS) [28]; however, no comparative study is available in AD. There could be large differences in how the descriptive systems perform across different health conditions, even among chronic skin diseases. Furthermore, it is important to examine how measurement properties of the descriptive systems translate into the discriminatory power of utilities, as this has a direct impact on QALYs.

This study therefore seeks to compare the psychometric properties of the 3L and 5L with regard to both the descriptive systems and utilities (hereafter referred to as index scores in the context of the EQ-5D) in adult patients with AD. The Hungarian 3L and 5L value sets will be used to estimate index scores that were developed in a parallel valuation study using the same respondents (n = 1000), protocol (i.e. EQ-VT), valuation method (i.e. composite time trade-off) and modelling approach (i.e. heteroscedastic Tobit) [29]. This will give us a unique opportunity to compare not only the descriptive systems but also utilities using real-world patient data. We aim to focus on the following psychometric properties: ceiling and floor effect, agreement, redistribution properties, informativity, convergent and known-groups validity.

Methods

Study design and patients

Between March 2018 and January 2021, a cross-sectional, multicentre study was conducted in Hungary among consecutive adult AD patients. Data were collected at two university dermatology clinics in Budapest and Debrecen and an outpatient centre in Pannonhalma. In each study site, patients were asked to read and sign an informed consent form on paper before participating in the study. Ethical approval was granted by the Scientific and Ethical Committee of the Medical Research Council in Hungary (reference No.: 29655/2018/EKU). Eligible patients were aged 18 years or over and had a diagnosis of AD confirmed by a dermatologist. Patients completed multiple generic and skin-specific HRQoL measures in a fixed order: Dermatology Life Quality Index (DLQI) [30], EQ-5D-5L (5L) [20], Skindex-16 [31], and EQ-5D-3L (3L) [19]. The 5L was placed before the 3L within the questionnaire to prevent the underuse of the second and fourth levels in the 5L [32]. The EuroQol visual analogue scale (EQ VAS) was completed only once, as part of the 5L.

Measures

A detailed description of the HRQoL measures used in the study is provided in Table 1, including their items, response levels, scoring and interpretation. In addition to HRQoL instruments, patients were asked to assess their level of itching and sleep disturbance for the past 1 month and their current disease severity (PtGA) using 11-point visual analogue scales (VAS). Demographic and medical history data were obtained from patients, including age, sex, education, employment, family history of AD and disease duration. Dermatologists assessed patients’ disease severity using the Investigator Global Assessment (IGA) [33], the objective SCORing Atopic Dermatitis (oSCORAD) [34], and the Eczema Area and Severity Index (EASI) scales [35], and provided information about treatments. These severity scales are widely used in clinical trials, treatment guidelines and core outcome sets [36,37,38]. We used the cut-off values for the interpretation of EASI and oSCORAD scores as published in Chopra et al. [39], and for DLQI as suggested by Hongbo et al. [40, 41].

Table 1 Generic and skin-specific HRQoL measures used in the study

Statistical analyses

We built on methods established in previous psychometric studies comparing the performance of the 3L and 5L across different healthy and patient populations [25, 27, 28, 32, 42].

Feasibility and ceiling

The feasibility was assessed by comparing the number of missing responses for the two EQ-5D questionnaires. Missing values were not imputed. Due to the two additional response levels, a reduced ceiling was expected in the 5L compared to the 3L. First, we computed the difference in the proportion of respondents scoring no problems (absolute ceiling reduction). Then, we calculated the relative reduction as (ceiling3L-ceiling5L)/ceiling3L. We compared the difference in ceiling between the 3L and 5L using McNemar’s test. The distributions of index scores were visualised using histograms, and the proportion of patients reporting no problems across all five EQ-5D dimensions was calculated to estimate the ceiling.

Agreement

The difference between 3L and 5L index scores was tested by Wilcoxon signed-rank test. The agreement between the 3L and 5L was displayed using a Bland–Altman plot [43], with the mean of the 3L and 5L index scores on the axis x and their difference on axis y. The 95% confidence interval for the difference was calculated as the mean difference ± 1.96 × standard deviation (SD). The points outside the upper and lower limit were considered outliers. We used the intraclass correlation coefficient (ICC) to test parallel forms reliability, which reflects both the agreement and degree of correlation between the two descriptive systems [44]. A two-way random model with absolute agreement was used to estimate ICCs [45]. We classified ICC values as follows: poor: 0–0.39, fair: 0.40–0.59, good: 0.60–0.74 and excellent: 0.75–1 [46]. Good or excellent agreement was expected between 3L and 5L [25].

Redistribution properties

We calculated the proportion of consistent and inconsistent 3L -5L response pairs using cross-tabulations. A 5L response at least two levels away from its 3L pair was considered inconsistent (e.g. respondent chooses severe problems [level 4] on the 5L and some problems [level 2] on the 3L) [32]. To calculate the average size of inconsistency, 3L responses were recoded to a 5L scale (level 13L = level 15L, level 23L = level 35L and level 33L = level 55L) and the following formula was used: |3L-5L| – 1 [32].

Informativity

Informativity reflects the ability of an instrument to discriminate between different levels of health [47]. The informativity of the five dimensions of 3L and 5L was determined using Shannon’s (H′) and Shannon’s evenness (J′) indices [47, 48]. The H′ expresses the absolute information content (the number of possible responses) combined with how evenly the information is distributed across all responses, while J′ represents the evenness of distribution exclusively. Our hypothesis was that the 5L with its two additional levels improves the informativity of the 3L [49]. We calculated the two indices according to the following formulae (L: number of levels in one dimension of the EQ-5D; pi: percentage of patients choosing the ith level):

$${H}^{\prime}=- \sum_{i=l}^{L}{p}_{i}{\mathrm{log}}_{2}{p}_{i}$$
$$J^{\prime } = \frac{{H^{\prime } }}{{H^{\prime } _{{{\text{max}}}} }},{\text{where }}H_{{\max }}^{\prime } = {\text{log}}_{{\text{2}}} L$$

Higher H′ indicates better informativity (range: 0 to log2L, where log2L is 1.85 for the 3L and 2.32 for the 5L). The value of J′ ranges from 0 to 1, whereby 0 corresponds to the worst discriminatory power, when all responses are in the same response level and 1 indicates the best discriminatory power with even distribution of responses among all levels [25].

Convergent validity

Convergent validity was analysed by calculating Spearman’s rank order correlation coefficients (rs) between the 3L and 5L dimensions and index scores and previously validated other measures. Based on earlier studies, we hypothesized at least moderate correlations between the EQ-5D dimensions and index scores and EQ VAS, DLQI, and Skindex-16 [50], and weak correlations with severity measures, including IGA, oSCORAD, EASI, and PtGA VAS [51]. In general, we expected most EQ-5D dimensions and index scores to correlate weakly or very weakly with sleep disturbance and itching VAS, as these are not parts of the EQ-5D descriptive system [52]. The only exception was itching for which we assumed a moderate correlation with the pain/discomfort dimension [53]. We expected the 5L to be more strongly related to these disease severity and skin-specific HRQoL measures. We interpreted correlation coefficients as follows: very weak < 0.20, weak 0.20–0.39, moderate 0.40–0.59, strong 0.60–0.79 and very strong 0.80 < [54].

Known-group validity

Due to the skewed distribution of EQ-5D index scores, non-parametric Mann–Whitney and Kruskal–Wallis tests were used to assess and compare the ability of 3L and 5L to distinguish between known groups of patients defined by severity scores on IGA, oSCORAD, and EASI or skin-specific HRQoL on DLQI. We hypothesized that patients with higher disease severity or worse skin-specific HRQoL have significantly lower index scores and the 5L is able to better differentiate across known groups.

Effect sizes (ES) were calculated as follows:

$$\mathrm{ES}\left(Z\right)=\frac{\mathrm{Mann}\text{-Whitney} \, {\text{Z}}}{n-1}$$
$$\mathrm{ES}\left(H\right)=\frac{\mathrm{Kruskal}\text{-Wallis} \, {\text{H}} \, -k+1}{n-k}$$

where k is the number of groups, and n is the sample size. We interpreted ESs ≥ 0.01 as small, ≥ 0.06 as moderate and ≥ 0.14 as large [55]. Relative efficiency (RE) was computed as the ratio of the ESs of 5L and 3L index scores. A RE larger than 1 indicated that the 5L was more efficient in distinguishing between known groups. Data analysis was carried out in R Statistical Software (v4.1.2 Vienna, Austria) [56]. A p < 0.05 was considered statistically significant and all tests were two-sided.

Results

Overall, 224 adult AD patients were invited to the study, four of whom declined to participate and another two patients did not finish the questionnaire. Thus, a total of 218 patients completed the questionnaire. No respondents were excluded from the data analysis. Demographic and clinical characteristics of patients are summarized in Table 2. Overall, 57.8% were women and the mean age was 31.3 ± 11.7 years (range 18–73). According to oSCORAD, 21.1%, 33.5% and 45.4% had clear/mild, moderate and severe AD, respectively. Nearly two-thirds of the patients (63.3%) were treated by systemic non-biological therapy at the time of the survey, while 23.4% received topical therapy only and a minority (9.6%) were untreated. Patients reported substantial impairment in their skin-specific HRQoL with mean DLQI score of 13.4 ± 8.5 and Skindex-16 total score of 56.8 ± 27.5 (Table 3).

Table 2 Demographic and clinical characteristics of patients with AD
Table 3 Disease severity and health-related quality of life scores of AD patients

Feasibility and ceiling

There were no missing responses across the 3L or 5L descriptive systems; however, one patient left the EQ VAS blank. A total of 33 different health states were reported on the 3L and 84 on the 5L.

The frequencies and percentages of patients reporting a ceiling are presented in Table 4. A statistically significant reduction in ceiling effect between 5L and 3L was observed in the mobility (4.6%), self-care (11.5%) and usual activities (9.2%) dimensions, while in the anxiety/depression dimension the ceiling slightly increased (2.3%), although the difference between 3L and 5L was insignificant. The largest relative ceiling effect reduction was found for usual activities (17.2%), followed by self-care (13.2%) and pain/discomfort (12.8%). The proportion of patients reporting no problems in each dimension (11111) demonstrated a reduction from 27.5% on the 3L to 22.5% on the 5L (p = 0.029). There were a total of 6 (2.8%) ‘best health you can imagine’ (= 100) responses on the EQ VAS.

Table 4 Ceiling effect, inconsistencies and informativity of the EQ-5D-3L and EQ-5D-5L in AD

Agreement

The distribution of 3L and 5L index scores is shown in Fig. 1. One patient had a negative index score on the 5L, while no negative values were observed on the 3L. The mean 5L index score was lower than that of the 3L, although the difference was insignificant (0.82 ± 0.22 vs. 0.85 ± 0.15, p = 0.928). An overall good agreement was observed between the two measures with an ICC of 0.815 (95% CI 0.758–0.859, p < 0.001). This was confirmed by the Bland–Altman plot in Fig. 2. The differences between 3L and 5L index scores tended to be higher for more severe health states.

Fig. 1
figure 1

Distribution of EQ-5D-3L and EQ-5D-5L index scores in AD patients. AD atopic dermatitis

Fig. 2
figure 2

Bland–Altman plot of EQ-5D-3L and EQ-5D-5L index scores in AD patients. AD atopic dermatitis

Redistribution properties and inconsistencies

The percentages of consistent and inconsistent response pairs in each level of 3L and 5L are shown in Table 5. A total of 64 (5.9%) inconsistent response pairs were reported by 50 (22.9%) patients. The average size of inconsistency was very small (1.09). The highest proportion of inconsistent response pairs (9.2%) and largest average inconsistency (1.15) were present in the anxiety/depression dimension (Table 4). The fewest inconsistent responses occurred in the mobility dimension (1.4%) with an average size of 1.00.

Table 5 Redistribution properties: cross-tabulation of EQ-5D-3L and EQ-5D-5L responses

Informativity

The informativity results are provided in Table 4. The 5L increased the absolute (H′) informativity across all dimensions (3L 0.53–1.27 vs. 5L 0.81–1.98) suggesting the usefulness of the two additional response levels in the 5L. Relative informativity (J′) increased for the first four dimensions (3L 0.33–0.74 vs. 5L 0.35–0.85), but not for the anxiety/depression (3L 0.80 vs. 5L 0.73).

Convergent validity

Table 6 shows the correlations between EQ-5D dimensions and index scores with other instruments and scales. The results provide support for most of our hypotheses. The EQ-5D mobility and self-care dimensions showed weak or no correlations with other measures. The usual activities, pain/discomfort and anxiety/depression dimensions were moderately or strongly correlated with the DLQI and Skindex-16 subscale and total scores (rs = 0.429–0.670). The only exception was the symptoms subscale of Skindex-16 that correlated weakly with anxiety/depression. As expected, the itching experienced in the past 1 month exhibited the strongest correlation with the pain/discomfort dimension (3L 0.351 vs. 5L 0.476). Similarly, sleep VAS score for the past 1 month correlated moderately with pain/discomfort (3L 0.381 vs. 5L 0.484). Both itching and sleep VAS showed weak correlations with 3L and moderate correlations with 5L index scores.

Table 6 Convergent validity: Spearman’s correlation coefficients

Moderate or strong correlations were detected between the EQ-5D index scores and DLQI and Skindex-16 total scores (rs = − 0.731 to − 0.622). Both the 3L and 5L index scores produced strong correlations with the EQ VAS (rs =  0.626 vs. 0.665). Contrary to our hypotheses, weak correlations were observed between index scores and disease severity measured by IGA, oSCORAD, and EASI (rs = − 0.359 to − 0.274). PtGA VAS scores showed moderate correlations with 3L and 5L index scores (rs = − 0.531 vs. − 0.583). With very few exceptions, the 5L demonstrated stronger correlations with all instruments and scales. The difference between the 3L and 5L was particularly pronounced for the pain/discomfort dimension.

Known-group validity

Results on known-group validity analyses are presented in Table 7. Both the 3L and 5L were able to distinguish across predefined groups of patients based on severity and skin-specific HRQoL (i.e. DLQI score bands) with moderate to large effect sizes (0.080–0.489). Patients with more severe disease and worse skin-specific HRQoL had lower EQ-5D index scores (p < 0.001). The 5L more efficiently discriminated across EASI (RE 1.033) and DLQI groups (RE 1.275), while the 3L slightly outperformed 5L in the case of IGA (RE 0.978) and oSCORAD groups (RE = 0.966).

Table 7 Known-groups validity of EQ-5D-3L and EQ-5D-5L index scores in AD patients

Discussion

This is the first study to compare the psychometric properties of the two adult versions of the EQ-5D in patients with AD. Both the 3L and 5L exhibited overall good psychometric properties in AD; however, the 5L was superior in terms of ceiling effects, informativity and convergent validity. Previously, similar head-to-head 3L vs. 5L comparative studies were carried out in two other chronic inflammatory dermatological conditions, psoriasis and hidradentitis suppurativa also in Hungary that allow for direct comparisons. In line with these prior studies of similar sample size, the 5L resulted in a much richer set of responses with more than twice as many unique health state profiles (psoriasis 86 vs. 30 [27], hidradenitis suppurativa 101 vs. 43 [28], and AD 84 vs. 33). Further similarities across these studies include a substantial relative ceiling effect reduction with the 5L (psoriasis 11.4%, hidradenitis suppurativa 14.6% and AD 18.3%), the low proportion of inconsistent response pairs (psoriasis 3.9%, hidradenitis suppurativa 8.0% and AD 5.9%) and an identical or improved average relative informativity of the 5L (psoriasis 0.61, hidradenitis suppurativa 0.74 and AD 0.64). It seems therefore that the two extra levels of the 5L are effectively used in AD similarly to other chronic dermatological diseases and enable patients to more commonly report health problems.

The improved measurement properties of the 5L descriptive system appear to be translated to the level of utilities as 5L index scores showed stronger correlations with disease severity and skin-specific HRQoL measures in AD in comparison with the 3L. The exceptionally strong correlations of the 5L index scores with Skindex-16 (rs = − 0.684) and DLQI (rs = − 0.731) total scores lend supportive evidence to the excellent validity of the 5L in this patient population. However, validity between known disease severity and skin-specific HRQoL groups was established for both 3L and 5L with negligible difference in effect sizes between the two measures.

Several findings of the present study may be explained by the slightly different wording used in the 3L questionnaire compared to the 5L. Some of these changes affect all language versions (e.g. most severe level of mobility is ‘confined to bed’ in the 3L and ‘unable to walk’ in the 5L) and there are a few variations used solely in the Hungarian versions (e.g. the descriptor ‘anxiety/depression’ in the 5L is ‘anxiety/feeling down’ in the 3L) [29]. This latter modification seems to be responsible for the unexpected psychometric properties of the anxiety/depression dimension, including an increase in ceiling effect (3L 45.4% vs. 5L 47.7%), lower relative informativity of the 5L (3L 0.80 vs. 5L 0.73) and the highest rate of inconsistent response pairs in anxiety/depression (9.2%) among the five dimensions. Similar psychometric properties of the anxiety/depression dimension were reported by other studies from Hungary [27, 28, 57].

Itching is considered a hallmark symptom of AD that may adversely affect patients’ HRQoL, including sleep. It is currently debated to what extent the EQ-5D descriptive system is able to capture itching and sleep problems. A recent study with AD patients found very weak and insignificant correlation between 3L index scores and sleep disturbance as measured on weekly average scores of an 11-point numeric rating scale [58]. In another study with burn patients, the pain/discomfort dimension of the 5L showed moderate correlations with a 10-point itching VAS [53]. Recent qualitative evidence in psoriasis patients also suggests that the discomfort element of the pain/discomfort composite dimension may cover itching to a minor extent [59]. Our findings in AD showed a weak correlation for the 3L and a moderate for the 5L pain/discomfort dimension and index scores with itching and sleep problems. However, a 1-month recall period was used for the itching and sleep VAS, whereas the EQ-5D asks about ‘today’. These results are also relevant for the currently expanding bolt-on research programme for the EQ-5D. Over the past three decades, several additional dimensions (bolt-ons) have been developed for the EQ-5D to improve accuracy and precision of the measure in specific populations [60]. Among them, there is a bolt-on aiming to assess sleep problems and another psoriasis-specific bolt-on with two items, one of which, skin irritation measures the level of itching experienced by the respondent [61, 62].

Another noteworthy finding of this study is that mean index scores were lower in the 5L (5L 0.82 vs. 3L 0.85). As the Hungarian 3L and 5L value sets were developed in a parallel valuation study from a common sample, using the same preference elicitation method and modelling approach, the differences found in index scores reflect the wording differences between the two measures. The difference in mean 3L and 5L index scores was smaller at the top end of the scale near ‘full health’ and there was a widening gap at lower mean index scores. For example, patients with severe AD according to their oSCORAD score had considerably higher mean 3L index score than that in the 5L (0.80 vs. 0.76). In contrast, the difference was much smaller in either the ‘clear’ (0.98 vs. 0.97) or ‘mild’ (0.90 vs 0.91) oSCORAD groups. As a result, an assumed improvement from ‘severe’ to ‘clear’ skin may lead to a mean index score gain of 0.18 with the 3L and 0.21 with the 5L that might guarantee a lower cost-effectiveness ratio with the 5L for the same AD treatment.

The main strengths of the present study include the multicentre design, the diverse patient population in terms of sociodemographic and clinical background and the wide range of validated skin-specific HRQoL instruments and disease severity scales used. Potential limitations include the cross-sectional design that did not allow to assess test–retest reliability and responsiveness of the instruments. Furthermore, most patients were recruited at university clinics, where patients with mild disease may be underrepresented. Lastly, albeit the DLQI and Skindex-16 have been extensively validated in AD patients and are the most widely used HRQoL questionnaires in dermatological conditions [63, 64], these are skin-specific and not condition-specific instruments and their adequacy has been subject to criticism [65,66,67,68,69]. Further research may concentrate on the validation of the EQ-5D against AD-specific HRQoL measures, such as the Quality of Life Index for Atopic Dermatitis (QoLIAD) [70].

In summary, both the 3L and 5L showed an overall good validity in adult AD patients. The superiority of the 5L was confirmed in many aspects, including ceiling effect, informativity and convergent validity. Given the high prevalence and considerable societal burden of AD, our findings fill in an important gap in evidence needed when selecting instruments for economic evaluations. Such analyses have become particularly important with the increasing number of costly new therapies for AD, including biological and small molecule treatments [71,72,73]. Based on our findings, we recommend the use of the 5L to measure health outcomes both in clinical settings and for QALY calculations in adult AD.