Introduction

Developed by the EuroQol group in 2005, the EQ-5D-5L instrument provides a widely used descriptive system for health valuation in multiple languages [1]. This descriptive system expresses a person's health along with five attributes, i.e., mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Each attribute has five levels (no problems, slight problems, moderate problems, severe problems, and unable to/ extreme problems) describing the severity of the person's health problems. Using this system of five five-level attributes, health valuation studies may ask respondents about their preferences regarding its 3125 possible health profiles (55).

In general, collecting ordinal responses using choice tasks, such as PC and BWS, is gaining widespread use in health economics and policy [2, 3]. Methodological advances in health preference research (HPR) have been applied successfully in eliciting patient and community preferences for a wide range of health care interventions [4]. Many literature reviews have been conducted that show the gaining interest in HPR [2, 5]. As a potential methodological extension, some researchers proposed including more choice tasks, such as ranking and best–worst scaling (BWS), as complements or alternatives to the time trade-off (TTO) tasks in the EQ-VT protocol [6,7,8,9]. Furthermore, many believe that choice tasks with their ordinal responses were less cognitively burdensome than cardinal tasks with their indifference responses [10,11,12]. The EQ-VT protocol currently includes some PC as a complement to the TTO to better understand preferences between EQ-5D-5L profiles; therefore, there seems to be an opportunity to include additional choice tasks within the protocol. In this project, we conducted a Dutch EQ-5D-5L valuation study, including PC and BWS tasks, to explore a natural extension to the EQ-VT protocol. The valuation is done in a pits scale rather than the conventional QALY due to lack of the life span attribute [13]. We proposed this project in hopes that BWS might serve as a possible alternative or addition for PC tasks in the EQ-VT protocol. Specifically, the single-profile (or case 2) task is one of the three BWS tasks [14]. Unlike a PC, where respondents choose between two EQ-5D-5L profiles, respondents in a case-2 BWS task face a single EQ-5D-5L profile (like a TTO task), making this task more coherent with the TTO task. In the case-2 BWS task, the respondent indicates the best and the worst attribute levels within the given profile. In this study, we hypothesize that the EQ-5D-5L values estimated using the PC and BWS responses agree.

Heteroskedasticity and heterogeneity in health valuation

Heteroskedasticity and heterogeneity have been identified as key limitations to the analysis and interpretation of preference evidence, particularly ordinal responses[15]. A recent review on heterogeneity analyses in HPR showed that most published studies analyzed heterogeneity without controlling for heteroskedasticity or differential scaling [3]. This paper further contributes to HPR by demonstrating the implications of controlling heteroskedasticity and heterogeneity in health valuation as well as separating taste and scale heterogeneity.

Like other observable differences [16], heteroskedasticity refers to differences in variance by observable factors, such as task-level or individual-level factors. In a heteroskedastic logit, its variance may vary between tasks systematically in response to task complexity and the number of choice alternatives, attribute differences, or individual behavioral differences [15, 17]. In this study, we hypothesize that variance varies by task sequence and task type and that controlling this heteroskedasticity can reduce uncertainty in EQ-5D-5L values. Heteroskedasticity is not a form of preference heterogeneity because the difference in variance is derived from a difference in behavior (e.g., task sequence), not preference.

Apart from heteroskedasticity, we also examine two types of preference heterogeneity [18]. First, groups of respondents like or dislike different alternatives in a systematic way that reflects the relative importance of the attributes (i.e., taste classes). Taste heterogeneity refers to differences in the relative effects of each attribute level. For example, some respondents may place a greater weight on functioning and others on feeling (e.g., pain/discomfort, anxiety/depression). Alternatively, there can be a group of respondents who fail to account for any of the EQ-5D-5L attributes, and by summing up their preference information creates coefficient estimates of garbage class. The responses of people who belong to a garbage class may show the probability of choosing the best (11111) over the worst (55555) EQ-5D-5L profile is near 50%. Second, groups may like or dislike alternatives systematically that reflect the absolute value of all attributes (i.e., scale classes). Scale heterogeneity refers to more subtle differences in the absolute effects of all attributes (compared to garbage classes), and scale classes may be related to the respondents' difficulty distinguishing between alternatives (e.g., more indifference with a lower scale value).

Estimating differences in attribute importance between respondents without controlling for scale heterogeneity can often mislead the interpretation of taste heterogeneity, which is confounded by scale heterogeneity [19]. Using the information on the respondents, a scale-adjusted latent class (SALC) model [20] can disentangle taste and scale heterogeneity simultaneously by identifying latent classes of persons who differ in their relative importance (taste classes), as well as latent scale classes—groups of people who differ by how intense (or indifferent) their preferences are. In this paper, heteroskedasticity is associated with observable differences in scale between tasks (e.g., task sequence), and scale heterogeneity is associated with latent differences in scale between individuals (e.g., failing the PC dominance task). The SALC model allows for heteroskedasticity and two forms of heterogeneity, and we hypothesize that controlling for all three can enhance health valuation. Given this background, this study is aimed to run a case analysis on a Dutch EQ-5D-5L valuation dataset with the following objectives. First, we examined the effects of controlling heteroskedasticity by comparing the results of the conditional and heteroskedastic logit. Second, we illustrated the EQ-5D-5L values based on the PC and case-2 BWS responses and assessed their correlation and agreement. Third, we estimated EQ-5D-5L values using the scale-adjusted latent class (SALC) logit models, which control for taste and scale heterogeneity as well as heteroskedasticity. Finally, we compare the Dutch EQ-5D-5L values to the original values produced using the EQ-VT protocol [21].

Methods

Overview

In September 2016, Dutch respondents were recruited from a marketing panel (Survey Sampling International) to complete computer-based interviews via an online survey instrument. We did not aim for a fully representative sample but sampled from groups with known EQ-5D-5L impairments. We aimed to sample 300 subjects stratified by domain and severity of health problems captured by the EQ-5D-5L. To facilitate the analysis of preference heterogeneity, all respondents completed the same PC and case-2 BWS tasks using the same series of EQ-5D-5L profiles. Examples of the PC and BWS tasks can be found in Appendix 1.

EQ-5D-5L profiles

Using the EQ-5D-5L descriptive system, the five five-level attributes can be described by a 5-digit vector of the attribute levels, where the position of the integer refers to the attribute, while the integer itself refers to the attribute level. For example, EQ-5D-5L profile '32512' would describe moderate problems walking about, some problems washing or dressing self, unable to perform usual activities, no pain or discomfort, and some anxiety or depression.

Experimental design

The BWS' Health profile A' design is based on an orthogonal main effects plan (OMEP) [22] that, in the case of the EQ-5D-5L, consists of 25 profiles. With these 25 profiles, in principle, it is possible to estimate 24 individual BWS level parameters. By rotating the OMEP coding,Footnote 1 a design was obtained with the minimal number of only one attribute at an extreme level, resulting in 15 out of 25 best choices with at least two attributes with the same lowest level and 16 out of 25 worst choices with at least two attributes with the same highest level. Therefore, at least 31 − 24 = 7 degrees of freedom to estimate a model for every respondent in case the other 19 choices were non-informative. The chosen health profiles are listed in Appendix 1 (Table 5). Moreover, the final design contained no states with all attributes at the same level, which would make the task excessively difficult, and the PC contained only one dominant comparison out of the 25 comparisons. Overall, it is not a representative sample, but more a stratified sample to the severeness of disease. The questionnaire was designed in a fashion that respondents first were asked to perform the BWS case 2 task with profile A.

Next, for the PC task, the 'Health profile B' that was added as a comparator to the BWS' Health profile A' was always the same profile, namely (24242), a state close to the center of the health-profile continuum (based on Devlin et al. [23]) that has three attributes at the same level, and the other two as well. Such a constant comparator design reduces efficiency to around 40–50% but provides the only currently known compromise possible between the needs of the case 2 BWS and the needs of the PC tasks [24]. This particular dual design appears unusual but is important in that it has properties that reflect the BWS case 2/PC relationship (investigation of "how I rescale my BWS case 2 estimates into PC-space") and practical benefits (minimizing cognitive load in the PC by familiarizing the respondent with profile A, then adding a constant, known, state B). A sample question of both types of tasks can be found in Appendix 1, Fig. 4.

Analysis

The final analysis dropped the dominant task from the PC question. Descriptive statistics were used to summarize respondents' characteristics and response feasibility of PC and BWS tasks. To maximize the use of the available data, we implemented a hybrid modeling approach that incorporated all PC and BWS responses to produce the Dutch EQ-5D-5L value set. Conditional logit model, heteroskedastic conditional logit, and heteroskedastic scale-adjusted latent class (SALC) model were estimated by maximum likelihood to illustrate the values for all 3125 EQ-5D-5L profiles [25]. The main effects of each model are shown as incremental changes in the level of severity on a pits scale where value (55555) = 0 and value (11111) = 1 [13]. Unlike EQ-VT studies, the study did not use the TTO or include any preferences evidence on "dying immediately;" therefore, the main effects cannot be reported on a quality-adjusted life-year (QALY) scale. Statistical analyses were done in R 4.0.2 [26,27,28]. A significance level of 0.05 was considered statistically significant.

Main-effect specification of EQ-5D-5L Values

To aid the interpretation of the BWS responses, we envisioned a profile of '00000' that represents a hypothetical ideal. The BWS specification includes twenty incremental variables, each representing the loss in health values for increasing severity from one level to the next of the same dimension, as well as five ancillary variables associated with a change in level from zero to one, which is outside the EQ-5D-5L descriptive system. The primary difference between the best and the worst responses is that the sign of the incremental variables switches (i.e., for best, the incremental variable is negative; for worst, the variable is positive). The hypothetical ideal is not relevant for the interpretation of the PC responses; therefore, its specification includes only the twenty incremental variables.

The twenty main-effect coefficients describe the value of the EQ-5D-5L profiles on a pits scale. The coefficients of the five ancillary variables have no effect on the EQ-5D-5L values; therefore, these estimated coefficients are reported in Appendix 1. Due to the identification problems of case-2 BWS, only four of the five ancillary parameters can be non-zero; therefore, we constrained the smallest ancillary parameter to zero, which has no effect on the EQ-5D-5L values.

Heteroskedasticity and differences by task

Overall, each PC and BWS response is a multinomial choice (from two and five alternatives, respectively) that reflects a respondent's preferences taking into account the 20 and 25 incremental variables, respectively. The conditional logit model assumes homogeneous preference and independent and identically distributed (IID) errors. Relaxing the IID assumption introduces the heteroskedastic conditional logit (HCL) model [29], where the scale parameter (inversely related to the variance) is an exponential function of observable factors that identify the source of differential variance and constrains the scale to be non-negative. The differential variance may be associated with individual level, choice set/task level, or alternative level characteristic variables. To avoid confounding between heteroskedasticity and scale heterogeneity, the scale parameter in this paper depends on only task-level variables, namely task sequence and task type (e.g., best/worst/paired comparison).

Furthermore, we estimated the heteroskedastic logit by task (i.e., BWS case 2 and PC) characteristics, computed the PC and BWS values using the interaction results, and assessed their correlation and agreement (Pearson's correlation and Lin's concordance, respectively).

Heterogeneity and EQ-5D-5L Values

The SALC model (model formulation in Appendix 2) allows for preference heterogeneity through latent classes. Taste classes represent groups that share the relative effects of each attribute level, and scale classes represent groups that share the absolute effects of all attributes. The likelihood that each individual belongs to a specific group is known as a respondent's grade-of-membership (GOM) and may be associated with their observable characteristics. In the analysis, we hypothesize that individuals' demographics, socio-economic variables, and health conditions are associated with taste class membership. The scale class, which identifies the irregularities and idiosyncratic features of choice behavior that are not particularly associated with any attribute level, rather captures the variability across subjects, tasks, and objects are identified by individual's age, education level, gender, competency level (whether passed the dominant task), and perception on the difficulty level between the two question types.

As an extension of the HCL [30, 31], the standard SALC model [20] identifies differences in scale by latent groups (i.e., scale classes), but scale remains constant within each scale class. [18, 32]. A SALC model can allow heteroskedasticity by letting the scale factor vary by observable factors within each scale class (i.e., heteroskedastic SALC).

As the number of classes both for the scale and taste classes is decided prior to the analysis rather than identified from estimation, a series of classes is usually estimated, and the best-fitted model is based on statistical and substantive criteria (i.e., BIC, AIC, CAIC) [33]. However, in empirical analysis, factors like a smaller size, complexity in the model, and low efficiency may cause identification problems for a higher dimension solution with many latent classes. This study only collected 380 respondents; therefore, the SALC model includes only two taste and two-scale classes.

In order to compare these values with the original Dutch EQ-5D-5L values [21], the original values were transformed to a pits scale, and their relationship was illustrated using a scatter plot and estimates using Pearson's correlation and Lin's concordance.

Results

Demographics

After excluding the dominant pair from the PC task, the analysis included 24 PC tasks and 25 BWS tasks. In total, 385 respondents completed the questionnaires, from which five were excluded due to engaging in click-through on the PC (no variation in their responses), so subsequent analyses are based on the remaining 380 respondents. Fifty-two percent (n = 198) of the respondents were male (Table 1). Respondents were almost equally divided among the age group 16 to 35, 36 to 55, and above 55. More than half of the respondents had a middle education (n = 197) compared to thirty-five percent (n = 131) having high education. Fifty-seven percent (n = 217) reported having a chronic illness.

Table 1 Descriptive statistics sample (n = 380)

Feasibility

Thirty-one percent (n = 117) found the best–worst questions difficult, compared to thirty-eight percent (n = 146) for the PCs (Table 1). Seventy-two out of 380 respondents preferred in terms of difficulty the PC questions over the best–worst questions. Almost half of the respondents (n = 173) had no preference. Remarkably, from those indicating BWS easier than PC rated 9/380 = 2.3% the difficulty of PC lower (less difficult) than BWS; those indicating PC is easier than BWS rated 13/380 = 3.4% the difficulty of BWS lower than PC. And finally, from those indicating no preference in the easiness of BWS or PC gave 42/380 = 11.1% a different level of difficulty to the two methods. Also, 72 of the total respondents failed the dominant task.

Difference between homoskedastic and heteroskedastic results

Table 2 showed the main effect estimates of the conditional logit (CL) model and heteroskedastic conditional logit (HCL) model. The HCL model fitted better by lowering the BIC value by 1666.29 (CL BIC: 64,458.32, and HCL BIC: 62,792.03). The correlation between the 3125 values measured by the CL and HCL estimates showed a high correlation (Pearson's correlation coefficient is 0.9953 (CI: 0.9950–0.9956) and Lin's concordance correlation coefficient 0.9927 (CI: 0.9922–0.9932)) (Appendix 1 Fig. 5). In both models, one incremental coefficient is negative (i.e., the change in severity from severe to extreme under usual activity) but insignificant (CL coefficient: − 0.0003 p-value: 0.956; HCL coefficient: − 0.0011, p-value:0.878). The sequence of completing tasks has a positive effect on the scale parameter (0.8419; p < 0.001), and its square has a negative effect ( − 0.7427, p < 0.001), indicating that scale increased (i.e., less random responses) up to fourteen tasks and decreased after that (Fig. 2) with overall p-value < 0.001. Also, the effect of the PC task on the scale parameter is significantly negative ( − 0.9930, p-value < 0.001), and the effect of the best task is significantly positive (0.2424, p-value < 0.001) (Appendix 1 Table 6). Controlling heteroskedasticity had little effect on the standard errors; the standard error decreased in 8 of the 20 estimated parameters (Appendix 1 Table 6).

Table 2 Conditional, heteroskedastic, and interaction model (controlling heteroskedasticity)

Differences between the PC and BWS results

Table 2 also showed the main-effect coefficients of PC and BWS for the heteroskedastic logit model. In the PC estimates, 17 out of 20 coefficients were significant (p < 0.05); however, two coefficients were negative but insignificantly different from zero. Under BWS, 13 coefficients were significant, with one significant negative estimate. Only four coefficients have shown a significant difference by task, and the largest difference is 0.1027. Converting the 3125 EQ-5D-5L values into a pits scale, we measured the correlation between PC and BWS values. (Fig. 1). Between the two 3125 EQ-5D-5L profiles, Pearson's correlation coefficient is 0.9167 (CI: 0.9109–0.9222), and Lin's concordance correlation coefficient is 0.7658 (CI: 0.7542–0.7769). The median absolute difference in the difference between PC and BWS values has 0.0732 (interquartile range 0.0592 to 0.1565).

Fig. 1
figure 1

Scatter plot of 3125 EQ-5D-5L profiles for heteroskedastic model. *values were estimated in a pits scale where v (55555) = 0 and v (11111) = 1. 95% Confidence interval for Pearson’s correlation 0.9109–0.9222, and for Lin’s concordance: 0.7542–0.7769

Taste and scale heterogeneity

The SALC model increased model fit compared to homogeneous models by achieving the lowest BIC value (56,698.35). Table 3 showed the main-effect coefficients of the two taste classes. Taste class 1 had consistent parameters with non-negative values, and 19 of them were significant (p < 0.050). In all the attributes, changing levels from moderate to severe problems led to the greatest reduction in value. Based on this evidence, taste class 1 is referred to as a Dutch EQ-5D-5L value set on the pits scale.

Table 3 Two taste classes of the scale-adjusted latent class (SALC) model

On the other hand, taste class 2 had few significant parameters and eight inconsistent estimates. In this class, the probability of choosing the best over the worst EQ-5D-5L profile is 0.554 (Table 3), which is much smaller than the near-unanimous probability found in taste class 1 (0.998). Based on this evidence, taste class 2 is referred to as a garbage class.

Around 71% of the individuals belonged to taste class 1 and 29% in taste class 2 (Table 4). Looking at the grade-of-membership results, respondents in the garbage class are less likely to be female (odds ratio: 0.5173 95% CI: 0.3685 to 0.6661) and more likely to be younger (odds ratio: 2.4143; 95% CI: 1.6509 to 3.1777).

Table 4 Grade-of-membership (GOM) of the scale-adjusted latent class (SALC)

The scale is lower in scale class 2 than in scale class 1, which implies scale class 2 has a higher variance (Appendix 1 Table 7). In scale class 1 (less random class), the effect of the sequence of tasks on the scale has a similar pattern as in the heteroskedastic model (Fig. 2). However, the coefficient of the sequence square was not significant (Appendix 1 Table 7 and Fig. 2). The effect of task type (i.e., PC or BWS) is the same across both classes, where PC is negatively associated with scale (i.e., increased uncertainty/randomness) and the best task under BWS is positively associated with scale factor (i.e., reduce uncertainty/randomness) (Appendix 1 Table 7). However, the coefficients were only significant in scale class 1. Around 41% of the respondents belong to scale class 1 and 59% to scale class 2. Respondents in scale class 2 were more likely to fail the PC dominant task (Table 4).

Fig. 2
figure 2

Heteroskedasticity: scale by the task sequence. *Scale coefficients were transformed into the original scale

Differences between the original and new Dutch EQ-5D-5L values

Comparing the twenty main-effect coefficients estimated in this study with those of the original Dutch value set [21], the SALC coefficients had the highest correlation and agreement (Pearson’s correlation: 0.7295, CI: 0.4238–0.8860; Lin’s concordance: 0.6904, CI: 0.4098–0.8516), followed by conditional logit (Pearson’s correlation: 0.6937, CI: 0.3626–0.8693; Lin’s concordance: 0.6601, CI: 0.3554–0.8380) and heteroskedastic conditional logit (Pearson’s correlation: 0.6321, CI: 0.2632–0.8398; Lin’s concordance: 0.5817, CI: 0.2543–0.7894) (Fig. 3) [34]. Looking across the 3125 EQ-5D-5L values, the SALC values had the highest correlation and agreement (Pearson’s correlation: 0.9293, CI: 0.9244–0.9339; Lin’s concordance: 0.8835 CI: 0.8763–0.8903), followed by conditional (Pearson’s correlation: 0.9254, CI: 0.9203–0.9304; Lin’s concordance: 0.8689, CI: 0.8610–0.8764) and, heteroskedastic (Pearson’s correlation: 0.9226, CI: 0.9172–0.9277; Lin’s concordance: 0.8453, CI: 0.8364–0.8537).

Fig. 3
figure 3

Comparing estimated coefficients with the Dutch value set. *Pearson's correlation coefficient for the 20 the conditional (0.6937), heteroskedastic (0.6321), and SALC (0.7295) coefficients

Discussion

Using a population-based sample from the Netherlands, we estimated the value of EQ-5D-5L profiles by task and controlling for heteroskedasticity and heterogeneity. Apart from heteroskedasticity, identifying taste heterogeneity often becomes difficult because of its confounding nature with scales. In this paper, we estimated a heteroskedastic conditional logit and a scale-adjusted latent class model to emphasize three sources of error related to respondent behavior: (1) task sequence, (2) garbage classes, and (3) failing a PC dominance task.

First, heteroskedasticity may occur as individuals' attention span reduces doing tasks consecutively [30, 35]). Interestingly, after controlling for heteroskedasticity, only a few incremental coefficients differ significantly between BWS and PC, which suggests the tasks might be used interchangeably. Second, the members of the garbage class may be indifferent between EQ-5D-5L profiles or respond randomly (i.e., a shuffled deck)[36]. Although these motives are confounded, this class did not express relative attribute importance; therefore, their responses can be disregarded as uninformative. Lastly, respondents who failed the PC dominance task were more likely to belong to a class with a lower scale, which implies that less weight was given to their preference evidence. Overall, the SALC model adjusts the EQ-5D-5L values to better represent the tasks in the middle of the sequence and persons who did not belong to the garbage class or failed the PC dominance test. By controlling heteroskedasticity and heterogeneity, this study produced a Dutch EQ-5D-5L value set on a pits scale that is moderately concordant with the original values. The moderate agreement is in line with our expectation as the study used an online sample with smaller sample size compared to the original study.

This study has several limitations. First, the results of the estimated model were shown on a pits scale rather than on a QALY scale. Second, this study is more of an exploratory study rather than a confirmatory analysis, which complicates the interpretation of p-values or statistical inference more generally. Third, the confounding between taste and scale in choice-based analysis implies that adjusting the scale might not totally control the scale factor from preference coefficients. Also, due to the design with a constant comparator in PC tasks and relatively smaller sample size, our capability to explore heterogeneity in larger dimensions was beyond the scope. Lastly, important variables such as income and time to complete the tasks were missing in the dataset, which would have been good indicators for class membership, as shown in previous studies [18]. Given this, this study is the first attempt to explore heteroskedasticity and heterogeneity in a health valuation study and should aid others considering similar approaches. It is also worth to be mentioned that the SALC model is a certain parametrization of a particular type of disentangling taste and scale. So, the results would be dependent on that particular parametrization and require justified theoretical background.

Conclusions

In conclusion, this study suggests that proper consideration regarding the sources of variation that affect individuals' decision rules can be included to inform the model analysis in health valuation studies. Considering the demonstrated potential of the case-2 BWS task to produce similar values as of PC, this study produced a Dutch EQ-5D-5L value set on a pits scale that is concordant with the original values by controlling heteroskedasticity and heterogeneity. In order to emphasize the importance of controlling the noises in the dataset, this method should be implemented in future studies with larger sample size and with richer behavioral information.