FormalPara Key Points for Decision Makers

Unstratified efficient design algorithms cannot guarantee adequate coverage of the severity range.

If health state selection bias occurs in DCE-duration studies, the derived values may be too low.

Sampling choice task from different severity strata is a way to prevent skewed designs and biased values.

1 Introduction

The use of the discrete choice experiment (DCE) has attracted researchers’ interest as an alternative to more conventional techniques, such as the time trade-off (TTO) method, to derive quality-adjusted life years (QALYs) in health state evaluation. One of the merits of using DCE methodologies is that they improve the feasibility of valuation studies. In contrast to TTO valuations, which require organizationally complex and costly face-to-face interviews, DCE valuation surveys can be self-administered [1, 2]. However, the use of a DCE for valuation introduces an unusual requirement in the DCE, namely that the estimated values are anchored at 1 for full health and 0 for death. The validity of the proposed approaches to achieve that has yet to be established. Two proposed strategies include the DCE-death and the DCE-duration approaches. However, results obtained from initial applications of those methods were markedly different. For instance, Norman et al. [3, 4] showed that DCE-duration approaches consistently produce lower values than DCE-death approaches. Also compared to conventional health valuation methods, DCEs have produced discrepant results. Craig et al. [5] and Jonker et al. [6] reported a considerably longer value range derived from their DCE approaches (minimum values < −  1.5) than the range obtained with the conventional TTO for EQ-5D (− 0.594 to 1.000) [7]. Researchers now aim to understand why.

We aim to contribute to the body of knowledge of how best to implement DCE methods for health state valuation, with a focus on strategies for the development of the experimental design. In this area, methodological advancements made best practice somewhat of a moving target. On top of that, best practice for choosing a strategy for designs may well be context dependent [8, 9]. Whereas some general considerations always apply, such as the importance of identification and statistical efficiency, other demands can be application specific. The latter may be the case in the field of health state valuation.

A popular approach for the construction of experimental designs in DCEs is the (Bayesian) efficient design approach. These designs exploit prior information to arrive at a design that produces small asymptotic standard errors. Because of the direct link between standard errors and sample size requirement, this is a desirable property [10]. Efficient designs have been frequently used in DCEs for health valuation [2, 6, 11, 12]. However, these designs are not without problems. A potential problem is that designs purely optimized for statistical efficiency can produce more difficult choice sets [13]. As a result, respondents might not always have a clear preference for any of the options, or they may be tempted to use simplifying decision rules that obscure their true preferences and cause bias [14]. A current line of research is whether such concerns can be addressed by introducing constraints on the design generation algorithms for DCEs, for example, by forcing attribute-level overlap in the constructed choice sets [15, 16]. Another potential problem is that the choice sets will not be selected at random, but rather chosen to support estimation of a proposed utility function [8, 17]. The algorithm will favor choice tasks that clearly reveal attribute trade-offs and avoid strongly dominant alternatives [18]. As a consequence, each health state has a different probability to be included in the choice tasks [17]. This can cause bias if decisions derived from included health states do not predict decisions about health states that have a lower inclusion probability due to model misspecification.

Currently, it is unknown whether this bias is a problem in DCEs designed to capture the value of health, but we hypothesize that it might be. Because optimization algorithms consider the level of utility balance for better statistical efficiency [19], the fact that the DCE-death and the DCE-duration approaches present respondents with very different fixed alternatives can cause other health states to be favored in the different approaches. To investigate the issue, we set up an experiment featuring EQ-5D-5L health states. First, we examine whether the current best practice efficient DCE designs (i.e. ‘unstratified’ efficient designs) tend to favor a particular type of health states in the context of various DCE formats. Second, we investigate the sensitivity of estimated health state values to the potentially skewed selection of health states by comparing estimates derived from unstratified designs with those from DCE designs that satisfy the requirement that the set of selected health states has to span the entire severity scale (i.e. ‘severity-stratified’ designs).

2 Methods

To investigate the issues mentioned above, we proposed a strategy for generating severity-stratified designs and compared the severity-stratified designs to unstratified efficient designs on (1) health state selection for inclusion in DCE tasks and (2) values derived from the DCE tasks. We did this in the context of three different DCE formats: standard DCE, DCE-death, and DCE-duration. Table 1 shows the overview of the six study arms used in this study.

Table 1 Overview of the study arms

2.1 The Discrete Choice Experiment (DCE) Choice Tasks

Figure 1 provides an example of the three DCE formats. The health states were defined by the five dimensions of the EQ-5D-5L instrument: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. For each dimension, five levels are used to describe the severity of impairment in monotonic order from ‘no problems’ (level 1) to ‘extreme problems/unable’ (level 5).

Fig. 1
figure 1

Presentation of choice tasks: a standard DCE; b DCE-death; c DCE-duration. DCE discrete choice experiment

The standard DCE was a forced choice paired comparison between two health states where respondents were asked to choose between 10 years in health state A and 10 years in health state B. This task focused on the direct trade-off between the health state attributes and produces values on a latent scale.

In the DCE-death format, each choice task had three alternatives, A, B, and C, and respondents compared A to B and B to C using a so-called ‘matched pairwise choice task’ [6, 15, 20]. The A–B comparison resembled the standard DCE described above. The next question was a forced choice between 10 years in health state B versus immediate death (i.e., B–C comparison). Each choice task thus comprises two pairwise comparisons so that the number of observations will be twice as high as in the standard DCE. However, the cognitive burden is only marginally increased, because option B appears in both comparisons and option C is fixed and easy to imagine.

In the DCE-duration format, each choice task also had three alternatives, A, B, and C, that were compared using a matched pairwise choice task. Respondents were first asked to choose between 10 years in health state A and 10 years in health state B (i.e., A–B comparison), followed by the B–C comparison, where option C was always health state 11111 (i.e., no problems in any EQ-5D dimension) with a duration shorter than that of option B. Length of life in the perfect health was restricted to 12 levels: 2, 4, 6 months and 1, 2, …, 9 years.Footnote 1

In order to reduce task complexity and respondent burden, all choice tasks for the A–B comparisons included attribute-level overlap [6, 15]. For each pair of choices A and B, a minimum of two out of five dimensions were presented at the same level. In addition, combinations of the first level (no problem) of usual activities with the fifth level (extreme problems) of pain/discomfort and/or anxiety/depression were avoided to make health states easier to imagine and evaluate. Lastly, intensity color coding was used to further reduce task complexity. Imposing attribute-level overlap and color coding as well as excluding implausible states is currently best practice considering the reduced dropout rate and improved respondents’ attribute attendance in DCE [15].

2.2 Experimental Designs With/Without Severity-Stratified Restriction

We implemented heterogeneous DCE design algorithms to create for each study arm a unique experimental design comprising 168 choice tasks, distributed over eight sub-designs [21]. The algorithm optimizes for Bayesian D error for the total design, while simultaneously optimizing for the Bayesian D errors of each of the eight sub-designs. In essence, this strategy produces a blocked design with eight blocks, where the design within each block is optimized in addition to the optimization of the overall design across blocks. A Latin hypercube sample optimized for maximum minimum distance between points and a greedy optimization algorithm was used to optimize the weighted averaged Bayesian D error with one-third of the weight assigned to the aggregated efficiency and two-thirds on the individual efficiencies of the sub-designs. Note that the design algorithm controlled for left–right randomization of the two states by including both options A and B in comparison with option C in the Bayesian design criterion, even though only one of the two choice options was presented (in random order) to the survey respondents.

To obtain an identifiable DCE design at the individual level, each sub-design contained 21 choice tasks, that is, the number of parameters to be estimated in a main effects model. As Bliemer and Rose [22] suggested, a DCE design optimized for a standard conditional logit model performs well for estimating panel mixed logit models. Therefore, the design was optimized for a conditional logit model, which reduced the computational burden substantially.

Whereas the full candidate set of all possible EQ-5D-5L health states (excluding 225 implausible health states) was used to optimize the unstratified DCE designs, the severity-stratified DCE designs used different candidate sets for each choice task. The creation of severity-stratified designs involved the following steps:

  1. 1.

    Informative priors were used to predict latent utility values for all health states, which were subsequently used to divide the health states into 21 severity strata (i.e., 3125/21 = ~ 148 states per stratum for each DCE format, thus comprising as many severity strata as there were choice tasks in each DCE design).

  2. 2.

    A total of 225 implausible health states were removed from the full set of 3125 health states and from each of the 21 strata.

  3. 3.

    For each stratum, candidate sets were constructed by creating all possible combinations of health states in the stratum with all other possible health states (i.e., 148 × 2899/2 = ~ 0.2 million).

  4. 4.

    The design algorithm created a DCE design that included exactly one choice task from each candidate set in each sub-design.

Prior values used for the DCE design optimization (and thus also in step 1) were obtained from previous research (based on an unstratified DCE design; unpublished to date), which contains 350 Dutch respondents for each DCE format. The design algorithm was implemented in Julia [23].

2.3 Statistical Analysis

To analyze the health state preferences, a mixed logit modelFootnote 2 was used. For the standard DCE, the utility of the respondent i for the health state j in the choice task t was specified as:

$$U_{ijt} = X_{ijt} \beta_{i} + \epsilon_{ijt}$$
(1)

where \(X_{ijt}\) consists of 20 dummies for EQ-5D-5L instruments assuming the level 1 (no health problem) as the reference category for each dimension. The error \(\epsilon_{ijt}\) is assumed independent and identically distributed with an extreme value distribution, and the vector of individual-specific coefficients \(\beta_{i}\) is assumed to follow a multivariate normal distribution with the population mean \(\mu\) and covariance matrix \(\sum\), that is, \(\beta_{i} \sim {\text{MVN}}\left( {\mu , \;\sum } \right)\). The same utility function was applied to the DCE-death approach; however, now \(X_{ijt}\) includes a dummy indicating death options.

For the DCE-duration approach, the utility was specified as the function of the product of the number of life years (\(T_{ijt}\)) and its observed characteristics (\(X_{ijt}\)) and their corresponding coefficients as follows:

$$U_{ijt} = \left( {T_{ijt} X_{ijt} } \right)\beta_{i} + \epsilon_{ijt}$$
(2)

Note that \(X_{ijt}\) consists of dummies for the EQ-5D-5L instrument and an intercept with the value 1, and the coefficient for the duration main effect represents the value respondent i assigns to living in perfect health for 1 year.

The specified models were estimated using the Bayesian Markov chain Monte Carlo (MCMC) methods as implemented in the R package bayesm [25]. Gibbs sampling was used to update \(\mu\) and \(\sum\), and a Metropolis–Hastings algorithm was used to update \(\beta_{i}\). A multivariate normal prior (with a mean of zero and a variance of 100∙I) was used for \(\mu\), and an inverse Wishart prior (with the dimension of \(\sum\) plus 3 degrees of freedom, i.e., \(\nu ,\) and a location parameter \(\nu I\)) was used for \(\sum\). Mean posterior estimates and 95% credible intervals were calculated by thinning the MCMC draws every fifth iteration for a total of 100,000 iterations. Convergence was established using visual inspection of chains and the convergence diagnostics as implemented in the R package CODA [26].

For testing hypotheses, the values for health states derived from the DCE-death and DCE-duration approaches were rescaled on the QALY scale where death has a value of 0 and full health a value of 1. To rescale the values, we divided the EQ-5D-5L parameters by the absolute value of the parameter for ‘death’ for DCE-death, and by the parameter value for ‘duration’ for DCE-duration for each draw of the posterior distribution of parameters. Next, the hypotheses that efficient design algorithms for the DCE-death and DCE-duration approaches tend to choose health states in skewed severity ranges was tested by comparing the distribution of values between designs with and without the severity stratification. For the hypothesis regarding the sensitivity of extrapolated health state values to the selection of health states, differences in values for the same health states between the designs with and without severity stratification were examined.

As DCEs aim to predict the choice probabilities of alternatives among given choice sets, we compared the predictive performance of estimates from the severity-stratified designs with those without that restriction using the mean errors (MEs), that is, the average deviation of predicted choice probability of a health state from the observed choice probabilities in each study arm. We used MEs to examine the direction of the bias that the estimates of each study arm produced.Footnote 3 Specifically, when comparing the impaired health states with the death or perfect health states, positive (negative) MEs regarding the impaired health state suggests that the predicted model of the study arm is likely to undervalue (overvalue) the disutility of impaired states so that it over-predicts (under-predicts) the choice probability of living in the impaired health condition compared with the actual observation. Cross validation of the MEs was done by applying the valuation function obtained in one study arm to the data of the other study arm of the same DCE format. The posterior predictive choice probability distribution was obtained by simulating mixed logit probabilities for each sample of the parameters in the posterior distribution, from which the distribution of MEs was inferred. Whether MEs were significantly different from zero was determined based on the 95% level credible intervals of the distribution of MEs.

2.4 Data Collection

The fieldwork was undertaken by Survey Sampling International (SSI) through an online platform during 2 weeks in December 2015. The target sample size was 3000 respondents (i.e., 500 respondents per study arm) representative of the Dutch general population regarding age, gender, and education. Respondents were recruited from SSI’s online panel that contains representative panelists of the population aged 15–65 years and as many panelists aged over 65 years to resemble a nationally representative sample as much as possible. All respondents who gave consent for participation were asked about their demographics to enable stratification of the sample and were randomly assigned by SSI’s survey management software to one of the six study arms and to one of eight sub-designs within that arm. After receiving the information regarding EQ-5D-5L instruments, respondents completed the 21 choice tasks in a random order. A total of 693 respondents who did not complete the tasks were excluded from the analysis. The average response time of respondents was 27 min (50% of respondents completed within 10 min).

3 Results

Table 2 shows the background characteristics of respondents. Respondents’ characteristics are comparable to those in the Dutch population, and no significant imbalance in respondents’ characteristics between unstratified and severity-stratified designs was found.

Table 2 Descriptive statistics of respondents

Figure 2 and Table 3 show the distributions of the modeled values for all EQ-5D-5L health states. As shown in Fig. 2, the distribution of states included in the design more closely followed the distribution for all health states when the severity stratification was applied for all DCE formats. It is most apparent for the DCE-duration format, where the unstratified design has a much more skewed distribution than the severity-stratified design.

Fig. 2
figure 2

Comparison of distributions of health state values between designs with and without severity-stratification. Distribution of modeled values for all possible EQ-5D health states (red bars) and modeled values for EQ-5D health states included in the designs (black bars). Health state values are on latent utility scales for the standard DCE (a), while they are on QALY scales for DCE-death (b) and DCE-duration (c). DCE discrete choice experiment, QALY quality-adjusted life year

Table 3 Distribution of health states selected for the designs over severity strata

For DCE-duration, the mean and the standard deviation (SD) of the distribution of health state values included in the unstratified design (i.e., the black bars) were 0.31 and 0.44, respectively, whereas those in the severity-stratified design were 0.09 and 0.35. A similar effect was hypothesized to exist in the DCE-death approach, but no strong evidence was found (mean 0.41 and SD 0.30 for the unstratified design; mean 0.40 and SD 0.26 for the severity-stratified design).

Table 4 shows the parameter estimates and corresponding 95% credible intervals for all six study arms. Almost all estimates are statistically significant, and all models resulted in logically consistent parameter estimates in the sense that worse levels of impairment are associated with larger utility decrements.

Table 4 EQ-5D parameter estimates with 95% credible intervals on QALY scales for 6 study arms

For the standard DCE, estimates in Table 4 are expressed on the latent utility scale, and therefore the obtained parameter estimates cannot be directly compared to the ones obtained in the other arms. However, the difference in scale between the unstratified and severity-stratified designs is very small, as can be seen from the values for state 55555 (the worst EQ-5D-5L state) and the fact that the 95% credible intervals overlap for all parameters when comparing the models from both designs. Similar results were observed for the DCE-death estimates on the QALY scale. For DCE-duration, estimated values for state 55555 are different and 95% credible intervals for several parameters (i.e., level 4 of ‘Mobility’ and level 5 of ‘Self-care’ and ‘Anxiety/depression’) do not overlap when comparing the unstratified design with the severity-stratified design.

Figure 3 shows scatter plots for each DCE format, comparing the values obtained by the designs with and without severity stratification. For the standard DCE and DCE-death formats, estimated values based on the severity-stratified design are close to those of the unstratified design. However, for the DCE-duration format, health state values of the severity-stratified design are higher than those of the unstratified design, especially on the range of states that are worse than death. The proportion of health states considered worse than death among 3125 health states was 56.0% for the unstratified design versus 42.8% for the severity-stratified design.

Fig. 3
figure 3

Comparison of values for all EQ-5D health states between designs with and without severity-stratification. The 45° line is omitted from the graph on the left, which shows the impact of the severity-stratified restriction in the standard DCE choice task, because both sets of values are on a latent scale and adding a 45° line might be misleading as a basis for comparison. DCE discrete choice experiment

Table 5 shows MEs by study arm to compare the in-sample and out-of-sample forecasting accuracy of the severity-stratified design with those of the unstratified design. That is, column 4 shows the ‘unstratified’ model predicting the ‘unstratified’ observed choice probabilities; column 5 shows the ‘severity-stratified’ model predicting the ‘unstratified’ observed choice probabilities; column 6 shows the ‘unstratified’ model predicting the ‘severity-stratified’ observed choice probabilities; column 7 shows the ‘severity-stratified’ model predicting the ‘severity-stratified’ observed choice probabilities.

Table 5 Mean signed errors for predicting choice probability

For DCE-death and DCE-duration, MEs were computed by separating health states into severity ranges: bad, medium, and better health state for QALY ≤ 0, 0 < QALY ≤ 0.5, and QALY > 0.5, respectively. In addition, comparisons of the choice tasks were separately included: choice probabilities of impaired health states in A-B comparison tasks and B-C comparison tasks.

When we computed MEs across all health states, all six study arms produced insignificant MEs that were very close to zero because positive and negative errors offset each other. However, when we divided health states into severity ranges, some errors were found to be significantly different from zero. An expected result was that the out-of-sample predictions are more likely to show significant errors than the in-sample predictions, regardless of whether the severity stratification was applied. Beyond that, we found few noticeable differences between designs in most cases. However, for the DCE-duration, we found that the unstratified design produced significant negative errors for bad health states (i.e., column 4, italicized) while errors in the severity-stratified design were not significant (i.e., column 7, italicized), especially on B-C tasks. Also, for B-C tasks, the out-of-sample predictions produced by the severity-stratified design (0.0028) were much better than the unstratified design (− 0.0559) suggesting that the latter overestimated the willingness to trade-off life years to avoid bad health states significantly. These results suggest that the skewed health states selection for the DCE-duration introduced a downward bias on estimated values.

4 Discussion

This paper investigated the effect of imposing the severity stratification on Bayesian D-efficient DCE designs created for valuing health. We found that imposing severity stratification on DCE-duration was required to ensure that the selected set of health states covered the severity range well. The model estimates derived from the severity-stratified design also demonstrated better predictive performance than unstratified designs, especially regarding the choice probability of bad health states, preventing a downward bias on the values for poor health states. In the other investigated DCE types, we find less evidence of favoritism in the selection of health states, and imposing severity stratification had no substantial effect on values. The results suggest that efficient design algorithms need to be implemented carefully in the contexts of DCE-duration studies for health valuation.

It is instructive to reflect on the reasons why it matters so much to impose severity stratification on an efficient design algorithm used to construct a DCE with duration for health valuation. The low accuracy of predicted values of poor states based on a pro-mild set of health states reveals an extrapolation issue. Extrapolation per se does not cause a bias; it only does so when the model is misspecified. Hence, our findings indicate that the model was misspecified and that we can mitigate this problem by better spreading the data, thus ensuring that the resulting QALY tariffs are less affected by extrapolation. In particular, the DCE-duration model seems to be sensitive to the assumptions made regarding duration preferences, as immediate death is not included so that the anchor point for the QALY scale is completely defined by extrapolation. The efficient optimization of the DCE design with a fixed (perfect health) comparator has aggravated the extrapolation problem, because it is efficient to include a skewed selection of relatively healthy health states. This reflects the special characteristic of DCE-duration models that utilities are derived using a multiplicative utility function with life years acting as a multiplier of the health state utility. Issues with utility dominance may arise in this context more easily than in standard applications of DCEs.

A limitation of this study is that it was beyond its scope to explore the extent to which our results are specific to the matched pairwise choice format that was used in this study. Having full health as a fixed alternative and the relatively long duration (i.e., 10 years) assumed for the impaired health states might have exaggerated issues that led to skewed selection of health states. Furthermore, we have not considered the merit of efficient designs in this context relative to other design generating approaches that do not require the implementation of strategies to enhance the spread of the data. The need to impose severity stratification makes construction of efficient designs for DCE-duration studies more difficult, and hence may influence the trade-offs between pros and cons of efficient versus other designs. Third, we did not find evidence of health state selection in the DCE-death approach, but we do not know if this result holds when valuing health states derived from other descriptive systems (e.g., disease-specific ones, where the mass of health states may be on a different location on the full health–dead scale). Fourth, assuming the normal distribution for parameters’ distribution may be inappropriate to specify the monotonic attribute-level effect due to its unbounded nature. Using more flexible distribution with a fixed bound can be considered to avoid the potential violation of monotonicity. Last, we measured respondents’ preference on the length of life using both months and years as the temporal unit in the perfect health state and converted months to years in the analysis. However, respondents may not treat values in months and the equivalent amount of years in the same way when valuing health states; thus, be cautious in further study [28].

5 Conclusion

We conclude that differences in how well selected health states span the severity range can explain part of the differences in values across DCE (duration) studies. Imposing ‘severity stratification’ on DCE-duration designs ensures robustness of the results against extrapolation from a misspecified model. Until we know how widespread associated extrapolation issues are in reported value sets, we need to be careful in the use of DCE-derived health state values.

Data Availability Statement

The datasets generated for and/or analyzed during the current study are available from the corresponding author on reasonable request.