Elicitation and modelling of imprecise utility of health states

Utilities of health states are often estimated to support public decisions in health care. People’s preferences may be imprecise, for lack of actual trade-off experience. We show how to elicit the utilities accounting for imprecision (using fuzzy sets), discover the main drivers of imprecision, and compare several approaches to modelling health state utility data in the fuzzy setting. We extended the time trade-off (TTO) questionnaire, to elicit utilities of states defined in the EQ-5D-3L descriptive system (health described by five dimensions) in184 respondents. Our study demonstrates that respondents are capable of assessing their own imprecision and rigorous mathematical modelling is possible. The imprecision is larger than as inferred from the standard TTO method and is larger than estimation error, even in our smallish sample. Non-trading in TTO often results from imprecision, rather than lexicographic preferences for longevity over quality. People are especially imprecise in assessing the impact of usual activities (one of the dimensions) on utility; also, the internal inconsistency of a health state increases the imprecision. Fuzzy least squares method seems best suited to assign disutilities to individual dimensions, while separately modelling the location of utility and amount of imprecision seems best to produce value sets. If crisp parameters are estimated, accounting for imprecision changes the results little.


Introduction
Imagine you may live for 10 years in a wheelchair or T years in full health (both followed by sudden and painless death). What T would make you exactly indifferent? Would the choice be obvious if one extra day was added to such T , or a month, or a year?
In ordinary life, people make no explicit choices between health states or between health-related quality and longevity of life. Hence, their preferences may not be precisely established. Still, in thought experiments, trade-offs are used to assign utilities to health states, to subsequently support the appraisal of health technologies (e.g. Xie et al. 2016;Shiroiwa et al. 2016). In time trade-off (TTO) method, we elicit time, T , that makes the respondent indifferent between T years in full health vs 10 years in a hypothetical state; the first paragraph demonstrated how difficult that may be.
Imprecision of preferences is present in many decision contexts and can explain observed behaviours, e.g. some violations of the standard decision theory (Cubitt et al. 2015;Loomes 2007, 2011). Forcing respondents to answer may increase the dropout rate or reduce data quality (Stieger et al. 2007;Décieux et al. 2015). Additionally, when the actual value is not known precisely, how the question is framed may change the outcome. 1 In TTO, a pre-defined sequence of T s is used; imprecise preferences may cause the respondent to accept the first in a range of indistinguishably good answers, not the single best T . Thus, what sequence of T s is used may change the outcome. If it is imprecision that is responsible for this à la satisficing behaviour (see Simon 1956), then simply increasing the motivation of the respondents may not suffice; rather, they must be helped to understand their own preferences, e.g. via focus groups discussion prior to elicitation task. Also, the effect of imprecision may not decrease with sample size: if the sequence of T s results in lower value of the range being selected more often (e.g. in a bottom-up approach), then using more respondents will not help. Hence, getting respondents to understand own preferences better may outweigh getting more respondents.
That satisficing prevails in TTO can be seen from the indifference being found at all: a non-obvious fact, given a limited number of T s offered and the continuum of possible utility values. Whether the rejected T s provide useful information (on what was not satisfactory enough) was tested by Ramos-Goñi et al. (2017) in the interval regression. Still, the authors use a standard TTO protocol and assume that the disutilities of health state characteristics are regular numbers, implicitly assuming the preferences are precise.
In the present paper, we address the imprecision of health preferences and contribute to the literature in two ways. Firstly, we allow the respondents to directly report in TTO their answers as imprecise, by asking them to indicate (1) the range of values they consider as equally plausible as the standard, single-number outcome T and (2) the range of values they consider somewhat plausible (i.e. not entirely precluded). To define health states, we use the EQ-5D-3L descriptive system (Brooks and De Charro 1996): a health state is described by five dimensions (hence, 5D)-mobility (MO), self-care (SC), usual activities (UA), pain/discomfort (PD), and anxiety/depression (AD)-each set on one of three levels (hence, 3L, 1 = no problems, 3 = extreme problems, see Table 1 in Supplementary Material for exact wording). Then, we may see how the characteristics of a health state are associated with the amount of imprecision, how capable of explaining the inconsistencies in TTO answers the imprecision is, and how much a path of rejected T s reveals about the degree of imprecision.
Secondly, we test several modelling approaches to assign disutilities to individual dimensions, accounting for the fact that TTO answers are imprecise, but also that the disutilities should be treated as imprecise numbers. To enable mathematical rigour in modelling, we treat utilities as fuzzy sets (notion introduced by Zadeh 1965). Our results show, e.g. that UA adds most imprecision to the valuation results-a factor that could be considered in working on a descriptive system.
Representing the imprecision as fuzzy sets/numbers has been used before in health preference context. Jakubczyk and Kamiński (2017) showed how to treat the willingness to pay/accept for health as fuzzy. Understanding the imprecision of health state utilities could be yet another factor to consider in sensitivity analysis in health technology assessment. Modelling these two forms of imprecision mathematically may make it easier to combine them in a meaningful way.
Regarding health state utilities, Jakubczyk et al. (2018b) estimated the disutilities of the EQ-5D-5L system (five levels per dimension) as fuzzy based on standard discrete choice experiment (DCE) data. In their study, the data came as choices between paired health states with duration; hence, the disutility of a state was not directly observed, contrary to the present study. Jakubczyk and Golicki (2018) estimated the disutilities as imprecise based on a path of answers in standard TTO. In accounting for the path, their study was similar to Ramos-Goñi et al. (2017), but the novelty lay in estimating the disutilities as imprecise. The limited amount of information derived from the path resulted in a very large error of estimates. As we show in the present paper, using a direct assessment of the imprecision from the respondents results in much narrower ranges, even in a smaller sample.
The structure of the paper is as follows. Because multiple analyses were performed, we did not distinguish the methods and results sections; instead, we split the paper by the area of analysis. In the next section, we present the questionnaire design and the characteristics of the sample used. Then (Sect. 3), we discuss the raw results of the questionnaire. In Sect. 4, we analyse the determinants of the amount of imprecision. Section 5 is central to the paper: we discuss how the disutility is shaped by the worsening in individual dimensions in the EQ-5D-3L descriptive system. The main results are further discussed in the last section.

Sample and questionnaire design
The experimental questionnaire consisted of three parts: (1) background questions, (2) own health valuation, (3) extended TTO experiment. In the first part, the respondents stated their gender (56% men), age (22.3 years on average, range 19-28), university and field/year of study (a convenience sample of 184 Polish students was recruited), and past experience with an illness (own or in a family; 51.7% had one).
To familiarize the respondents with assessing imprecision, they were asked to evaluate how convinced they were about selecting a given level in individual dimensions (from 0 to 100). 2 The respondent also provided a range of values representing equally plausible answers (henceforth, epas) and somewhat plausibly answers (henceforth, spas) as the single VAS score. Specifically, for epas the respondents were asked for a range of answers that seem just as good as the single one given (i.e. the range within which they cannot single out the most appropriate answer). For spas, the respondents were asked for a range of answers that are not precluded (the range outside of which the answer surely is not located). In each of both, the respondents were required to provide two numbers defining the ends of the range (without any particular elicitation protocol). The detailed results for epas/spas for VAS are in the Supplementary Material.
In the third and main part of the questionnaire, the respondents answered 13 TTO questions. The first three were warm-up tasks (living in a wheelchair, a mild 22111, and a severe 33333), to introduce to the TTO; hence, no ranges were asked for. A typically used procedure was used. 3 If for state Q the procedure ended with T , we inferred the utility amounts to u(Q) = T /10. If a state was found worse than dead (WTD), then lead time version of TTO (LT-TTO) was used (Devlin et al. 2011): 10 years of full health with 10 years in Q were compared with T years of full health; then, for the result T we concluded u(Q) = T /10 − 1.
The remaining ten TTO experiments constituted the main part of a study. It differed from the standard TTO: when T was identified, 4 the respondent was asked to provide a range of epas and spas, i.e. of values of T that seemed equally or somewhat plausible, respectively. The respondent was asked for ranges also in LT-TTO; however, the ranges were confined to either TTO or LT-TTO. The bounds of the epas/spas ranges can be used to calculate the analogous ranges for the utility, with formulas as in the previous paragraph.
In the main part, 17 health states were used and grouped in two blocks of 10 states (3 states in both blocks). The first block contained 11112, 11211, 12111, 21121, 11312, 32211, and 32313; the second block contained 11121, 21111, 11113, 11131, 11133, 2 Other versions were tested: e.g. a Likert scale was used to measure the conviction, but respondents frowned at the option 'I disagree' with the just-selected option. 3 First using 10 years, then 0 and 5 years, and then 6, 7, …or 4, 3, …. Half a year was added if needed. 13311, and 32223; health states 11311, 22222, and 23232 were in both blocks. Exactly half of the interviews used each block.
In design, milder states were preferred, due to foreseen difficulties with assigning ranges for WTD states. The health states were selected starting from a set of 17 states suggested by Macran and Kind (2000) (see also Lamers et al. 2006). We removed 33323 and 33333 (the latter was used in warm-up), for the milder states preference. Instead, 21121 and 11311 were introduced, to test how imprecision is aggregated when health problems in different dimensions are combined (e.g. how do the ranges for 11112, 11311, and 11312 relate). All the states were split by trial and error, so that the blocks are similar in average misery index (MI, the sum of levels, 8.3 and 8.2) and every level of every dimension is represented in each block.
Nine interviewers (graduate students) were recruited and were first subject to the test questionnaire. This testing helped to frame the questions and showed that ten states should not be exceeded due to fatigue. A convenience sample of interviewers' student colleagues was used. Before the interview, the respondent gave written consent to participate. The interviews were face to face in a quiet environment, with no computer assistance, but using printed cards to illustrate the TTO and printed health state descriptions. The interviewers were forbidden to go back (e.g. changing previous valuations, once when seeing consecutive state), to make the process more tractable. 5

Data cleaning and raw results
The data cleaning and quality checks are described in the Supplementary Material. What is important for the present study is that in just three cases the respondents explicitly mentioned after the interview that they had found it difficult to provide answers for ranges, and in seven more the interviewers had the impression that the respondents had found it difficult. This problem was no more prevalent than claiming that state descriptions are unrealistic/internally inconsistent or not precise enough to imagine (10 interviewed).
The internal consistency was verified using the Pareto dominance: whether a given respondent assigned greater utility to an objectively worse state (i.e. having at least one dimension worsened, and no dimension improved). A serious violation was considered when the utility differed by at least 0.5 (as in EQ-5D-3L valuation in Poland by Golicki et al. 2010); in total, 129 such Pareto violations were identified (out of 184 respondents, all of whom valued 10 states, with a total of 3114 pairs of states Pareto comparable), i.e. 4.1%. Only 15 serious violations were present, and only two respondents had two of them (i.e. two pairs of states wrongly ordered). As Golicki et al. (2010) only removed observations if ten or more serious inconsistencies were present (out of 23 states per interviewed), we did not remove any data in the present study.
In 51 cases, the (non-serious) logical inconsistencies were explained away by the fuzzy approach in the sense that one of epas contained the other utility. In 65 cases (50.4%), the epas had a non-empty common part. In 83 cases (64.3%), the spas ranges NT non-traders, WTD worse than dead, CI confidence interval, SD standard deviation overlapped; hence, there was no Pareto violation in the interval order sense. Notice that due to the methodology, the ranges could not overlap if baseline utilities differed in sign, allowing for the epas/spas to cross zero, which might further reduce the number of inconsistencies. Tables 1 and 2 present the descriptive statistics for the tested health states with respect to the point valuation and epas (see Table 2 in Supplementary Material for spas). As can be seen in Table 2, the average length of epas is typically two to five times larger than the standard error of the mean (SEM) of the utility. Only in about 20% the epas had zero length, suggesting there usually is some imprecision regarding utility. In about 30% of cases, the ranges (epas and spas) were equal, suggesting that the respondents in most cases were capable of differentiating between the two.
The ranges only included zero in 73 cases (4%) for spas and 35 cases (1.9%) for epas, which suggests that forcing epas/spas on one side of 0 was not a major limitation. More frequently (77 cases), the baseline utility amounted to −1, the minimal LT-TTOallowed value, possibly leading to censored observations.
Thanks to the pre-defined protocol, we know what values respondents rejected during TTO task (e.g. if the respondent settled for T = 5.5 years, they must have previously rejected 10, 0, 5, and 6 years). In as many as 56% of the valuations, at least one the rejected values belonged to the epas range, and in 15.6% of all valuations it was in the interior of epas. The proportions amounted to 74.7% and 42.8% for spas, respectively. Thus, rejecting a value in the course of TTO is not a valid indicator of clear preference. We checked how certain the non-traders (not trading off any life years in TTO, i.e. finishing with utility equal to 1) about this assigning maximal utility. Out of tasks resulting in non-trading, we measured the percentage of situations where the lower end of spas/epas did differ from 1, i.e. allowed some trading-off: it happened in 34.7% and 46.2%, respectively.

What drives the imprecision
The lengths of epas/spas differ between health states (see Table 2 and Table 2 in Supplementary Material). In Supplementary Material, we show that the imprecision is subadditive, i.e. the length of epas/spas for the compound state (e.g. 11133) is smaller than the sum of the lengths of the building states (i.e. 11131 and 11113). Therefore, for epas and spas, separately, we built three types of models explaining their length. One used only the dummy variables for dimensions/levels as in standard disutility modelling that served as a benchmark model. The other model also used five derived variables, to account for subadditivity: MI minus 5 (to measure the departure from 11111), the maximal level across dimensions (maxLvl) minus 1, the number of dimensions at maxLvl (maxCnt) minus 1, the minimal level (minLvl), and the count of dimensions at minLvl (minCnt). The third model also used the estimated point disutility, 1 −û (panel random effects model, A1 in Table 4).
In the first type of model, the statistically insignificant variables were removed one by one (including the intercept, if needed). Then, in the other types, the remaining variables were added to the pool of regressors and removed if insignificant. The pre- viously removed dummies for dimensions/levels were tested again, in the widened set of regressors. Panel random effects models were applied; robust variance-covariance matrices were used to determine significance. Table 3 shows the results. For epas, the third type results in 1−û being insignificant; hence, this model was not constructed. UA2 is present in all models with positive coefficient and UA3 in both models using only dimension/level dummies. This fact suggests that imprecision is especially driven by the UA dimension. UA is followed by AD in driving the imprecision in valuation. We find it difficult to interpret the negative sign by PD2. Perhaps, this result is an artefact of the selection of states and the subadditivity (this term disappears in models with derived variables).
Increasing the maximal level increases the imprecision; so does adding dimensions at this maximal level: there is more imprecision for severe states. The negative sign by minLev can be interpreted: the increase in minLvl decreases the spread between the dimensions and apparently makes it easier to assign a utility. At the same time, the positive sign my minCnt confirms: the more dimensions are at the minimal level, the more internally diverse the health state is; hence, there is larger imprecision. That it is more difficult to precisely value an internally inconsistent state may explain why the imprecision is reduced when 11133 is considered as compared to 11131 and 11113.
Addingû yields a less interpretable model: severity of a health state increases imprecision via maxLvl but decreases it via disutility, 1 −û.
To this goal, we build three classes of models (in each, several specifications are used). First, in the crisp-crisp models we neglect the imprecision altogether: both the dependent variable and the parameters are regular (crisp) numbers (the independent variables are necessarily crisp in all the approaches). This approach serves as a benchmark.
In the fuzzy-crisp models, we account for the imprecision of the TTO outcome, but estimate crisp model parameters. This approach serves to build a standard value set, making it easier to support decision making. Finally, in the fuzzy-fuzzy models, we estimate the model parameters (i.e. the disutilities of individual dimensions) as imprecise.
The crisp-crisp models use standard econometric methods and are only briefly presented in Sect. 5.1. Introducing the other classes requires some information on how imprecision is modelled with fuzzy sets, as done in Sect. 5.2. The multitude of approaches used serves to see which modelling technique is most suitable for health state utility data.

Crisp-crisp models
As usually done for convenience, we model as the dependent variable the disutility of a health state: 1 − u(Q). We use as independent variables the dummies, d i, j , denoting if dimension i is at level j; the corresponding parameters, α i, j are disutilities of worsening dimension i to level j (from level 1). To make interpretation easier, in tables the parameters are denoted with dimension abbreviation and level number, e.g. MO2 instead of α 1,2 . Also a constant term, α 0 , is used in some specifications.
The results are presented in Table 4. First, we quote the valuation study Golicki et al. (2010) in the Polish population, to inspect possible differences due to the specific target group in the present study. The remaining models are based on the current sample. In model A1, we used the random effects panel model (with robust variance-covariance matrix), one of the standard approaches in this type of models. 6 In A2, we estimated an analogous model using ordinary least squares (OLS) to test the impact of neglecting the panel structure, as done in the fuzzy-crisp/fuzzy approaches. In A3, we dropped α 0 , again, to match some fuzzy-fuzzy specifications that assign the imprecision entirely to individual dimensions, for a more convenient interpretation. Finally, in A4 we used median regression (Koenker 2005), for reasons explained in the next subsection.
Firstly, in the present sample, the respondents attach greater disutility to most of the dimensions/levels than the general population (except for PD3, where the estimation error is largest). This may be due to the sample consisting of young, healthy individuals, scared of health problems due to lack of such experience. This is particularly true for UA, suggesting that students value their free time a lot. Fortunately, the representativeness of the sample is not essential in the present study, as we do not focus on the exact values of utilities but rather on mechanisms present and methodological approaches to detect them. Still, our results are not drastically different from those of Golicki et al. (2010). Secondly, the difference between panel and OLS regression is negligible from a practical point of view, a convenient fact. Thirdly, the median regression yields lower parameters (and a negative constant term). An unsurprising finding, accounting for the negative skewness of health states' utilities (e.g. due to frequent non-trading for several states): the median regression establishes how the middle person's utility changes when dimensions are worsened; as the substantial fraction of the respondents still assigns high utility, the median decreases slower than the mean.

Fuzzy-crisp models
In this second class of models, we account for the TTO answers being usually imprecise (i.e. epas/spas ranges having non-zero length). Hence, the elicited (dis)utility is also imprecise. To formally model it, we use fuzzy sets (only basic introduction is offered; for details, see, e.g. Kilir and Yuan 1995).
A fuzzy set X over real numbers (the tilde denotes fuzziness) is represented by a membership function: μ X (·) : R → [0, 1] (we drop the subscript in notation, if clear from the context); μ X (x) is interpreted as the measure of belongingness of x to X . If μ X (x) = 1 for a single x and μ X (·) = 0 otherwise, then X is a regular, crisp number (equal to x). With fuzzy numbers, more than one value can be perceived by a respondent as a possible (dis)utility of a given state, and some values are less credible but not entirely ruled out. 7 Equivalently, X is defined by α-cuts: A X ,α = x ∈ R : μ X (x) ≥ α , for α ∈ (0, 1], or strong α-cuts: A X ,α+ = x ∈ R : μ X (x) > α , for α ∈ [0, 1). Then, A 1 7 In the paper, we do not differentiate between ontic or epistemic interpretation (Couso and Dubois 2014): respectively, whether the respondent believes there are multiple values of u(Q) (e.g. because of ambiguity what being in Q truly means), or whether the respondent believes there is a single value u(Q), only not being perfectly accessible through introspection into own preferences.  is a (regular) set of values that belong to X with full conviction, and A 0+ is a set of values that belong thereto with non-zero conviction. In our study, both cuts can be directly inferred from epas and spas. Lastly, we assume the membership function is continuous and linear where possible. In fuzzy set parlance, the disutility is a normal (max μ = 1) and trapezoidal (shape of μ) fuzzy number (α-cuts are convex).
With our assumptions, fuzzy disutility can be denoted by a four-tuple (l, L, H , h), where [L, H ] are the bounds of A 1 and [l, h] are the bounds of A 0+ . An example of a fuzzy disutility, (0, 0.2, 0.4, 0.4), is shown in Fig. 1.
In all models below, we ignore the panel structure; the observations for individual respondents are pooled and indexed by k. For each k, we have the four-tuple (l, L, H , h) denoting the observed, empirical disutility and, hence, the membership function, μ (we omit index k for clarity).
In fuzzy-crisp models, we assume model parameters, α 0 and α i, j , are crisp. Then, the theoretical disutility, 1 − u, is also crisp and given by: Fitting models is done by setting such parameter values to make 1 − u as close to (l, L, H , h) as possible, in a given sense. In B1, we minimize the sum (over k) of squared distance between 1 − u and (L+H ) /2. Hence, the imprecision is accounted for by basing the model on epas range (an alternative model for spas could be built), yet the range is reduced to considering its middle point.
In B2, the whole four-tuple is used, and we maximize the overall compatibility of the theoretical values given the imprecise preferences, i.e. we maximize the expression k μ( 1 − u), (see also Celmiņš 1987). This model can be written as a linear optimization problem; still, as the resulting problem is overwhelmingly large, we numerically maximized the non-linear function with the DEoptim package in R. Therefore, the global optimum might not have been found.
and minimize the sum of squared such distances. Finally, B4 uses [l, h] instead of [L, H ]. From the technical side, the B3 and B4 approaches can be presented as quadratic programming tasks. The results of the above specifications are collected in Table 5. The results of B1, B3, and B4 are very similar between themselves and to the results of crisp-crisp models. Apparently, the imprecision around the TTO outcome has little impact when the modelling process is set to yield crisp values.
However, the results of B2 are very different: much lower disutilities were found for all dimensions/levels (and the constant term), and we offer the following explanation. B2 approach strives to set parameters to make the theoretical value fall into the range of as many respondents as possible; in this sense, B2 can be viewed as mode-oriented regression. The disutilities of health states are positively skewed, and for variables with positive skew we often observe mode < median < mean. Analogously in our case, B2 yields lower disutilities than A4 (median regression), and A4 lower than A2.
We illustrate the above phenomenon for state 32313 in Fig. 2: point answers and epas for all respondents are presented (spas left out for clarity). The average point utility amounts to −0.122, and the median is greater (between 0.0 and 0.05, due to the even number of observations). To cross the maximal number of ranges with a horizontal line, we need to draw it at the utility of 0.5 (we have 13 crosses then), more than a median. Additionally, this cross-maximizing level can radically change with just a small change of the data. Also, the censoring of the answers at −1 (it is not possible to report lower utility) may lead to the grouping of the answers and impact the  (15) results. Thus, the results of the compatibility approach seem challenging to interpret and unstable.

Fuzzy-fuzzy models
In this third class of models, we attribute imprecision to individual dimensions; the results are collected in Table 6. In C1, we use a simplified, naïve approach: we build two separate OLS models for L and H , i.e. estimate two sets of crisp parameters α i, j and α 0 .
In the remaining approaches, we explicitly redefine model parameters to be fuzzy numbers: α 0 and α i, j . To conform with the definition of the dependent variable, we assume αs are trapezoidal and normal; hence, they can be represented by four-tuples of numbers. Calculating theoretical disutility requires multiplying parameters by dummies and adding the results. Standard fuzzy arithmetic yields: 1 × (a, b, c, d) =  (a, b, c, d), 0 × (a, b, c, d) = (0, 0, 0, 0), and (a, b, c, d) + (e, f , g, h) = (a + e, b + f , c + g, d + h). Then, the theoretical disutility is also fuzzy, and is denoted by (l,L,Ĥ ,ĥ). The models are estimated by making (l,L,Ĥ ,ĥ) as close to the observed (l, L, H , h) (for all k) as possible.
The fitting of the model requires calculating distances between trapezoidal fuzzy numbers, to minimize the sum over k of squared distances, and we use two methods. In the first one, we use the supremum of Hausdorff distance between their respective α-cuts of two fuzzy numbers (see Voxman 1998): d(X ,Ỹ ) = In C1 the results of the L and H modelling is given in parentheses; in other models, the quadruples presented are (l, L, H , h) MO mobility, SC self-care, UA usual activities, PD pain/discomfort, AD anxiety/depression is a standard Hausdorff distance. In our case of trapezoidal fuzzy numbers, the distance can be shown to simplify to max l −l , L −L , H −Ĥ , h −ĥ . We use this distance in C2 and C3, with and without a constant termα 0 , respectively. A possibly counter-intuitive and unwanted property of d H is its insensitivity to some changes in the considered intervals: For this reason, we use another distance between four-tuples, directly comparing the corresponding elements of two four-tuples (see Chachi and Taheri 2016, for a similar distance for intervals): We use d I (·, ·) in C4 and C5, with and without a constant termα 0 , respectively. Models C2-C5 can be written as quadratic programming problems.
A counter-intuitive property of C1 is that some coefficients when modelling the lower bound of the disutility range (L) are greater than for the upper bound (H ); hence, it is impossible to directly interpret the results as the imprecise disutilities of individual dimensions/levels.
In the remaining approaches, this difficulty does not arise. However, some parameters reduce to crisp numbers (e.g. MO3 in C2), especially when intercept captures most of the imprecision. Models with no constant term (C3 and C5) allow better to assign imprecision to the individual dimensions. Conforming with the results presented in Table 3, UA followed by PD is responsible for the largest part of the spread in answers. Looking at absolute values, the ranges of disutilities conform with the results of the standard crisp-crisp modelling in Table 4: PD3 causes the largest disutility, followed by UA3 and MO3.

Value sets
Supporting health technology appraisal requires not only understanding the impact of dimensions on disutility, but we also need to assign utility values to all the health states in the descriptive system, i.e. to construct a value set. We constructed and present in Fig. 3 the value sets constructed for selected types of models: standard panel model (A1, health state ordered decreasingly with this utility), compatibility (B2), fuzzy-crisp least squares (B3), naïve approach (C1, two lines), fuzzy-fuzzy Hausdorff distance model with no intercept (C3, gray area based on L-H range only). We also presented the result of a combined modelling of epas middle point (B1) and its length (Table 3,  For most models, the resulting value sets offer similar utilities. Notably, the compatibility approach used in B2 results in much higher utility values (due to the explanation provided above). Regular crisp-crisp (A1) and fuzzy-crisp (B3) approaches yielded nearly identical values. In this sense, imprecision of preferences used as input does not matter as long as we require a crisp output.
However, if we transport the imprecision to the value set, we obtain a band (not a line) of values that all can equally well be interpreted as utilities of consecutive health states. For fuzzy least squares (C3), for some health states there seemingly is no imprecision, and the band narrows to a single point (if only dimensions where disutilities degenerated to crisp numbers are worsened). The naïve approach (C1) and modelling the middle and length of the interval separately yield similar results; hence, the unwanted parameter reversal in C1 does not pose a problem for value set construction.

Concluding remarks
In the present paper, we studied the amount of imprecision in the preferences for health states. We showed that the imprecision of the utility values can be elicited by a simple extension of the standard TTO protocol. Preferences are indeed imprecise: there is typically a range of answers the respondent considers equally plausible. Importantly, this imprecision is not stochastic in nature: it will not diminish with increasing sample size (and even for a moderate sample used in the present study, the imprecision is larger than the standard error due to sample randomness). Hence, focusing solely on sample size or number of states used may be insufficient to increase the credibility of the results; it could be worthwhile to add elements that increase respondents' awareness of their own preferences (e.g. a qualitative part in the beginning of a study to discuss what aspects of health one values in life).
Some studies (Ramos-Goñi et al. 2017;Jakubczyk and Golicki 2018) notice the problem of imprecision in preferences and approach it by studying the paths of responses in TTO. As we show, the actual imprecision is often larger than what was rejected in TTO would suggest (as presented in Sect. 3, in > 50% a rejected value was included in the range of plausible answers). Perhaps, respondents need some time or other framing of the question to realize that the previously offered T values were acceptable; or perhaps respondents tend to only accept T being closer to the middle of the epas range. That comparing alternatives and valuing are two different tasks and sometimes their results diverge were observed in the literature (see, e.g. Butler et al. 2014;Grether and Plott 1979).
In particular, when respondents in TTO accept the first offered value of T = 10 years, the implied utility equals 1 (non-trading, present in > 30% cases for some mild states), and there is no path on which to base the inference (which might explain wide intervals in Jakubczyk and Golicki 2018). Asking for epas/spas directly helps to understand the non-trading behaviour better. As we found, in 34.7% of the nontrading cases, the respondents actually felt trading off some years as an equally good representation of their preferences, and in 46.2% they did not rule out the trading off completely. This finding shows that observed non-trading may be largely due not to preferences being lexicographic (longevity first), but to the TTO protocol (starting with 10 years). As non-trading poses some technical challenges in modelling utilities (e.g. whether the utilities should be censored at 1, as done in Devlin et al. 2018), our finding suggests modifications to the protocol may be useful (e.g. using a finer set of answers near 10 years).
That the imprecision can explain some inconsistencies in answers (a dominated state being assigned greater utility by a given respondent) suggests that the standard quality checks may be overly conservative. This is not to say that respondents with inconsistent answers ought not to be removed from the sample (inconsistent answers as input may lead to counter-intuitive results), but perhaps inconsistencies in TTO should not be immediately treated as a sign of poor job by an interviewer or lead to also discarding the DCE data, if both are collected.
We found the preferences are typically not so imprecise to make respondents unsure if a given state is better or worse than dead (in only a few cases the epas/spas ranges contain 0 utility). This is a rather optimistic finding; for example, a discrete choice experiment between health states with no duration (another elicitation procedure, often paired with TTO) allows to estimate the utility values only on a latent scale that needs to be subsequently anchored at 0 value (dead) (see Norman et al. 2016, for information on anchoring), and apparently directly comparing health states versus dead can be relatively precisely done. Also other, novel approaches to estimating health state utilities (e.g. personal utility function presented in Devlin et al. 2019) require locating being dead in a set of ranked states; our finding supports this task can be done rather credibly.
In the paper, we demonstrated what state characteristics drive the imprecision mostly. Respondents are particularly unclear about the impact of usual activities, followed by pain/discomfort. This may result from usual activities being a wide concept (very different activities might be included-e.g. work, free time, family obligations) and most subjective (respondent must make a decision what activities they include in considerations). If the health state is internally diversified (some dimensions at level 1, and some at level 3), then it also increases the imprecision in how the states are valued.
The analysis of imprecision may help in developing descriptive systems; obviously, the EQ-5D-3L/5L descriptive system is well established and ought not to be changed due to the present study, but the analysis of imprecision may help when developing other systems or bolt-ons, i.e. additional dimensions added, e.g. for specific patient groups (see, e.g. Yang et al. 2015). In a larger and more diversified sample, one might verify how confident respondents feel with reporting problems in individual dimensions (i.e. re-estimate the model presented in Supplementary Material).
Also, the results presented here suggest what states are best used in valuation studies. As internal consistency increases the precision in assessment, it makes sense to focus on less internally diversified states. Also, the imprecision seems to be subadditive (the overall imprecision is smaller than the sum of its parts when worsening in two dimensions are compounded). This observation motivates using severe states in valuation studies.
In the paper, we tested several econometric approaches. Our findings suggest that for different purposes, different classes work best. To attribute the imprecise disutilities to individual dimensions, probably the fuzzy-fuzzy least squares with no intercept are best suited (C3 and C5), because in this approach, we directly attribute the imprecision of utility to parameters for individual dimensions/levels. However, if we want to construct a fuzzy value set, then separately modelling the middle and the length of the α-cuts seems most convenient, as it avoids the problem of possible sub-additivity of imprecision when problems in different dimensions are combined (see Supplementary Material). The naïve models proved similar in producing value sets, but modelling separately the location and the length has an additional advantage that it is possible to select the best class of model (e.g. semi-log) for both elements separately.
The compatibility approach (B2) seems least suited, as it yields counter-intuitive (and difficult to interpret) results with very low disutilities. On the one hand, that is a disappointment as the membership function of the fuzzy disutility is not used in other approaches (only the bounds of the two α-cuts). On the other hand, the assumptions regarding the actual shape of the membership function (is it trapezoidal) are less important and the findings are more robust in this sense.
The results of some modelling approaches are probably influenced by the selection of states used in the survey. For example, level 3 in SC was only used in three states and always in combination with some other dimension at level 3; this fact may explain the degenerate result of C3 and C5. On the one hand, if worsening in two dimensions usually coincides, it might be difficult to separate the impact of each on the imprecision. On the other hand, worsening only one of two dimensions to level 3 may make the state hardly imaginable, internally inconsistent, and difficult to assign utility. In future studies, probably a wider pool of states is needed. It might also be easier to use a richer EQ-5D-5L descriptive system in future studies, as the levels differ more subtly and most variation in dimensions can be introduced without making the state internally implausible (3L version was used here, as we wanted to cover a reasonable fraction of all the states with the design used).
In the present study, we neglected the problem of the estimation error (except for models where standard econometric techniques were used). Still, how the results of various approaches coincide is reassuring for the point estimates obtained. In future research, bootstrap methods could be used, for example, to inspect the impact of sample randomness on the estimated parameters.
The impact of imprecision on parameters is negligible, when crisp parameters are required as output (e.g. results of A1 and B1 or B3 are rather similar). This finding is rather fortunate for the whole area of research: if we want to estimate standard (crisp) value sets to support decisions on health technology provision, we may as well neglect preference imprecision. One one hand, that finding seems a bit unfortunate for the present paper (but was not known before the study). On the other hand, the imprecision of preferences should probably make us think more about the estimand (if not about the estimation process). If preferences are imprecise, what is it that we would like to maximize?
It is a bit paradoxical that we ask the respondents to report their imprecision in the form of (precise) numbers. The imprecision about how to quantify one's imprecision can be modelled formally (see Zadeh 1975, for type-2 fuzzy sets). We believe, however, there is a trade-off between trying to learn more about the respondent and having to ask questions that may be unintelligible (asking for a range of values describing the lower bound of the epas range). Additionally, we believe that if the respondent is unsure about, say, lower bound of epas, they can report it more conservatively (i.e. report a lower bound of the range in a way accounting for the whole range of possible lower bounds of epas).
We have to acknowledge, however, that there are probably other forms of imprecision that the respondent cannot access by introspection. How questions are framed or what protocol is used to elicit responses may change the answers, as observed in other experimental contexts Beattie et al. (1998). This is especially true if we assume the preferences for health states are not available a priori but are only created during the experiment (just like regular markets shape the preferences for goods, see Isoni et al. 2016). Our present study is limited to studying only one type of imprecision. Still, perhaps the amount of imprecision (as defined in the present study) can serve as a proxy for measuring stability of preferences in repeated tasks. Then, future studies could inspect if weighting observations based on imprecision improves the quality of produced models (defined, e.g. by predictive validity, see Jakubczyk et al. 2018a).