1 Introduction

Choices abound in real life and they are made with varying degrees of conviction. For instance, we may hesitate in choosing which shirt to wear for a job interview, yet easily pick a favourite radio station on the way to attending it. You—dear reader—may just have falteringly decided to keep on reading this paper. We aim to verify whether quantifying the self-reported conviction in discrete choice experiments (DCE) is possible in a meaningful way, i.e. whether such a measure has intuitive properties and whether it is associated with observed choices in DCE or other preference elicitation tasks.

Understanding people’s preferences is crucial in various contexts. For instance, companies want to learn what their customers value, to help with product development or selecting a marketing or pricing strategy. Public decision makers may strive to understand societal preferences for publicly offered goods, to help with cost-benefit analysis and to align the offering of such goods with what society needs and is willing to pay for. This paper is located in the latter context: we study preferences for health states.

Health preferences are studied with the aim of informing cost-utility analyses (CUA) of health technologies. In CUA, the improvements may stem from increased longevity of life or improved health states. In CUA, health states are often defined using one of the EQ-5D family of instruments (Kennedy-Martin et al., 2020). These instruments are used to measure the health-related quality of life by capturing the severity of health problems across five dimensions (see next Section for details). In the present paper, we use EQ-5D-3L, in which the degree of health problems is measured using three levels. Based on valuation studies (for instance, Golicki et al., 2010; Versteegh et al., 2016), values are assigned to health states to subsequently quantify the health benefits in CUA.

In valuation of EQ-5D instruments, the elicitation tasks typically encompass the time trade-off method (TTO) and DCE. Usually, the so called QALY model (QALY = quality-adjusted life years) is assumed as a representation of preferences for health profiles, i.e. for the alternatives denoting living in state Q for T years, denoted as (QT) (see Bleichrodt et al., 1997; Miyamoto et al., 1998 for axiomatization). In this model, the preferences for (QT) can be represented by a von Neumann–Morgenstern utility that can be decomposed into a product of T and a value representing the utility of Q: \(u(Q)\times T\). The function \(u(\cdot )\) is normalised to assign zero to being dead and 1 to full health.

In TTO, the respondents are asked to compare health profiles to understand what trade-offs they are willing to accept. TTO is an indifference-based task, i.e. the respondents are interviewed in such a way as to establish their indifference between various health profiles—for example, living in a given state, Q, for 10 years and living in full health for 6 years. In the QALY model, the utility of Q can be elicited as \(u(Q)=\nicefrac {6}{10}\).

DCE is a choice-based task: the respondent faces a series of choice tasks in which they typically choose from a pair of health states. Depending on the specific operationalisation, the alternatives may also be defined to take duration into account (Craig & Rand, 2018) or to disregard it (Stolk et al., 2019). Under random utility theory, the observed proportion of choices is reinterpreted in terms of the utility difference between the alternatives. If only health states without information on duration are used in DCE, the obtained utilities are only measured on a latent scale and the task has to be accompanied by TTO, for instance, to anchor the utilities on the scale used in the QALY model (Norman et al., 2016). For the EQ-5D-5L instrument, the valuation studies use either only TTO data (Versteegh et al., 2016; Pickard et al., 2019), only DCE data (Craig & Rand, 2018), or both types of data (Golicki et al., 2019).

Using DCE is perceived as easier for the respondents, which has the advantage that online surveys, rather than only supervised interviews, can be used to collect data. Also, making binary choices resembles the actual life situations more closely than undergoing the iterative procedure in TTO. However, in the case of DCE, only binary information is collected from each choice task. In consequence, many respondents have to be included and the ability to understand the preferences of each individual is limited.

In this empirical paper, we assess the possibility of expanding a typical DCE by additionally asking for a self-reported conviction for the choice. We foresee several benefits of such an extension. First, the measure of conviction would add cardinal information. Since the respondent has had to make a decision anyway, reporting the level of conviction accompanying it should not represent a large additional effort. Second, acknowledging that the choices are made with varying degrees of conviction may help to understand the violation of standard axioms for preferences. For instance, if a respondent picked A over B, B over C, and C over A, yet all the choices were made with very low conviction, can the respondent be considered as irrational? This issue might be particularly relevant in data quality control, where the observed choices are the basis for respondent exclusion (e.g. in the garbage class approach Karim et al., 2022; Jonker, 2022). Third, asking respondents for their conviction may improve their engagement with the task: since such a question is asked of them, it is understood that some choices come with difficulty, and responses given with no conviction should not discourage them from focussing on subsequent tasks.

To us, it seems that there are two key reasons why people may not be fully convinced about their choice. First, the best and second-best alternatives may be perceived as equally attractive. Second, the choice task may be difficult due to the complexity of comparisons required. In the context of comparing health states defined using multiple attributes, this complexity may stem from having to evaluate multiple trade-offs across several dimensions. For complex choices, people simplify their decision-making process by referring to dominance (resulting in a decoy effect) or anchors to compare the alternatives (Ariely & Jones, 2008). In this paper, we explore the importance of the above two reasons for lack of conviction.

However, whether self-reported conviction yields meaningful information cannot be taken for granted: it is not defined in terms of choice; thus, it is not clear what a conviction of, say, 40 means.To show that information obtained with conviction is meaningful, we study the properties of conviction itself. We show that (i) it changes in an intuitive way when choice tasks are modified (for instance, when one of the offered alternatives is worsened), (ii) it can be used to predict choices (it is associated with whether the choice is sustained in a test–retest experiment), (iii) it is linked not only with the estimated difference in utility between the offered alternatives but also the amount of imprecision with which the utility is perceived by the respondent. To study the properties of conviction, we use both DCE and TTO data. The DCE part was extended to collect the additional conviction data; the TTO part was extended to measure respondents’ uncertainty about the indifference point (similarly as in Jakubczyk & Golicki, 2020).

There are only a few studies that have measured conviction in DCE. Elrod and Chrzan (2007) collected the extent-of-preference information (as per their terminology) in choices between brands of products. They studied how this additional information changed the estimation results. Two important differences from the present paper are that: (1) they measured the conviction as an ordinal variable and (2) they assumed that the conviction for a choice depends only on the difference in utilities of the alternatives. To give a simple example as to why such an approach may be incorrect, imagine choosing between a certain pay-off of 99 dollars or 100 dollars. The utility difference is very small, yet the conviction behind the choice is likely to be very high. The contribution of the present paper is that we attempt to check how, in addition to the utility difference, the imprecision in the perception of the utility is linked with the degree of conviction.

In a recent study, Wranik et al. (2020) collected information on conviction from respondents in a single-scenario DCE on accepting reimbursement (or not) of a hypothetical cancer drug. Conviction was also measured solely on an ordinal scale and not used in the DCE data modelling (it was considered alongside best-worst scaling data, also collected in this study, for presenting descriptive statistics).

Finally, there are studies in which the way the preferences are modelled involves imprecision, whilst no measures of imprecision are directly elicited from the respondents, i.e. the degree of imprecision remains only a theoretical, unobserved entity. Such an approach was used both for DCE (e.g. Jakubczyk et al., 2018b) and for TTO data (e.g. Ramos-Goñi et al., 2018; Jakubczyk and Golicki, 2018).

Regarding the theoretical basis of choice conviction, Gerasimou (2021) proposes a theoretical model of preference intensity that shares some properties with our model. However, our model allows for a more general interpretation of the conviction level. More specifically, it reflects not only the strength of preference, but also indecisiveness due to the complexity of the choice. Our experimental design distinguishes between these two determinants of choice conviction (see hypothesis H3 in Sect. 3.1).

An approach quite similar to the TTO method discussed in Sect. 3.4 is the Loomes (1997); Dubourg et al. (1994) approach, which analyses the imprecision of stated preferences for road safety. They use the contingent valuation method, elicit maximum and minimum willingness-to-pay for a specific reduction in the risk of a specific injury, and use the difference as a measure of preference imprecision. We use the QALY model, where life expectancy serves as a continuous scale against which imprecision is measured. On the one hand, the QALY model requires strong assumptions concerning the preferences. On the other hand, the increase in life expectancy seems to be a more natural form of compensation for the highly untradeable change in health than the reduced willingness-to-pay.

2 Methods

2.1 Data

We collected data in a convenience sample of 83 students from SGH Warsaw School of Economics, Poland, via personal interviews performed by 3 graduate/PhD students (live or online; see Lipman, 2021, for a discussion of an online form). Each respondent gave their consent to participate, answered basic demographic questions, evaluated their own health (with EQ-5D-3L), conducted 17 DCE tasks (pairs or triplets) together with self-reporting their degree of conviction, assessed the overall difficulty of the DCE tasks, and conducted 2 warm-up TTO tasks and 7 actual TTO tasks.

In the DCE part, the respondent was asked to choose in which of the presented states they would prefer living for the next 10 years (after which the health state would return to the current one and progress as expected). In the actual TTO tasks, the respondent also evaluated the plausible ranges for the indifference point, to capture the amount of imprecision when evaluating the attractiveness of health states (similarly to Jakubczyk & Golicki, 2020). The utility elicited with TTO is set on a QALY scale (Bleichrodt et al., 1997), i.e. the utility of full health is normalised to 1, and the utility of being dead is normalised to 0.

Health states were defined using the EQ-5D-3L descriptive system. A health state is defined by five dimensions (attributes): mobility (MO), self-care (SC), usual activities (UA), pain & discomfort (PD), and anxiety & depression (AD). Each dimension can be graded on one of three levels, 1–3, where 1 denotes no problems and the severity increases with level. The states are referred to by representing them with five digits. Hence, state 11111 is the best possible (full health), and 33333 is the worst possible (also known as the pits state). For a detailed presentation see, for instance, Golicki et al. (2010). In this paper, we also refer to states using a, b and c.

To allow for a testing of the properties of the self-reported conviction, the DCE tasks were chosen to offer a range of possible analysis types, as characterised below. Table 1 summarises our DCE choice tasks (we bolded the states that were also used in the TTO part). The ordering of choice tasks was random.

Table 1 The choice tasks used in the discrete choice experiment

2.2 Properties of self-reported conviction

First, we analyse the properties of conviction based solely on DCE data. We then proceed by formulating several properties and test whether they hold in collected data.

Each choice task corresponds to a subset of states (a menu). We focus on choices from pairs of alternatives, denoted by (ab), so our notation is optimised for this case. For a given respondent and choice task, we summarise the reported choice and conviction level with \(\mu (a,b)\) equal to the conviction if a was chosen and negative conviction if b was chosen. For instance, for a choice task (ab), \(\mu (a,b)=100\) means that a was chosen over b with maximal conviction, \(\mu (a,b)=0\) means that either a or b was chosen with zero conviction (note the equivalence), and \(\mu (a,b)=-100\) means that b was chosen and maximal conviction was reported. By definition, \(\mu (a,b)=-\mu (b,a)\). For brevity, we also use \(\mu _i\) to denote a choice task with a given ID as per Table 1.

Given two states a, b we define \(a+b\) to be a new state in which the health problems of a and b (if any) are combined. We only consider such a, b where no more than one health state has problems (i.e. a level higher than 1) in any given dimension. In consequence, \(a+b\) is a state for which every dimension is at the worst level indicated for a or b. For example, 11112+11121=11122.

When health problems in several dimensions are combined (hence, the state becomes unambiguously worse), we expect the conviction level to be consistent with dominance, i.e. the combined state should be less preferred than each of the states before combining. This property can be explicitly formulated as hypothesis H1, Dominance: \(\mu (a+b,c)\le \min \{\mu (a,c),\mu (b,c)\}\). To verify H1, we use choice tasks ID=2, 5, and 6, i.e. we test whether \(\mu _5\le \min \{\mu _2,\mu _6\}\).

As pointed out in the introduction, \(\mu (\cdot )\) may respond to the increased difficulty of the task, for instance, as measured with the number of trade-offs one has to evaluate. To verify whether this effect prevails, we study hypothesis H2, Strength of preference vs. choice difficulty, stating the following. Consider four states \(a,b,a',b'\) such that there is exactly one trade-off (the alternatives differ in exactly two dimensions — one is better than the other in one and worse in the other dimension) in each of the following pairs: (ab), \((a',b')\), \((a,a')\) and \((b,b')\). The classic monotonicity axiom (Gilboa, 2009, p.143) of multi-attribute choice would require that if a is chosen from (ab) and \(a'\) is chosen from \((a',b')\), then \(a+a'\) should be chosen from \((a+a',b+b')\).

If \(\mu\) captures the strength of preference only, we would expect

$$\begin{aligned} \mu (a,b)>0,\,\mu (a',b')>0 \Rightarrow \mu (a+a',b+b')\ge \max \left( \mu (a,b),\mu (a',b')\right) , \end{aligned}$$

because we would expect the strengths of preference to add up. However, if \(\mu\) captures only indecisiveness due to choice difficulty, we would expect

$$\begin{aligned} \mu (a,b)>0,\,\mu (a',b')>0 \Rightarrow \mu (a+a',b+b')\le \min \left( \mu (a,b),\mu (a',b')\right) , \end{aligned}$$

because the number of trade-offs for menu \((a+a',b+b')\) is larger than in each of (ab) and \((a',b')\).

In our data, we can look into H2 using \(a=21111\), \(b=11121\), \(a'=12111\) and \(b'=11112\) in choice tasks ID=1, 2 and 3. We consider respondents for whom \(\mu _1\times \mu _2>0\) (i.e. either a and \(a'\) or b and \(b'\) chosen). Then we verify for how many respondents either the strength of preference or difficulty of choice motif prevails, i.e. whether \(|\mu _3 |\ge \max \left( |\mu _1 |,|\mu _2 |\right)\) or \(|\mu _3 |\le \min \left( |\mu _1 |,|\mu _2 |\right)\).

As a third property, we study whether \(\mu\) can be used to detect inconsistencies in choices. To begin, we verify the hypothesis H3, Transitive–intransitive: intransitive choices are made with lower levels of conviction as compared to the transitive ones. To verify H3, we use two groups of choice tasks: ID=1, 6, 7 and 8, 9, 10, respectively, and we compare the average (per respondent-choice task) minimum of the absolute values of \(\mu (a,b)\), \(\mu (b,c)\) and \(\mu (c,a)\), separately for intransitive and transitive choices.

To proceed, we verify whether \(\mu\) captures the consistency of choice in tasks involving three alternatives. We use Sen’s property \(\alpha\) (Sen, 1971): adding new decision alternatives should either not affect the choice, or one of the added alternatives should be chosen. We operationalise this property in hypothesis H4, Sen’s property \(\alpha\): people for whom the \(\alpha\) property holds make choices with higher conviction than people violating this property. More specifically, we looked into respondents who made transitive choices in H3. We compared the conviction in choice task (abc), depending on whether the selected alternative was also picked in pairwise comparisons. Based on our study design, we used choice tasks ID=15 and 16 (corresponding to pairwise choice tasks, 8–10 and 1, 6, and 7, respectively).

Finally, we address response mode effects (valuation vs. choice) and the resulting intransitivities (Seidl, 2002; Dhar & Nowlis, 2004). In particular, we study whether choice-valuation preference reversals can be attributed to low values of conviction. To this end, we compare the ordering of states from the DCE tasks with that implied by the TTO valuation. We formulate the final hypothesis, H5, TTO-DCE reversal: the average (over all respondents) conviction for DCE choices consistent with the TTO valuation is higher than for choices inconsistent with TTO. To verify H5, we use choice tasks ID=8–11, and 13 (pooling all such tasks).

2.3 Drivers of conviction in pairwise DCE choice tasks

We attempt to explain the drivers of conviction with a simple linear regression. When asked to indicate overall reasons for DCE choice difficulty, respondents mostly pointed to the difficulty in simultaneously considering multiple attributes, the difficulty in imagining a given health state, and that of perceived similar attractiveness (ambiguity of descriptions and ethical issues were indicated as far less important). The above results motivated the selection of variables for the model, as follows.

Let r denote the actual health state of the respondent. Consider a choice task (ab). Let \(a_i\) denote the level of the i-th dimension of state a. We define three variables that attempt to capture the possible reasons for difficulty of choice.

The complexity of choice, CPX, is captured by the (Manhattan) distance between the alternatives: \(\text {CPX}=\sum _{i=1}^5|a_i-b_i |\). The difficulty in imagining the states, IMG, is captured by the difference of the alternatives from the current health: \(\text {IMG}=\sum _{i=1}^5|a_i-r_i |+|b_i-r_i |\). The relative attractiveness of a over b, ATV, is captured by simply comparing the sum-score over all dimensions for the alternatives: \(\text {ATV}=\sum _{i=1}^5 (a_i-b_i)\).

To study the reasons for conviction, we estimate the following model (for all pairwise tasks, indices omitted for brevity):

$$\begin{aligned} \mu =\beta _0 + \beta _1 \times \text {CPX} + \beta _2 \times \text {IMG} + \beta _3 \times \text {ATV} + \varepsilon . \end{aligned}$$
(1)

2.4 Conviction and test–retest prediction

To test whether conviction can be used to predict subsequent choices, we used data coming from 20 respondents who repeated the entire interview after 10–18 weeks. Such a time seems long enough to us for the respondents to forget their actual answers, but short enough to assume that there were likely to have been no events to change respondents’ views on health (i.e. no severe illnesses in the meantime). These data allowed us to study how much the conviction in the original choice (and in the new choice) correlated with the choice being repeated. In total, there were 340 repeated choice tasks.

We compared the average conviction (separately for the first-round and second-round choices) between two situations: when the choices in repeated choice tasks were identical or different (independent t-Student test with unequal variance). We also compared the proportion of choices being equal, depending on whether the conviction (again, separately for the first and second round) was greater than the median or not (Chi-squared test).

2.5 Conviction when comparing two elicitation methods

To study how the conviction is related to the difference in utility of the alternatives, we jointly used the DCE and TTO data, as the TTO method allows us to directly estimate the utility assigned to a given alternative by a respondent.

We used a Bayesian approach with the following assumptions (presented in a non-technical manner). The utility of a state is equal to 1 (i.e. the utility of a full health in the QALY model) minus the disutilities of individual dimensions. The additive impact of dimensions on the utility is a standard assumption in valuation studies (Golicki et al., 2019).

The TTO task was extended as in Jakubczyk and Golicki (2020) to also elicit the ranges for a plausible utility value to measure the imprecision of preferences: after the TTO procedure had ended, the respondent was asked what answers were equally plausible as the one originally provided. In the Bayesian model, it was assumed that the length of the range is proportional to the disutility of a health state with a proportionality parameter \(\theta\). The proportionality reflects the fact that in valuation studies, heteroscedasticity prevails: the dispersion of utility is larger for more severe states, which possibly reflects less precise preferences (more severe states typically entail more dimensions being harmed, hence more attributes to consider). Modelling the imprecision of preferences with \(\theta\), instead of only using the elicited utility range, allowed the imprecision for all health states to be estimated, i.e. also those not present in TTO (but present in DCE).

To account for heterogeneity in the model, the disutilities for dimensions and \(\theta\) were set at an individual respondent level from distributions defined at the whole population level. The observed TTO utilities and bounds are drawn from a normal distribution. Somewhat informative priors were used for the parameters of the model.

In the DCE part, we assumed that the probability of one option being chosen is determined by the difference in utilities via a Cauchy distribution (see Jakubczyk et al., 2018a). The conviction for a choice was modelled in the following way. We assumed the conviction depends on two elements: the difference in utilities and the total imprecision, defined as the sum of the length of the utility ranges for the two states being compared (DCE choice from triplets was omitted in the model). More precisely, the utility difference was scaled by dividing it by \(1+\nu \times \textit{total imprecision}\), where \(\nu\) is a parameter to be estimated. This scaled utility difference impacted the conviction via a logistic function of \(\kappa +\lambda \times \textit{scaled utility difference}\). The logistic function was rescaled to a \(-100\)–100 range to match the data collected.

The above three parameters, \(\nu\), \(\lambda\) and \(\kappa\), are central to the present paper. The interpretation is as follows. The value of \(\lambda >0\) implies that the reported conviction is associated with the difference in utilities of health states. For \(\nu >0\), we can infer that the conviction depends not only on the difference in utility but also on the precision with which the utility is perceived. The parameter \(\kappa\) is related to what degree of conviction is expected when the utility difference is zero. For the conviction to also be close to zero, strongly negative \(\kappa\) is required. Non-informative normal priors were used for the three parameters.

3 Results

Conviction levels differ both across respondents and choice tasks, as shown in Table 2. The variation of averages (as measured by the squared standard deviation—STD\(^2\)) within respondents is only slightly smaller than that within choice tasks: whilst individuals differed in their understanding of the scale (individual effect), there were also important differences between questions captured by the conviction level. The variation of the averages within choice tasks accounted for only 22.1% of the overall variance of \(\mu\) over all subject–task pairs.

Table 2 Conviction statistics across respondents and choice tasks

3.1 Properties H1–H5

Regarding H1, 57 respondents (68.7% of 83) satisfied H1. On the aggregate level, \(\overline{\mu _5} = -64.5\le \overline{\min (\mu _2,\mu _6)}=-61.6\) (the bar indicates the average). Hence, H1 cannot be refuted using our data.

Regarding H2, there were 54 respondents satisfying \(\mu _1\times \mu _2>0\). Within this group, 30 subjects (55.6%) exhibited \(|\mu _3 |\le \min \left( |\mu _1 |,|\mu _2 |\right)\), thus yielding support for \(\mu\) reflecting indecisiveness due to choice difficulty, 18 subjects (33.3%) exhibited \(|\mu _3 |\ge \max \left( |\mu _1 |,|\mu _2 |\right)\), thus supporting the strength of preference interpretation of \(\mu\), and 6 subjects (11.1%) had \(|\mu _3 |\) in between these 2 bounds. We conclude that more subjects are consistent with \(\mu\) capturing indecisiveness due to choice difficulty rather than strength of preference.

Regarding H3, transitivity is violated if for \(\mu _8\), \(\mu _9\), \(\mu _{10}\), all are either positive or negative. This obtained for three respondents. The average of \(\min \left( |\mu _8 |,|\mu _9 |,|\mu _10 |\right)\) for intransitive choices was almost halved in comparison to the average for transitive choices (23.3 compared to 42.9). However, the difference was not statistically significant (Wilcoxon test, p-value\(=0.23\)). H3 was also verified using tasks 1, 6, 7. Transitivity is violated if \(\mu _1\), \(-\mu _6\), \(-\mu _7\) are either all positive or all negative. This obtained for ten respondents. The average of \(\min \left( |\mu _1 |,|\mu _6 |,|\mu _7 |\right)\) for intransitive choices was lower than for the transitive ones (41.5 as compared with 53.9), again, with no statistical significance (p-value\(=0.12\)). Overall, even though the conviction level for intransitive choices was found to be lower in our sample, the support for hypothesis H3 is moderate, due to a small number of intransitive cases.

Regarding H4, we first used pairwise tasks 8, 9, 10 and the triple task 15. The Condorcet winner is 11113 if \(\mu _8>0\) and \(\mu _{10}<0\), it is 22222 if \(\mu _8<0\) and \(\mu _9>0\), and it is 32211 if \(\mu _{10}>0\) and \(\mu _9<0\). The Condorcet winner was selected in choice task 15 (in agreement with Sen’s property \(\alpha\)) by \(68.7\%\) of respondents. Their average conviction amounted to 68.5, compared to 53.9 for those violating property \(\alpha\). The difference was statistically significant (p-value\(=0.01\)). We separately tested H4 using tasks 1, 6, 7 and 16. The Condorcet winner is 21111 if \(\mu _1>0\) and \(\mu _6>0\), it is 11121 if \(\mu _1<0\) and \(\mu _7<0\), and it is 11112 if \(\mu _6<0\) and \(\mu _7>0\). The Condorcet winner was selected in choice task 16 by \(59\%\) of the respondents. Their average conviction level amounted to 81 compared with 72.9 for those violating property \(\alpha\). The difference was not statistically significant (p-value\(=0.12\)). Overall, we find moderate support for hypothesis H4.

Regarding H5, the proportion of subjects whose chosen alternative from DCE choice tasks received the highest TTO valuation (the ordering implied by choice and valuation task was the same) varied between 66.3% and 81.9%. For each of the four DCE choice tasks, the average conviction level was substantially higher for consistent choices. The aggregate (over all four choice tasks) mean conviction level for consistent choices amounted to 62.4 (309 subject–task pairs), compared with 47.6 (106 subject–task pairs) for inconsistent ones. The difference was statistically significant (p-value\(<0.001\)). We conclude that conviction is capable of capturing the choice-valuation (DCE-TTO) reversals.

To sum up, we have found strong support for hypotheses H1 and H5, and moderate support for H3 and H4, yet we expect the support for the latter hypotheses to be stronger for larger samples. The analysis of H2 indicates that for the majority of subjects, the difficulty of choice (or choice complexity) is a key driver of the conviction level. In the following section, we try to elaborate on this further by discussing a model for choice conviction.

3.2 Choice complexity and the status quo effect

The estimation results of the model specified in equation (1) are presented in Table 3. All three variables are highly significant predictors, and the direction of association is intuitive: the attractiveness is positively correlated with the level of conviction, whilst the difficulty and complexity are negatively correlated.

We also present the standardised coefficients and incremental \(R^2\) from the hierarchical regression model, which allows us to infer the relative importance of independent variables. Complexity turns out to be significantly more important than the other two regressors.

Overall, the model captures around 7% of the variability in conviction levels. Accounting for individual subject effects would surely lead to an increased \(R^2\). Indeed, the variation of the averages within choice tasks accounted for only 22.1% of the overall variance of \(\mu\) over all subject–task pairs: see Table 2.

Table 3 Regression model explaining the conviction in discrete choice experiment data, separately for original and standardised variables

3.3 Test–retest prediction

The average conviction in the first round amounted to 58 when the choice in the second round was identical, and to 76.5 otherwise (p-value \(<0.001\)). For the second-round conviction, the results are 62.3 and 78.4, respectively (p-value \(<0.001\)). The median conviction (across both rounds) amounted to 80. The proportion of identical choices amounts to 68.1% when conviction in the first round was below or at median, and to 89.5% otherwise (p-value \(<0.001\)). When using the conviction from the second round, the results amounted to 69.8% and 87.4%, respectively (p-value \(<0.001\)). The test–retest reliability of about 75% is typically seen in other studies (e.g. Gamper et al., 2018).

3.4 Linking DCE and TTO

In Table 4, we present the estimation results (medians, and 2.5 and 97.5 percentiles of the posterior distribution). Regarding the mean disutilities of dimensions, surprisingly large values were obtained. The average relative imprecision in TTO tasks (\(\theta\)) was estimated around 21% of the disutility value. The three parameters central to the present paper, i.e. \(\lambda\), \(\nu\), and \(\kappa\), were all significantly positive, showing that (1) the disutility difference in DCE is associated with conviction, (2) the conviction also depends on the precision of the utility assessment, (3) even when there is no difference in utility, the respondents will claim some degree of conviction for the choice.

Table 4 Estimation results for the Bayesian model linking discrete choice experiment and time trade-off (TTO) data

4 Discussion

In the paper, we studied the properties of the self-reported conviction for choice in DCE. The analysis comprised three areas: the properties of the measures according to the choice task, the association with test–retest choice stability, and the properties of the conviction in a joint model linking DCE and TTO data.

First, we found that the key determinants of the conviction level for binary choice in our data are the task complexity (measured by the sum of absolute coordinate-wise differences between choice alternatives), the relative attractiveness of the choice alternatives (measured by the sum of coordinate-wise differences between the two states), and the status quo effect (measured by the sum of the Manhattan distances from each of the alternative states to the current state). The fulfilment of the consistency properties, such as Pareto efficiency (dominance), lack of preference cycles, or consistency across measurements, can be inferred from the conviction measure. Nevertheless, it is important to ensure that there are no confounding factors involving the key determinants of the conviction level, as outlined above.

Second, we showed that the degree of conviction is associated with the stability of choices when the same choice situation is repeated after some time. Third, we showed that the conviction levels reported in DCE are linked to the utility difference and utility imprecision, both measured outside of DCE.

Our results show that choices made with greater conviction are more reliable (for instance, they are more likely to recur when a choice situation is repeated) and should be given more weight in the preference modelling. This weighting could be operationalised by assuming the error term is a function of the conviction, with the parameters specifying the function being subject to estimation. More extensive studies than ours would probably be needed to see whether (and to what extent) such weighting would enhance the predictive power. Our findings also suggest that some choice patterns that violate standard axioms of rationality (e.g. intransitive revealed preferences) are more prevalent when the strength of the reported conviction is low. We believe that this observation can be used in data quality control in the following sense. A violation of transitivity with low conviction may simply mean that a respondent has no clear preference between the offered alternatives. Hence, there is no reason to discredit other answers by this respondent. Meanwhile, a violation with high conviction may either mean that the respondent’s preferences are irrational (and consequently difficult to model using utility representation) or that the respondent did not focus adequately on the task. As a result, other responses may seem less credible as well.

We found that positive conviction is also reported when the estimated utility difference between the alternatives is zero (see the \(\kappa\) parameter in our model). Perhaps it is related to the respondents post hoc rationalising their choices, i.e. finding reasons after they have made a choice as to why they have done so. Otherwise, the preferences for the offered alternatives revealed in the DCE part of the experiment may differ from the ones revealed in a TTO context. In DCE, two non-perfect health states are presented, and perhaps respondents refer to some heuristics beyond the standard QALY model to make their choice. In TTO, a comparison vs. full health is used, and perhaps in this case, no (or different) heuristics are used.

Our treatment of conviction in this paper is based on the idea of an informal cardinal measure that provides additional information on the robustness and meaningfulness of the binary choice data. We admit that the conviction measure, as operationalised in the present study, lacks firm theoretical foundations. Only the extreme upper anchor, 100, is assigned a clear interpretation in the interview: “I surely prefer the chosen option” to other options in the menu. The 0 conviction level may mean that either (1) the subject is preferentially equivalent (indifferent) between the chosen option and some of the unchosen ones from the menu or (2) the subject refuses or is unable to tell which one is superior (note that we do not allow subjects to abstain from choice). Part A of the Appendix formally clarifies our informal and empirical discussion on indecisiveness vs. indifference using the formal notion of preference.

The ambiguity with which the values of \(\mu\) can be interpreted is even larger for the intermediate values. There is no clear meaning behind a conviction level amounting to 60, for example. Respondents may use the available scale differently, and various parts of the scale may also be used differently (i.e. the increase of the conviction from 10 to 20 is in no sense equivalent to the increase from 50 to 60). Nevertheless, the unfounded numerical scales are being used in preference research. In the health context, EQ VAS—a visual analogue scale allowing 0–100 values—is often used to measure health-related quality of life or to elicit preferences (for instance Webb et al., 2020, used EQ VAS to anchor the utilities estimated in DCE onto the QALY scale).

Admittedly, the very notion of conviction can be interpreted in various ways. For instance, some respondents may understand it as a difference in the attractiveness between the offered alternatives or the absolute attractiveness (the strength of preference), whilst some may focus on the ease or difficulty with which the choice was made: such indecisiveness might be due to task complexity, difficulty in imagining states, lack of sufficient information, etc. Our findings suggest that sufficiently many respondents fit into the latter interpretation: the estimated imprecision with which the utilities of health states were reported in TTO also mattered for the strength of conviction (on top of just the utility difference). Of course, measuring the difficulty on the 0–100 scale does not have a theoretical foundation either. Qualitative studies may help to see possible interpretations, their frequency, and the link to choices made.

We see the following, somewhat bizarre, way in which measuring the conviction could be operationalised in further studies to establish cardinal properties of the scale. The respondent could be asked: ‘Imagine we wipe out your short-term memory and ask you to choose again. What do you think the probability is that you would make the same choice?’ Then the answer of 50% would correspond to the conviction of 0 (i.e. practically indifference), and the answer of 100% would correspond to the conviction of 100 (i.e. certainty). In our interpretation of this question, the answers \(<50\%\) should not obtain (if they do, qualitative research would probably be needed). Admittedly, the answers to this question do not correspond to any choice, but at least they have a clear interpretation (rather close to the test-retest context studied in the present paper) and make the changes along the scale comparable.

A relatively small dataset is without hesitation a limitation of our work. In our analysis, we aimed to see how both types of data (DCE and TTO) are related. Whilst DCE data can be collected via unsupervised surveys (hence rather cheaply), conducting TTO-based surveys requires interviewers. The sample size reflects the logistic and budgetary constraints. Nevertheless, in the current sample, we managed to establish statistically significant results. A larger dataset could allow for a more direct demonstration of how to employ the conviction measure for out-of-sample predictions. However, it was shown that predicting DCE choices for health states is no easy task and requires departures from the standard QALY model, see Jakubczyk et al. (2018a) for a comparison of various approaches in a predictive competition. In consequence, a rather large dataset might be needed, probably limited specifically to DCE data.

5 Conclusion

People are capable of reporting their own degree of conviction underlying a choice in DCE in a meaningful way: the collected levels of conviction change in an intuitive way, depending on whether standard properties of choice hold (for instance, transitivity) or whether preferences remain stable over time. Self-reported convictions can be used when modelling stated choice data in order to elicit preferences. The strength of such convictions is not only related to the perceived difference in utilities, but also depends on the imprecision of preferences.