FormalPara Key Points for Decision Makers

This study found the error variance reduced when an animated storyline was used to inform respondents about the disease area and the intervention before completing a discrete-choice experiment.

As reduced error variance is related to choice consistency, the results suggest respondents were more able to complete the elicitation tasks, but the survey materials did not affect the stated preferences.

Having engaged and informed respondents is beneficial in all stated-preference studies, but the advantages may be particularly pronounced in research relating to complex healthcare interventions, with lesser reached populations (e.g., those with lower literacy), or when the research question requires a complex experiment.

1 Introduction

Stated-preference methods are a group of approaches used to elicit and then quantify individuals’ preferences for health states, goods, or services [1, 2]. Stated-preference methods are more popular in areas where markets are imperfect and consumers cannot easily be observed (such as the environment or healthcare) or for forecasting demand for new technologies [3, 4]. Most stated-preference methods involve rating, ranking, or making a choice between hypothetical options presented as questions in a survey [5]. In healthcare, stated-preference methods such as time trade-off, standard gamble, and contingent valuation have been used to understand people’s time preferences, risk tolerance, and willingness to pay, respectively. However, the hypothetical nature of stated-preference methods has resulted in some criticism [6, 7].

Discrete-choice experiments (DCEs) are an increasingly popular type of stated-preference method [4]. In a DCE, respondents select their preferred alternative from a set in a series of hypothetical choices in a survey. Respondents are expected to make trade-offs between different attributes of the good or service to make their decision. In healthcare, where market data rarely exist, the quantification of preferences through DCEs allows decision makers to understand which aspects of an intervention provide the most benefit. Interest has also increased in using the results of DCEs to inform regulatory decisions [8, 9]. However, for DCEs to be used in decision making, they must be robust and produce data that minimize bias from either their hypothetical nature or other sources [10].

To ensure the hypothetical choices reflect real-life behavior, respondents to stated-preference surveys usually receive information and explanations in the form of “training materials” before completing the valuation tasks. The importance of training materials has received little attention in the DCE literature; they are rarely described or presented in published articles or made available through online appendices. Where guidelines for general best practice [11, 12] and more specifically for the identification of attributes, experimental design, and econometric analysis are available [13,14,15,16], guidance about how to design and frame the survey training materials that are presented before the choice sets is lacking. However, psychology literature surrounding choice making in health behavior emphasizes the importance of individuals’ “capability” [17], defined as the psychological capacity to engage in the necessary thought processes to make a choice or change behavior. In the context of a healthcare DCE, this could relate to respondents’ understanding of the disease and the treatment forming the basis of the valuation exercise, their ability to retain the information presented, and their ability to make decisions or choices based on this information, all of which can, and should, be addressed in the training materials at the start of a DCE.

Communicating large volumes of complex information is notoriously difficult, as people struggle to retain the information or stay engaged to read all of it [18]. Louviere [19] highlighted “information acceleration methods,” a concept developed in the 1990s, as a way to rapidly inform individuals about new technologies and their associated benefits and harms. The information acceleration literature was developed in marketing and management to improve strategic management decisions involving a new alternative not currently in the market [20]. Much of the information acceleration literature was produced in the 1990s and focuses on visual materials using videos rather than interactive materials [21]. More recently, “serious games,” which can be defined as “a game in which education (in its various forms) is the primary goal, rather than entertainment” [22] (p. 17) have been developed to help train and/or motivate individuals to learn about new, often complex and abstract, concepts [18]. The rationale for using a serious game is to achieve better learning outcomes by immersing the participant in an educational and enjoyable environment, which is intrinsically motivating, through the use of interactive technology. Serious games have been used in a variety of applications and examples, including educating individuals about genetics or improving their mathematical achievements [23, 24]. A published systematic review and meta-analysis found the degree to which serious games aided learning (compared with conventional text-based approaches) was positive [18]. It has been argued that, in addition to learning, serious games could keep participants interested and engaged in a task [25], improving completion rates and the quality of data collected.

Recent systematic reviews of healthcare DCEs have shown a large increase in the number of these surveys administered online. Between 2001 and 2008, only 11% of DCEs were web surveys, but between 2013 and 2017, some 57% of healthcare DCEs were online [4, 26]. This digitization of DCE surveys provides scope to incorporate animated or interactive training materials. However, the extent to which training materials affect respondents’ choices or heuristics in DCEs remains unclear. This study aimed to investigate whether, and how, the format of presenting training materials influenced the choice data collected in an example DCE. This study used the case study of preferences for a new prescribing algorithm to guide the treatment of rheumatoid arthritis (RA) with a first-line biologic (a “biologic calculator”) as an example of a complex topic requiring substantive training materials.

2 Methods

A DCE designed to elicit preferences for a “biologic calculator” compared with a conventional prescribing practice was used as the basis for this study. Respondents were then randomized to complete the survey with training materials presented as plain text or as an animated storyline. Approval for the study was obtained from The University of Manchester’s Research Ethics Committee.

2.1 Discrete-Choice Experiment Design

The DCE was designed and is reported in line with published recommendations [11, 12]. In brief, an iterative process involving clinical experts and patient representatives and supported by systematic reviews was used to identify the relevant attributes (five) and their plausible levels (four for each attribute) described in Appendix A. Extensive piloting involving qualitative and quantitative methods resulted in the choice set shown in Fig. 1, which presented the choice question using a non-labelled format for two alternatives (of biologic calculator) with an opt out (representing current prescribing practice). The selection of relevant choice sets, four blocks of five, was informed using the software Ngene and aimed to minimize D-error [27]. An internal validity check for monotonic preferences was added, so each respondent completed six choice sets.

Fig. 1
figure 1

Example choice set. NHS national health service

2.2 The Survey

The survey was uploaded online using Sawtooth SSiWeb [28]. The final survey comprised the training materials to explain the purpose of the DCE followed by the six questions presenting the choice sets and questions about the individual (including sociodemographics and questions to ascertain their level of understanding).

2.3 Training Materials

Two formats for presenting training materials were developed: plain text and an animated storyline. The content presented in each format was the same. The content and text for the training materials were developed via consultation with three clinical experts in stratified medicine.

The plain-text version of the training materials (see Appendix B) was presented on 15 separate webpages to avoid the need for scrolling on a standard computer screen. The text was supported by icon arrays illustrating probabilities. The respondent was required to click through and read each page of text.Footnote 1

The animated storyline was developed with assistance from a company that developed and applied their theory-driven, evidence-based approach to develop interactive educational tools such as serious games (MindBytes, http://www.mindbytes.be [29]). As advised by Reeve [30], a storyline was created and a narrative developed for an avatar (a figure representing a person). The framework aims to ensure that the narratives and animations (game mechanics) enhance the educational objectives without creating a bias. Easy-to-follow stories are proposed to be useful motivators to keep respondents engaged with a subject and process or make sense of information [30]. Although personification is important [30], evidence also suggests that people identify most with avatars like themselves [31]; to minimize this bias, a green genderless, ageless stickman avatar with a gender-neutral name (Alex) was designed. The setting was also dynamic, with different backgrounds using archetypical visuals and a simple design to indicate the location e.g., “in hospital” or “at home,” allowing us to convey these concepts and avoid both information overload and bias [32, 33]. A “linear traditional narrative” [34], the most simplistic structure, was used in which the central character (Alex) was followed along a pathway that started with a description of RA. The story then explained how first-line treatments may fail, which then requires switching to a biologic. The story then explained that the choice of which biologic and relevant dose will be made by a clinician who may decide to use a biologic calculator to guide this decision. The last elements of the story explained the relevant attributes that describe the biologic calculator and that some trade-offs must be made when choosing a prescribing approach. Each attribute was explained in the storyline with the help of graphics and visuals using the learning mechanic-game approach [35], thereby ensuring that these visuals explicitly addressed the learning objectives without introducing potentially bias-inducing elements.

2.4 Background Questions

In the last section of the survey, respondents were asked to complete a series of background questions about themselves, including quality of life (the EuroQol Five-Dimension, Five-Level instrument [EQ-5D-5L] [36]) and sociodemographics. Although “quiz” questions were considered as a measure of respondents’ understanding of the training materials, the authors decided against the approach out of concern that some respondents may exit the survey if they could not provide a correct answer or felt they were being tested or examined. Such dropout of respondents may induce a selection bias, where only informed respondents (regardless of the training materials) proceeded to the choice sets. Instead, respondents were asked: “On a scale of 1–5, how confident are you that you would make the same choices if faced with the situations in real life?” and “On a scale of 1–5, how easy or difficult did you find making choices between the alternatives?” These questions were included to understand whether there was any difference in their choice-making ability as a result of the training materials. Self-reported attribute non-attendance (ANA) was also collected. Initially, respondents were presented with a screening question: “Did you find yourself making choices based on one or two characteristics rather than the option as a whole?”; respondents who answered “yes” then saw five follow-up questions regarding their attention to each of the attributes.

2.5 Study Sample

The relevant study population for this survey was defined as members of the public aged ≥ 18 years. Respondents were recruited through an internet panel provider, ResearchNow®. Respondents were sent a link to the online survey, and individuals were randomly allocated to receive one of the two training materials (plain text or animated storyline) upon clicking the link to enter the survey.

2.6 Data Analysis

Descriptive statistics for the answers to the background questions were produced and used to summarize the respondents who completed the DCE. In addition, a logistic regression model was estimated to confirm that the randomization to training materials was successful by testing whether any sociodemographic variables predicted the survey version received (the dependent variable).

Choice data were analysed within a random utility maximization framework [37], where individuals are said to choose the alternative that provides them with the most utility. An individual’s (n) utility (Unj) for an alternative (j) is said to be made up of an observable component (Vnj) and a random component (\(\varepsilon_{nj}\)). In this study, panel data methods were used to account for the same individual making multiple choices with fixed-effects conditional logit models. These models assume that the random component of utility has a Weibull form [38].

In this DCE, all attributes were continuous variables (time, cost, probabilities) and therefore entered the utility function as single attributes in the preliminary analysis (Eq. 1):

$$\begin{aligned} U_{njt} &= \beta_{0t} + \beta_{1t} {\text{Delay}}_{nj} + \beta_{2t} {\text{PPV}}_{nj} + \beta_{3t} {\text{NPV}}_{nj}\\&\quad + \beta_{4t} {\text{Risk}}_{nj} + \beta_{5t} {\text{Cost}}_{nj} + \varepsilon_{njt} , \end{aligned}$$
(1)

where β1–5 are the parameters associated with each of the attributes for each version of the materials, t. The functional form of preferences was also investigated by introducing square terms for each variable [39]. The term β0t is an alternative-specific constant (ASC) for the opt out, which captures differences in the mean of the distribution of the unobserved effects in the random component, \(\varepsilon_{njt}\), between the opt out (conventional approach) and the other alternatives (biologic calculators). Equation (1) was estimated separately for respondents who received the plain-text version and the animated storyline.

In the standard conditional logit model, the estimated coefficients reflect both preference weights and the variance of the unobservable element of utility (the variance of \(\varepsilon_{nj}\)). This means that differences in estimated coefficients may be due to differences in preferences or differences in the relative variance of the error term (differences in “scale”). The variance of the error term can be interpreted as a measure of the randomness (or consistency) in choices. In this example, the consistency in choices may also depend on the training material received. To understand whether the choice behavior of respondents who received the plain-text version and those who received the animated storyline differed, a heteroskedastic conditional logit (HCL) model [40] was also estimated using the pooled data from both groups:

$$\begin{aligned} U_{nj}& = \lambda_{n} \beta_{0} + \lambda_{n} \beta_{1 } {\text{Delay}}_{n,j} + \lambda_{n} \beta_{2} {\text{PPV}}_{n,j} + \lambda_{n} \beta_{3} {\text{NPV}}_{n,j}\\&\quad + \lambda_{n} \beta_{4} {\text{Risk}}_{n,j} + \lambda_{n} \beta_{5} {\text{Cost}}_{n,j} + \varepsilon_{n,j} . \end{aligned}$$
(2)

In the HCL model, the scale parameter, \(\lambda\), a measure of error variance, was permitted to vary by the training materials received and is modelled as follows:

$$\lambda_{n} = \exp \left( {\gamma {\text{TEXT}}_{n} } \right),$$
(3)

where TEXTn is equal to one when respondent n received plain text. Testing the significance of the parameter \(\gamma\) is therefore a test of whether the training materials affected choice consistency (the scale parameter, \(\lambda\)) [41].

The results of the conditional logit models were used to estimate marginal rates of substitution, which reveal the amount of one attribute (e.g., time to starting treatment) individuals are, on average, willing to exchange for another (e.g., predictive value). The associated confidence intervals (CIs) for the marginal rates of substitution were estimated using the delta method [42]. Even if the results of the HCL model suggest significant scale heterogeneity, marginal rates of substitution (ratios of coefficients) are unaffected by the heteroskedasticity in the error.

3 Results

In total, 300 members of the public completed the DCE, with 158 people receiving training materials as plain text and 142 people receiving the animated storyline. Table 1 shows the sample characteristics for a number of key variables for all respondents and the two subsamples. Appendix C shows the results of a logistic regression model that suggests no observable characteristics predicted the training materials respondents received, which suggests that randomization worked. A total of 37 respondents clicked on the link and consented to the survey but did not complete the questions, and 28 participants left during the training materials; most of these [n = 23 (82.1%)] were randomized to receive the animated storyline.

Table 1 Summary of study sample characteristics

In this study, 16.7% (n = 50) of the total sample failed the internal test for monotonicity. Slightly fewer failures were observed for respondents who received the animated storyline [n = 22 (15.5%)] than for those who received the plain text [n = 28 (17.7%)], but the difference was not statistically significant (p = 0.605). All respondents, whether they failed or passed the monotonicity test, were included in the final analyses of the choice data. Appendix D contains the results of a split sample analysis for each version of training materials with (1) respondents who “passed” the internal validity test and (2) all respondents.

Table 1 also shows the responses to self-reported task difficulty and confidence in choices. When asked to rate their confidence on a scale of one to five, respondents who received the plain-text version of the training materials reported a lower average confidence score (mean 2.50) than those who received the animated storyline (mean 2.59), although this difference was not statistically significant (p = 0.404). Similarly, when asked to self-report their ease of choice making on a scale of one to five, those who received the plain-text version reported a lower average ease score (mean 2.68) than those who received the animated storyline, although, again, this difference was not statistically significant (p = 0.353).

The respondents who were randomized to the animated storyline spent on average of 50.1 s (95% CI 41.3–59.0) reading the training materials. Appendix E shows kernel density estimates for average time spent reading and clicking through the training materials for respondents who received the animated storyline. The average (mean) time spent on each choice set in the DCE, regardless of randomization, was 2.82 s (95% CI 1.65–3.99). Respondents randomized to the animated storyline completed the choice sets slightly more quickly (2.08 s; 95% CI 1.80–2.37) than those who received plain-text training materials (2.79 s; 95% CI 1.97–3.61), but this difference was not statistically significant (p = 0.124). Appendix F shows kernel density estimates for average time spent completing a choice set by training materials received.

Figure 2 shows the difference in rates of self-reported ANA for each attribute. In all attributes except for cost, people who received the plain-text training materials were more likely to report ANA. Only 15 people (19.0%) who received the animated storyline compared with 26 (42.9%) who received plain-text material reported ANA to the risk attribute.

Fig. 2
figure 2

Proportion of respondents self-reporting attribute non-attendance for each attribute by training materials received. NPV negative predictive value, PPV positive predictive value

3.1 Results of the Discrete-Choice Models

The estimated coefficients for all attributes had signs consistent with a priori expectations about the direction of impact of the attribute on preferences (Table 2). Respondents disliked increases in the delay to the start of treatment or risk of infections but liked increases in positive predictive values (PPVs) and negative predictive values (NPVs) and cost savings to the healthcare system. The ASC was large, negative, and statistically significant, suggesting that individuals derived utility from the “biologic calculator” over and above that derived from the attributes. Alternative specification of the utility function was investigated by including quadratic terms for each attribute, but no quadratic terms were statistically significant (p > 0.01).

Table 2 Pooled and split-sample estimates of discrete-choice data using different model specifications

The presence of scale heterogeneity was confirmed in the estimated HCL model (Table 2). The estimated scale term of − 0.216 was statistically significant (p < 0.01), suggesting that the error variance differed between the two groups. The negative sign suggests that the scale parameter was smaller for the sample who received plain text, implying that this group had a larger error variance and were therefore, on average, less consistent than those who received the animated storyline. The scale parameter (exponential of the scale term of the HCL model) was estimated at 0.805.

To test whether the format of the training materials affected the choices and estimated preferences from the DCE, a likelihood ratio test was used to compare the conditional logit models from each sample and the HCL model. The likelihood ratio test suggested that, conditional upon there being differences in scale, the hypothesis of preference homogeneity could not be rejected (p = 0.282), meaning there were no statistically significant differences in average preferences between the two groups. This is also illustrated in the marginal rates of substitution presented in Table 3. Further analyses on the sample who received the serious game found the respondents who spent longer reading the materials also made more consistent choices (see Appendix G).

Table 3 Marginal rates of substitution

4 Discussion

The aim of this study was to investigate whether, and how, the format of training information affected respondents’ ability to complete DCE choice sets rather than using training materials to prime or bias respondents. The results of this study found that preferences estimated from choice data collected in the DCE were, on average, unaffected by the format of the training materials. Importantly, there were statistically significant differences in the error variance between the sample of respondents randomized to receive plain-text rather than animated-storyline training materials. In other words, the respondents randomized to read the animated storyline were less affected by factors that are unobservable (such as omitted attributes), which resulted in their choices appearing to be statistically significantly less random than those of respondents who received the plain-text training materials. A potential interpretation of this result could be that the respondents were better informed when the information in the training materials was conveyed in an animated storyline.

The observed difference in error variance identified in this research has implications for researchers conducting DCEs in complex areas that require many attributes, alternatives, or choice sets or those with a small population of interest. Reduced error variance could be a signal of reduced cognitive burden, implying that those who received the interactive training materials could answer more, or more difficult, choice sets.

The rate of self-reported ANA to all attributes, except cost, was lower for respondents who received the animated storyline. The difference was most notable in the more complex attributes that involved some element of risk, such as NPV and PPV. A wealth of literature suggests that, even in well-educated populations, people find risk and probabilities hard to comprehend [43,44,45]. Evidence also suggests that risk is not always communicated effectively in healthcare DCEs [46]. The finding that fewer respondents reported ANA could suggest an increased understanding of these attributes when training materials were presented as an animated storyline.

This study focused on training materials that were used to inform a DCE, but the findings may also be relevant to researchers using other stated-preference methods (such as time trade-off, standard gamble, and contingent valuation) in either health or non-health settings, such as studies eliciting preferences for environmental goods or services [47]. Stated-preference surveys are increasingly digitized [4], suggesting that scope exists for interactive or more engaging materials. Using an animated storyline or a more sophisticated serious game may be useful when eliciting preferences for complex interventions or when current practice is difficult to explain. A recent systematic review of DCEs showed a rise in the number of studies conducted in lower-income or developing countries [4], where literacy rates might be lower or the subject matter less familiar to respondents, particularly if access to healthcare is low. Furthermore, researchers are also using stated-preference methods to elicit utilities for health states from challenging samples such as children [48] or those with conditions related to cognitive impairment [49]. Using interactive materials that do not overwhelm the respondent with text but also do not change preferences may be a way to improve survey respondents’ understanding and thus the confidence of researchers and policy makers in the derived valuations. This study adds to the growing body of evidence [50] on the value of using this theory-driven, evidence-based approach to developing health-related educational applications that are able to realize the desired outcomes.

Researchers have also used other methods to reduce hypothetical bias in healthcare DCEs, including “cheap talk” and “time to think.” In “cheap talk,” respondents are led through a script explaining hypothetical bias and its consequences in economic valuation [51]. In “time to think,” respondents are encouraged to deliberate before stating their choices [52]. The ability of “cheap talk” and “time to think” methods to reduce hypothetical bias is debatable [53,54,55,56,57,58], but these approaches may be used in addition to serious games to improve the validity of the elicited stated preferences.

This study represents a preliminary investigation into the potential influence of the format of training materials on respondents’ choices and response efficiency. Future research may wish to consider more complex econometric models to understand how personality traits or attitudes may impact the influence of training materials. However, it has been noted that considering preferences and attitudes simultaneously introduces an endogeneity issue: as they are latent, they could be jointly correlated with unobservable factors [59]. Researchers employing serious games or interactive survey materials may wish to consider allowing respondents to choose the format of the information they receive.

This study used an online panel to recruit members of the public to complete the DCE. The sampling approach meant it was not feasible to conduct a study to gain qualitative insights into respondents’ views about the format of training materials [60, 61]. Given the observed dropout rate of respondents randomized to receive the animated storyline, it is possible that the animated storyline acted as a filter to remove less serious or inattentive respondents. If more attentive respondents watched the storyline materials, the difference in error variance could be attributed to sample selection rather than improvements in communication. The mechanism by which the animated storyline improved choice consistency requires further research. Further research may wish to use an alternative recruitment strategy to enable the collection of qualitative data to illuminate why an animated storyline, specifically, or serious games, more generally, appear to influence response efficiency but not observed preferences.

The largest limitation of this study relates to the sample size, which limited our ability to understand two key aspects: self-reported difficulty/confidence and speed of completion. Although differences in self-reported difficulty/confidence and failure of the internal validity test were identified, these differences were not statistically significant in our sample. Respondents who received the animated storyline also answered the choice questions slightly quicker than those who received the plain text, but this difference was not statistically significant, which may also be an artefact of the sample size. It should also be noted that response time was also automatically recorded using the page timer in Sawtooth software and may reflect differences in browsers or computing power and thus should not be compared with other studies with different recording methods. As effect sizes were unknown in advance of the research, no power calculation was conducted to detect differences between the survey versions. Instead, the sample of 150 participants was based on estimating preference coefficients. Further research with a larger sample size is warranted to understand whether there is a statistical difference and whether the speed of completion is due to better acquisition or retention of information in the survey. Future research may seek to compare investing in training materials for the survey to other aspects of the study design, e.g., increasing the sample size or pre-testing to acquire priors.

5 Conclusion

This study found that providing animated information about the disease area and intervention being valued had a positive influence on the quality of choice data collected in a DCE in terms of the variance of the error term. The results of this study may have particular relevance to researchers conducting surveys about complex issues or those completed by small sample sizes. Stated-preference researchers should pay close attention and carefully develop the training materials to ensure respondents can make informed decisions when presented with the subsequent valuation exercise, such as a choice set. Researchers reporting the results of stated-preference methods should provide survey materials in online appendices so readers may consider them alongside the study findings. Further research is required to establish the generalizability of these results in a larger sample size, and in other settings, using alternative stated-preference methods, or for specific subgroups of respondents.