FormalPara Key Points for Decision Makers

Comparing case 2 best-worst scaling and discrete choice experiment outcomes within patients with neuromuscular diseases showed differences in relative importance scores but also comparable preference classes between the two methods.

Careful consideration when selecting either a discrete choice experiment or case 2 best-worst scaling to elicit patient preferences is necessary as these preferences may differ and the method should match the decision context.

1 Introduction

There is an emerging consensus that patient preferences should be incorporated within decisions in the medical product lifecycle [1,2,3,4]. These preferences have become more important over time for the companies that develop new medical products and for the authorities that assess, regulate, and decide which products are effective, safe, well tolerated, and cost effective [5]. Yet, there are still outstanding questions related to which preference methods are best suited for each decision context and there are many different methods that can be used to gain insights into preferences. Studies by for example the Medical Device Innovation Consortium [6] and Soekhai et al. [7] provide an overview of several stated preference methods to elicit these preferences within the medical product life-cycle context.

One of the stated preference methods that has become increasingly popular to elicit patient preferences is best-worst scaling (BWS) [8, 9]. Best-worst scaling was introduced to obtain more preference information than a discrete choice experiment (DCE) by asking individuals not only to select their best but also their worst option, without a large increase in the cognitive burden of the elicitation task [8]. The literature distinguishes between three types of BWS: object case (case 1 BWS) where attributes (characteristics) are selected as best and worst, profile case (case 2 BWS) where attribute levels (values of characteristics) are selected as best and worst, and multi-profile case (case 3 BWS) where profiles are selected as best and worst [10]. For more details regarding BWS, see Louviere et al. [10] Case 2 BWS (hereafter: BWS-2) has received much attention in the preference literature, as this method is able to uncover attribute level importance, might reduce cognitive burden of the elicitation task by focusing on one profile at a time, and is relatively easy to design [11, 12].

Although BWS-2 is being used more frequently in health preference research, it cannot yet match the years of experience and the resulting body of work of DCEs in health preference research [13, 14]. In DCEs, respondents are presented with multiple-choice tasks including two or more hypothetical alternatives. These alternatives consist of a fixed set of attributes with varying attribute levels between the alternatives and choice tasks. Respondents are then asked to select their preferred alternative in each choice task. For more information about DCEs, see Hensher et al. [15] and Train [16].

There are few studies investigating differences between DCE and BWS-2 preference study outcomes. Studies from van Dijk et al. (hip replacement surgery) [11], Potoglou et al. (social care preferences) [17], and Severin et al. (priority setting for genetic testing) [18] are examples in which DCE and BWS-2 preferences have been compared. The aim of this study is to compare preference weights and relative importance scores obtained from both methods. In this study, we focused on treatment preferences for patients with neuromuscular diseases (NMD), which are rare diseases and often affect the central nervous system leading to impaired or reduced cognitive functioning [19,20,21,22]. General cognitive deficits have been described in over 60–70% of patients and the prevalence and severity depend on the age at onset of the disease. With an earlier onset of disease, the cognitive limitations are generally more severe than observed for adult phenotypes, which are classified as those with symptoms first diagnosed ≥ 20 years of age [23]. Comparing DCE and BWS outcomes in this study context is of interest, as DCEs generally require larger sample sizes, which is challenging for rare disease applications, and patients with NMD may have reduced cognitive functioning and the perception is that BWS-2 presents a lower cognitive burden for patients [24]. The latter is related to the fact that previous research showed that BWS-2 requires that all attributes are framed either positively or negatively (i.e., mixing benefits and risks leads to identifcation problems) [25], while in DCEs combining positive and negative attributes within one choice task is possible, making it cognitively more demanding. One of the aims of our study was to compare a DCE to a BWS case that is able to uncover attribute level importance (as the aim was to compare with DCE results), while at the same time reducing cognitive burden by focusing on one profile at a time (as lowering the cognitive burden for patients with NMD is important) and is relatively easy to design because no specific software is needed (important in clinical settings when eliciting preferences).

2 Methods

2.1 Study Population

A sample of adult patients with NMD was selected between May and December 2020. Respondents were mostly recruited through patient organizations and patient registries in the UK, USA, Canada, Australia, and New Zealand via e-mail, advertisements, and newsletters. Informed consent was obtained before the start of the survey. Respondents were included if they were 18 years of age or older, were self-reported as diagnosed with NMD with late onset (established diagnosis or first reported symptoms on or after 20 years of age), and had an active e-mail account to register. Respondents were excluded if they were unable to provide informed consent, complete the online survey, or with a reported history of encephalopathy or dementia (as these may have an impact on cognitive skills and ability to complete the survey). This study was approved by the Newcastle University Ethics Committee (Reference: 8840/2018).

2.2 Attributes and Attribute Levels

Potentially relevant attributes and attribute levels for a hypothetical medicinal treatment for patients with NMD were selected using a qualitative study for both DCE and BWS-2. The qualitative study included 52 participants who completed in-person semi-structured interviews or participated in focus group discussions. When designing the survey instruments, additional evidence such as a literature review and experience-based opinions from the key members of the team (patients, clinical experts, and methodological experts) were also considered. More details regarding these qualitative findings were reported elsewhere [26, 27]. These findings showed that 11 attributes were eventually narrowed down to six final attributes that were included in the DCE and BWS-2 as minimizing the cognitive burden was key: muscle strength, energy endurance, balance, cognition, chance of (temporary) blurry vision, and chance of (permanent) liver damage. Table 1 presents the attributes and attribute levels for DCE and BWS-2.

Table 1 Attributes and levels for eliciting preferences with discrete choice experiment and case 2 best-worst scaling (including priors for discrete choice experiment design)

2.3 Design of DCE Choice Tasks

A Bayesian D-efficient design was generated for the DCE, in which the D-efficiency was maximized using Ngene software (Ngene, version 1.2.1) [28]. Pilot data from the first 51 respondents were used to update priors and their specific distributionFootnote 1 (see Table 1) as well as for further optimization of the design [28, 29]. The final DCE design used for the survey included 24 unique choice tasks, which were blocked into two blocks with 12 choice tasks each to reduce cognitive burden for respondents. The alternatives in each choice task were unlabeled and the attribute order was kept constant across all tasks [30].

After we collected data for 51 patients, we estimated a multinomial logit (MNL) model using the DCE data in order to update our priors to generate a more efficient design. We used a dummy specification and our analysis showed that for the attributes muscle strength, energy endurance, balance, and liver damage, attribute levels had the expected size, sign, and were statistically significant. For these attributes, we generated a new experimental design with a normal distribution with the estimation coefficient as mean and standard deviation = estimated coefficient/1.96 to account for preference heterogeneity. For the attributes cognition and blurry vision, the estimates were not as expected and we therefore decided to use the original experimental design choices for these two attributes.

2.4 Design of BWS Choice Tasks

For designing the BWS-2 choice tasks, an orthogonal main effect plan experimental design was used. An orthogonal main effect plan enables the independent estimation of preference weights for each attribute level [10]. Based on the number of attributes and levels, the orthogonal main effect plan indicated 18 choice tasks to be included in the experiment [31]. As the combination of negative and positive attributes in BWS-2 choice tasks can lead to identification problems, negative attributes (i.e., chance of blurry vision and chance of liver damage) were framed positively [25]. This means that for these attributes, attribute levels in Table 1 for BWS-2 included a 70%, 85%, and 99% chance of not experiencing blurry vision or liver damage. Attribute order was kept constant across all tasks.

2.5 Survey Design

The survey consisted of several sections. At T = 1 (first measurement with first part of the survey), this included (1) background questions such as demographics (age, sex, school or work situation, country of origin), recruitment platform, clinical characteristics (diagnosis and age of diagnosis), disease status, and a list of 18 activities along with questions about whether these were possible for the patient; (2) a short video introducing the preference task with an explanation of all attributes and attribute levels, (3) either BWS-2 (18 choice tasks) or a DCE (12 choice tasks) [randomly allocated], and (4) evaluation questions about the ease of understanding and answering, and the usefulness of the video instructions. At  T = 2 (second measurement with second part of the survey), a short video introduced the other preference method and follow-up questions were also included [26]. To minimize the cognitive burden, the first set of choice tasks (either a DCE or BWS-2) and the second set of choice tasks were administered at different timepoints, with a 2-week period in between. In BWS-2, respondents had to select their best and worst attribute level, while in the DCE, respondents were asked about their preferences by choosing between two alternatives. The survey was designed using Lighthouse Studio (Sawtooth Software, version 9.8.1X). Examples of DCE and BWS-2 choice tasks are shown in Figure 1.

Figure 1
figure 1

Example of case 2 best-worst scaling and discrete choice experiment choice tasks

2.6 Statistical Analysis

Statistical analyses were performed using data from respondents who completed both BWS-2 and DCE tasks (including respondents from pilot). Following guidance from the literature, as well as our interest in investigating preference heterogeneity, identifying different respondent groups, and model fit, a latent class (LC) model was estimated to analyze choice data for both DCE and BWS-2 [10, 15]. While the standard multinomial logit (MNL) model, used as a starting point within this study, assumes that all respondents have identical preferences, the LC model deals with preference heterogeneity by assuming, based on the choices respondents made, that there are a fixed number of different groups of respondents (i.e., LCs) [16]. Within each group in a traditional LC model, each individual has identical preferences.

With the LC model, the utility (U) of an alternative for each LC in both the DCE and BWS-2 can be modeled as a linear function of the specific attributes and levels, with

$$U= \sum_{k=1}^{A}\sum_{j=1}^{{J}_{k}}{\beta }_{k,j}{X}_{k,j}+ \varepsilon ,$$

where there are A attributes with attribute k having \({J}_{k}\) attribute levels, with \({X}_{k,j}\) equal to one if the attribute level j of an attribute k is available in the presented profile, \({\beta }_{k,j}\) are the utility parameters for the jth levels of attribute k, and \(\varepsilon\) is the random error term representing the unexplained part of utility. The LC model was programmed using R version 4.0.0 (Apollo package, version 0.0.1) to estimate the utilities for both the DCE and BWS-2 data, as well as for the ex-post descriptive analyses to characterize the LCs [32, 33]. For the DCE and BWS-2, “muscle strength stays the same” was included as the reference level (fixed at zero). The DCE also required a reference level within each specific attribute. To create a clear interpretation of attribute levels (for the attributes, muscle strength, energy endurance, balance, and cognition), the least attractive attribute levels were used as the reference level. For the other attributes, the most attractive attribute levels were selected as the reference level. This means that for muscle strength, energy endurance, balance, and cognition, preference weights increase when the attribute level value increases, while for the chance of blurry vision and the chance of liver damage the preference weights decrease with increasing attribute levels. To facilitate the comparison between the DCE and BWS-2, the utility levels relative to the corresponding attribute reference level were also estimated for BWS-2. Relative importance scores (RIS) of attributes were calculated (based on an MNL estimation) by looking at the maximum utility differences between the best and worst attribute levels within each specific attribute and comparing those between the DCE and BWS-2, while outcomes from the evaluation questions for both methods were also analyzed.

3 Results

A total of 140 patients completed both the DCE and BWS-2 part of the survey. Responding patients were mostly female (65%) and the median age was 54 years (with a range of 23–76 years). The majority of patients completed a higher (45%) or vocational (34%) education. Most patients reported that they were able to walk without an assistive device (36%), followed by 26% of the patients reporting able to walk but relying on an assistive device. A relatively large group of patients (23%) also reported able to walk and run without an assistive device (Table 2).

Table 2 Sample characteristics

Figure 2 shows the overall (based on MNL) RIS calculations for both DCE and BWS-2. For DCE, (avoiding) liver damage had the highest relative importance, followed by muscle strength, energy endurance, balance, cognition, and (avoiding) blurry vision. For BWS-2, a different pattern was observed. Muscle strength had the highest RIS value, followed by energy endurance, balance, liver damage, cognition, and blurry vision. Preferences for improving the typical impairments of NMD were similar across methods, with generally a high preference to improve muscle strength, energy, and (to a somewhat lesser extent) balance. Accounting for preference heterogeneity with LC, Figure 3a, b illustrate the relative importance of each attribute for each LC. Given the sample size, statistical measures of fit and aiming for a meaningful interpretation of the LCs, a three-class model was superior for both the DCE and BWS-2. The DCE LCs in Figure 3a reveal a group of patients in whom avoiding liver damage is by far the most important attribute, while there are also patient groups where improvement of balance and energy endurance are most important. For BWS-2, there is a patient group in which muscle strength is most important, while there is—similar to DCE—a patient group in which liver damage is considered the most important attribute (Figure 3b).

Figure 2
figure 2

Overall relative importance score of attributes for discrete choice experiment (DCE) and case 2 best-worst scaling (BWS-2)

Figure 3
figure 3

Relative importance of attributes for the discrete choice experiment (a) and case 2 best-worst scaling (b)

Table 3 presents the estimated LC preference weights for both preference methods. Focusing on the magnitude of these weights, for DCE overall, the more attractive levels were preferred above the less attractive levels with most attribute levels being statistically significant. This is however not the case in DCE class 2, in which most attribute levels are not statistically significant and where the utility of a 15% chance of liver damage was larger than the utility of a 1% chance of liver damage. The largest patient class (47%) was the class of patients in which liver damage was the most important attribute (class 3). For BWS-2, Table 3 shows that most attribute levels were statistically significant. Additionally, all the more attractive attribute levels were preferred above the less attractive attribute levels. The largest classes of patients were the classes in which energy endurance (42%, class 1) and liver damage (41%, class 3) were the most important attributes.

Table 3 Latent class analysis results for DCE and BWS-2

To characterize patients in the three different DCE and BWS-2 LCs, ex-post analyses were conducted (Table 4) by making use of the sample characteristics in Table 2 because extending our LC model with a class membership model failed to converge owing to the relatively small sample. These results show that DCE LCs differed in terms of the level of highest education, sex, and age: DCE LC 2 included the highest percentage of female patients (72%), who were the youngest (median age 47 years) and who had the highest level of education (96% completed vocational or higher education). For BWS-2, LC 2 was also different compared with other classes: this class included the highest percentage of female patients (74%), who were the oldest (median age 58 years) and who were relatively less impaired by their disease (74% indicated that they were able to walk). The ex-post analyses in Table 4 also highlighted that there was a high level of concordance between patients in a specific DCE class and patients in the same BWS-2 class. More specifically, patients in the DCE class in which balance was the most important attribute (class 1) and in which liver damage was the most important attribute (class 3), had the highest probability to also be in BWS-2 LC 1 (energy endurance most important) and LC 3 (liver damage most important), respectively. This was however not the case for LC 2.

Table 4 Ex-post analyses of latent class analysis of DCE and BWS-2

Table 5 presents the results from the evaluation questions regarding DCE and BWS-2. The results show that there are no statistically significant differences between methods for evaluation questions about help with the survey, difficulty of answering questions, and if the descriptions were sufficient. However, statistically significant (chi-squared test, p-value 0.04 < 0.05) differences were found between the DCE and BWS-2 about the difficulty of understanding the questions. The percentage of patients who found DCE choice tasks easier to understand (74%) was greater than the percentage of patients who found BWS-2 choice tasks easier to understand (62%). In order to gain knowledge specifically of the understanding of DCE and BWS-2 questions by patients, we also performed an individual patient-level analysis. This meant data were analyzed from the same patients that both saw DCE and BWS-2 (or the other way around) and completed both sets of evaluation questions about understanding the questions (see Table 6). Table 6 shows that overall patients who completed either DCE first (44% + 50% = 94%) or BWS-2 first (26% + 47% = 73%) both evaluated DCE more often as being very easy or easy, compared to BWS-2 (31% + 31% = 62% when DCE was the first method and 16% + 47% = 63% when BWS-2 was the first method).

Table 5 Evaluation questions for DCE and BWS-2
Table 6 Individual-level evaluation questions for DCE and BWS-2

4 Discussion

In this study, preference weights and other outcomes (e.g., RIS) between DCE and BWS-2 were compared within patients with NMD. We conclude that the two methods lead to different preference weights as well as RIS values. However, accounting for preference heterogeneity, LC outcomes showed that patient classes look more similar, with a clear class of patients who both in DCE and BWS-2 indicated that liver damage was the most important attribute (class 3). For both preference methods, this class was among the largest class of patients. Additionally, patients that identified liver damage as most important (class 3) in DCEs also had the highest probability to be in the same class in BWS-2. The ex-post analyses also showed that for both preference methods class 2 differed (which might be related to the small class size) in terms of descriptives (i.e., sex, age, education, disease status) compared with class 1 and class 3. Contrary to initial expectations, the proportion of patients who found DCE easier to understand was greater than the proportion of people who found BWS-2 easier to understand.

One of our main findings of this study was that both DCE and BWS-2 led to different outcomes. There are several stated preference studies comparing outcomes between these two methods. Studies by Van Dijk et al. [11], Potoglou et al. [17], and Severin et al. [18] showed similar outcomes between DCE and BWS-2. Differences between these studies and our study might first be related to differences in the health decision context. Working with different types of respondents and dealing with different types of decisions (e.g., treatment choice, priority setting) might lead to different behavior, different choices, and therefore different outcomes. Second, in our study, we explicitly framed negative attributes (i.e., blurry vision and liver damage) positively in BWS-2 choice tasks in order to avoid comparisons of positive and negative attributes with a BWS-2 choice task as this could lead to identification problems [25]. This was not the case in the previous studies. Additionally, there might also have been a framing effect in our study with regard to the attribute liver damage, as the word “permanent” was included in the choice task, which might be a reason why this attribute was being considered important in both DCE and BWS-2. For the other negative attribute in the DCE, risk of blurry vision, it was stated that problems would disappear once (hypothetical) medication was stopped. Indeed, this temporary negative side effect appeared to be far less important in patient decision making. However, although our study differs from some of the prior research studies comparing the two methods, our study outcomes are in line with a study by Whitty et al. [34] in which the authors also reported differences in relative preference weigths and preference orderings between DCE and BWS-2 in a priority setting context.

In our study, the same patient sample (n = 140) completed both 12 DCE and 18 BWS-2 choice tasks. Preference weights from LCs in Table 3 showed that especially in DCE LC 2, most attribute levels were not statistically significant (i.e., smaller t-values) compared with BWS-2. Furthermore, attribute levels in the DCE overall had smaller t-values compared with BWS-2. This can be an indication that given the same (small) sample size, BWS-2 might be the preferred method of choice when statistical power is important for decision making. It should be noted here that this can however only be conclued by assuming that the cognitive burden of the 12 DCE and 18 BWS-2 choice tasks are comparable. Our results also suggest a smaller utility scale for DCE, which suggests the need for larger sample sizes in a DCE compared with BWS-2, as also mentioned in previous work [24].

The BWS-2 literature states that one of the reasons BWS-2 could be an interesting preference method compared to a DCE is because of its lower cognitive burden [11, 12]. However, this study indicated that the proportion of patients who found the DCE easier to understand was greater than the proportion of patients who found BWS-2 easier to understand. It should be noted here that the number of choice tasks between DCE (12) and BWS-2 (18) was different and the lead-ins for DCE and BWS-2 tasks also differed because the pilot study showed that patients needed more guidance regarding the BWS-2 tasks, which may both have influenced the evaluation of the methods by patients. The findings in this study follow the trend as described in a study by Himmler et al. [35] in which the authors found that DCE choice tasks were less cognitively burdensome than BWS-2 choice tasks. Whitty et al. [12] also reported that in their study the majority of respondents found it more difficult to complete BWS-2 compared with a DCE and most respondents preferred a DCE over BWS-2. The individual-level analysis also indicated that a DCE was more often evaluated as very easy or easy compared with BWS-2. However, these results should be interpreted with caution, as the sample (n = 35) of patients used for this analysis was very small. Therefore, the signal will expectedly contain quite some noise because of the low number of observations.

A strength of this study is that it is the first study focusing on differences in outcomes between DCE and BWS-2 with regard to a sample possibly hampered by cognitive limitations. As mentioned in the introduction, several studies have focussed on differences between DCE and BWS-2 outcomes. However, to our knowledge, there are no such studies conducted within the context of a sample with cognitive limitations specifically. This study is also important because NMD are considered rare diseases that often translate into relatively small sample sizes when eliciting preferences. This study provides useful insights into how BWS-2 and DCE performed with a relatively small sample size.

At the same time, the relatively small sample size is a limitation of this study. In general, this will not be a problem when estimating choice models not accounting for preference heterogeneity (MNL). However, when estimating more sophisticated models like for example LC in this study, such small sample sizes could potentially lead to estimation problems. In this study, we were able to estimate an LC model, but the extension with a class membership model failed to converge. Therefore, descriptive ex-post analyses were conducted to characterize the different latent patient classes. Future studies should however focus on larger samples that have cognitive limitations to investigate preference heterogeneity more thoroughly. A further limitation of this study is the fact that no information about the exact cognitive limitations of patients was identified, analyzed, and accounted for. In order to get a better understanding about cognitive burden and using DCE or BWS-2, future studies should identify the cognitive limitations of patients. Another limitation is the fact that a different number of choice tasks for each patient was used in the DCE (12) and BWS-2 (18). This may have influenced the evaluation of the methods by patients. However, pilot testing showed that 18 choice tasks for BWS-2 was managable and given the number of attributes and levels, we were not able to create an experimental design in which the number of choice tasks between methods was equal. Future studies comparing these two methods should focus on an experimental design with an equal number of choice tasks for both methods.

5 Conclusions

This study showed that using either a DCE or BWS-2 leads to different preference weights as well as relative importance values. A potential reason lies in the way risks were framed (i.e., positive) in BWS-2, which was different than in a DCE. Patients indicated that DCE choice tasks were easier to understand than BWS-2 tasks. Accounting for preference heterogeneity, the LC analysis indicated comparable LCs in both the DCE and BWS-2, especially the class of patients that indicated that liver damage was the most important attribute. Hence, we advise careful consideration when selecting either BWS-2 or a DCE to elicit preferences as the results of this specific study suggest that BWS-2 is the preferred method of choice when dealing with small samples, while DCEs may be preferred when minimizing the cognitive burden is key and choice tasks include both benefits and risks. It will therefore be important that the method matches the size and characteristics of the patient population. Proper pilot testing in the target population will also be important. To support medical decision making, keep in mind the research and decision context will be key.