FormalPara Key Points

Although they elicit preferences and risk tolerance differently, discrete choice experiments and the probabilistic threshold technique lead to similar conclusions with regard to the general magnitude of maximum acceptable risk for serious adverse events.

Discrete choice experiments seem better suited in studies that aim to estimate the maximum acceptable risk for multiple risk attributes of differing severity.

The probabilistic threshold technique is likely better suited to measuring heterogeneity in the maximum acceptable risk estimates across a relatively small population, particularly for serious adverse events.

More respondents perceived the choice tasks of a discrete choice experiment to be easy or very easy to understand compared with those who rated the choice tasks of a probabilistic threshold technique the same way.

1 Introduction

There is increasing interest in the use of information about the benefit-risk trade-offs that people or patients make when considering treatment among decision makers at all stages of medicine development, regulatory approval, and reimbursement [1,2,3]. In patient-centered outcomes research, risk is the chance of a negative outcome, such as developing rheumatoid arthritis (RA). Treatments may reduce some risks (i.e., benefits) and induce other risks, such as side effects. The maximum level of risk that people are willing to accept in exchange for a given increase in benefit (also known as the maximum acceptable risk [MAR]) is one of the measures that can be used by pharmaceutical companies and regulatory agencies to support, for instance, target product profile development, endpoint selection, benefit-risk assessment, and regulatory approval.

Different methods can be used to estimate MAR for a given medicinal product, with discrete choice experiments (DCEs) being the most widely applied method [4]. Discrete choice experiments ask respondents to evaluate full treatment profiles against each other. It is assumed that respondents consider all treatment attributes included in these profiles and select the profile that provides them with the highest utility. Discrete choice experiments have been applied broadly throughout healthcare settings [5] and the first insights into external validity in healthcare have been published [6,7,8,9,10]. However, DCEs are relatively resource intense to administer and the cognitive burden on respondents is considered by some to be relatively high [11, 12]. This method further requires complex experimental designs [13], statistical modeling techniques [14], and relatively large sample sizes [15]. Finally, MAR is a secondary outcome measure in DCE studies, which is indirectly calculated from estimated model output; as a result, the ability to identify covariates that are associated with heterogeneity in MAR is expected to be limited [16].

Other preference methodologies can be used to estimate MAR, including the probabilistic threshold technique (PTT) [16]. A recent review identified over 43 applications of this method across the healthcare sector [16] and additional applications of the PTT have been used to support regulatory decisions [17, 18]. The PTT directly measures MAR by asking respondents to choose between two treatment profiles, of which one represents an improved benefit over the other, but also imposes additional risk. Subsequently, the level of risk in the profile providing the beneficial treatment is varied until the respondent switches their choice to the alternative profile. In contrast to the DCE, this method estimates the threshold value directly and at an individual level, does not require a complex experimental design, and is typically applied in smaller samples [16].

Although both methods are able to produce MAR estimates, research comparing the extent to which results from a DCE and a PTT differ when applied to the same research question is lacking. Because both methods can be used to determine MAR, applying them to a similar research question should ideally result in MAR estimates with overlapping confidence intervals. To date, there is no evidence-based guidance on what points to consider when selecting one of these methods over the other for the purpose of eliciting preferences and estimating MAR. Therefore, the current study included DCE and PTT exercises in ways these methods have been applied previously to answer the same clinical research question. The study aimed to empirically compare the results of the DCE and the PTT for estimating MAR.

2 Methods

2.1 Case Study and Ethical Approval

Part of a study assessing preferences of the general public and first-degree relatives of patients with rheumatoid arthritis (RA) for the preventative treatment of RA was used as a case study for this research. As stated before, the DCE and PTT choice tasks in this study were designed, conducted, and analyzed individually, with the DCE following good research practices [14, 19,20,21,22] and the PPT following the approach used in a recent empirical study [23]. That is, the DCE was used to estimate an indirect utility function and then calculate MAR using the resulting estimates while the PTT was used to estimate MAR directly. Even though one method is applied in utility space while the other is applied in MAR space, they are both ultimately used to estimate the same measure of risk tolerance when applied to the same research question and both have been used to inform previous regulatory decisions. The study is described in detail elsewhere [24,25,26].

2.2 Attribute and Level Selection

A literature review [27] followed by a qualitative study using focus groups and individual interviews including a ranking exercise were conducted to inform attribute selection [24,25,26]. Selection of final attributes was agreed by an international multidisciplinary team of clinical researchers, patient preference experts, and patient research partners. Attribute levels were selected based on current clinical evidence and expert consultation. Table 1 includes a full overview of all attributes and levels included in the study.

Table 1 Attributes and levels selected for the DCE and the PTTa

2.3 Experimental Design DCE

Ngene software (ChoiceMetrics. Sydney, NSW, Australia) was used to construct a Bayesian D-efficient design [28]. Prior information on the attribute-level parameters was based on the previous literature [27] and best guesses for a pilot study and outcomes of an initial analysis (conditional logit) of pilot data (n = 100) for the main survey. For both the pilot and the final design, 60 unique choice tasks were generated, which were divided into four blocks. In the DCE, participants were randomized to one of the blocks, so each participant answered only 15 choice tasks. The choice tasks each consisted of two hypothetical treatment scenarios as generated in the experimental design and a no-treatment option. Within each block, choice tasks were shown in a fixed order to respondents, but the order of the treatment alternatives (i.e., being the left or right alternative) was randomized. The no-treatment options were always presented on the far-right side of the choice task. The design was restricted to exclude the following combinations of attribute levels in treatment profiles: taking a pill every month or every 6 months, having an injection daily, or having a drip daily or weekly. Additionally, interactions between effectiveness and the chance of a serious infection and between effectiveness and the chance of serious side effects were included in the design. The full experimental DCE design can be found in the Electronic Supplementary Material (ESM).

2.4 Design of the PTT

The PTT has a similar choice task format as the DCE and was developed based on the final DCE design [26]. In the first question, referred to as the baseline question, participants were asked to choose between no treatment (a 60% chance of developing RA and no increased safety risks) or treatment (fixed benefit and varying levels of increased risks). The fixed benefit in the PTT was a 40-percentage point reduction in the chance of developing RA, which corresponded to the difference in the highest and lowest levels of this attribute in the DCE. Table 1 shows the starting values of each of the risk attributes for the PTT. After answering the baseline question, participants were asked a series of follow-up questions. In each follow-up question, the ‘no treatment’ alternative remained fixed. In the treatment alternative, the level of benefit remained fixed, and the level of one of the risk attributes (i.e., mild side effects, serious infection, or serious side effects) was varied systematically while the other two risks were fixed at the baseline level. For the risk that varied, the subsequent levels were either lower or higher than the baseline risk in the treatment alternative depending on the baseline response. A maximum of two follow-up questions were asked after the baseline for each included risk attribute. If respondents indicated they would accept the highest level of risk included in the PTT exercise, an open-ended question was included to ask respondents to indicate the highest risk they were willing to accept. Whenever respondents did not accept the lowest risk included in the PTT exercise, a follow-up question asked if they were (yes or no) willing to accept any risk at all. This procedure was repeated for each risk attribute to define the risk intervals for each participant for each risk presented in the survey. The order in which respondents completed the PTT questions for the three risk attributes was randomized. See the ESM for an example of the design of the PTT.

2.5 Survey

The online survey, programmed in SurveyEngine GmbH software, consisted of six sections and was developed in close collaboration with patient research partners to enhance content and accessibility. First, respondents completed informed consent, and a demographic questionnaire. Second, they were asked to read a description of RA and risk factors for RA. This information was followed by comprehension questions to test the participant’s understanding of the information presented. Participants were then asked to imagine they had started to develop joint pain and had received test results that indicated they had a 60% chance of developing RA in the following 2 years. Subsequently, attributes and levels were explained in detail, and examples of choice tasks were given, including a walk-through example. In the third part, either the DCE or the PPT choice task questions (participants were randomly assigned to either receive the DCE or the PTT first) were presented. The effectiveness attribute and all risk attributes were displayed using percentages to increase comparability. In addition, frequencies were added to these choice tasks to enhance interpretation and participants could view the explanation of the attribute and the levels (including icon arrays for the risk-related levels) using pop-up windows during the completion of the choice task. Fourth, participants were asked to complete the Single Item Literacy Screener (SILS [29]) and the 3-Item Version of the Subjective Numeracy Scale (SNS-3 [30]). Fifth, participants were shown a walk-through for the choice tasks of the second method, followed by the actual choice tasks. Sixth, participants were asked to complete further measures of psychological constructs, including their family history of RA.

After completing each preference method, participants were asked to complete two feedback questions related to how easy it was to understand and answer the choice questions. A copy of the full survey is included in the ESM.

2.6 Pre-Pilot and Pilot Testing

The survey was pre-tested in a convenience sample (N = 15) of members from the general public, first-degree relatives of patients with RA, and patient research partners in the UK using qualitative think-aloud interviews and written feedback forms, to test the online survey system, remove software bugs, change the wording if necessary, and ensure adequate interpretation of risk information presented in the choice tasks. These participants were paid £20 (i.e., approximately $US27) in shopping vouchers. To optimize the statistical efficiency of the final DCE design, a survey pilot was conducted with 100 members of the general public in the UK.

2.7 Participant Recruitment

Participants were recruited via an online survey panel of members of the general population in UK provided by Survey Engine GmbH. Eligible participants were 18 years of age or older, without a diagnosis of RA [24]. Potential participants received an e-mail invitation to take part in the survey with a unique password-protected link to the online survey itself. After completing the survey, panel members were credited with panel points (equivalent to approximately £2 [i.e., approximately US$2.75] for a 30-minute online survey) and additional sources of information and support about RA and risk factors for RA. Recruitment was continued until 1000 completed surveys were returned [24].

2.8 Data Analysis

Only completed surveys were included in the analysis. Discrete choice experiment analyses were conducted in Nlogit 6 (Econometric Software Inc., Plainview, New York, USA) and the PTT analyses were conducted in R Statistical Software 3.6.1 (R Core Team, Vienna, Austria). All surveys that were completed in less than 5 minutes were excluded. The data from all participants who took between 5 and 9 minutes to answer the survey AND answered two out of the three comprehension questions incorrectly AND showed flat-lining behavior in the DCE tasks were excluded because of suspicion of low-quality data. Results were considered statistically significant if p < 0.05.

2.8.1 Respondent Feedback Questions

Frequencies were calculated and chi-square tests were conducted to compare respondents’ ratings of the perceived difficulty to understand and answer the DCE and PTT questions.

2.8.2 DCE Analysis

Panel random parameter logit (RPL) models were constructed to adjust for the multi-level structure of the data in order to be able to correct for preference heterogeneity [14]. The linearity of all non-categorical attributes was assessed visually by plotting the coefficients and based on a dummy variable in a spline function. Variables were included as either linear or categorical (using effects coding [31, 32]) in the utility function, except for effectiveness, for which linearity was assumed to facilitate MAR calculations. The contributions to model goodness of fit of interaction terms that were pre-specified in the experimental design were tested (i.e., effectiveness-serious infection and effectiveness-serious side effects). Interaction terms that significantly contributed to the model fit (Log likelihood ratio test) were included. To test whether the order in which participants answered the choice tasks (i.e., DCE first or PTT first) influenced the outcomes, a variable that identified that order was included in the model as an interaction term with the attribute levels.

All attributes were included as random parameters. Based on a model fit test, it was determined what distribution should be assumed for those parameters (normal, triangular, or lognormal). The systematic utility component (V), which describes the measurable utility of a specific treatment based on the attributes included in the DCE, was tested using the equation below. The β0 represents the alternative specific constant for the opt-out alternative, and other estimates are the attribute-level estimates that indicate the relative importance of each attribute level.

Valternative A|B = β1i * chance of developing RA + β2 i * mode of administration injection + β3 i * mode of administration drip + β4 i * frequency of administration + β5 i * chance of mild side effect 5% + β6 i chance of mild side effect 10% + β7i * chance of serious infection + β8i * chance of serious side effects 20 in 100.000 + β9i * chance of serious side effects 1 in 100.000 + β10i * effectiveness * chance of serious infection

Vopt-out = β0i

Maximum acceptable risk was calculated in all instances where the risk attribute was linearly coded following the equation below. When a risk attribute was effects coded, linearity was assumed between the two highest risk levels; this was further extrapolated beyond the levels included in the DCE experiment when necessary. Calculations were made both based on aggregate average attribute-level estimates as well as individual attribute-level estimates retrieved from the RPL model.

$$MAR=- \frac{{(\beta }_{\mathrm{Chance \, of \, developing \, RA}}) }{\begin{array}{c}\left({\beta }_{k = risk \, attribute}\right)\\ \end{array}}$$

2.8.3 PTT Analysis

For the PTT, the series of threshold technique questions resulted in a threshold interval representing the risk level each participant was willing to accept in exchange for the benefit (reduction in the chance of developing RA). These data were explored using descriptive statistics. The proportion of respondents selecting each of the risk levels as the MAR were displayed. Whenever respondents indicated a MAR beyond the pre-set range in the open-ended follow-up question, these risks were grouped into intervals of 10 percentage points (meaning that, e.g., a response of 23% would fall in the 20–30% interval). The formal analyses were conducted using interval regression models, in which the data were interval censored because the threshold falls within an interval with fixed endpoints. An interval regression model was fitted using a Tobit model to account for the fact that the interval has both a fixed upper bound (being the highest acceptable risk as indicated by respondents in the open-ended risk question), resulting in left-censored data, and a fixed lower bound, resulting in right-censored data [33]. For each risk attribute, \(\left(r\in [mild\, side \,effects, \,serious \,infection, \,serious \,side \,effects]\right)\), the risk interval threshold, \(Threshold,\) was regressed on the variable that indicated whether the participants completed PTT or the DCE first, \(Order\). Because each participant saw a single level of benefit in each threshold series (a 40-percentage point reduction [from 60% to 20%] in the chance of developing RA), the relationship between the level of benefit and the risk threshold was estimated cross-sectionally.

The base interval regression was specified as:

$${Threshold}_{r}= {{\alpha }_{r}+ \beta }_{1}{Order}_{i}+{\varepsilon }_{r}$$

where the intercept αr represents the MAR for risk r for the 40-percentage point reduction in the chance of developing RA \(\mathrm{and}\, {\varepsilon }_{br}\) is an independent and identically normally distributed random error term with a mean of zero and variance σ2.

2.8.4 Heterogeneity in MAR

An interval regression was conducted for the PTT data, including clinically relevant covariates (age, education, health literacy, subjective numeracy, reported family history of RA) to test whether they were associated with MAR values. For each covariate that was significantly associated with the MAR in the interval regression for the PTT, separate subgroup analyses were conducted on the DCE data using the RPL modeling procedure described above.

2.8.5 Sensitivity Analyses

Several differences exist in how data from a DCE and PTT are modeled. To explore whether differences in the results appeared to be due to fundamental differences in the modeling approach or whether the differences in results likely can be explained by assumptions used in the applications of the methods in this study, several sensitivity analyses were conducted. These analyses should be interpreted with caution as these are post-hoc analyses meant to enhance comparison of the results of the two methods by controlling for differences in the applications of the methods. By doing so, we make the analyses less in line with the formal recommendations on applying the (particularly DCE) analyses [14, 21]. The following sensitivity analyses were conducted:

  1. 1.

    In the primary interval regression for the PTT, the upper bound for the interval of greatest magnitude of each risk was set at the value of the MAR expressed by the respondent if the respondent indicated that they would accept the maximum level of risk included in the PTT questions. A separate interval regression was conducted in which the upper bound of this interval was set at 100%.

  2. 2.

    In the primary RPL model for the DCE, an interaction between ‘Chance of developing RA’ and ‘Risk of serious infection’ was included. A separate RPL model was estimated for the DCE data in which this interaction was excluded so that the DCE and PTT specifications were similar.

  3. 3.

    In the primary RPL model for the DCE, risk was modeled as categorical and the disutility of risk was allowed to be nonlinear. Because the benefit in the PTT analyses is fixed, there is no need to assume linearity or nonlinearity in the disutility of risk. To determine if assumptions about the linearity of the disutility of risk impacted the comparison between the DCE and PTT results, a separate RPL model was estimated for the DCE data in which all risk attributes were assumed to be linear.

  4. 4.

    A latent class analysis model was used to analyze the DCE data. Each covariate that was significantly associated with the MAR in the interval regression for the PTT was included in the class membership probability function in this analysis to test whether the covariates that explained differences in risk tolerance in the PTT data also explained differences in risk tolerance in the DCE data.

3 Results

The survey data of 982 (332 male, 650 female) individuals were included in the analyses. A full overview of the demographics including age, education, and health literacy can be found in Table 2.

Table 2 Demographic variables and survey feedback responses for the total sample (N = 982)

3.1 Response to Feedback Questions

The majority of the participants reported that the choice tasks of the DCE and PTT were easy or very easy both to understand (71.7% and 64.7%, respectively) and to answer (59.8% and 55.6%, respectively) [Table 2]. However, there was a statistically significant (p < 0.05) 7-percentage point difference in the number of respondents indicating that the exercise was easy or very easy to understand between the DCE and the PTT, with more respondents indicating that the DCE was (very) easy to understand.

3.2 Avoiding Risks for DCE and PTT

Based on responses to the DCE, 4.8% of participants (n = 47) always chose to opt out (i.e., would avoid any risk of side effects by not taking a treatment). Similarly, for the PTT, some participants indicated they were not willing to accept any chance of getting mild side effects (5.2%), serious infection (6.1%), or serious side effects (4.6%). In the PTT, in total, 3.1% of participants (n = 27) indicated that they were unwilling to take any risks. Of these 27 participants unwilling to take any risk in the PTT, 12 (44%) always opted out in the DCE as well.

3.3 DCE Outcomes: Preferences

Table 3 shows the outcomes of the DCE analysis (using the RPL model), which were in line with a priori expectations. Participants preferred treatment with increasing effectiveness; they preferred a pill over an injection or drip, lower frequencies of administration over higher frequencies of administration, and lower chances of mild side effects, serious infections, and serious side effects over higher chances of getting these side effects and infections.

Table 3 Random parameter logit model output based on discrete choice experiment data

3.4 PTT Outcomes: Frequencies

Figure 1 shows the choice frequency for MAR intervals of risk attributes for a 40-percentage point decrease in the chance of developing RA. Some participants were willing to accept a 100% chance of mild side effects (2.2%) or serious infection (0.9%), while none of the participants was willing to accept a 100% chance of serious side effects. Most participants were willing to accept a mild side effect risk between 10 and 30%, a serious infection risk between 5 and 20%, and a serious side effect risk between 0.1 and 1% (42.5%, 42.6%, and 52.6%, respectively).

Fig. 1
figure 1

Frequencies of thresholds of maximum acceptable risk intervals for a 40-percentage point reduction in the chance of developing rheumatoid arthritis (RA) in the upcoming 2 years based on a probabilistic threshold technique analysis, separately shown for (A) serious infection, (B) mild side effects, and (C) serious side effects

3.5 MAR

Table 4 shows the MAR calculated based on aggregate means estimates from the DCE and individual-level estimates from the DCE and PTT. For a 40-percentage point reduction in the chance of developing RA in the upcoming 2 years, participants were willing to accept a 50.8-percentage point increase in the chance of getting mild side effects, a 14.0-percentage point increase in the chance of getting a serious infection, and a 1.2-percentage point increase in the chance of serious side effects based on a DCE analysis using aggregate mean estimates. Based on the individual-level estimates, MAR values were comparable to overlapping confidence intervals. Based on the PTT analysis, these values were 18.1 percentage point, 10.2 percentage point, and 1.4 percentage point, respectively. Confidence intervals for the MAR based on the DCE and PTT overlapped for serious infection and serious side effects but not for mild side effects.

Table 4 MAR values calculated based on DCE average estimates, individual estimates, and PTT

3.6 Heterogeneity in MAR

Significant preference heterogeneity was found based on health literacy, numeracy, and family history in the PTT analysis (Table 5). When conducting a subgroup analysis using RPL models for the DCE data, no significant differences were found as all 95% confidence intervals between subgroups overlapped (Table 5).

Table 5 MAR values from the PTT based on interval regression including relevant covariates and MAR values from the DCE based on a subgroup analysis for significant covariates from the PTT interval regression

3.7 Sensitivity Analyses

Changes in the modeling of both the PTT and DCE resulted in slight changes in the numeric values for MAR (see ESM for the complete model output). Confidence intervals of the MAR estimates still overlapped for serious side effects and still did not overlap for mild side effects. Results related to the MAR estimates for serious infections were inconsistent. In the DCE model that excluded interactions or assumed the disutility of risk to be linear, the confidence intervals on the MAR estimates did not overlap with the confidence intervals of the MAR calculated based on the original PTT analysis. In contrast, the confidence intervals on the MAR estimated for serious infections for the original DCE overlap with the confidence intervals of the MAR estimates from the ‘bounded’ PTT analysis. Application of a latent class analysis model to identify heterogeneity in preferences and MAR estimates resulted in a five-class model (see ESM for the complete model output) with a class assignment model in which only subjective numeracy significantly impacted preferences and thus subsequently MAR. This is in contrast to the original PTT model in which health literacy and family history also impacted MAR estimates.

4 Discussion

This is the first paper that compares MAR estimates derived from DCE and PTT exercises. Findings showed that confidence intervals surrounding the MAR estimates from the two methods overlapped for two out of the three risk attributes, namely ‘chance to develop serious infection’ and ‘chance to develop serious side effects’. However, for the attribute ‘chance to develop mild side effects’, the MAR calculated based on the DCE data was significantly higher compared with the value calculated based on the PTT data.

The functional form of the attributes included in the MAR calculations likely influenced the outcomes. González Sepúlveda and Boeri showed that MAR outcomes were higher (though not statistically significantly so) when applying linear as compared with categorical specifications of the attributes used for MAR calculations in a DCE [34]. In the current study, the DCE-based MAR calculation for two out of the three attributes assumed that the levels were categorical, and that utility was linear between the levels. However, only for one attribute this resulted in a substantial and significant difference with the outcomes of the PTT (where linearity was assumed).

In the current study, the MAR calculated for the PTT may significantly underestimate actual preferences because (1) linearity was assumed in the PTT-based MAR calculation and (2) all key stakeholders (i.e., representatives from pharmaceutical industry, clinicians, and patients [24, 25]) included as research partners determined the appropriateness of the assumed functional forms in the DCE. The underestimation of the PTT-based MAR compared with the DCE-based MAR might be partially explained by the multi-attribute approach of a DCE, where respondents are forced to trade-off multiple attributes at the time, which results in the disclosure of the importance of each attribute relative compared to all the other attributes included in the experiment [14].

In the PTT, all risks were investigated separately, thereby not constraining the importance of a single attribute relative to all other attributes, potentially resulting in lower MAR for the ‘less important’ risks such as the chance of mild side effects. Additionally, the responses to the PTT might have been subject to anchoring effects [16]. Although respondents were able to indicate they would accept a relatively high risk or a certainty of side effects (i.e., 100% risk of side effects), their answers to the choice questions in the PTT were likely influenced by the level of risk provided in the initial choice tasks (although the risk level used was based on clinical insight into the likely risk of mild side effects and the range of risk levels was equal to that in the DCE). If the actual choice decision has a natural starting point that is known with certainty, then any anchoring effect in the study may simply reflect reality. However, when the starting point is known but uncertain or unknown, anchoring effects might introduce bias in results.

Future studies should explore: (1) if and to what extent results derived using the PTT are subject to anchoring effects (2); whether this might be influenced by the importance of the risk at stake (e.g., whether this effect is more pronounced for relatively less important risk attributes); and (3) to what extent anchoring effects might be acceptable in cases where the risk levels included in the initial choice task are informed by clinical evidence for which there are different levels of certainty.

A potential advantage of the PTT relative to the DCE is the ability to identify heterogeneity in MAR. A number of covariates that significantly impact MAR were identified using the interval regression on the PTT data (age, education, health literacy, numeracy, family history of RA), but these could not be identified using a subgroup RPL analysis on the DCE data. This shows the advantage of the PTT having MAR as a primary outcome measure calculated based on individual-level data. At the same time, this demonstrates the relative inefficiency of a sub-group analysis in DCE studies, even in relatively large samples such as in this case study. Although DCE studies have alternative strategies to investigate preference heterogeneity, for example, latent class analysis [14], such methods generally require large sample sizes.

More respondents indicated that the DCE choice tasks were easy or very easy to understand compared with the number of respondents who rated the PTT choice tasks the same way (a 7-percentage point difference that was statistically significant). Previous studies among similar populations without cognitive limitations reported similar findings [35,36,37,38,39]. This raises the question of whether DCE choice tasks really are as difficult as researchers and clinicians have previously assumed; respondents’ capabilities to accurately complete such questions might have been underestimated, which would call for a partial change in perspective toward this method as being (too) complex and time consuming to complete. However, it is essential to remember that this study used evidence-based guidelines for developing the DCE [19], whereas such guidelines do not yet exist for PTT studies. It is therefore possible that researchers in health economics are generally better trained in accurately designing and conducting DCE studies compared with PTT studies, as well as effectively incorporating the preferences of respondents related to the layout and presentation of information and choice tasks. With applications of PTT methodology being relatively novel and unexplored, further investigation into how best to design such experiments, when the aim is to measure MAR for multiple risks within one experiment, is warranted.

This study has considerable strengths, including a large sample size and the involvement of international multidisciplinary experts. However, this study also has some limitations. First, this is the first study to compare these methods. Therefore, definitive recommendations on the use of DCE versus PTT cannot be drawn from this study alone. However, although this analysis focuses on a sample of data collected in the UK, the study was also carried out in Germany and Romania [24], resulting in similar findings with respect to respondent feedback and MAR differences (see ESM), which supports the theoretical validity of this study. Second, this study was conducted among a sample of the general population who were members of a survey panel and did not consist of respondents with impaired cognitive skills. In addition, the current sample consisted of a relatively large proportion of respondents with a high educational level and/or a high health literacy level. Therefore, the current findings cannot be generalized across all (patient) populations. Third, several assumptions had to be made for MAR estimations. These assumptions might to some extent impact on the outcomes generated. This is especially true for the functional from of the attributes included [34]. Further studies should explore the impact of study assumptions related to method selection, functional form of attributes, and the data analysis strategy on MAR study outcomes and MPLC decision making. Fourth, to determine the maximum risk participants were willing to accept within the PTT method, an open-ended question was added to the survey in case participants indicated that they were willing to accept the highest presented risks in the preceding threshold questions. This procedure was used to limit the number of threshold questions in the survey. However, the literature suggests that responses to open-ended questions might provide a different threshold than one derived from bounded choice questions (e.g., [40, 41]). Because respondents whose MAR was greater than the highest level provided in the PTT exercise were allowed to state their MAR, this might have impacted the study results. Future research should be conducted to determine the best balance between estimating exact thresholds and determining upper bounds for acceptable risk levels in the PTT. Fifth, because a priori sample size calculations for DCE and PTT are complex, no formal power calculations were conducted. As a standard DCE includes about 100–300 respondents and a standard PTT include <100 respondents, this study oversampled and recruited 1000 respondents to ensure the responses provide enough information to identify preferences and allow comparisons across methods with acceptable precision.

5 Conclusions

Maximum acceptable risk estimates based on a DCE and a PTT were numerically similar with overlapping confidence intervals for two out of the three included risk attributes. With a 7-percentage point difference, the DCE was considered easier. This may suggest that the DCE is better suited in studies estimating MAR for multiple risk attributes of differing severity, while the PTT may be better suited when measuring heterogeneity in MAR estimates or when investigating one or more serious adverse events. Further studies should explore if this conclusion holds in different disease areas and respondent samples to provide further evidence on method selection guidance for estimating MAR to support stakeholder decision making on medical products.