1 Introduction

The ability to estimate risk preferences at the individual level is of utmost importance for decision analysis and policy evaluation. Accordingly, a number of methods to measure risk attitudes have been proposed, going back at least to Binswanger (1980), see Holt and Laury (2014) and Mata et al. (2018) for recent reviews. A popular strand of the literature employs elicitation procedures in which subjects make repeated choices between two risky outcomes; the data obtained in this way consist of a finite number of binary choices, which can then be used to partially recover a subject’s preference. The most influential among these procedures is the one of Holt and Laury (2002, 2005) (HL), which derives risk parameter intervals from a series of ordered binary lottery choices. However, it has been argued that this method might be too complex and too difficult to understand (Charness & Gneezy, 2010), especially for non-student populations (Yu et al., 2019).Footnote 1

If a method is perceived as being too complex, estimates become inconsistent and reliability might be questionable (Dave et al., 2010; Charness et al., 2018). As a consequence, other methods have tried to simplify the elicitation procedure. Designed as a direct alternative to HL is the procedure by Eckel and Grossman (2002, 2008) (EG) which involves a single choice among 6 gambles, all with 0.5 probability of winning a higher prize. Yet another popular alternative is the Investment Task (INV) of Gneezy and Potters (1997) (see also Charness & Gneezy, 2010; Charness et al., 2013), which simply asks individuals to allocate money between a safe option and a risky one.Footnote 2 This latter method belongs to a growing strand employing choice tasks given a fixed budget set, either in the form of an explicit allocation of a monetary budget or as a direct choice from, say, a linear budget set (i.e., Gneezy & Potters, 1997; Choi et al., 2007a, b, 2014; Ahn et al., 2014; Hey & Pace, 2014; Castillo et al., 2017; Halevy et al., 2018; Kurtz-David et al., 2019; Polisson et al., 2020; Daniel et al., 2022, among others). These tasks are also often used to test for the consistency of subjects’ choices with the Generalized Axiom of Revealed Preference (GARP) (among many others Drichoutis & Nayga, 2020). In these tasks, which we will refer to as budget-choice tasks, subjects choose a preferred option from an effectively-infinite set of alternatives.

However, “risk elicitation is a risky business” (Friedman et al., 2014) and while budget-choice tasks are becoming more popular than binary-choice procedures, their properties remain largely untested. In this work, I tackle two specific questions in this direction. First, it is unclear whether empirical implementations of portfolio-choice tasks have a larger predictive validity than binary-choice tasks. By predictive validity, what is meant here is the ability to actually predict risky choices out-of-sample. Second, even though such methods often aim to reduce complexity compared to binary-choice tasks, they involve large choice sets, and hence it is reasonable to ask to what extent do subjects indeed fully understand the involved procedures.

To answer the first question, this work empirically compares the out-of-sample predictive ability of different tasks (HL, INV, and EG), which are the most commonly used risk elicitation procedures in economics.Footnote 3 To this end, I conducted an experiment including HL, INV, EG and a separate block of 36 lottery choices, hence providing a clear metric to judge the predictive ability of the two methods out-of-sample. Although these two tasks are clearly different, it is an empirical question which one is better at predicting subsequent choices (and hence eliciting risk attitudes). As HL, INV, and EG are often interchangeably used in the literature, it is further important to understand their properties and have an objective criterion which leads the choice of which method to implement in any given experiment. To answer the second question, regarding subjects’ understanding of these tasks, the experiment included three further budget-choice tasks which were constructed in such a way that any risk-averse participant should have selected (the same) corner solutions, and hence other choices are indicative of confusion or lack of understanding. Finally, I explore the potential heterogeneity in behavior in these task by eliciting demographic characteristics as well as participants’ numerical literacy and cognitive reflection.

The incentivized experiments relied on a general-population sample (\(N=403\) and \(N=400\)). Results show that the out-of-sample predictive ability of HL and INV is undistinguishable, and that overall performance is rather modest. The performance of EG, under some specifications, was worse than that of the other methods, which is not surprising given that it only allows categorization of decision makers into five risk categories, hence it has mechanically less predictive power. Further, I find limited effect of individuals’ characteristics on the predictive power of the risk-elicitation tasks. Strikingly, in the additional budget-choice tasks, a large majority of the subjects failed to report the normatively-predicted corner solutions. Moreover, optimal choices in this context seem to depend on numerical literacy as well as cognitive abilities. These results cast doubt on the suitability of general budget-choice tasks for empirical applications in non-student populations or as a general method to test rationality.

The results in this manuscript go beyond the well-known observation that measurements of risk preferences are unreliable (Friedman et al., 2014) and that they often exhibit a limited correlation with real-world behavior (see Charness et al., 2020, for a recent example). First, and in contrast with the literature, I concentrate on (out-of-sample) predictive ability as a well-defined criterion to evaluate measurement methods. Second, this work is part of the more recent but scarce literature investigating why risk preference measurements are unreliable (Crosetto & Filippin, 2016; Holzmeister & Stefan, 2020). Specifically, the results described here suggest that lack of comprehension might be one of the leading explanations. Last, this paper is also related to a different branch of the literature, namely that which extensively uses budget-choice allocation tasks to test the consistency of subjects’ choices with GARP (Choi et al., 2007a; Choi et al., 2014; Kurtz-David et al., 2019; Polisson et al., 2020; Drichoutis & Nayga, 2020; Daniel et al., 2022). The results described here should be seen as a caveat on the lack of robustness of tests built around budget-choice allocations. The widespread lack of understanding in the general population for this type of task suggests that systematic attention and comprehension checks should be implemented to increase the reliability of the data in this field.

The paper is structured as follows. Section 2 discusses the experimental design and procedures. Section 3 presents the results on predictive performance and correlation among measures. Section 4 reports the behavior in budget-choice tasks where the optima are corner solutions. Section 5 concludes.

2 Experimental design

The two experiments involved 403 and 400 individuals and used Prolific (Palan & Schitter, 2018), an online platform which allows recruiting from the general population.Footnote 4 The heterogeneous composition of our sample is confirmed, e.g., by the distribution of age and employment status. Subjects were on average 33 years old (SD 11.546, minimum 18, maximum 82; for the second experiment average age was 38, SD 12.93, minimum 20, maximum 87). Among participants, 49.95% (56.25%) were fully-employed, 19.92% (23.00%) worked part-time, 11.10% (14.00%) were housekeepers, and 9.07% (3.75%) were unemployed. 68% (60%) of our sample was female.

Subjects were paid based on their answers for one randomly-sampled decision. Average earnings were GBP 5.47 (4.45) including 1.25 for completing the experiment (SD \(= 6.58\), min \(=1.25\), max \(=22.25\); for the second experiment SD \(= 5.90\), min \(=1.25\), max \(=23.25\)).

The experiments were programmed in Qualtrics. The first experiment consisted of five parts in the following order: three budget-choice slider tasks, 36 binary lottery choices, an implementation of HL, and implementation of INV, and a repetition of the three budget-choice slider tasks with increased incentives (\(5\times\)). At the end of the experiment, a (self-reported, non-incentivised) Qualitative Risk Assessment measure (QRA) (Dohmen et al., 2011; Falk et al., 2018) was implemented to investigate its correlation with HL and INV. The second experiment encompassed the first while adding the EG task as well as the Cognitive Reflection Task (Frederick, 2005) and a measure of numeracy ability (Lipkus et al., 2001).

Discussion of the budget-choice tasks is relegated to Section 4 below, which also describes their implementation. Implementation of the other methods was kept as close as possible to the originals, with payoffs scaled to ensure comparability across tasks and guarantee the expected earnings as prescribed by Prolific. HL was implemented using an ordered list of 10 binary choices, such that subjects should start by choosing the safer option (presented on the left) to then indicate a preference for the right option as they proceed along the list of choices, with the switching point indicating their risk attitudes (see Csermely & Rabas, 2016, for an illustration of the different implementations of this task). INV allowed participants to invest part of their endowment in a lottery that paid 2.5 times the amount invested with a 50% chance and that GBP 0 otherwise, while keeping the part of their budget that was not invested. For QRA, subjects were directly asked to state their willingness to take risks on a 0–10 scale. EG was implemented as a choice among six different lotteries following the standard in the literature (Dave et al., 2010). Further details on the implementation of the tasks are given in Appendix A and instructions are presented in Appendix B.

Four of the 36 lottery choices involved a dominance relation. These choices were implemented as a check of participants’ attention and comprehension. The remaining 32 lottery choices were used for assessing the out-of-sample predictive performance of the different methods (see Appendix B for the list of lotteries). To ensure an unbiased selection, the set of lotteries used in this phase was constructed following optimal design theory (Silvey, 1980) in the context of non linear (binary) models (Ford et al., 1992; Atkinson, 1996), see also Moffatt (2015) for a detailed explanation of the procedure.

3 Comparison of methods

This section presents the results of the experiment. Subsection 3.1 gives an overview of estimated risk attitudes, subsection 3.2 compares the predictive performance of HL, EG, and INV, subsection 3.3 compares HL, EG, and INV with a structural econometric estimation using the block of 32 binary choices, subsection 3.4 explores the potential role of heterogeneity in influencing the predictive power of the different measures. The results of the two experiments are qualitatively identical and they are presented together.

3.1 Descriptive results

Following influential contributions in the estimation of risk attitudes (e.g., Andersen et al., 2008; Wakker, 2008; Dohmen et al., 2011; Gillen et al., 2019), and in agreement with standard analyses of HL and INV, in this paper I adopt the CRRA specification for all incentivized procedures as defined by:

$$\begin{aligned} U(x) = {\left\{ \begin{array}{ll} x^{(1-r)}, &{}\text{ if } x \ge 0\\ \ln x, &{}\text{ if } r=1. \end{array}\right. } \end{aligned}$$

In the Appendix A I perform the analyses reported in the main text assuming a different functional form for the utility functions (CARA instead of CRRA). The results are qualitatively unchanged. According to the assumed utility function, the vast majority of subjects are classified as risk averse, as commonly found in the literature (Gneezy & Potters, 1997; Holt & Laury, 2002; Harrison et al., 2007). In particular, according to HL only 27.30% (exp 2: 14.50%) of subjects are classified as risk seeking, while EG classifies 11.25% of participants in this category. INV does not distinguish between risk-seeking and risk-neutral subjects, since it does not allow for negative values of the relative risk attitude coefficient.

The average estimated risk attitude using HL is 0.309 (SD 0.605; exp 2: 0.598, SD 0.656), for EG is 1.397 (SD 1.204) while with INV is 6.343 (SD 34.996; exp 2: 2.978, SD 22.36). This very large difference is striking. Examination of the data shows that the discrepancy is due to the fact that, in this sample, almost 34.49% of participants (38.00%) gave “focal-point answers” investing amounts of exactly \(0\%\) (with an implied \(r\le 0\)), exactly \(100\%\) (with an implied \(r\ge 223.1\)), or exactly \(50\%\) (with an implied \(r\simeq 0.65\)). This observation already suggests that budget-choice tasks might be mechanically biased due to subjects’ lack of comprehension or attention. Excluding the 40 (43) subjects who report corner solutions (\(0\%\) or \(100\%\)), the average estimated risk attitude using INV is 0.840 (SD 1.939; exp 2: 1.444, SD 11.925). Excluding all 139 (152) subjects reporting \(0\%\), \(100\%\), or \(50\%\), the average is 0.913 (SD 2.270; exp 2: 1.796, SD 14.302). In the subsequent analysis, no subjects are excluded, but results are qualitatively unchanged when restricting the sample to those subjects not reporting corner solutions in the INV task.

In HL, \(31.51\%\) (28.00) of subjects switched from the left to the right option and vice versa multiple times. Following the literature, instead of excluding these subjects, subsequent analyses consider the total number of “safe” (left) choices as an indicator of risk aversion (Holt & Laury, 2002; Holt & Laury, 2005). This yields a comparable sample size for the different measures of risk attitudes. However, the results are not affected if I use “consistent” subjects only (those that switched only once).

The literature has typically found that risk attitudes estimated through different elicitation methods are often uncorrelated (Friedman et al., 2014; Charness et al., 2020).Footnote 5 This is also partially true in the datasets at hand: HL and INV are not significantly correlated (Pearson’s \(r=0.01\), \(p=0.877\); exp 2: \(r=-0.05\), \(p=0.283\)), while EG is correlated with HL but not with INV (\(r=0.184\), \(p=0.001\); \(r=-0.080\), \(p=0.111\)). However, the three measures are not significantly correlated when restricting to subjects who behaved consistently in the HL task (HL with INV \(r=-0.030\), \(p=0.771\); exp 2: \(r=-0.074\), \(p=0.435\); HL with EG \(r=0.145\), \(p=0.125\); INV with EG \(r=-0.175\), \(p=0.065\);) or to those reporting interior solutions in the INV task (HL with INV \(r=0.010\) \(p=0.911\); exp 2: \(r=-0.068\), \(p=0.406\); HL with EG \(r=0.154\), \(p=0.058\); INV with EG \(r=-0.047\), \(p=0.565\);). There is a positive, although small, correlation between the self-reported, non-incentivised willingness to take risks (QRA) and HL (\(r=0.106\), \(p=0.034\); exp 2: \(r=0.231\), \(p=0.001\)), as well as between QRA and EG (\(r=0.217\), \(p=0.001\)).

3.2 Predictive performance

Figure 1 presents the out-of-sample predictive performance of INV, HL, and EG in the two experiments. In particular, I report the distribution (violin plot) of the proportions of correctly predicted choices at the individual level using the individual estimated risk attitudes.Footnote 6

Fig. 1
figure 1

Distribution of the average proportions of out-of-sample predicted choices between elicitation methods in the two experiments (first in the left, second on the right). Violin plots show the median, the interquartile range, and the 95% confidence intervals as well as rotated kernel density plots on each side. The red-horizontal line indicates the predicted behavior of an expected value maximiser

There is no significant difference in the number of correctly predicted choices between INV and HL (INV: 66.51%, HL: 67.18%; Wilcoxon Signed-Ranked (WSR) test, \(N=403\), \(z=0.485\), \(p=0.628\); exp 2: INV: 69.88%, HL: 68.67%; WSR test, \(N=400\), \(z=1.145\), \(p=0.252\)), which is modest. However, both measure perform better than EG (60.90%; INV, WSR test, \(N=400\), \(z=5.863\), \(p<0.001\); HL WSR test, \(N=400\), \(z=5.461\), \(p<0.001\)).Footnote 7 These levels of performance for HL and INV are only slightly above the predictive ability of simply assuming expected-value maximization (i.e., risk neutrality), which is 64.74% (65.07%), while EG does perform statistically significantly worst (for INV: WSR test, \(N=403\), \(z=2.691\), \(p=0.007\); exp 2: \(N=400\), \(z=3.739\), \(p<0.001\); for HL: WSR test, \(N=403\), \(z=4.451\), \(p<0.001\); exp 2: \(N=403\), \(z=4.932\), \(p<0.001\); EG: WSR test, \(N=400\), \(z=-3.319\), \(p=0.001\)).

At the individual level, INV significantly predicts out-of-sample choices better than HL for 23.57% (32.79%) of individuals.Footnote 8 Conversely, HL significantly predicts better than INV for 23.08% (33.53%) of individuals. Therefore, there is no clear difference in the predictive power between these two methods.

Part of the subjects (17.82%, 11.00%) chose a dominated option at least once. The comparison of out-of-sample predictive performance is qualitatively unchanged if those potentially-confused subjects are excluded from the analysis. In particular, there are no significant differences in the percentage of correctly-predicted choices (INV 68.97%; HL 69.10%; WSR test, \(N=332\), \(z=0.757\), \(p=0.449\); exp 2: INV 70.73%; HL 69.69%; WSR test, \(N=356\), \(z=1.008\), \(p=0.313\)). Moreover, both still perform better than EG (61.42%, WSR test, \(N=356\), \(z=5.317\), \(p<0.001\); \(N=356\), \(z=5.713\), \(p<0.001\)). Since not all subjects behaved consistently in the HL task (68.49%; 72.00%), one could argue that the predictive performance of this latter method should be evaluated only on the sub-sample that displayed a unique switching point. However, there are also no differences when focusing on this subset of participants (HL 69.19%; INV 67.75%; WSR test, \(N=276\), \(z=0.017\), \(p=0.987\); exp 2: HL 69.47%; INV 70.59%; WSR test, \(N=288\), \(z=-0.905\), \(p=0.366\)). Results are also unchanged when restricting the analysis to those participants reporting interior solutions in the INV task (HL 66.75%, INV 67.69%; WSR test \(N=363\), \(z=-1.222\), \(p=0.222\); exp 2: HL 68.86%, INV 70.10%; WSR test \(N=357\), \(z=-1.331\), \(p=0.183\)).

3.3 Comparison with structurally-estimated risk attitudes

As an alternative way to compare the predictive performance of the two elicitation methods, I used a maximum likelihood procedure (ML) to estimate each subject’s risk attitudes (e.g., Harrison et al., 2005, 2007; Harrison & Rutström, 2008; Harrison et al., 2019) from their decisions in the set of 32 lottery choices. The procedure followed the approach described in Moffatt (2015, Chapter 13) and implemented well-established techniques as used in many recent contributions (Gaudecker et al., 2011; Conte et al., 2011; Moffatt, 2015; Alós-Ferrer & Garagnani, 2022; Alós-Ferrer & Garagnani, 2022; Alós-Ferrer & Garagnani, 2021). I estimated an additive random utility model, which considers a given utility function plus an additive noise component (e.g., Thurstone, 1927; Luce, 1959; McFadden 2001). Specifically, I assumed CRRA utility and normally-distributed errors. To account for individual heterogeneity, I further assumed that the risk parameter is normally distributed over the population and estimated the parameters of this distribution, deriving individual risk attitudes by updating from the so obtained population level prior (e.g., see Harless & Camerer, 1994; Moffatt, 2005; Harrison & Rutström, 2008; Bellemare et al., 2008; Gaudecker et al., 2011; Conte et al., 2011; Moffatt, 2015).

The average estimated risk attitude following this method is 0.418 (SD 0.321; exp 2: 0.397, SD 0.342). I then compared the results to the estimated risk parameters from HL, INV, and EG. There is a positive correlation between HL and ML (\(r=0.225\), \(p<0.001\); \(r=0.272\), \(p<0.001\)), and between EG and ML (\(r=0.166\), \(p=0.001\)) but there is no significant correlation between INV and ML (\(r=0.070\), \(p=0.160\); \(r=0.075\), \(p=0.132\)).

3.4 Heterogeneity and predictive performance

This subsection reports an explorative analysis of whether the predictive performance of the different risk-elicitation tasks depends on participants’ characteristics. With this aim in the second experiment I collected several demographic informations as well as answers to the Cognitive Reflection Task (CRT) (Frederick, 2005) and Numeracy Scale (Lipkus et al., 2001).

One possibility is that the predictive performance of the risk-elicitation tasks is poor because the sample is different from the standard (well-educated) student population of laboratory experiments. In the selected sample 2.50% of the participants reported having a doctorate degree while 73.50% had at least an undergraduate degree. However, there are no systematic differences in the predictive performance of the tasks with respect to the highest educational level achieved. In particular, for none of the tasks those with doctoral degrees have more correctly predicted choices than everybody else (INV, 69.91% vs. 68.61%, Wilcoxon rank-sum test (WRS) \(N=400, z = -0.667, p=0.504\); HL 68.51% vs. 75.00%, WRS, \(N=400, z = -1.242, p=0.214\); EG 60.86% vs. 62.50%, WRS, \(N=400, z =-0.410, p=0.682\)). There are also no significant differences between the predictive performance of those who have at least an undergraduate degree compared to who do not. There is hence no evidence of an effect of education of the predictive performance of the different risk-elicitation tasks.

Another argument involves the numeracy skills of subjects. Because all the risk-elicitation tasks involve numbers and probabilities, it is reasonable to expect a differential effect of their performance based on the numeracy skill of people, especially for those tasks which have been argued to be more complex, e.g., HL (Dave et al., 2010; Charness et al., 2018). This might be true even in educated individuals as there is evidence that even they often are unable to convert a percentage to a frequency (Lipkus et al., 2001). I measure the participants’ numeracy skills with a well-established procedure, the Numeracy Scale by Lipkus et al. (2001), which has been shown to correlate with real-life outcomes such as wealth (Estrada-Mejia et al., 2016). Of the three standard questions no participant answered all three correctly. The average number of correct answers is 1.623 with a median of 2 and 7.50% of subjects answered correctly to no question. Out of the three risk-elicitation tasks, there is a significant difference in the predictive performance only for INV, with participants answering correctly more than one question in the numeracy scale presenting a higher proportion of correctly predicted choices (71.03%) compared to the others (67.24%; WRS, \(N=400, z =2.433, p=0.015\)). For the other two tasks there are no statistically significant effects (HL 69.69% vs. 66.32%; WRS, \(N=400, z =1.276, p=0.202\); EG 62.16% vs. 58.01%; WRS, \(N=400, z =1.646, p=0.1000\)).

Lastly, I implement the CRT of Frederick (2005) in the version proposed by Alós-Ferrer et al. (2016) in order to avoid recognition effects, as the CRT in its classical form often appears in the popular press. Alós-Ferrer and Hügelschäfer (2016) and Haita-Falah (2017) find that higher test scores in the CRT are correlated with lower incidences of certain economic biases, e.g., the conjunction fallacy, conservatism, and sunk-cost fallacy. Moreover, as argued by Toplak et al. (2011), low CRT scores might indicate a tendency to act on impulse and give an intuitive response. Of the three standard questions 30.75% of participants answered all of them correctly. The average number of correct answers is 1.615 with a median of 2 and 22.75% of subjects answered correctly to no question. For INV and HL there is a significant difference in the predictive performance, with participants answering correctly more than one question of the CRT presenting a higher proportion of correctly predicted choices compared to the others (INV: 71.99% vs. 64.86%; WRS, \(N=400, z =3.857, p<0.001\), HL: 71.15% vs. 68.43%; WRS, \(N=400, z =2.006, p=0.045\)). For EG there are no statistically significant effects (61.37% vs. 60.36%; WRS, \(N=400, z =0.397, p=0.692\)).

These offer limited insight on potential heterogeneity regarding the predictive performance of different risk-elicitation tasks. Even in a general, but educated, population sample the different tasks seem to not to be difficult to comprehend, as exemplified by the lack of significant differences in their predictive power between differently educated or (numerically) literate groups of individuals. However, there seem to be some indicative results pointing to a relation between impulsivity, as measured by cognitive reflection, and the reliability of most risk-elicitation tasks. Their results should however be taken with a grain of salt and more research on this should be done.

4 Slider budget-choice tasks

In each of the three additional budget choice tasks, participants had to select their preferred lottery from a linear set. The tasks were implemented in the form of sliders as often done in the literature (i.e., Gneezy & Potters, 1997; Kurtz-David et al., 2019; Gillen et al., 2019, among others). In particular, participants were asked to indicate which option they preferred by moving a slider, with values changing in real-time.

The possible values of the sliders were constrained. To avoid losses, all monetary outcomes were positive (larger than or equal to one penny). Further, probabilities belonged to the interval [0.05, 0.95], to avoid confounds due to focal points or heuristics (e.g., the certainty effect). Moreover, in order to show that the results do not depend on the particular probabilities or outcomes chosen by the experimenter, the range of possible values for the second and third slider depended on the subjects’ choices in the first slider task. Hence, they potentially assumed different values for each participant. However, the sliders were designed such that, independently of individual differences between subjects, the optimum of the underlying maximization problems was the same corner solution for every (risk-averse) participant.

The set of sliders was presented twice, with different levels of incentives. Specifically, a first version was presented at the beginning of the experiment, and a version with incentives multiplied by five was presented at the end. This aimed to test whether stake sizes influence the stability of risk preferences (Bandyopadhyay et al., 2021).

4.1 Design of the sliders

The first slider described the set of lotteries \(\left\{ [p,q; \, 1-p,0]\; \left| \; \; pq=K\right. \right\}\). That is, all lotteries in the slider have the same expected value of K, with \(K=4\) (GBP) for the first three sliders and \(K=20\) for the high-incentives version. The slider changed p and q simultaneously preserving the expected value. Thus, the underlying maximization problem was

$$\begin{array}{rl} \max _{p,q} &{} p\cdot u(q)\\ \text {s.t.} &{} pq=K, p\in [0.05,0.95], \end{array}$$

or, equivalenty,

$$\max _{p\in [0.05,0.95]} p\cdot u\left( \frac{K}{p}\right) .$$

A simple computation (see Appendix C for details) shows that the objective function in the last problem is strictly increasing for any twice-differentiable utility function with \(u''(\cdot )<0\). Hence, in normative terms risk-averse participants should report the corner solution \(\hat{p}=0.95\).

Let \(p^*\) be the participant’s actual answer to the first slider, and let \(q^*= K/p^*\). The next two sliders depend on the chosen values \(p^*\) and \(q^*\).

The second slider described the set of lotteries \(\left\{ [p^*,q+z;1-p^*,z]\;\left| \; \; p^*q+z=K\right. \right\}\). That is, again all lotteries in the slider have the same expected value K. The slider moves q and z simultaneously preserving the expected value. The maximization problem is

$$\begin{array}{rl} \max _{q,z} &{} p^*\cdot u(q+z)+ (1-p^*)\cdot u(z)\\ \text {s.t.} &{} p^*q+z=K, q\ge 0.01, z\ge 0, \end{array}$$

or, equivalently,

$$\max _{q\in [0.01,q^*]} p^*\cdot u\left( K+(1-p^*)q\right) + (1-p^*) \cdot u\left( K-p^*q\right) .$$

A direct computation shows that the objective function in this problem is strictly decreasing whenever \(u''(\cdot )<0\). Hence, risk-averse participants should report the corner solution \(\hat{q}=0.01\).

The third slider described the set of lotteries \(\left\{ [p,q^*+z; \, 1-p,z]\;\left| \; \; pq^*+z=K\right. \right\}\). As in the previous cases, all lotteries in this slider have the same expected value K. The slider moves p and z simultaneously preserving the expected value. The maximization problem is

$$\begin{array}{rl} \max _{p,z} &{} p\cdot u(q^*+z)+ (1-p)\cdot u(z)\\ \text {s.t.} &{} pq^*+z=K, p\ge 0.05, z\ge 0, \end{array}$$

or, equivalently,

$$\max _{p\in [0.05,p^*]} p\cdot u\left( K+(1-p)q^*\right) + (1-p) \cdot u\left( K-pq^*\right) .$$

Even replacing \(u(\cdot )\) with a CRRA functional form, the objective function in this problem is not analytically tractable. Numerical results, however, show that the problem has a corner solution at the lower extreme \((p=0.05)\) for all subjects with moderate risk aversion \((0<r<1)\). Specifically, 247 (exp 2: 249) participants were classified as moderately risk averse according to HL, resulting in four different possible values of r (but different ranges of p depending on \(p^*\)). The numerical solution of the optimization problems using a CRRA utility function with those possible risk parameters has a corner solution at \(p=0.05\) for all 247 (249) moderately risk-averse participants. An additional 46 (46) subjects were classified with \(r>1\) according to HL. Of those, 4 (2) had a corner solution at \(p=0.05\), 16 (0) had a corner solution at \(p=0.95\), and 90 (44) had interior optima. Among the 110 (110) subjects were classified with \(r<0\) according to HL. Of those, 2 (4) had a corner solution at \(p=0.05\) and 44 (90) had interior optima.Footnote 9 Therefore, all subjects with moderate risk aversion \(0<r<1\) had optima at the corner solution \(p=0.05\) in the third slider in both experiments.

4.2 Behavior in budget-choice tasks

To account for imprecision and noise in the use of the interface, and since participants could only select increments of 0.01, I conservatively define a choice to be a corner solution when the chosen value is within the lower (higher) 10% of possible values. For the first slider, this means that an answer was classified as the upper corner solution if it was larger than or equal to \(0.95-0.09=0.86\). For the second and third sliders, the range of possible values depended on subjects’ previous choice.

Fig. 2
figure 2

Distribution of answers in the first (upper), second (middle), and third slider (bottom) in the first repetition (left figures) and with higher incentives (right figures). The equivalent pictures for the second experiment are reported in Appendix E

Behavior and results relative to the slider tasks are very similar in the two experiments. I report the result of the first experiment in the main text and the second in Appendix E. Figure 2 shows that behavior in the additional budget-choice tasks was far from the normative optima. Choices for the low-incentive (\(K=4\)) slider tasks are displayed on the left-hand side panels. Only \(4.47\%\) (18) subjects reported all correct corner solutions (of which 13 were risk averse according to HL and 5 had \(r<0\)), and only 49.13% (198) reported at least one of the three correct corner solutions (of which 147 were risk averse according to HL).

The extraordinarily high levels of suboptimal behavior in the slider tasks are at odds with what would be expected due to lack of comprehension or attention in other choice tasks. As a comparison, and as reported above, only 17.82% (\(N=72\)) of subjects made one or more dominated choices in the binary choice task. This further rules out that the poor performance in the budget-choice tasks was due to general inattention to the experiment.

For the first slider, only 80 (19.85%) of the 403 subjects reported the correct (upper) corner solution. Of the 293 participants classified as risk-averse according to HL, only 61 (20.82%) reported that corner solution (recall that INV cannot classify participants as risk-seeking). For the second slider, only 131 (32.51%) of the 403 participants reported the correct (in this case lower) corner solution. This includes only 100 (34.13%) of the 293 participants classified as risk-averse according to HL. In the third slider, the proportion of subjects reporting the correct (upper) corner solution was 20.35% (82 of 403). This includes only 80 (32.39%) of the 247 participants classified as moderately risk-averse (\(0<r<1\)) according to HL.

Needless to say, and given that the sliders should have elicited corner solutions for risk-averse individuals, these numbers are very low, suggesting low levels of understanding in the budget-choice tasks. It is hence natural to ask whether understanding would increase with higher incentives. At the end of the experiment, the sliders were presented again, but with incentives multiplied by 5 (\(K=20\) instead of \(K=4\)), which also changed all involved outcomes. The right-hand side panels in Figure 2 display the choice histograms in these versions of the sliders and illustrate that there is only mixed evidence that more subjects choose the right corner solutions more frequently under increased incentives.

For the first slider, only 129 (32.01%) of the 403 subjects reported the correct (upper) corner solution (99 or 33.79% of the 293 risk-averse ones according to HL). This is a significant increase with respect to the 19.85% under low incentives (test of proportion \(N=403\), \(z=3.938\), \(p<0.001\)). For the second slider, only 120 (29.78%) of the 403 participants reported the correct (lower) corner solution (92 or 31.40% of the 293 risk-averse ones according to HL). This is not significantly different from the 32.51% under low incentives (test of proportions \(N=403\), \(z= 0.837\), \(p=0.403\)). In the third slider, only 97 (24.07%) of the 403 subjects reported the correct (upper) corner solution (59 or 23.89% of the 247 participants classified as moderately risk-averse according to HL). Again, this is not significantly different from the 20.35% under low incentives (test of proportions \(N=403\), \(z= -1.271\), \(p=0.204\)).

4.3 Heterogeneity in the budget-choice tasks

This subsection reports an explorative analysis of whether behavior in the budget-choice tasks depends on participants’ characteristics such as their literacy, education, and cognitive reflection.

A similar argument which was made above regarding a potential link between the highest level of educational obtained by the participants and the predictive power of the risk-elicitation tasks can be made also for the behavior in budget-choice tasks. The intuition is that, higher optimal behavior could be relative to more educated subjects as they might have understood the tasks better. However, there are no systematic differences in this respect. Out of the nine sliders, only for one, the second slider for the high incentive, reaches a statistically significant difference in the proportion of optimal behavior between subjects with at least an undergraduate degree compared to all the others (26.42% vs. 36.05%, test of proportion \(z=-1.803, p=0.036\)).Footnote 10

Compared to education numerical literacy, as measured by the numeracy scale, is able to capture systematic heterogeneity in subjects’ behavior in budget-choice tasks. For 5 out of the 6 sliders there are significant differences, with higher numerical literacy corresponding to more optimal behavior in the budget choice tasks (slider 1: 31.54% vs. 24.79%, test of proportion, \(z=1.359, p=0.087\); slider 2: 34.05% vs. 19.83%, \(z=2.857, p=0.002\); slider 3: 24.37% vs. 14.05%, \(z=2.318, p=0.010\); slider 4: 40.86% vs. 24.79%, \(z=3.075, p=0.001\); slider 5: 40.86% vs. 16.53%, \(z=4.736, p<0.001\); slider 6: 28.67% vs. 19.83%, \(z=1.851, p=0.032\)). This result indicate that there is indeed a relation between numeracy and optimal behavior when using this type of tasks.

Lastly, I investigate the relation between cognitive reflection as measured by the CRT and the proportion of optimal choices in the budget-choice tasks. The results of the cognitive reflection resemble those of the numeracy skills with 4 out of the 6 sliders presenting significant differences. In particular, higher cognitive reflection corresponds to more optimal behavior in the budget choice tasks (slider 1: 37.85% vs. 19.89%, test of proportion, \(z=3.928, p<0.001\); slider 2: 31.78% vs. 27.42%, \(z=0.951, p=0.171\); slider 3: 22.90% vs. 19.35%, \(z=0.864, p=0.194\); slider 4: 41.59% vs. 29.57%, \(z=2.498, p=0.006\); slider 5: 42.52% vs. 23.12%, \(z=4.101, p<0.001\); slider 6: 32.24% vs. 18.82%, \(z=3.053, p=0.001\)).

The results of the exploration of heterogeneity in optimal behavior in budget-choice tasks suggest that even in a general and educated population sample having higher numeracy skills and cognitive reflection correlates positively with optimal behavior in this type of task. This suggests again that budget-choice tasks are hard to comprehend for the general population.

5 Discussion and conclusion

In two experiments with a general population sample, I evaluate the predictive validity of three of the most common risk-attitude elicitation procedures, the choice-list procedure of Holt and Laury (2002), the investment task of Gneezy and Potters (1997), and the multi-alternative procedure of Eckel and Grossman (2002). Two of these tasks are undistinguishable in their ability to predict out-of-sample choices, and their performance is moderate at best. That is, it is only slightly better than that obtained by ignoring all individual information and predicting on the basis of expected value maximisation only. However, the procedure introduced by Eckel and Grossman (2002) is outperformed by the other two and by a simple expected value maximization.

The experiment also included budget-choice tasks where participants selected their preferred lotteries out of linear budget sets, and which were such that risk averse participants should have selected corner solutions. On the contrary, the vast majority of participants failed to do so. This strongly suggests that decision makers in non-student samples might have a limited understanding of budget-choice tasks, and hence data collected using such tasks might not adequately reflect attitudes toward risk in the general population. This is a relevant observation, since such methods are widespread.

Finally, I investigate potential heterogeneous factors influencing comprehension and performance in these tasks. While education, numerical literacy, and cognitive reflection have limited predictive power for the risk-elicitation tasks, they confirm that budget-choice tasks are hard to understand for the general population. In particular, those participants with higher cognitive skills and numerical literacy perform significantly better in this type of task.

The results of this paper are in alignment with the limited external validity of measures of risk attitude found in other studies (Dohmen et al., 2011; Charness et al., 2020). In particular, in light of these results, it should not be expected that these laboratory measures exhibit high levels of correlation with behavior in the field. In particular, measures based on budget-choice or portfolio-choice tasks should not be expected to deliver robust findings.

Furthermore, the results also have implications for other contexts where tasks similar to budget-choice allocations are implemented. For example, these methods are often used to test the consistency of subjects’ choices with GARP (i.e., Choi et al., 2007a, 2014; Kurtz-David et al., 2019; Polisson et al., 2020; Drichoutis & Nayga, 2020). However, if a majority of decision makers have a limited understanding of these tasks, the test cannot be expected to be reliable. Therefore, the results speak in favor of implementing systematic understanding checks to increase the reliability of the data and the robustness of the results when using budget allocation tasks. One possibility is including additional tasks designed in such a way that any risk-averse participant should give the same answer, as in the slider tasks reported in this work.