1 Introduction

Multiple-choice tests (MCT) are one of the most extended mechanisms for evaluating human capital (e.g., Scholastic Aptitude Test, medical residence exam or driving license tests). There are different mechanisms for scoring MCT. The “number right guessing” method awards points for correct answers and assigns zero points for omitted or wrong answers. With this scoring system, test takers have incentives to answer all questions regardless of whether they know the answer or not. Thus, the score includes an error component coming from those questions in which a student gets the correct answer by chance. To minimize this problem, examiners often penalize wrong answers.

MCT evaluation systems using penalties are widely employed around the world.Footnote 1 When wrong answers are penalized, test takers can avoid risk-taking by skipping items. Thus, under this scoring method, MCTs provide accessible and vast data on real life risk-taking decisions. In the present paper, we exploit MCTs to analyze framing effects on risk taking. By doing so, we provide field evidence showing that framing manipulations affect willingness to take risks in a real stakes context. At the same time, we derive some implications for test design.

In penalized MCTs, correct answers are typically announced as gains while wrong answers are announced as losses. Prospect Theory (Kahneman and Tversky 1979) predicts that individuals are loss-averse, i.e., they value losses relatively more than gains. In lab studies, the differences between loss and gain framings have been found to be especially relevant for risk-taking decisions (Tversky and Kahneman 1981). Our paper, contributes to the field studies literature on framing effects (Ganzach and Karsahi 1995; Gächter et al. 2009; Arceneaux and Nickerson 2010; Bertrand et al. 2010; Fryer et al. 2012; Hossain and List 2012; Levitt et al. 2016; Hoffmann and Thommes 2020). Field studies on this topic have focused on studying whether the effectiveness of persuasive communication or incentives changes whenever framed as a loss or as a gain. By contrast, our field experiment focuses on the effects of framing on the willingness to accept risks. Previous field studies on this issue did not document framing effects (Krawczyk 2011, 2012; Espinosa and Gardeazabal 2013) with the only exception of Wagner (2016) who finds framing effects but in a non-incentivized setting. This scarcity of field evidence on risk-taking decisions is surprising considering the central attention that Kahneman and Tversky (1979) and Tversky and Kahneman (1981) devoted to this issue. In their seminal articles they specifically consider framing effects on risk-taking decisions in a (non-incentivized) laboratory setting. Our paper, contributes to the experimental literature that followed their article by providing evidence from the field of framing effects on risk-taking.

We ran a field experiment using real stakes MCTs in higher education. Our intervention consisted of modifying the framing of rewards and penalties in an MCT that accounted for between 20% and 33% of students’ course grade. All the courses included in the experiment involved 6 credits, which is equivalent to 150 hours of students’ work according to the European Credit Transfer and Accumulation System. Despite the difficulty in establishing a quantitative measure on the size of the incentive, higher education students generally take their exams very seriously. Test scores have important consequences for undergraduate students in terms of costly effort in case of failing the exam (studying for retakes), raised tuition fees (if failing the course) and for their career prospects (academic record is relevant for future jobs and fellowships).Footnote 2

To emphasize that the typical way of announcing grading in penalized MCTs is by mixing scores in the gain and in the loss domains, we refer to it as the Mixed-framing. Under Mixed-framing, correct answers will result in a 1 (normalized) point gain, wrong answers in a loss of \(\rho \in (0,1]\) points, and non-responses will receive zero points (neither a gain nor a loss). We propose a Loss-framing, where students are told that they will start the exam with the maximum possible grade; correct answers do not subtract nor add points, wrong answers will result in a loss of \(1+\rho\) points, and non-responses in a 1-point loss. The two scoring rules are mathematically equivalent. Thus, a rational test taker should provide the same response pattern under the two rules. However, we consider a model built on Prospect Theory (Kahneman and Tversky 1979) which predicts that students’ non-response will differ in the two framings. According to the model, loss-averse and risk-averse (in the gain domain) individuals will be more willing to provide a response under Loss-framing than under Mixed-framing. Given the prevalence of risk- and loss-aversion among the general population (Andersen et al. 2008; Booij and Van de Kuilen 2009; Gaechter et al. 2010; Dohmen et al. 2010; Von Gaudecker et al. 2011; Schleich et al. 2019), we expect the loss treatment to decrease students’ non-response rate (Hypothesis 1).

Penalties in the exams covered by our intervention are computed to guarantee that the expected value of random guessing is non-negative. Consequently, a decrease in non-response arising from random guessing is not expected to decrease test scores while a decrease in non-response coming from an educated guess (e.g., being able to disregard one of the alternatives) is expected to increase test scores. Thus, if our first hypothesis holds true, then we also expect test scores to be higher under Loss-framing than under Mixed-framing (Hypothesis 2).

Consistent with our theoretical results, subjects omit fewer questions under Loss-framing than under Mixed-framing. In particular, under the Loss-framing, omitted items reduce by a 18%-20%, supporting Hypothesis 1. Thus our experiment shows that being exposed to a Loss-framing matters for risk-taking decisions in a real stakes context. By contrast, the test scores and number of correct answers are not significantly affected by this reduction in non-response. Thus, we do not find evidence for Hypothesis 2. By exploiting question-level information, we show that the failure of Hypothesis 2 is driven by students under Loss-framing performing worse overall and not only in those additional questions answered as a response to the treatment.

In the last part of the paper, we try to disentangle risk attitude and loss attitude as drivers of non-response. To do so, we collected measures of risk-aversion and loss-aversion for a sub-sample of students participating in the field experiment. Despite the small sample size, this analysis suggests risk-aversion as the main channel throughout which the treatment operates.

Our results have direct implications for test design. Guessing adds noise to test scores and, hence, reduces their accuracy as a measure of knowledge. Penalties for wrong answers mitigate this problem by discouraging guessing but add potential biases in test scores: answering correctly no longer only depends on the level of knowledge but also on other traits such as risk- and loss-aversion. Recent literature documented a gender gap in guessing in MCTs and associated it to gender differences in risk-aversion (Baldiga 2013; Akyol et al. 2016; Iriberri and Rey-Biel 2021). According to our theoretical model, the Loss-framing can reduce some of these biases by reducing the influence of risk and loss attitude on non-response. This is partially confirmed by the fact that non-response is reduced under the Loss-framing condition. However, a change in the framing is ineffective in significantly reducing the gender gap in non-response. More strikingly, a change in the framing may have unintended consequences in terms of impaired performance that should be taken into consideration when designing tests.

2 Literature review

Seminal works by Kahneman and Tversky (1979) and Tversky and Kahneman (1981) challenged the paradigm of rational decision making. A prominent violation of rationality is the framing effect. Given a fixed set of alternatives, the final choice may change, depending on how information is presented. A clear illustration of this effect is the Asian disease problem (Tversky and Kahneman 1981), where decision makers prefer to take more risk when identical information is presented in terms of lives lost rather than in terms of lives saved. Many lab experiments followed Tversky and Kahneman (1981) to investigate the effect of framing on decision making in different contexts (see among others, Sonnemans et al. 1998; Lévy-Garboua et al. 2012; Loomes and Pogrebna 2014; Grolleau et al. 2016; Essl and Jaussi 2017; Charness et al. 2019).

Levin et al. (1998) proposed a typology for framing interventions. They divided them into i) risky choice framing à la Tversky and Kahneman (1981), ii) goal framing, which affects the effectiveness of persuasive messages, and iii) attribute framing, which affects the assessment of the characteristics of events or objects. Field studies on framing have notably focused on attribute and goal framing, finding mixed results. In consumer choice and marketing messages (Ganzach and Karsahi 1995; Bertrand et al. 2010) found positive evidence on framing effects. More recently Hossain and List (2012), Fryer et al. (2012) and Levitt et al. (2016) showed that framing monetary attributes as losses improves worker productivity, teacher performance and student test scores, respectively. By contrast, Hoffmann and Thommes (2020) found that Loss-framing backfires in motivating energy-efficient driving, List and Samek (2015) found no effect in fostering healthy food choices and Arceneaux and Nickerson (2010) find no framing effects in the context of political advertising. Gächter et al. (2009) found that only junior participants reacted to framing when early registration prices were presented either as a loss or a gain in a conference. In contrast to these works, we study framing in the domain of risky choices which, as explained above, has been widely investigated in the lab but not in the field.

Studies in psychometrics have claimed the existence of framing effects on test taking behavior (Bereby-Meyer et al. 2002, 2003). Critical differences exist between these studies and ours. Firstly, in contrast to our study, in all these experiments except experiment 1 in Bereby-Meyer et al. (2003), they compared non-equivalent scoring rules. Therefore, framing is not the only change operating between these methods and cannot be identified as being responsible for differences in non-response. Secondly, our results arise from a field experiment with real academic consequences, while theirs were obtained from lab experiments with students performing general knowledge tests where the reward is only given to top performers.

The closest papers to ours are Krawczyk (2011), Krawczyk (2012), Espinosa and Gardeazabal (2013), and Wagner (2016), which analyze framing effects by comparing score equivalent methods in field experiments. However, none of these compare the Mixed-framing to the Loss-framing.Footnote 3 On the one hand, Krawczyk (2012) and Espinosa and Gardeazabal (2013) reframed a Mixed-framing under a gain domain finding no treatment effect. On the other hand, Krawczyk (2011) and Wagner (2016), compared framing manipulation under a gain and a loss domain. In this case, evidence is mixed: while Krawczyk (2011) did not find a framing effect on non-response, Wagner (2016) did find it. A remarkable difference between Wagner (2016) and the rest of these papers, including ours, is that the exams in his experiment did not entail academic consequences for test takers.

Another strand of literature focuses on analyzing the gender differences in test-taking. Females have been found to be negatively affected by the presence of penalties for wrong answers (Ramos and Lambating 1996; Baldiga 2013; Pekkarinen 2015; Akyol et al. 2016; Coffman and Klinowski 2020) or rewards for omitted answers (Iriberri and Rey-Biel 2019). This finding has been related to gender differences in self-confidence and risk-aversion.Footnote 4 A third explanation to gender differences in non-response in penalized MCTs could be differences in loss-aversion. Crosetto and Filippin (2013) found females to be more loss-averse and equally risk averse than males which can explain gender differences in non-response under a Mixed-framing. As the expected score from guessing tends to be positive, individuals with more risk aversion and less self-confidence are more negatively affected by the presence of penalties for wrong answers. Only Funk and Perrone (2016) found that females perform relatively better with penalties. The recent work by Espinosa and Gardeazabal (2020) is particularly related to our study. They specifically analyzed the effects of framing manipulation on gender differences in non-response and performance in college MCTs. When they compare a mix framing scenario to a gain framing scenario as in Espinosa and Gardeazabal (2013), they did not observe a framing effect on differences in aggregate non-responses but did observe a framing effect on gender differences in non-response and performance.

As we make explicit in our model, both risk- and loss-aversion may induce non-response in an MCT. Karle et al. (2019) disentangled the effect of risk- and loss-aversion in MCT by matching data from subjects’ exams and the results of classroom experiments to measure subjects’ risk and loss preferences. They found that subjects’ omission patterns in MCTs correlated to loss-aversion but not to risk-aversion. We conducted an incentivized on-line questionnaire and interacted measures of loss- and risk-aversion with the framing. In our case, only risk-aversion seems to drive our treatment effect. However, it should be noted that we only used a small sub-sample to conduct this analysis. Also, the on-line nature of our data might provide lower quality measures than the ones obtained by Karle et al. (2019) in the classroom.

When evaluating test scores, our results support the possibility that performance is impaired under Loss-framing. Although this possibility contrasts with other field studies that found that performance increases when bonuses are framed under loss domains, other authors found results similar to ours. In an educational setting, Bies-Hernandez (2012) and Apostolova-Mihaylova et al. (2015) looked at the effects on modifying the way students receive the overall course evaluation. Under this setting, Bies-Hernandez (2012) found that the Loss-framing decreased students’ performance compared to a control treatment framed as gains. Apostolova-Mihaylova et al. (2015) did not observe overall differences in grades but found gender biases in the response to the Loss-framing, with this treatment benefiting males and impairing females.

3 Theoretical framework

A rational test taker must be unaffected by framing manipulations in exam instructions. The theoretical model proposed by Espinosa and Gardeazabal (2013) confirms this is the case by showing that two score equivalent rules must always result in the same response pattern. By contrast, we consider a model based on Prospect Theory where test takers’ reference points depend on the framing of the scoring rule. The framings proposed in our intervention are summarized in Table 1 (see the Experimental Design section for further details).

Table 1 Framings

Let \(U_i(x_j)\) denote the utility function of student i when receiving outcome \(x_{j}\) from prospect j (item j). Without loss of generality, fix prospect j as the prospect being evaluated by the decision-maker. So, from now on, we refrain from using subscript j in the notation. Prospect Theory (Kahneman and Tversky 1979) assesses that decision-makers “perceive outcomes as gains and losses, rather than as final states” and “the location of the reference point, and the consequent coding of outcomes as gains or losses, can be affected by the formulation of the offered prospects” (Kahneman and Tversky 1979). According to these ideas, in our model students perceive each item as a potential gain or loss. In other words, their reference point depends on the assigned framing and corresponds to the expected score excluding the evaluated prospect.Footnote 5 According to this formulation, the argument of \(U_i(x)\) under each framing corresponds to the values presented in Table 1. Similar models have been considered in a testing context by Budescu and Bo (2015) and Karle et al. (2019).

For \(x\ge 0\), we let \(U_i(x)=u_i(x)\) where \(u_i:\mathbb {R}_+ \rightarrow \mathbb {R}\) is twice differentiable with \(u_i(0)=0\), \(u_i'(x)>0\) and \(u_i''(x)\le 0\). Following the widespread formulation by Kahneman and Tversky (1979), for any \(x<0\) let \(U_i(x)= - \lambda _i u_i(-x)\) where \(\lambda _i\ge 0\) is the loss-aversion parameter. A student is loss-averse if and only if \(\lambda _i> 1\). This formulation implies that concavity in the gain domain becomes convexity in the loss domain (Kahneman and Tversky 1979, call this phenomenon the reflection effect). Throughout the paper, we measure concavity according to Arrow-Pratt measure \(r_i(x)=-\frac{u_i''(x)}{u_i'(x)}\).

Let \(\tilde{p}_{i}(k_{i}, z_i)\) be student i’s perceived probability of choosing the correct answer with \(k_{i}\) denoting student i’s knowledge of the topic evaluated and \(z_i\) accounting for other characteristics, such as self-confidence, that may influence student i’s perceived probability of answering correctly. We assume the perceived probability \(\tilde{p}_{i}(k_{i}, z_i)\) to be independent of the particular scoring rule. To ease the exposition, we refrain from using the arguments determining the perceived probability and henceforth refer to it as \(\tilde{p}_i\).

In Prospect Theory probabilities are evaluated according to decision weights, which can differ from actual probabilities by overweighting small probabilities and underweighting moderate and large probabilities. Let \(\pi ^c_i(\tilde{p}_i)\) and \(\pi ^w_i(\tilde{p}_i)\) be the functions mapping student i’s perceived probability \((\tilde{p}_i)\) into the decision weights of correct and incorrect answers, respectively (i.e., \(\pi ^x_i:\tilde{p}_i\in [0,1]\rightarrow [0,1]\), \(x\in \{c,w\}\)). According to Prospect Theory (Kahneman and Tversky 1979), decisions weights are assumed to satisfy: i) \(\pi ^c_i(0)=0\) and \(\pi ^c_i(1)=1\), ii) \(\pi ^w_i(0)=1\) and \(\pi ^w_i(1)=0\), iii) \(\pi ^c_i(\tilde{p}_i)\) is increasing in \(\tilde{p}_i\), iv) \(\pi ^w_i(\tilde{p}_i)\) is decreasing in \(\tilde{p}_i\) (i.e., increasing on \(1-\tilde{p}_i\)) and, v) \(\pi ^c_i(\tilde{p}_i)+\pi ^w_i(\tilde{p}_i)\le 1\). The latter assumption implies that the perceived probability of correct (\(\tilde{p}_i\)) and incorrect (\(1-\tilde{p}_i\)) answers can be simultaneously underweighted (i.e., \(\pi ^c_i(\tilde{p}_i)< \tilde{p}_i\) and \(\pi ^w_i(\tilde{p}_i)< 1-\tilde{p}_i\)) but only one of the two can be overweighted (i.e., either \(\pi ^c_i(\tilde{p}_i)> \tilde{p}_i\) or \(\pi ^w_i(\tilde{p}_i)> 1-\tilde{p}_i\)).

Under the Mixed-framing, correct answers will result in a gain of 1 (normalized) point, wrong answers in a loss of \(\rho \in (0,1]\) points, and non-responses will receive zero points. So, a student is expected to provide an answer under the Mixed-framing if:

$$\begin{aligned} \pi _i^c (\tilde{p}_i) U_i (1)+ \pi _i^w(\tilde{p}_i) U_i (-\rho ) \ge U_i (0)\iff \pi _i^c (\tilde{p}_i) u_i(1)-\pi _i^w (\tilde{p}_i) \lambda _i u_i(\rho ) \ge 0 \end{aligned}$$

Since \(\pi _i^c(\tilde{p}_i)\) and \(\pi _i^w(\tilde{p}_i)\) are increasing and decreasing in \(\tilde{p}_i\), respectively, the left hand side of the latter inequality is increasing in \(\tilde{p}_i\). Thus, we can define \(\bar{p}_i^{Mix}\) as the minimum value for which the above inequality holds (i.e., the unique value of \(\tilde{p}_i\in [0,1]\) solving equation (1) with equality). Thus, \(\bar{p}_i^{Mix}\) represents the cut-off probability at which student i chooses to provide an answer under the Mixed-framing. Running comparative statics on \(\bar{p}_i^{Mix}\) we obtain the results in Lemma 1.

Lemma 1

Let \(U_i(x)= u_i(x)\) for \(x\ge 0\) and \(U_i(x)= - \lambda _i u_i(-x)\) for \(x <0\), where \(u_i(0)=0\), \(u_i'(x)\ge 0\) and \(\lambda _i> 0\). Under the Mixed-framing, non-response is increasing in the loss attitude parameter (\(\lambda _i\)) and in the concavity of \(u_i(.)\).

The proof of the lemma is in Appendix A. As \(\bar{p}_i^{Mix}\) is increasing in \(\lambda _i\) and in the concavity of \(u_i(.)\), either loss-aversion and/or risk-aversion (in the positive domain) might be causing non-response in the Mixed-framing.

Under the Loss-framing, students are told they will start the exam with the maximum grade. Correct answers will result in no points loss, wrong answers in a loss of \(1+\rho\) points and non-responses in a loss of 1 point. A student is expected to provide an answer if:

$$\begin{aligned} \pi _i^c(\tilde{p}_i) U_i (0)+ \pi _i^w(\tilde{p}_i) U_i (-1-\rho ) \ge U_i (-1)\iff - \pi _i^w(\tilde{p}_i) \lambda _i u_i(1+\rho ) \ge -\lambda _i u_i(1) \end{aligned}$$

Similarly as before, we can define \(\bar{p}_i^{Loss}\) as the cut-off probability at which student i chooses to provide an answer under the Loss-framing.

Lemma 2

Let \(U_i(x)= u_i(x)\) for \(x\ge 0\) and \(U_i(x)= - \lambda _i u_i(-x)\) for \(x <0\), where \(u_i(0)=0\), \(u_i'(x)\ge 0\) and \(\lambda _i> 0\). Under the Loss-framing, non-response is independent from the loss attitude (\(\lambda _i\)) and decreasing in the concavity of \(u_i(.)\).

The proof of the lemma is in Appendix A. In contrast to the Mixed-framing, under Loss-framing, non-response is unaffected by loss attitude (\(\lambda _i\)). Loss-framing eliminates the asymmetry between gains and losses that exists under Mixed-framing. As a consequence, the loss attitude does not affect non-response under Loss-framing.

At first glance, the second part of Lemma 2 might be surprising, as concavity is generally associated to a higher level of risk-aversion. However, according to Prospect Theory, this is only so in the gain domain. The reflection effect implies that “risk aversion in the positive domain is accompanied by risk seeking in the negative domain" (Kahneman and Tversky 1979). This implies that more risk-averse students in the gain domain, who are risk-seekers in the negative domain, should display lower levels of non-response under the Loss-framing.

Next, we compare the level of non-response under the two framings. As \(\bar{p}_i^{f}\) represents the cut-off probability at which a student chooses to provide an answer under each framing \(f\in \{Mix, \, Loss\}\), a higher value indicates greater non-response, all else equal. By comparing the two cut-offs, we can obtain the following result.

Proposition 1

Let \(U_i(x)= u_i(x)\) for \(x\ge 0\) and \(U_i(x)= - \lambda _i u_i(-x)\) for \(x <0\), where \(u_i(0)=0\), \(u_i'(x)\ge 0\) and \(\lambda _i> 0\). The Loss-framing induces lower non-response if

$$\begin{aligned} \lambda _i > \frac{ u_i(1+\rho )-u_i(1)}{ u_i(\rho )} \end{aligned}$$

The proof is in Appendix A. Proposition 1 provides a sufficient condition for observing a reduction in non-response under the Loss-framing.

The left hand side of the expression in Proposition 1 is increasing in loss-aversion, while the right hand side is decreasing in the concavity of \(u_i(.)\).Footnote 6 Thus, both loss-aversion and the concavity of \(u_i(.)\) can contribute to observe less omitted questions under the Loss-framing. The first effect is a consequence of canceling-out the effect of loss-aversion under the Loss-framing documented in Lemma 2. The second effect arises from the reflection effect, which makes individuals more willing to take risks when confronted with the Loss-framing. Moreover, for any degree of concavity of \(u_i(.)\), it is always possible to find a degree of loss-aversion that induces less omitted questions under the Loss-framing (see Figure 1 for a graphic illustration).

Proposition 1 implies that mild conditions are sufficient for the Loss-framing to induce higher non-response than the Mixed-framing, as highlighted in the next corollary.

Fig. 1
figure 1

Graphical illustration of Proposition 1 according to an exponential utility function (\(u_i(x)=\frac{1-e^{-r\cdot x}}{r}\) for \(r\ne 0\), \(u_i(x)=x\) for \(r=0\)) with \(\rho =0.25\) and \(\pi _i^w(\tilde{p}_i)=1-\pi _i^c(\tilde{p}_i)\) . The X-axis represents the degree of (absolute) risk aversion r. The Y-axis represents the degree of loss-aversion \(\lambda\). The blue line shows the combinations of risk and loss attitudes for which \(\bar{p}_i^{Mix}=\bar{p}_i^{Loss}\). The area shaded in light gray shows the combination of parameters making \(\bar{p}_i^{Mix}<\bar{p}_i^{Loss}\) and the area shaded in dark grey the combinations making \(\bar{p}_i^{Mix}>\bar{p}_i^{Loss}\)

Corollary 1

Test-taker displaying simultaneously concavity of \(u_i(.)\) and loss-aversion is sufficient to observe lower non-response under the Loss-framing than under the Mixed-framing.

The proof of the corollary is in the Appendix A. Corollary 1 establishes a sufficient (but not necessary) condition for finding a positive treatment effect on response-rates. This sufficient condition is illustrated in Figure 1 where \(\bar{p}_i^{Mix}> \bar{p}_i^{Loss}\) always holds for any combination \(\lambda _i>1\) (loss-aversion) and \(r_i>0\) (concavity of \(u_i(.)\)). Previous studies have shown that, although heterogeneous, the population displays both concavity in u(.) and loss-averse attitudes (Fishburn and Kochenberger 1979; Abdellaoui 2000; Abdellaoui et al. 2007, 2008; Andersen et al. 2008; Booij and Van de Kuilen 2009; Harrison and Rutström 2009; Gaechter et al. 2010; Von Gaudecker et al. 2011), so we can expect the condition in Proposition 1 to hold more frequently than the opposite. These observations provide the theoretical background for our main hypothesis.

Hypothesis 1

Average non-response will be lower under the Loss-framing than under the Mixed-framing.

It also follows from lemmas 1 and 2 that the reduction in non-response under the Loss-framing would be greater the more risk- and loss-averse the decision-maker is.Footnote 7 This implies that women who have been found to be more risk-averse (e.g., Eckel and Grossman 2002; Fehr-Duda et al. 2006) and more loss-averse than men (e.g., Schmidt and Traub 2002; Booij et al. 2010; Rau 2014) might exhibit a higher decrease in terms of non-responses under the Loss framing.

Next, we address the consequences of Hypothesis 1 on test performance. Let \(p_i(k_i)\) be student i’s actual probability of answering a specific item correctly.Footnote 8 If Hypothesis 1 is confirmed, the ratio of correct answers over the total number of questions must increase for any \(p_i(k_i)>0\).

The condition for observing an increase in test scores is more demanding due to the penalties for wrong answers \(\rho\). Additional answers increase the score if and only if \(p_i(k_i)\ge \frac{\rho }{1+\rho }=\underline{p}\). Let \(A>1\) be the number of alternatives in a test item. For all the MCTs considered in our intervention \(\rho \in \left\{ \frac{1}{A}, \frac{1}{A-1}\right\}\). By replacing the values of \(\rho\) by its highest value \(\frac{1}{A-1}\) in the expression for \(\underline{p}\), we get that \(\underline{p}=\frac{1}{A}\). Note that \(\frac{1}{A}\) is the probability of answering correctly by choosing a random alternative. Thus, if Hypothesis 1 holds, a sufficient condition for an increase in test scores under the Loss-framing is that the probability that the additional answers are correct is greater than if choosing randomly. If these conditions hold, Hypothesis 2 automatically follows:

Hypothesis 2

Average scores will be higher under the Loss-framing than under the Mixed-framing.

Finally, note that an increase in the proportion of correct answers is necessary but not sufficient for observing an increase in test scores.

4 Experimental design

We conducted a field experiment with 554 students from the University of the Balearic Islands (Spain). All participants had to do a penalized MCT as a part of a course evaluation. The exams involved substantial stakes, accounting for between 20%-33% of their final course score. Test scores have important consequences for undergraduate students in terms of career prospects, grants, costly effort and tuition fees. Students’ attendance in the exams was almost 100% which confirms their importance for students.

The experiment consisted of modifying the framing of the MCT instructions. The design of the experiment was approved by the Ethics Committee of the University of the Balearic Islands under registration number 99CER19.

4.1 Treatments

The experiment consisted of modifying the framing of the exam instructions according to the score equivalent rules in Table 1. The treatments only varied in the instructions, where two framings were used to describe the scoring rule:

  • Mixed-framing (control): Typical framing for a penalized MCT where each correct answer adds points to the score, omitted answers do not add or subtract points and wrong answers are penalized. Example:Footnote 9

    The exam is a multiple-choice test with 20 questions and 5 possible answers for each question. Only one of the 5 potential answers is correct. The maximum grade is 100 points. Correct answers give you 5 points. Each incorrect answer subtracts 1.25 points and finally each unanswered (omitted) question does not subtract or add points. For instance, a student who answered 16 questions correctly, left 3 unanswered questions and answered 1 question incorrectly, would have a final score of 78.75 over 100 (16*5- 3*0 - 1*1.25 = 78.75).

  • Loss-framing (treatment): We proposed a score equivalent manipulation of the Mixed-framing. Students were informed that they would start the test with the highest score. Correct answers would not add to or subtract anything from the initial score. Each wrong or omitted answer would decrease this initial maximum score by an amount equivalent to the one under the Mixed-framing. Example:

    The exam is a multiple-choice test with 20 questions and 5 possible answers for each question. Only one of the 5 potential answers is correct. The maximum grade is 100 points. You start the exam with a grade equal to this maximum score. The correct answers do not subtract anything. Each incorrect answer will subtract 6.25 points and finally, each unanswered (omitted) question will subtract 5 points. For instance, a student who answered 16 questions correctly, left 3 unanswered questions and answered 1 question incorrectly, would have a final score of 78.75 over 100 (100-16*0- 3*5 - 1*6.25 = 78.75).

4.2 Implementation details

We conducted the field experiment in 14 different sessions. Each session related to a different exam. Within each session, half of the students were randomly assigned to the Mixed-framing and the other half to the Loss-framing. All the exams took place during the 2018-2019 academic year.Footnote 10

Table 2 presents the main features for each of the sessions. All exams in our study were part of the official evaluation of three different courses (Introduction to Business, Human Resource Management, and Business) taught by eight different members of the Department of Business Economics.Footnote 11 The exams lasted between 30 minutes and 1 hour. Stakes, penalty size, number of items and number of alternatives in each item varied slightly between exams and courses. Importantly, all were midterm exams accounting for between 20% and 33% of the final grade. None of these MCTs had a cut-off score or released material for the final exam. Thus, as in the model presented above, students should have been aiming to maximize their final scores.Footnote 12

All the students knew in advance that the exam was an MCT but they did not know the specific scoring rules. More importantly, students were not aware of the existence of different framings while doing the exam.Footnote 13 Each student participated in only one session and was only exposed to one of the two treatments.Footnote 14

Randomization was implemented in three different ways depending on organizational features of the exams. For computer-based exams, the on-line platform automatically and randomly assigned students to one of the framing conditions. In paper-based exams, hard copies of the grading instructions were delivered in such a way that immediate neighbors were assigned a different framing. This was done to ensure that the different framings were spread over the entire classroom to prevent the possibility that students’ seats were not random. Finally, in one of the courses, the treatments were assigned according to surnames in alphabetical order. Alphabetical order can be considered quasi-random. Since this course involved several sessions, to prevent surname effects, the mixed condition was implemented for the first half in alphabetical order in some sessions and for the second half in the remaining sessions.

For computer and surname-based randomization, whenever more than one classroom was available, students under the Mixed and Loss-framings took the exam in separate rooms. Students in these groups were assigned ex-ante (by the computer or their surname) to Mixed or Loss-framings and directed to take the exam in a particular room where all the other students were under the same treatment. Our aim was to avoid spillover effects. In case of taking the exam in a single classroom, an extra proctor was assigned to prevent spillovers between the different experimental conditions. Before starting the exam, students had 5 minutes to read the instructions (containing our treatments) and to privately ask any questions that they may have had regarding the evaluation method. After these 5 minutes, the exam started.

We also carried out a pilot study with 184 subjects from another course. In each exam, there were two shifts corresponding to different groups taking the course. The treatment was assigned at a group level. Despite the treatment being randomly assigned to each group, the group formation itself may not have been random. Therefore the observations from this pilot study are not included in our main results.Footnote 15

Finally, to gain better insights on the specific mechanisms driving the framing effect, we invited students to participate in an incentivized on-line survey. A total of 166 subjects who participated in the main study (30,9% of the total sample) filled in this survey. Participants were asked to complete 5 different incentivized tasks designed to measure their risk and loss preferences (see Appendix D for more information on the specific tasks). We present the survey and its results on Section 6.

Table 2 Description of the sessions/exams

4.3 Data and descriptive statistics

Our main sample consisted of 537 students.Footnote 16 266 students (49.53%) were assigned to the Mixed-framing and 271 (50.47%) to the Loss-framing. We observed their score in the test (Score), their total number of omitted questions (NR), the total number of correct answers (Correct), and the corresponding proportions (%NR and %Correct). We were also granted access to administrative data from the University of the Balearic Islands, including students’ academic record on a 0 to 10 scale (Acad. Rec.) and gender (Female). All data used in this study was conveniently anonymized by the IT services of the university.Footnote 17 To further check that the randomization worked correctly, we also retrieved information on test takers’ non-response from different computer-based MCTs other than the ones in the experiment (Non-Intervention %NR). These data were obtained from other exams performed during the 2018-2019 academic year and were available for 513 out of the 537 participating students.Footnote 18 We also constructed a pre-intervention non-response measure but in this case, we could only gather data for 427 students (80% of our sample).

Table 3 shows the overall average of our main variables (column 1) and the average for the Mixed-framing and Loss-framing (columns 2 and 3). It also shows the difference between treatments (column 4), standard errors (column 5), and the p-value for the two-sample t-test on means equality (column 6). Overall, Panel A in Table 3 shows no difference in gender composition or academic record between the students exposed to the Mixed and Loss-framings. More importantly, groups are also balanced in terms of non-response in tests outside the intervention, which can be considered a placebo test of our treatment (a proper placebo test is provided in Table B5 of Appendix B). Table B1 in Appendix B reports descriptive statistics by session. Though a few exceptions arise, treatment and control were balanced according to most of the observables at the session level. When presenting our results, we show they are robust when excluding sessions where any of the observables were not balanced between control and treatment. Taking all this together, we find support for our claim that randomization worked properly and that both groups are comparable ex-ante.

Table 3 Descriptive Statistics

Panel B in Table 3 presents the comparison between the Mixed and the Loss-framings for our main outcome variables. Raw averages show that non-response is significantly lower under the Loss than under the Mixed-framing both in total number (p-value=0.002) and as a percentage of the total number of questions in the exam (p-value=0.006). In other words, students under the Loss-framing answered more questions on average than students under the Mixed-framing. This finding is in line with our Hypothesis 1. By contrast, we find no evidence in favor of Hypothesis 2. When looking at the variable Score, we observe that the difference, although not significant, has the opposite sign to that predicted in Hypothesis 2. The same happens with the number and the proportion of correct answers.

In the next section, we present ordinary least squares (OLS) estimates of the treatment effect to provide a more accurate analysis by adding session-fixed effects and students’ controls. In what follows, results will be presented in terms of the non-response rate (% NR) but results are qualitatively the same by using the total number of omitted items.Footnote 19

5 Results

Firstly, we focus on the framing effects on risk-taking decisions by using the non-response rate as a (negative) measure of risk-taking. Then, we analyze the framing effects on performance (test scores and proportion of correct answers).

5.1 Treatment effect on non-response

Table 4 reports the effects of the intervention on the non-response rate estimated by OLS. Changing from the Mixed-framing to the Loss-framing reduces the non-response rate. Column 1 does not control for group fixed effects. Without controlling for the specifics of each session, we found that non-response reduces by 2.47 percentage points under the intervention. In relative terms, changing the framing reduces non-response by 18.28%.

Table 4 OLS estimation of treatment effects on non-response (% NR)

In subsequent columns, we add controls, session-fixed effects, and clustered standard errors at the exam level. By adding session-fixed effects, we are also controlling for language of the test, lecturers, degree, and subject. We consider this to be the most suitable specification for our model. Standard errors were corrected for heteroskedasticity and clustered at the session level to account for potential intra-group correlation.Footnote 20 Considering the fractional nature of our dependent variable, as a robustness check we replicated the above results following the method proposed by Papke and Wooldridge (1996). Results remain the same (see Table B4 in Appendix B).

The size of the treatment effect and statistical significance remains comparable when adding group fixed effects, gender, and academic record controls (column 2). Column 3 provides an estimate which is robust to outliers and slightly reduces the size of the treatment effect.Footnote 21 Columns 4 and 5 add the data obtained in the two pilot sessions and in session 14 (the potentially contaminated session), respectively. Finally, Column 6 excludes the groups for which we found any statistically significant difference (10% level) in the balancing tests displayed in Table B1 in Appendix B. The result holds for all specifications.

As an additional robustness check, we conducted a placebo test considering session-homogeneous measures for out-of-intervention non-response (see Table B5 in Appendix B). This placebo test confirms that students under the Mixed and Loss-framing were comparable in out-of-intervention non-response.

In line with previous literature, we also observe that women tend to skip slightly more questions than men. Despite the subtle change in the instructions, in our sample, the change induced in non-response is larger than the highly studied gender differences in non-response. Finally, non-response is lower for students with better academic records. In terms of our model, this may be explained if the perceived probability of providing a correct answer increases with knowledge, which could be proxied by academic record.

5.1.1 Heterogeneous effects on non-response

Table 5 OLS estimation of heterogeneous treatment effects on non-response (% NR)

Now we explore the heterogeneous treatment effects for different groups of students. The size of the treatment effect is two times larger for women (Column 1 restricted for men and 2 for women in Table 5). However, by interacting the gender and treatment dummies in Column 3, we did not find any sufficiently strong evidence to claim that framing induces differential effects across genders. Nevertheless, gender effects may be attenuated by the highly unbalanced composition of some sessions (STEM degrees).

Columns 4-7 divide our sample according to students’ academic record.Footnote 22 The treatment effect is similar and significant across the different tiers of academic record, with the exception of the highest level. Non-response is already very small for students at the highest level of academic record (notice the negative coefficient for academic record in all our specifications in Table 4), which may explain their lower reaction to the treatment. Also, the group with the highest academic record is the smallest, so it might also be a matter of power.

In Columns 8-10, we report separate estimates for each of the courses evaluated in our sample. An interesting pattern emerges. The biggest effect arises from “Human Resource Management” (Column 9). We find the smallest one for “Business”, a course that was taught to engineers. Engineers seem to be unaffected by the treatment. Finally, Column 10 does not display statistically significant effects for the course “Introduction to Business” taught to students in the Business and Tourism schools. However, this non-significance seems to be driven by session 10, in which the control group was displaying statistically significant (5%) lower non-response before the treatment (see Table B1). The framing effect becomes statistically significant for “Introduction to Business” when that group is dropped.

5.2 Treatment effect on performance

Hypothesis 2 predicts that test scores increase under the Loss-framing. This is especially likely to hold, after observing that the treatment increases students’ response rate.

Table 6 OLS estimation of treatment effects on correct answers (% Correct)

Table 6 contains the same specifications as Table 4 but using correct answers as the dependent variable. Remember that an increase in the proportion of correct answers is a necessary but not sufficient condition for an increase in test scores. Table 6 rejects Hypothesis 2. The treatment does not have a positive effect on the proportion of correct answers, so it cannot increase test scores (see Table B6 in the Appendix for the results on test scores). Even more strikingly, despite not being statistically significant, the treatment coefficient has the opposite sign than the one expected.

This result is surprising because, as omitted items are surely not correct, increasing the response rate has a positive mechanical effect on correct answers. This mechanical effect can be defined as:

\(ME= \bar{p}^{Loss}* (-\Delta \% NR)\)

Where \(\bar{p}^{Loss}\) is the average probability of answering correctly in marginal responses and \(\Delta \% NR\) is the framing effect on non-response. We know from Table 4 that \(\Delta \% NR<0\), while by definition \(\bar{p}^{Loss}\ge 0\).

The mechanical effect implies that if the Loss-framing only affects performance throughout the change induced in non-response, then we cannot observe a negative effect on correct answers and indeed we might observe a positive effect if \(\bar{p}^{Loss}\ne 0\). These observations are at odds with the results in Table 4.

Indeed, if the Loss-framing only affects performance throughout the change induced in non-response, the results in Tables 4 and 6 can only be reconciled if \(\bar{p}^{Loss}\) is negative.Footnote 23 Despite being not-statistically significant, the negative coefficients are unfeasible and imply that the change in framing affected performance by a channel other than non-response. In other words, students under the Loss-framing seem to experience worse overall performance.

The main difficulty in analyzing the possibility of impaired performance relies on the existence of the mechanical effect described above. The mechanical effect and impaired performance work in opposite directions. Thus, the two effects may cancel each other out and result in a non-statistically significant effect on correct answers as in Table 6. However, by exploiting question-level data, we can partial-out the mechanical effect to further explore the possibility of impaired performance. To do so, we focus on those questions where the change induced in non-response by the treatment is small and, consequently, the mechanical effect is shut down or, at least, substantially reduced. These items offer the possibility of analyzing impaired performance after partialling out the mechanical effect.Footnote 24 The results of this analysis are presented in Table 7.

Table 7 OLS estimation for the Question Level Analysis

Columns 1 and 2 in Table 7 replicate the above results on framing effects using question-level data. Column 1 confirms that the Loss-framing reduces non-response by 2.4 percentage points while column 2 shows that it has a negative but not significant effect on correct answers. In columns 3, 4, and 5 we use the percentage of correct answers as the dependent variable and add explanatory variables intended to capture the mechanical effect and their interaction with the treatment dummy. Consequently, the uninteracted treatment dummy provides the coefficient of interest: the framing effect on the items where the mechanical effect is more likely to be inactive.

We use three different approaches to identify items where the mechanical effect is weaker. In columns 3 and 4, we exploit a natural cap on the mechanical effect. For items where non-response is close to zero, changing to the Loss-framing cannot further reduce non-response. Following this logic, in these two columns, we add the non-response rate as a regressor and its interaction with the treatment dummy. In column 3, the non-response rate was calculated using all subjects, while in column 4 it was calculated using only the control group (Mixed-framing).Footnote 25 Given that we are controlling for the proportion of non-response and its interaction with the treatment, the (uninteracted) treatment dummy provides an estimate on the framing effect for the questions where non-response was close to zero. In the two cases, this coefficient of interest is negative and statistically significant, thereby providing evidence of impaired performance on those items where the mechanical effect is inactive. In column 5, instead of using an exogenous cap, we directly consider the observed difference in non-response (\(\Delta \% NR_j=\%NR^{Loss}_j-\%NR^{Mix}_j\)) for each test item j. The result is very similar to the ones in columns 3 and 4. The coefficients of the (uninteracted) treatment dummies are negative and statistically significant, showing evidence of impaired performance on those items where the mechanical effect is capped.

Impaired performance explains why in Table 6 we found that, despite answering more items, students under the Loss-framing did not get a higher percentage of correct answers and why we get a negative but not significant result: Students provide more answers under the Loss-framing but all answers, including the ones to the items that would have been answered even in the absence of the treatment, are of poorer quality.

6 Risk-aversion vs loss-aversion

Table 8 Treatment effects interacted with risk and loss attitude (% NR)

To gain better insights into the relative importance of risk and loss-aversion, we administered an incentivized survey. In this survey, students had to choose between different gambles that were specifically designed to measure their risk and loss attitudes (see Appendix D for a detailed description of each measure). Incentives were introduced by means of a lottery, where the winner effectively participated in the gamble and was paid according to his/her choices. Survey participation was voluntarily. Therefore, unfortunately, our sample reduces to 166 subjects (30.9% of the total sample) when these measures are taken into account. This restriction imposes a challenge in terms of the representativeness and power of this part of the study. Finally, we must recognize that obtaining separate measures for risk and loss-aversion can be problematic. These difficulties call for some caution when considering these results.

We collected 4 measures for risk-aversion, one for loss-aversion and one trying to capture reflection. We combined all 4 measures for risk aversion into one factor by using principal component analysis accounting for 41% of the variance. All these variables are codified such that greater values indicate greater risk or loss-aversion. Table 8 analyzes the effects of each of the measures on the treatment effect on non-response.

Firstly, none of the measures have a statistically significant effect on non-response (except for the self-reported measure and the factor that combines all four measures of risk). However, the sign of the coefficients is consistent with more risk-averse and/or loss-averse students omitting more questions under the Mixed-framing. Interestingly, we obtain statistically significant results for the interaction between the treatment (Loss-framing) and the risk-aversion measures but not for loss-aversion or reflection effect. In particular, all interaction terms with risk-aversion measures (three out of five being significant) present a negative point estimate, implying that the Loss-framing is more effective in reducing non-response among those students who are more risk-averse.

7 Conclusions

We ran a field experiment to analyze framing effects in penalized MCTs. Our intervention consisted of modifying the framing of rewards and penalties in real stakes MCTs that accounts for between 20% and 33% of students’ course grade. Under the Mixed-framing, the scoring rule was presented in a mixed gain and loss domain, while under the Loss-framing, the scoring rule was presented in the loss domain. Consistent with our theoretical predictions, we showed that non-response is greater under the Mixed than under the Loss-framing. By contrast, we did not find a positive effect on test scores or correct answers. We show that it is very plausible that students’ performance was indeed impaired under the Loss-framing.

Our paper contributes to generalizing framing effects on risk-taking from the lab to the field. However, the question of whether this result can be extended to other population groups remains open. Subjects participating in our experiment were higher education students performing a high stakes task. If we consider that high skills and stakes make decision-making more likely to be rational, then we can expect similar effects to hold in more general population. However, this is of course an open question that can only be answered by conducting more experiments of this type.

Despite our experiment not being able to identify the specific mechanism driving impaired performance, several previously documented mechanisms could be behind this finding. Higher education tests may have important and sometimes non-reversible consequences for the test taker. Students facing loss conditions may be exposed to higher levels of anxiety when they encounter unexpected evaluation methods. The link between loss framings and physical responses that indicate arousal or anxiety is well documented (Sokol-Hessner et al. 2009; Hochman and Yechiam 2011; Hartley and Phelps 2012), as it appears that higher anxiety levels can produce poor academic performance (Cassady and Johnson 2002; Chapell et al. 2005). In addition, loss-averse subjects might perceive a greater importance of performing well under the Loss than under the Mixed-framing. If so, loss-averse subjects may choke under the extra pressure imposed by the Loss-framing, lowering their performance (Baumeister 1984; Chib et al. 2012).Footnote 26 Another plausible explanation is that by altering the instructions under the Loss-framing treatment, subjects may have suffered the effects of a cognitive load (Sweller et al. 1998), thereby limiting their working memory and consequently impairing their performance (Baddeley 1992; Carpenter et al. 2013; Deck and Jahedi 2015). All these explanations are especially appealing when considering that the task performed by subjects is a one-shot cognitively demanding task where cognitive aspects, rather than effort and/or motivation, are key when it comes to determining performance. By contrast, these explanations might be irrelevant for non-cognitive or routine tasks. A limitation of the present study is its inability to find the exact mechanism that causes impaired performance. Indeed, this effect was unexpected, and our experiment was not designed to find the exact mechanism that drives it.Footnote 27

We conclude by listing the implications of our study in terms of MCT design. Loss framing in the instructions of a penalized MCT increases test takers response rate by reducing the influence of non-cognitive traits such as risk- or loss-aversion. Thus, it may provide a more accurate measure of knowledge on the evaluated topic. However, loss framing may also have unintended consequences on students’ performance. This possibility calls for some caution in scoring rule modifications. Further research on this topic might provide better insights on the reasons behind these negative effects.