Goals and guesses as reference points: a field experiment on student performance

In this paper, we study overconfidence and goal-setting in academic performance, with and without monetary incentives. Students enrolled in a microeconomics course were offered the possibility of setting their own target grade before taking part in the final exam. They were also asked to guess their grade immediately after they had taken the exam (“post-diction”). In general, students overestimated their performance, both at the goal-setting and at the post-diction stages. Controlling for several sources of this bias (cognitive abilities, academic record and self-reported academic confidence), we find that the use of monetary rewards mitigates the overestimation of potential achievements and eliminates overestimation of actual achievements through the improvement of actual performance. Our results suggest that monetary incentives do not cause subjects to put more effort into correct guesses but make them put more effort into academic performance. Using students’ academic records to measure overall skill, we find a strong Dunning–Kruger bias which is intensified in the presence of monetary rewards.


Introduction
As has been established by prospect theory (Kahneman & Tversky, 1979;Tversky & Kahneman, 1992), human behavior strongly depends on reference points which are used to assess whether an outcome is perceived as a gain or as a loss. Besides the reference points which may be dictated by social norms, individual reference points may relate to rational expectations (Koszegi & Rabin, 2006, 2007, 2009) and goals (Diecidue & Van De Ven, 2008). Using the properties of the prospect theory value function, Locke and Latham (2002) and Heath et al. (1999) suggest that goal-related reference points affect individual intrinsic motivation, improving actual performance. 1 Wu et al. (2008) present a model in which subjects' performance is improved by exogenously set goals. This finding was empirically supported in Allen et al. (2016) and Markle et al. (2018), with a sample of marathon runners. In the field of education, Meng (2019) tested the effect of grade aspiration-driven reference points on student performance.
In this paper, we study 2 types of reference points, depending on the timing of elicitation: self-chosen goals, elicited ex ante, by asking students to set their own target grade in a forthcoming exam and post-dictions, elicited immediately after the exam, by asking students to forecast their grade, given their perceived performance. 2 Following Fryer and Elliot (2008), self-chosen goals are empowered and proactive, creating commitment and acceptance. Thus, as personal bests, they act as reference points by inducing effort when current performance might otherwise be insufficient (Anderson & Green, 2018). 3 When goals are associated with monetary incentives, the performance improvement is even higher. 4 Whereas goals can be considered as target-based reference points, post-dictions can be used as actual behavior-based reference points. This is so, given that "the expected value of an outcome is an easily available integrated mechanism that could be used as a (…) reference point" (Hack & von Bieberstein, 2015). Generally, the literature suggests that students' post-dictions of performance are more accurate than any type of prediction or target. 5 This is because, "whereas predictions are made prospectively and are based on what students think they know, post-dictions are made retrospectively and reflect the student's experience of the test" (Hacker et al., 2008).
In this vein, we are interested in analyzing the effectiveness of monetary incentives to encourage students to make a more thoughtful assessment of their potential and actual academic performance. We hypothesize that monetary incentives could reduce students' overestimation bias by improving their guesses. With the objective of testing this hypothesis, we conducted a randomized field experiment to elicit students' reference points, well before and immediately after the exam of a microeconomics course. In addition, we control for 2 potential factors driving students' reference points: their skill (potential or actual) and their self-reported academic confidence. Contrary to our hypothesis, we find that monetary incentives 1 Their main properties are reference point dependence, loss aversion, and diminishing sensitivity. 2 Experimental evidence suggests that goals and expectations are crucial in the formation of reference points (Hack et al., 2016). 3 In words of van Lent and Souverijn (2020), "Economic theory (…) have shown how (…) goals can be used as reference points in order to increase performance for loss averse agents or hyperbolic discounters. 4 See Goerg and Kube (2012), Dalton et al. (2015), Corgnet et al. (2015), Brookins et al. (2017) and Herranz-Zarzoso and Sabater-Grande (2018). 5 See Glenberg and Epstein (1987), Maki (1998), Winnie and Jamieson-Noel (2002), and Serra and DeMarree (2016). do not improve students' guesses but their academic performance, causing a significant decrease in students' overconfidence.

Literature review
This paper contributes to 3 lines of research: (1) monetary incentives and academic performance, (2) monetary incentives and overconfidence and, (3) overconfidence and cognitive ability.
First, we deal with the effect of monetary incentives on academic performance. The empirical evidence available 6 offers mixed conclusions on the potential of these interventions. Depending on the context, the students' age or implementation details, impacts have been generally modest or even null. As a departure point of the present study, in Herranz-Zarzoso and Sabater-Grande (2018), we conducted a randomized field experiment aiming at improving academic performance in a different course (Introductory microeconomics). In that study, we found that monetary incentives offered on the basis of self-chosen goals were effective to significantly increase students' grades.
Second, we focus on overconfidence bias. It has been generally observed that a person's subjective confidence in own performance is greater than the actual performance, resulting in the well-known overconfidence phenomenon. 7 Moore and Healy (2008) distinguish among 3 types of overconfidence: over-precision, a tendency of individuals to be excessively certain about the accuracy of their beliefs: overestimation, a tendency of subjects to exaggerate their absolute actual achievement or skill; and over-placement, which occurs when people perceive their performance in a group as better than it actually is. 8 Given the 2 types of reference point elicited, we observe 2 types of overestimation: students' overestimation of their targeted academic performance, defined as the difference between the grade obtained and the goal chosen before the exam, and students' overestimation of their actual academic performance, interpreted as the difference between the actual grade and the grade forecasted immediately after the exam.
Previous research has shown that students consistently overestimate their performance on academic exams, 9 especially when their grades are low. Specifically, Hacker et al. (2008) find that good students are usually more accurate, with a tendency to underestimate, while bad students usually overestimate their performance. Recently Santos-Pinto and de la Rosa (2020) reviewed empirical research on overconfidence and its effect on economic choices. Psychology evidence suggests that overconfidence depends on personal traits and environmental factors. Economic 6 In the field of education, Lavecchia et al. (2016) offer a survey about the effectiveness of monetary incentives in improving academic performance. 7 Overconfidence has been called "the most significant of the cognitive biases" by Kahneman (2011), Bazerman andMoore (2012) refer to it as "a mother of all biases" and Meloy et al. (2006) state that it "is one of the most robust findings in the decision and judgment literature". 8 In Hoezl and Rustichini (2005) subjects estimate whether their skills are better than the median. 9 See, for example, Chen (2003) and Foster et al. (2017), where it is concluded that "even after a great deal of practice (with over 10 exams), students on average did not improve the accuracy of their predictions". evidence shows that overconfidence can persist even under monetary incentives and feedback.
Third, we deal with the relationship between subjects' cognitive ability and their overconfidence. Defining overconfidence as the difference between the guess and the proportion of correct answers, 10 a positive correlation between cognitive ability and confidence is found in Wolfe and Grosch (1990) 11 and Bruine de Bruin et al. (2007). 12 On the contrary, Stanovich and West (1998) reported a negative correlation of different cognitive ability tests with subjects' overconfidence. Later, Hoppe and Kusterer (2011) found that subjects with higher cognitive reflection test (CRT) scores had a significantly more precise self-assessment of their performance. However, the effects of cognitive abilities reported in earlier studies could depend on the type of overconfidence analyzed. In this vein, Duttle (2016) showed that, although overestimation 13 is not affected by cognitive abilities (as measured by a CRT), the CRT score was associated with a significant decrease in over-placement and overprecision. Nevertheless, Bialek and Domurat (2018) showed that the relationship between cognitive abilities and overconfidence disappeared after addressing 2 critiques: (1) the CRT does not measure cognitive abilities but, rather, the analytic cognitive style, and (2) overconfidence and cognitive ability are artificially correlated since the RPM test (which served as a basis for estimating overconfidence) is also a measure of cognitive abilities.
We are interested in testing for the so-called "Dunning-Kruger effect". 14 This well-known cognitive bias implies that when people are objectively unskilled in a given area, they tend to largely overestimate their knowledge. As Dunning (2011) states, this effect has been observed in multiple domains of skill 15 and knowledge, including academics. The empirical evidence available on the relation between skill and overconfidence is mixed, depending on the type of overconfidence analyzed and the methods used to measure the 2 variables. 16 Specifically, when the same task is used to measure overconfidence and skill, the empirical evidence obtained can be distorted by 'regression to the mean' effects, that is, individuals with higher skill are more likely to show less overconfidence. To avoid this effect, we use different tasks to measure overconfidence and skill, evaluated through a cognitive ability test or by means of the students' academic record. 10 The problem with this overconfidence measure, as Moore and Healy (2008) suggest, is that being excessively sure of having answered correctly reflects both overestimation of your performance and excessive confidence in the precision of your knowledge. 11 Using verbal and mathematical Scholastic aptitude test (SAT) scores to measure cognitive ability and a trivial quiz where subjects report their confidence in having answered correctly. 12 Using Raven's standard progressive matrices for nonverbal cognitive ability and form G of the Nelson-Denny reading comprehension subtest for verbal ability. 13 Defined as the difference between a subject's estimated number of correct test items in the Raven progressive matrices (RPM) test and his/her actual performance in the test. 14 Kruger and Dunning (1999) find that the more poorly participants actually performed in tests of humor, grammar, and logic grossly, the more they overestimated their own performance. 15 Park and Santos-Pinto (2010) find support for the Dunning-Kruger hypothesis in both poker and chess tournaments. 16 See Feld et al. (2017).
In addition, we examine the relationship between perceived skill, measured by means of reported academic self-confidence, and overconfidence. Following Sander and Sanders (2009), we use the academic behavioral confidence scale 17 (ABC, hereafter) as a global measure of perceived academic confidence. Using a group test of general mental ability by Tandon (1971) and a self-confidence inventory by Agnihotri (1986), Dhall and Thukral (2009) investigated the relationship among intelligence, self-confidence, and academic achievement in schools in Pakistan. They found that intelligence was positively correlated with both self-confidence and academic achievement. However, Saenz et al. (2019) obtained that attendance, study habits/preparation, and/or prior performance did not offer a strong or robust explanation of students' grade predictions.
The rest of the paper is structured as follows: first, we introduce the design of the experiment implemented and the hypotheses; second, we present the methods used; third, we analyze the empirical evidence collected and present our results; last, we discuss them.

Experimental design
A randomized field experiment was conducted to analyze students' overconfidence using monetary incentives as the treatment variable. In the beginning of the semester, 154 students enrolled in a microeconomics course at a Spanish university in 2018. 18 They were offered the possibility of setting their reference points well before 19 and right after completion of the final exam. In the call, students were informed that those responding affirmatively would be immediately randomly assigned 20 to one of two groups: 1 without monetary incentives (NMI) and 1 with monetary incentives (MI). Additionally, we notified that participants would receive information about their corresponding group before they were invited to choose a goal for their final exam grade. From the 138 volunteers, 64 were randomly assigned to the NMI condition and 74 to the MI one. However, only 42 (16 females and 26 males) in the NMI treatment and 58 (26 females and 32 males) in the MI treatment decided to finally take the exam. 21 Furthermore, 16 students not responding affirmatively to our call were included in a baseline condition (non-participants group, NP group hereafter) to compare non-participants' final grades with those corresponding to the ones obtained by participating students. In doing so, we can check for potential self-selection bias. Table 1 summarizes the characteristics of the 3 groups. 17 This scale encompasses cognitive judgments regarding study-related behaviors, such as confidence in independent study, attaining grades, discussing course material and questions, and academic engagement through attending formal sessions. 18 The experiment was approved in the framework of an educational innovative project promoted by the university and approved by the faculty of law and economics. 19 After a midterm exam (3 months before the final exam), self-chosen goals were elicited. Moreover, although we allowed their revision until 1 day before examination, no student submitted a new goal after their initial decisions. The midterm (final) exam represented 30% (70%) of the course grade. 20 A pure random draw using a die was used to allocate subjects individually into groups. 21 A Mann-Whitney test shows no gender differences between treatments (p value: 0.5029).
In the call, students were instructed that, if assigned to the MI group, their monetary reward (R) would depend on the chosen reference point (RP) and the grade (GR) they obtained in the exam, according to the following scoring rule: This is a more conservative test of overconfidence bias than a grade forecast question because the reward depends on a chosen reference point instead of the obtained grade. This is so because, like in Park and Santos-Pinto (2010), 22 a risk neutral player who overestimates her performance in a given amount incurs in a larger loss (0 earnings and loss equal to the square of the RP) than if she underestimates it in the same quantity (loss equal to the square of the difference between GR and RP). Thus, the optimal reference point of a risk neutral player should be smaller than his optimal grade forecast. To counterbalance the abrupt fall in earnings from a grade below the reference point, the quadratic scoring rule chosen here gives a good incentive for choosing a higher reference point because earnings increase marginally more for higher (successful) bets.
Moreover, students were instructed that only 1 of the 2 elicited reference points (participants' self-chosen goal or post-diction), chosen at random, would be used to determine their rewards. The actual average payment received from successful subjects in the experiment was €16 from a maximum potential reward of €49 (given that the top score in the final exam is 7 points). To distinguish between the 2 types of overconfidence, we define students' potential overconfidence (POC) as the selfchosen goal minus the grade obtained, and students' actual overconfidence (AOC) as the difference between the post-diction and the actual grade.
In addition, participants' cognitive ability was measured by means of the abstract reasoning part of the differential aptitude test (DAT-AR for PCA, Bennett et al., 1974). Moreover, we elicited scores on the ABC scale. 23 The research by Nicholson et al. (2013) suggested that undergraduates' confidence in their ability is related to academic performance. Specifically, this study showed that students who, at the beginning of the semester, were confident about their grades also perform better in They use a similar quadratic scoring rule as well as bets to incentivize poker and chess players' forecasts of relative performance in tournaments. The exam grade performance studied here resembles more the chess rather than the poker treatment, because students have a rather good idea of their abilities and would not be expected to treat their exam grades as purely random. 23 See Sander and Sanders (2009). their end-of-semester marks. Given the type of course considered in this study, we are especially interested in two factors of the ABC scale: grades and study beliefs. Using the aforementioned experimental design, we propose four hypotheses. Because monetary incentives induce individuals' more thoughtful guesses, they should help to bring goals and expectations closer to actual academic performance. In consequence, Hypothesis 1 (H1) Introducing monetary incentives to elicit students' reference points, should soften both potential and actual overconfidence by improving guesses.
In addition, it is expected that monetary incentives will lead to more effort, producing a higher academic performance. Hence, Hypothesis 2 (H2) Implementation of monetary incentives will increase actual grades. Moreover, assuming that both cognitive ability and academic record are good predictors of academic aspiration, performance and lower overconfidence, we hypothesize that: Hypothesis 3 (H3) Students with higher potential and actual skills will choose higher goals, they will achieve a better academic performance and they will show a lower level of overconfidence.
Finally, since self-reported studying and grade confidence should be related to students' aspirations, Hypothesis 4 (H4) Students self-reporting a higher academic confidence should choose higher goals. Summing up, by controlling for potential driving factors, such as skill and reported academic self-confidence, our experimental design aims at analyzing whether both overconfidence and Dunning-Kruger bias can be mitigated using monetary incentives to elicit students' reference points.

Methods
In this section, we offer detailed information on the measures used to elicit students' cognitive abilities and self-reported academic confidence, respectively.
1. The abstract reasoning part of the differential aptitude test for personnel and career assessment. The abstract reasoning (AR) scale of the DAT used in this experiment is included in the Spanish adaptation of DAT-5 by the publisher TEA (Cordero & Corral, 2006). This test is used as a non-verbal measure of reasoning ability and involves the capacity to think logically and to perceive relationships in figures made up of abstract patterns. It is considered as a marker of fluid intelligence (Colom et al., 2007) and the component of intelligence most related to general intelligence or g factor (McGrew, 2009). The advantage of this test is that it can be administered quickly, containing 40 multiple-choice items within a 20-min time limit. 2. The academic behavioral confidence scale (Sander & Sanders, 2009). The ABC scale used in this research was the 24-statement version. These statements elicit the student's expectation of achieving good grades in assessments (grades), engaging in independent study (studying), attending lectures, tutorials and other taught sessions (attendance), and discussing material with tutors, lecturers, and peers (verbalizing). However, all analyses presented in this paper were computed only for 2 ABC subscales: grades and studying. In Sander et al. (2011), the ABC scale shows cross-cultural validity when translated into Spanish and administered to over 2 thousand Spanish psychology students.

Sample self-selection
Given that our design requires students' willingness to participate in the experiment, potential self-selection problems do not affect differences between MI and NMI. Nevertheless, the baseline group allows us to test for self-selection bias by comparing prior midterm grades of participating and non-participating subjects. A Mann-Whitney test shows that differences between participant and non-participant groups are not statistically significant (p value: 0.1517). Table 2 shows the sample split into NMI and MI, and presents descriptive statistics of: normalized data to a scale from 0 to 10 corresponding to (1) subjects' self-chosen goals, (2) post-dictions, (3) grades and (4) POC and AOC. Moreover, we display additional descriptive statistics corresponding to: (1) score in the DAT-AR test, (2) whether the lack of significance for some differences was due to a low statistical power, we conduct an ex-post power analysis using Stata with power set at 0.80 and probability at 0.05. The last column of Table 2 shows the minimum sample size to find statistically significant differences. We observe that although self-chosen goals and post-dictions are similar in the NMI and the MI groups, both types of overconfidence are significantly higher when subjects' reference points are elicited without incentives (NMI group). Specifically, a POC of almost 2 points is reduced to a quarter when subjects' self-chosen goals are elicited using money. In the same manner, an AOC of 1.3 points disappears when subjects' post-dictions are obtained under monetary incentives. It is worth mentioning that this reduction of the students' overconfidence is not caused by a difference in their expectations but, rather, by an improvement in their performance. Moreover, this reduction in both POC and AOC is not influenced by cognitive ability or academic confidence/records, since these variables do not present significant differences across groups.

Descriptive statistics and tests
In Fig. 1, means are presented and statistical differences are tested through a Wilcoxon test. Specifically, in the first row, we display means of self-chosen goals, post-dictions, and grades, split between NMI and MI.
For both groups, we obtain that the median of students' self-chosen goals is significantly higher than the median of their post-dictions. However, only NMI subjects obtain a median grade significantly lower than their median post-diction, since MI subjects post-diction their grade accurately. The 2nd row of Fig. 1 presents the mean values of POC and AOC for both groups of subjects, showing that real monetary incentives significantly reduce both POC and AOC medians, the latter being close to 0. Figure 2 displays the confidence measured using self-chosen goals and postdictions against the grade obtained in the midterm exams. The 45°line provides a benchmark, given that points above the line would represent overconfidence whereas points below the line would represent under-confidence. The dashed line is the minimum grade required to pass an exam. In the right panel of this figure, we can observe that most of the NMI subjects show both POC and AOC. Thus, this right panel reproduces the effect of overconfidence in general terms. This pattern is not observed for MI subjects (left panel). In fact, they show overconfidence for low grades and under-confidence for high grades, as can be expected when a regression to the mean effect is present. Now, we analyze the relationship between subjects' confidence and cognitive ability. Figure 3 displays both potential and actual confidence for subjects included in both the NMI and the MI group against cognitive ability. For both groups, NMI and MI subjects, Fig. 3 shows no pattern relating confidence and cognitive ability. 25 Moreover, for both MI and NMI samples, subjects were divided into 2 groups according to their reasoning ability. A subject was classified as "high (low) reasoning" if her score was higher (lower) than the median score in the DAT-AR test.  Using a Mann-Whitney test 26 we obtain that high reasoning incentivized students choose higher goals than low reasoning ones. However, there are not significant differences for post-dictions, grades and overconfidence between high reasoning and low reasoning participants for both samples, incentivized and non-incentivized students.
The same procedure was implemented using academic record, and studying and grades confidence variables. Regarding academic record, we find that incentivized students with better academic records do not set more ambitious goals or higher postdictions but obtain better grades resulting in a significant decrease in both potential and actual overconfidence. 27 This decrease is not significant 28 for the nonincentivized sample. 27 Mann-Whitney test p values obtained comparing incentivized high academic record students and low academic record ones corresponding to goals, post-dictions, grades, potential and actual overconfidence are 0.2756, 0.9144, 0.0101, 0.0242 and 0.0192 respectively. 28 Mann-Whitney test p values, comparing potential and actual overconfidence between non-incentivized high academic record students and low academic record ones, are 0.1170 and 0.1682 respectively. Finally, in relation to self-reported confidence, we obtained no significant 29 differences between reference points, grades and overconfidence chosen by incentivized subjects with a higher studying/grade confidence and those with a lower one. However, non-incentivized highly self-confident students obtain significantly 30 better grades than those with a lower self-confidence, resulting in a significantly 31 lower overconfidence.

Reference points
In this section, potential explanatory factors of self-chosen goals and post-dictions are explored. OLS models are estimated to explain both reference points elicited. 32 The potential driving factors used are: (1) cognitive ability, (2) academic record, (3) self-reported grades confidence and studying confidence, and (4) gender (1, if the student is a woman and 0, student is a man).
The first important result is related to cognitive ability. The direct relationship between cognitive ability and self-chosen goals only occurs when participants are monetarily rewarded. Moreover, cognitive ability does not play a role in explaining post-dictions independently of the type of incentives offered. In addition, subjects with a better academic record only report higher self-chosen goals and post-dictions when they are not monetarily incentivized. In addition, a higher reported academic confidence does not lead to higher goals or post-dictions. Regarding the importance of the type of incentives used to elicit students' reference points, we find that the MI dummy is not significant: self-chosen goals and post-dictions by MI subjects are, on average, similar to those of NMI subjects. Finally, the reference points elicited do not differ across genders.
Therefore, contrary to H1, we can state that: Result 1: Reference points are not affected by monetary incentives. In addition, H3 is only partially confirmed in relation to subjects' skills and H4 is rejected: Ancillary result 1: Under monetary incentives, we find a direct relationship between cognitive ability and subjects' ambition choosing their goals. However, reference points are not affected by students' academic record and reported academic confidence. 29 Mann-Whitney test p values, comparing incentivized high studying (grades) confidence subjects and low confidence subjects regarding to goals, post-dictions, grades, potential and actual overconfidence, are 0.8705 (0.5051), 0.9385 (0.9730), 0.5562 (0.1767), 0.6438 (0.2102) and 0.6882 (0.2358) respectively. 30 Mann-Whitney test p value comparing grades between non-incentivized high studying (grades) confidence subjects and low confidence subjects is 0.0101 (0.0175). 31 Mann-Whitney test p values, comparing potential and actual overconfidence between non-incentivized high studying (grades) confidence subjects and low confidence subjects, are 0.0175 (0.1084) and 0.0903 (0.0747) respectively. 32 Given that our dependent variables are limited and as Park and Santos-Pinto (2010) point out, this fact produces heteroskedasticity, invalidating the statistical inference. Thus, we use robust standard errors in the OLS regressions. Additionally and following Park and Santos-Pinto (2010), to check whether our results are robust, we perform a logit transformation of the dependent variables to show the robustness of our results (see Appendix A2).

Grades
In this section, determinants of students' grades are analyzed. Below we present OLS models explaining students' grades. 33 The first model shows that the student's academic record plays a crucial role in the grades obtained. The last 2 models indicate that the predictive power depends crucially on monetary incentives. In addition, we obtain that subjects' confidence in their study method positively affects grades, especially in the absence of monetary incentives. Moreover, we obtain that a higher cognitive ability predicts a better academic performance only for the NMI group. Regarding the size of the reference points elicited, only students' post-dictions were related with their grades, but the predictive power disappeared when they were monetarily incentivized. MI subjects obtain, on average, 1.08 points more than NMI subjects. Thus, real monetary incentives are effective as a means of significantly improving students' academic performance.
Thus, confirming H2, we can state that: Result 2: Students' academic performance is improved by monetary incentives. Moreover, confirming partially H3 regarding actual skills, we find that: Ancillary result 2: Especially under monetary incentives, we find a direct relationship between academic record and grades.

Overconfidence
In this section, 4 OLS models are estimated to shed light on the determinants of subjects' overconfidence.
Regressions indicate that monetary incentives significantly reduce both POC and AOC. Specifically, in the presence of monetary incentives POC and AOC are on average 1.12 points lower than in the absence of monetary incentives. In conclusion: Result 3: By improving grades, monetary incentives reduce overconfidence in goal-setting and make it disappear in post-diction of grades.
In addition, from the OLS models, we obtain that students' academic records have a negative effect on both POC and AOC. That is, subjects with a better academic record show less overconfidence, especially when they are incentivized with money. The same pattern is found regarding the studying confidence scale, but only in the absence of monetary incentives. Therefore, subjects with more confidence in their study methods show lower overconfidence. In contrast, cognitive ability and gender do not explain POC or AOC. 34 Concerning the existence of a Dunning-Kruger bias, we can summarize our findings as follows. Our results confirm this phenomenon using cognitive ability as the measure of potential skill only when no incentives were offered. However, this effect disappears when monetary incentives are used. Specifically, although students 33 Because the dependent variable has a limited variation, robust standard errors are used on OLS regressions. Furthermore, the results are robust to the transformation of the dependent variable (see appendix A2). 34 With one exception: a significant negative relation between cognitive abilities and AOC is found when subjects are not monetarily incentivized. with higher cognitive abilities choose higher goals, we do not find any relationship between cognitive ability and potential overconfidence, when self-chosen goals are elicited using monetary incentives. In addition, fluid intelligence of MI subjects is not related to their post-dictions or to their actual overconfidence. Thus, we can state that, under monetary incentives and using cognitive ability as measure of subjects' (potential) skill, we do not find any evidence of the Dunning-Kruger effect.
However, when we use students' academic record to measure their (actual) skill, we obtain that monetary incentives reinforce the presence of the 35 observed to a lesser extent in the absence of money rewards. Therefore, we partly confirm H3 regarding overconfidence: Ancillary result 3: Using monetary incentives to elicit students' reference points, we (do not) find an inverse relationship between actual (potential) skills and overconfidence.
Lastly, we reject H4: Ancillary result 4: When reference points are elicited without monetary incentives, although there is no direct relationship between studying confidence and goals, there is a direct (inverse) one between studying confidence and grades (overconfidence). Moreover, the self-reported confidence in their grades does not correlate with their reference points, grades or overconfidence in any case.

Discussion
To the best of our knowledge, this paper is the first to study students' overconfidence using monetary incentives to analyze self-chosen goals and post-dictions as reference points. We consider money as an effective incentive because, it is a non-satiable good and, as Croson (2005) points out, "everyone values it, in contrast with extra-credit points or other grade-related rewards which may be valued only by students who are grade-conscious and/or whose grade may be affected by the outcome" and it is a nonsatiable good. 36 Both students' self-chosen goals and post-dictions are often elicited in the literature with non-incentive compatible methods. Following Murstein (1965), multiple survey data confirming overconfidence have been collected using no incentives. 37 However, to motivate students in their task of forecasting, in the past decade, some contributions have introduced 2 types of incentives: extra grade points and money. Miller and Geraci (2011), Magnus and Peresetsky (2018) and Caplan et al. (2018) use bonus points to encourage students to reveal their honest guesses about grades. In general, this type of incentive fails to show improvement in students' expectations or performance.
Monetary incentives were introduced in Feld et al. (2017) and Saenz et al. (2019). 38 Ehrlinger et al. (2008) and Gutiérrez and Schraw (2015) are among the few papers analyzing the role of monetary incentives in students' predictions with mixed results. Specifically, Ehrlinger et al. (2008) find that even offering $100 to college students who are exactly correct in their prediction did not lead to more accurate estimates of the number of questions answered correctly. Gutiérrez and Schraw (2015) used a monetary reward of US$10, contingent upon meeting or exceeding the test performance criterion at posttest. They find that incentives improve calibration accuracy only when combined with a training strategy.
Our results contrast with this previous literature suggesting the importance of monetary incentives as a means of reducing students' overconfidence through the improvement of performance. We find that monetary incentives do not cause students to put more effort into correct guesses but, rather, in exam performance. In particular, our results suggest that setting goals is more motivating in the presence of monetary incentives, making goals and actual performance converge by enhancing the latter.
Specifically, taking into account individual characteristics like skill and reported academic confidence, we obtain that students' overestimation of their potential achievements is significantly reduced when money is used to elicit their self-chosen goals. Rather than revising their goals down to match a lower skill, subjects brought their performance up to meet their aspirations. This effect is stronger in the case of subjects' actual overestimation of their grades, eliminating the bias.
Our study presents some strengths and undoubtedly several limitations. Among the major strengths: (1) we use non-negligible monetary incentives to elicit students' self-chosen goals and post-dictions; (2) we control for potential driving factors like cognitive ability, academic record and self-reported academic confidence; and (3) we double-check potential self-selection effects in our sample. Our study also had some limitations including: (1) limited sample size; (2) incentive effects dependent on the quadratic reward function introduced in our experimental design; and (3) uncontrolled factors which may partially be responsible for some of the differences reported here. With reference to the first limitation, using an ex-post power analysis, we check whether the lack of significance of some differences between treatment groups was due to a small sample size. The results show that samples in each treatment group would have to increase to at least 942 subjects to find statistically significant differences at a 5% level in self-chosen goals and to 157,344 in postdictions. In relation to the 2nd reservation, the scoring rule used here is a conservative test for overconfidence. So, mitigation effects of monetary incentives could be smaller if subjects were rewarded by alternative scoring rules. Regarding the last concern, other uncontrolled factors may partially be accountable for some of the effects reported here. Thus, our results have to be interpreted cautiously when establishing causal relationships between monetary incentives and academic performance.
Although this paper reveals the positive effect of monetary incentives in academic performance, policymakers might be concerned by (1) crowding-out effects (Gneezy et al., 2011) and (2) the financial resources needed. Regarding the argument that monetary incentives could crowd out students' intrinsic motivations, List et al. (2018) only found a limited temporary effect, obtaining that one year later nonincentivized tests were not negatively affected. With respect to the second concern, Herranz-Zarzoso and Sabater-Grande (2018) showed that lower cost-intensive mechanisms as a rank-order tournament scheme can be as effective as a piece-rate payment mechanism like the one adopted here.
Further research is needed to explore factors like personality traits to better explain subjects' overconfidence bias, and new tools reinforcing monetary incentives to soften this overconfidence, like experience and feedback on previous academic tasks.

Call instructions
The LEE Research Team of the University Jaume I is conducting a study to evaluate which factors contribute to good performance on this course. The team will use your responses, grades from this course and your academic records if you consent to participate. All the information will be anonymously associated with the findings of this research. You have the right to withdraw from the study at any time during the semester. If you withdraw there will be no consequences for you; your academic standing and record will not be affected.
If you consent to participate in this study you will be immediately randomly assigned to one of two groups: group 1 or group 2. If you have been selected in group 1 you will be paid by your decisions but in the case of being assigned to group 2 you will not be rewarded by them. You will receive information coming soon by email about your corresponding group, before you have to make any decision. Once you have been informed, you will have to set a goal for your final exam grade in this course. Remember that the final exam is worth 70% of your final course grade (7 points is the maximum grade to be achieved). You will be allowed to revise your goal until one day before examination. All communications should occur over email.
Moreover, you will be asked to guess the grade you think you will get in the final exam immediately after being completed. Only one of the two decisions, chosen at random, will be used to determine your reward if you have been randomly selected in group 1. In this case, your reward, in euros will be equal to: In addition, if you have declared your willingness to participate, you will be asked to perform two additional tasks (a 20-min abstract reasoning test and and a short question test).
Please respond to this mail if you consent to participate in this study.

Instructions for group 1 participants
You have been randomly selected in group 1. Thus, your decisions can be monetary rewarded. The first decision you need to make is choosing a goal for your final course grade (remember that your communication must be by email and the goal can be revised until one day before examination). Please, think carefully before setting your goal. Remember that the second decision you must take is to guess the grade you think you will get in the final exam, immediately after being completed (you will be asked to write your guess at the final page of the exam). Only one of the two decisions, chosen at random, will be used to determine your reward. For transparency, we will use one number of the Lotería Nacional raffle of May 26, 2018 for each class group (A, B, C, D, E and F). For each course group, we will use your first decision to reward you in case that the raffle corresponding number is 1, 2, 3 or 4, and we will adopt your second decision to remunerate you if the raffle number is 5, 6, 7, 8 or 9.
Remember that, in both decisions, you have to guess your grade over 7 and if you equal or exceed your guess, you will received a monetary payoff in euros according to the following function:

Instructions for group 2 participants
You have been randomly selected in group 2. Thus, your decisions will not be monetary rewarded. The first decision you need to make is choosing a goal for your final course grade (remember that your goal can be revised until one day before examination). The second decision you must take is to guess the grade you think you will get in the final exam immediately after being completed.

Logit transformation
Since reference points and grades are limited variables taking normalized values between 0 and 10, the results obtained by OLS can suffer from heteroskedasticity which invalidates statistical inference. To avoid this shortcoming, on the one hand, the standard errors reported in OLS regressions are robust to the presence of heteroskedasticity. On the other hand, to check if the results obtained are robust, we perform a logit transformation of our dependent variables based on reference points and grades. The logit transformation of self-chosen goals (SCG), post-dictions (POST), grades (G), potential overconfidence (POC) and actual overconfidence (AOC) are given by:  The regression is split into monetarily incentivized and non-monetarily incentivized subjects. Robust standard errors in parentheses ***p<0.01, **p<0.05, *p<0.1    Funding Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work has been supported by the Spanish Ministerio de Ciencia, Innovación y Universidades (RTI2018-096927-B-100), Generalitat Valenciana (AICO/2018/036), Universitat Jaume I (UJI-B2018-76/77), and Ministerio de Educación Cultura y Deporte (FPU14/02966). N Georgantzis received support by the Agence Nationale de la Recherche (grant ANR-18-CE26-0018).
Availability of data and material Data are available and provided as supplementary information.  Robust standard errors in parentheses ***p<0.01, **p<0.05, *p<0.1. Logit transformation implemented in the dependent variables