We now report our findings under each of the two research questions set out above.
How are students’ confidence ratings related to their mean facility, age, gender and socioeconomic disadvantage, and how do these variables interact?
The correlation matrix given in Table 2 shows that there was a positive association between the mean facility (the proportion of questions correct) for each student, and their mean confidence (rs = .504, p < .001). In general, confidence was higher for boys (rs = .134, with female coded 0 and male coded 1) and for more socioeconomically advantaged students (rs = .094, with socioeconomically disadvantaged coded 0 and socioeconomically advantaged coded 1) but decreased with age (rs = −.140). Facility was higher for advantaged students (rs = .141) and decreased a little with age (rs = −.053, all ps < .001). The small p values in Table 2 for all of the correlations just mentioned, even those correlations that are small in absolute size, mean that the correlations are statistically significantly different from zero.
Table 2 Correlation matrix for the 5 variables We fitted two linear regression models, with standardisedFootnote 5 mean confidence as the dependent variable, and, among the predictors, we standardised Facility and Age, but not the two binary categorical predictors, Gender and Advantaged. The first model contained just the four predictors (Table 3); the second model included all of the 2-way interactions (Table 4).
Table 3 Multiple regression with 4 predictors of confidence Table 4 Multiple regression with 4 predictors and all 2-way interactions The regression coefficients for Facility and Age in this multiple regression model (Table 3) are consistent with the correlation matrix (Table 2). For Gender and Advantaged, it is not possible to directly compare the βs in Table 3 with the rs values in Table 2, since those variables were not standardised. However, by running models with single predictors of Confidence, we calculated that the regression coefficients for Gender and Advantaged (in separate single-predictor models) were 0.257 [95% CI = 0.212 to 0.303] and 0.245 [95% CI = 0.189 to 0.301]. Comparing these with the values in Table 3, we can see that the regression coefficient for Gender is very similar, but the coefficient for Advantaged has decreased considerably (0.245 to 0.081), suggesting that Advantaged may be partially mediated through the other predictors. Below, we conduct a post hoc mediation analysis to investigate this.
It is clear that Facility is the dominant predictor, so it is important to consider the possible interaction of other predictors with Facility. Table 4 shows results from the multiple regression model which includes all 2-way interactions. Including the interaction terms does not appreciably affect the regression coefficients for Facility, Age and Gender, but, again, the coefficient for Advantaged has now dropped a little further, from 0.081 to 0.056, also now becoming nonsignificant. This suggests that the effects of Advantaged are now fully mediated through some or all of the other predictors. The only significant 2-way interaction is between Facility and Gender, and the regression coefficient for this is small (−0.052).
We now present more detailed analysis relating to each predictor.
Facility
Facility was by far the strongest predictor of Confidence (β = 0.522, p < .001), and the positive association between mean facility and mean confidence for each student was rs = .504 (p < .001). This is close to Foster’s (2016) previously reported correlation of r = .546 between facility and mean confidence for 11–14-year-old students in the topic of directed numbers (N = 345), meaning students’ level of calibration (see Fischhoff et al., 1977, p. 552) in the current study is comparable to this. However, it is clear from Fig. 3 and Table 5 that there are students at every combination of facility and confidence.
Table 5 The percentage of correct answers by confidence To explore students’ calibration in more detail, we calculated each student’s mean confidence on the questions on which they were correct and their mean confidence on the questions on which they were incorrect (see scatterplot in Fig. 4a). To guard against extreme responses, in this analysis we only included students who had provided at least 50 answers, at least 5 of which were correct, and at least 5 of which were incorrect, giving a dataset of 115,437 answers from 1033 students. The fact that most of the points are above the diagonal line in Fig. 4a shows that students tended to show greater confidence on correct questions than on incorrect questions, but the strong positive correlation (rs(1031) = .889, p < .001) indicates that students who gave higher confidence scores tended to do so both for questions on which they were correct and for those on which they were incorrect. The histogram of differences in mean confidence score (Fig. 4b) is positively skewed, with a bulge near zero, indicating a large number of students who gave the same confidence level, whether or not their answer was correct. A Wilcoxon signed-rank test indicated that, on average, students were more confident with questions on which they were correct (Mdn = 82.8) than with those on which they were incorrect (Mdn = 68.8, Z = 11,255, rs = .956, p < .001). The percentage of correct answers increased markedly with confidence (Table 5). A Kruskal-Wallis test found a difference in the number of correct attempts by answers with different confidence (H(4) = 20,487.1, p < .001).
A similar analysis but by question (Fig. 5) showed a positive but weaker correlation (rs(1,137) = .477, p < .001). Similar to before, for this analysis, we only included questions with at least 50 answers, at least 5 of which were correct and at least 5 of which were incorrect, giving 110,283 responses across 1139 questions. For each student and each question, we calculated the standard deviation of the confidence. A Mann-Whitney U test indicated that the standard deviation of the confidence was greater for questions (Mdn = 30.8) than for students (Mdn = 22.2, U = 284,839.5, p < .001). This indicated that the confidence rating given was more strongly associated with the student than with the question.
Age
The correlation matrix given in Table 2 shows that mean confidence decreased with student age (rs = −.140, p < .001), and facility also decreased a little with age (rs = −.053, p < .001). To explore this further, we restricted the dataset to responses during one academic year: between September 2019 and May 2020 (N = 5382). We did this so that each student could be assigned to a single school year. We grouped the students into bands according to their school year: key stage 2 (ages 7–11), key stage 3 (ages 11–14) and key stage 4 (ages 14–16). Kruskal-Wallis tests comparing the parameter distributions between the groups found statistically significant differences among the key stages for the number of questions answered, the mean facility and also the mean confidence (see Table 6). For almost all facility levels, there is a clear decrease in confidence as age increases, and the drop in confidence from key stage 2 to key stage 3 appears to be generally larger than the drop from key stage 3 to key stage 4 (see Fig. 6).
Table 6 Mean confidence and facility by key stage To investigate the possibility that the decrease in confidence with increasing age might be mediated by the difficulty of the mathematics, we conducted a post hoc mediation analysis using the statsmodels mediation package in Python (Seabold & Perktold, 2010) to compute 95% confidence intervals (95% CI) over 1000 simulations to test for significant indirect effects (Fig. 7). Age displayed a significant direct effect on Confidence (β = −0.117, 95% CI = −0.136 to −0.097, p < .001) and a significant indirect effect on Confidence, with Facility as mediator (β = −0.027, 95% CI = −0.046 to −0.008, p = .008). Age was associated with Confidence, but only 18.7% (95% CI = 6.2% to 29.4%, p = .008) of this relationship was mediated by decreased Facility.
Gender
Table 7 indicates that mean confidence was higher for boys than for girls (rs = .134, p < .001) and, when analysed by decade of facility (see Fig. 8), the same pattern is striking across all levels of facility. Mann-Whitney U tests found no significant difference between boys and girls on the number of questions answered or on the number of questions answered correctly (Table 7), suggesting that boys’ higher confidence constitutes overconfidence. We found a significant interaction between Facility and Gender (β = −0.052, p = .011), meaning that confidence increases more slowly with Facility for boys than it does for girls. This means that the overconfidence of boys is more marked with lower-attaining students.
Table 7 Mean confidence and facility by gender Advantage
As noted before, both Confidence and Facility were higher for more socioeconomically advantaged students (rs = .094 and rs = .141, respectively, both ps < .001). When we included all 2-way interactions in our multiple regression model, the coefficient for Advantaged became nonsignificant, suggesting that the effects of Advantaged were fully mediated through some or all of the other predictors. To investigate this, we again conducted a post hoc mediation analysis using the statsmodels mediation package in Python (Seabold & Perktold, 2010), this time on the full dataset (N = 7302), to compute 95% confidence intervals (95% CI) over 1000 simulations (Fig. 9). Advantage displayed a significant direct effect on Confidence (β = 0.082, 95% CI = 0.030 to 0.128, p < .001) and a significant indirect effect on Confidence with Facility as mediator (β = 0.166, 95% CI = 0.136 to 0.197, p < .001). Advantage was associated with Confidence, and 67.0% (95% CI = 55.0% to 84.4%, p < .001) of this relationship was mediated by increased Facility.
Figure 8 presents the distribution of students (advantaged and disadvantaged) by decade of facility, showing a peak for the disadvantaged students in the 40–50 facility interval, whereas for the advantaged students, the peak is in the 60–70 facility interval. Two-sided Mann-Whitney U tests found that disadvantaged students answered fewer questions than advantaged students (rs = −.086, p < .001), answered fewer questions correctly (rs = −.199, p < .001), and the mean confidence for disadvantaged students was significantly lower than for advantaged students (rs = −.133, p < .001) (Table 8). For most (but not all) decades of facility, confidence was lower for disadvantaged students (see Fig. 8).
Table 8 Mean confidence and facility by disadvantage Table 9 Students attempting an analogous question, following an incorrect answer, and the proportion of these which were correct Is there evidence for the hypercorrection effect in students’ responses to a second set of questions (quiz B) administered 3 weeks after the first (quiz A)?
The hypercorrection effect (Butterfield & Metcalfe, 2001) predicts that a student who answers a question incorrectly but with high confidence will be more likely to be successful with the same or a similar question subsequently. To test this in our data, we analysed the 86,144 answers from a total of N = 7002 students within quiz A sessions where the answer was incorrect, a confidence rating was given and an analogous question was subsequently assigned 3 weeks later in quiz B (see Section 3.1). Table 9 and Fig. 10 show that the percentage of students making second attempts increased with the confidence level expressed on the first attempt.Footnote 6 We see a clear increase in facility with confidence, which might seem to demonstrate the hypercorrection effect: the students who were more confident about their original incorrect answer were more likely to answer correctly 3 weeks later in quiz B. However, it is important to distinguish hypercorrection from regression to the mean (Baguley, 2012), where facility is a confounder. Students expressing high confidence in quiz A, despite being incorrect on that occasion, are likely on average to be higher-facility students, since facility and confidence are correlated. Consequently, even without a hypercorrection effect, they would be expected to be more likely to succeed on quiz B anyway. This means that, in order to tease out any hypercorrection effect of confidence over and above an “ability effect”, we need to carry out a logistic regression.
The data consisted of 44,524 incorrect answers by 3838 students who had attempted analogous questions 3 weeks later in quiz B (see Table 10). Of these students, 19,885 (44.7%) answered the quiz B question correctly and 24,639 (55.3%) incorrectly. The hypercorrection hypothesis is that the probability that the quiz B question is answered correctly is higher when the student’s confidence in their original mistake was higher, after controlling for facility.
Table 10 Distribution of data for logistic regression In order to make as few modelling assumptions as possible, we fitted a five-predictor logistic model, using the Logit method in the statsmodels package, version 0.10.1 (Seabold & Perktold, 2010), so as to allow effects of any of these predictors to be accounted for. The model may be expressed as:
$$ \mathrm{logit}(Y)={\beta}_0+{\beta}_1{X}_1+{\beta}_2{X}_2+{\beta}_3{X}_3+{\beta}_4{X}_4+{\beta}_5{X}_5 $$
where the outcome variable Y is whether the quiz B question was answered correctly (0 = incorrect, 1 = correct), X1 is the mean student facility on quiz A (0–100), X2 is the student’s confidence in their original incorrect response on quiz A (0–100), X3 is the student’s age (6–16), X4 indicates whether the student was advantaged (0 = disadvantaged, 1 = advantaged) and X5 is the student’s gender (0 = female, 1 = male).
The result was:
$$ {\displaystyle \begin{array}{c}\mathrm{Predicted}\ \mathrm{logit}\ \mathrm{of}\ \left(\mathrm{Correct}\ \mathrm{Quiz}\ \mathrm{B}\ \mathrm{A}\mathrm{nswer}\right)=-1.4578\\ {}+0.0235\times \mathrm{Student}\ \mathrm{Facility}\\ {}\begin{array}{c}+0.0028\times \mathrm{Confidence}\ \mathrm{in}\ \mathrm{Quiz}\ \mathrm{A}\ \mathrm{Mistake}\\ {}-0.0188\times \mathrm{Age}\\ {}\begin{array}{c}+0.0402\times \mathrm{Is}\ \mathrm{Advantaged}\\ {}-0.0461\times \mathrm{Is}\ \mathrm{Male}\end{array}\end{array}\end{array}} $$
According to the model (see Table 11), the log odds of a student answering correctly was, as expected, positively related to their overall facility (p < .001). However, and confirming our hypercorrection hypothesis, it was also positively related to their confidence in their quiz A mistake (p < .001). The log odds of a student answering correctly was negatively related to their age (p < .001) and to gender (p = .019). Whether they were advantaged was not statistically significant (p = .105).
Table 11 Logistic regression analysis of 44,524 students’ answers to questions 3 weeks after answering a similar question incorrectly In other words, the higher the student’s confidence in their quiz A mistake, the more likely it was that the student answered the quiz B question correctly, even after accounting for overall facility. The odds ratio when increasing from one emoji (e.g., 25) to the next (e.g., 50) was \( {e}^{25{\beta}_2}=1.07 \). In this dataset, the hypercorrection effect appeared to be stronger for younger students than for older students, and stronger for girls than for boys.
Finally, we note that, as Fig. 11 suggests, confidence in quiz A is positively correlated with facility in quiz A, rs(1,654) = .509, p < .001, and the relationship for quiz B is almost identical, rs(1,654) = .502, p < .001, indicating that students were similarly well calibrated in both quizzes. Their calibration did not measurably change across the intervening 3-week period.