The thought probes used in Experiments 1 and 2 have two major shortcomings. First, each experiment had only a small number of probed trials. To be consistent with the existing literature on task-unrelated thoughts, we chose only to probe participants about their thoughts on a small subset of trials. However, because we only probed a small percentage of trials, we could not take full advantage of the trial-by-trial resolution afforded by the whole-report working memory measure. Second, we could not objectively measure the accuracy of subjects’ meta-awareness of internal states. Instead, we had to take participants’ ratings of their internal states at face value.
In Experiments 3a and 3b, we instead had observers report subjective confidence for each item that they reported. By collecting both confidence ratings and accuracy for every item and every trial, we had more power to examine trial-by-trial relationships between accuracy and confidence. Further, because subjective ratings were on the same scale as accuracy (number of items), we could directly measure bias in metacognition. Because participants had some number of working memory failures even when they reported being fully attentive, we predicted that participants would have a positive bias in confidence ratings, particularly for failure trials. We further predicted that individuals with poor working memory performance would suffer the “dual burden” of poor metacognitive insight (Kruger & Dunning, 1999).
In Experiment 3a, we repeated the same challenging set size (six items) for a large number of trials (300). We collected both accuracy and confidence ratings for each item in order to examine trial-by-trial fluctuations in working memory performance. Once again, participants could report the items in any order they chose. In Experiment 3b, we replicated the manipulation in Experiment 3a and also added a control condition in which the computer randomly determined the order in which participants must report the items. This random-response order condition allowed us to estimate and control for the effects of output interference in Experiment 3a.
Materials and method
There were 45 participants in Experiment 3a and 38 in Experiment 3b. One subject was excluded from Experiment 3a for failure to comply with task instructions, leaving 44 participants for analysis. Four participants were excluded from Experiment 3b for the following reasons: failing to complete both tasks (one subject), chance-level performance (one subject), or failure to comply with task instructions (two subjects). Some aspects of the data from Experiment 3a have been previously reported (Adam et al., 2015, Experiment 1b), but all analyses presented here are novel. Participants in both experiments also completed a color change detection task at the end of the experiment (results not reported in this study).
Stimuli and timing parameters were identical to those in Experiment 1. In the random response-order condition of Experiment 3b, the to-be-reported square was indicated by a light gray box drawn around the response pad (RGB = 170 170 170).
Procedures for Experiment 3a
Participants completed 10 blocks of 30 trials (300 trials total); all arrays were Set Size 6, and colors were chosen without replacement from the set of nine possible colors. By using arrays that were only one set size, we could examine fluctuations in performance that were disentangled from differences in difficulty from trial to trial. At test, participants could report the items in any order they chose. While responding, participants were instructed to report their confidence in each response by using the left and right mouse buttons. Participants were instructed to click their color choice with the left mouse button if they felt they had any information in mind about the color of the item. Likewise, they were instructed to click their color choice with the right mouse button if they felt they had no information in mind about the color of the item.
Procedures for Experiment 3b
Participants completed two conditions of the whole-report task (60 trials per condition): free response order and random response order. The order of the two conditions was counterbalanced across participants. As in Experiment 3a, all arrays were Set Size 6, and colors were chosen without replacement from the set of nine possible colors. The free response-order condition was identical to Experiment 1a; participants were allowed to report the six items in any order they wished. In the random response-order condition, participants instead had to report the items in an order dictated by the computer. At the beginning of the response period, the computer indicated which item must be reported by drawing a light gray frame around the item. After the participant responded to the probed item, the computer moved the frame to the next to-be-reported item. This process was repeated until the subject had made a response for every item. In both conditions, participants reported confidence in each item using the left and right mouse buttons as in Experiment 3a.
On average, participants correctly identified an average of 2.88 items (SD = .49), and they reported being confident about 3.04 items (SD = .52) out of six possible items. There was no significant difference between the mean number of correct items and the mean number of confident items, t(43) = 1.64, p = .11, 95% CI [−.04, .36]. However, looking at the full distribution of responses reveals some systematic differences in the underlying distribution of confident responses relative to correct responses (see Fig. 5a). Specifically, participants seem to have overreported their modal performance outcome (three items).
In addition to looking at total trial performance, we can look at confidence and accuracy for each individual response within the trial. All trials were Set Size 6, so participants made six responses total. Figure 5b shows proportion correct and confident as a function of response number for all trials. As participants were free to report the items in any order they chose, performance and confidence were initially high (for the first three responses) and then dropped precipitously at Response 4. On lapse trials (zero or one correct), however, there was a stronger disconnect between performance and confidence. Here, accuracy was above chance for the first response but quickly fell to below-chance levels for later responses. Despite this pattern of performance, participants still reported that they were confident in the first three responses.
Next, we wanted to more formally test the predictions that (1) there is a reliable trial-by-trial relationship between accuracy and confidence and (2) despite this reliable relationship, participants underestimate failures (zero or one correct). For each individual subject, we calculated the correlation coefficient between number of correct responses and number of confident responses. The average correlation value was r = .34 (SD = .16, average p < .05), and 40 out of 44 participants had statistically reliable within-subject correlations (p < .05). To quantify awareness of failure trials, we calculated a lapse sensitivity measure (lapses detected / total number of lapses). That is, of all the trials in which participants got zero or one items correct, what proportion of the time did they report that they were confident on zero or one items? Average sensitivity was only .28 (SD = .19), indicating that participants accurately caught extreme failures only about a quarter of the time. While d-prime is a more commonly used means of quantifying discriminability, we could not use this metric because of a number of participants with hit rates or false alarm rates of zero (thus yielding d-prime values of +/− infinity). Average hit rate in Experiment 3a was 27.5% (SD = 19.0%), and average false alarm rate was 3.4% (SD = 4.3%).
Next, we asked whether there were systematic differences in the accuracy of metacognition as a function of overall performance. To do so, we divided participants into quartiles and examined actual performance (correct items) versus perceived performance (confident items). We ran two mixed ANOVA models using Metaknowledge (actual vs. perceived) as a within-subjects factor and Quartile as a between-subjects factor to predict (1) mean number correct and (2) lapse rate.
Consistent with the Dunning-Kruger effect, poor performers showed a larger discrepancy between perceived and actual performance (see Fig. 6). There was a significant main effect of Quartile on lapse rate, F(3, 40) = 42.6, p < .001, ηp
2 = .76. There was a significant main effect of Metaknowledge, indicating that reported lapse rates were significantly lower than actual lapse rates, F(1, 40) = 34.2, p < .001, ηp
2 = .46. Critically, there was an interaction between Metaknowledge and Quartile, indicating that the difference between perceived performance and true performance was larger for poor performers relative to good performers, F(3,40) = 8.13, p < .001, ηp
2 = .38.
We found the same effects for mean performance as for lapse rate. There was a significant main effect of Quartile on mean performance, F(3, 40) = 15.0, p < .001, ηp
2 = .53. There was a significant main effect of Metaknowledge, indicating that reported mean performance was significantly higher than actual mean performance, F(1, 40) = 6.4, p = .016, ηp
2 = .14 Finally, there was an interaction between Metaknowledge and Quartile, indicating that the difference between perceived performance and true performance was larger for poor performers relative to good performers, F(3, 40) = 6.47, p = .001, ηp
2 = .33.
We used a quartile split method to investigate the Dunning-Kruger effect because that is the prevailing standard in the literature. To supplement and strengthen this analysis, we computed the correlation coefficient between average performance (mean number correct) and the various metaknowledge metrics summarized above. There was a significant negative correlation between lapse awareness (actual lapse rate – perceived rate) and overall performance, r = −.67, p < .001, 95% CI [−.81, −.47], indicating that lower performing participants were more overconfident during lapses. There was also a significant correlation with mean performance awareness (mean number correct – mean number confident), r = .59, p < .001, 95% CI [.35, .75]. We also examined our metaknowledge correlation metric (correlation strength between single-trial confidence and accuracy) and our lapse sensitivity metric (percentage of lapses caught). There was a significant relationship between the metaknowledge correlation metric and average performance, r = .47, p = .001, 95% CI [.20, .67], but no relationship between lapse sensitivity and average performance, r = .13, p = .39, 95% CI [−.17, .41].
Participants typically reported that the first three reported items were confident, and we interpreted this as evidence that participants had metaknowledge of item quality. That is, they chose to report their best remembered items first. An alternative explanation, however, could be that late responses have low accuracy only because of output interference. Therefore, participants may have reported that they were accurate early in the trial without regard to the quality of remembered items. To disentangle item-level metaknowledge from output interference, we had a new group of participants complete a free response-order condition (replicating Experiment 3a) and also complete a random response-order condition, in which the computer randomly chose the order in which participants must respond to the items.
Average performance was slightly higher during the free response-order condition (M = 2.96, SD = .44) than during the random response condition (M = 2.58, SD = .61), t(33) = 4.98, p < .001, 95% CI [.22, .53] (see Fig. 7a). The difference in accuracy for the first three responses versus the last three responses was strongly attenuated in the random response-order condition (see Fig. 7b). In the free response-order condition, participants had a mean accuracy of 78.3% (SD = 9.5%) on the first three responses and 20.2% (SD = 8.0%) on the last three responses. The average difference was 58% (SD = 9.6%), t(33) = 35.4, p < .001, 95% CI [55%, 61%] . On the other hand, the average difference between the first three and last three responses in the randomized order was only 7.6% (SD = 7.4%), t(33) = 5.95, p < .001, 95% CI [5%, 10%]. These results suggest that the decline in accuracy across responses in the free response-order condition was not due solely to output interference. Instead, this pattern of results suggests that subjects successfully stored the same number of items as in the free-recall procedure (e.g., three), but the random probing procedure distributed these accurate responses across all response positions.
Figure 8 shows performance and confidence at the trial level and at the response level in the free response-order condition. On average, participants reported that they were confident for 3.4 items (SD = .93) in the free-response condition, and this was significantly higher than the number of accurate items, t(33) = 2.70, p = .01, 95% CI [.11, .80]. As in Experiment 3a, participants underreported low-performance trials and overreported modal trials (three correct) and high-performance trials (six correct). When looking at responses for all trials (see Fig. 8b), confident and correct responses were both predominately early in the trial (first three responses). Likewise, on failure trials (see Fig. 8c), participants were likely to report that they were confident on the first three responses.
Figure 9 shows performance and confidence at the trial level and at the response level in the random response-order condition. On average, participants reported that they were confident about 3.1 items (SD = .74) in the random-response condition, and this was significantly higher than the number of correct items, t(33) = 3.61, p = .001, 95% CI [.22, .80]. At the trial level (see Fig. 9a), we once again replicated the general pattern that participants overreported modal trials and underreported poor performance trials. On the other hand, we observed that participants’ confident responses were spread more evenly among response position, both for all trials (see Fig. 9b) and for lapse trials (Fig. 9c). We once again saw that participants were vastly overconfident on lapse trials (Fig. 9c), but this was not due to a response bias whereby participants always reported they were confident on the early responses. Instead, participants were confident on a specific subset of items, and the random probing procedure spread confident responses more equally across early and late responses.
We again quantified subject metaknowledge using within-subject correlations between the number confident and the number correct for each trial. In the free response-order condition, the average correlation coefficient was .29 (SD =.24, average p = .16). Twenty out of 34 participants had significant correlation coefficients. In the random response-order condition, the average correlation coefficient was .38 (SD = .24, average p = .09). Twenty-eight out of 34 participants had significant individual correlation coefficients. Note, these correlation values are numerically similar to those from Experiment 1a. However, because there were only 60 trials used to construct the correlation (as opposed to 300), relatively fewer individual participants reached the significance threshold. Combining both conditions together (120 trials total), we found an average correlation coefficient of .35 (SD = .23, average p = .07). 29 out of 34 participants had a significant within-subject correlation between number of confident response and number of correct responses when trials from both conditions were combined. We also quantified lapse sensitivity in both conditions. In the free response-order condition, participants had an average lapse sensitivity of .22 (SD = .29). In the random response-order condition, participants had an average lapse sensitivity of .31 (SD = .29). Combined across both order conditions, lapse sensitivity was .28 (SD = .27). Once again, participants tended to have poor metaknowledge for extreme failure trials, noticing on average little more than a quarter.
Finally, we examined whether low performers again showed a deficit in metacognitive awareness. For this analysis, we combined trials from the free and random response-order conditions and examined metacognitive bias (perceived vs. actual performance) as a function of overall performance. We again ran mixed ANOVA models with the within-subjects factor Metaknowledge (perceived performance vs. actual performance) and the between-subjects factor Quartile to examine metacognitive bias for lapse rate and mean performance.
Despite fewer trials (120 in Experiment 3b vs. 300 in Experiment 3a), we replicated the overall pattern of results from Experiment 3a (see Fig. 10). First, we used lapse rate as our performance metric. There was a significant main effect of Quartile on lapse rate, F(3, 30) = 27.8, p < .001, ηp
2 = .74. There was also a significant main effect of Metaknowledge, indicating that perceived lapse rates were significantly lower than actual lapse rates, F(1, 30) = 50.9, p < .001, ηp
2 = .63. Critically, there was an interaction between Metaknowledge and Quartile, indicating that the difference between perceived performance and true performance was larger for poor performers relative to good performers, F(3, 30) = 6.03, p = .002, ηp
2 = .38. Second, we used mean performance as our performance metric. There was a significant main effect of Quartile on mean performance, F(3, 30) = 7.4, p = .001, ηp
2 = .43. There was a significant main effect of Metaknowledge, indicating that perceived mean performance was higher than actual performance, F(1, 30) = 16.2, p < .001, ηp
2 = .35. The interaction between Metaknowledge and Quartile was numerically similar to that observed in Experiment 3a but did not reach significance, F(3, 30) = 1.8, p = .17, ηp
2 = .15.
We again computed the correlation coefficient between average performance (mean number correct) and metaknowledge. There was once again a significant negative correlation between lapse awareness (actual lapse rate – perceived rate) and overall performance, r = −.72, p < .001, 95% CI [−.85, −.50], indicating that lower performing participants were more overconfident on lapse trials. Likewise, there was a significant correlation between overall performance awareness (mean number correct – mean number confident), r = .47, p = .005, 95% CI [.15, .70]. We also examined our metaknowledge correlation metric (correlation strength between single-trial confidence and accuracy) and our lapse sensitivity metric (percentage of lapses caught). There was no significant relationship between the metaknowledge correlation metric and average performance, r = .21, p = .23, 95% CI [−.14, .51] or between lapse sensitivity and average performance, r = .22, p = .22, 95% CI [−.13, .52]
Individual differences combined across Experiments 3a and 3b
We combined data across Experiments 3a and 3b in order to further illustrate individual differences in performance awareness (see Supplementary Figures S6–S9). We found a significant correlation between lapse awareness (actual lapse rate – perceived rate) and overall performance, r = −.82, p < .001, and a significant correlation between mean performance awareness (mean number correct – mean number confident), r = −.54, p < .001. In addition, we found that our correlation metric predicted overall performance, r = .33, p = .003, but our lapse sensitivity metric did not, r = .17, p = .14. To examine the robustness of these effects, we also computed the split-half reliability of each metric. We found that split-half reliability was very high for lapse awareness (perceived – actual, r = .90), mean performance awareness (perceived – actual, r = .98), and confidence-accuracy correlation strength (r = .75). On the other hand, split-half reliability was rather poor for the lapse sensitivity metric (r = .48), suggesting that it would be difficult to interpret significance of individual differences for this particular metric .
Using a whole-report measure of working memory confidence, we found that observers had reliable knowledge of the number of items stored on a given working memory trial. Confidence ratings, like accuracy, fluctuated from trial to trial. Overall, participants had excellent insight into the number of items stored in working memory. The number of correct items consistently correlated with the number of confident items on a trial-by-trial basis. However, resolution (correlation) and bias (over- or underconfidence) are dissociable aspects of metacognition (Koriat, 2007). While confidence and accuracy correlated, participants were particularly likely to underreport failure trials. On average, participants only correctly identified about 28% of lapse trials.
Importantly, observers’ reliable metaknowledge was not an artifact of response order or temporal delay. In Experiment 3a, observers were allowed to report the items in any order they chose. Consequently, both the correct items and confident items were the first items reported in the trial. As such, observers could simply report that they were confident about the early items without having awareness of item-by-item accuracy. In Experiment 3b, we replicated this pattern for freely ordered responses, and we also added a condition where participants had to respond to the items in a randomized order. In the random order condition, response order was far less predictive of accuracy. We once again found a reliable relationship between the number of confident items and the number of correct items, although now the confident responses were distributed more equally among responses due to the random probing procedure. The random response-order condition revealed that output interference did not account for the precipitous decline in accuracy across responses in the free response-order condition. Rather, participants were aware of and chose to report their best remembered items first. When the computer forced participants to report items in a randomized fashion, the decline in performance was much less severe (7% relative to 58% from the first three to the last three responses).
Finally, we examined individual differences in the discrepancy between perceived performance (confidence) and actual performance (accuracy). Previous work has shown that low-performing individuals have particularly inflated estimates of how their own performance compares to others’ (i.e., the Dunning-Kruger effect; Kruger & Dunning, 1999), and that they also overestimate their raw performance (Ehrlinger, Johnson, Banner, Dunning, & Kruger, 2008). Here, we replicate the finding that low-performing individuals overestimate their raw performance relative to high-performing individuals. There was a significant interaction between participants’ quartile and misestimation of lapse rates (Experiments 3a and 3b) and mean performance (Experiment 3a only). This result was not an artifact of an extreme-groups split; underestimation of lapse rate also significantly correlated with average performance in both samples. In sum, all subjects were poor at identifying working memory failures, but those with the worst performance were doubly burdened with especially poor metacognitive awareness.
We feel it is important to point out criticisms of work related to the Dunning-Kruger effect and how those criticisms may or may not apply to our own conclusions. The main criticism of the Dunning-Kruger effect has focused on the general tendency for subjects to rate themselves as above average relative to others (Burson, Larrick, & Klayman, 2006), and how this positive bias in combination with regression to the mean could potentially explain the wider self-perception gap for low-performing individuals (Krueger & Mueller, 2002; but for counterargument, see Ehrlinger et al., 2008). Importantly, these criticisms are aimed at a particular aspect of the Dunning-Kruger model—whether metacognition truly accounts for inaccuracy of self-perception. In fact, critics of the Dunning-Kruger effect agree that there is a relationship between task-related metacognitive accuracy and task performance (Krueger & Mueller, 2002); they disagree about whether metacognitive accuracy explains the accuracy of self-perception (which we have not tested). If we were to be conservative, we should be wary that our difference score metrics might be susceptible to similar problems that have been pointed out for self-perception difference scores (namely, positive bias plus regression to the mean). Additional work is needed to assess the scope of this concern (see the Supplementary Materials for additional discussion of individual differences). Importantly, however, our trial-by-trial correlation metric is free of this criticism, as it decouples bias (intercept) from accuracy (slope); the results from our correlation metric nicely converge with our overestimation metric (perceived – actual performance), supporting our conclusion that metacognitive accuracy predicts working memory performance.