If competency is required to recognize incompetence, truly incompetent people will be both incompetent and unaware of their incompetence (Dunning, Johnson, Ehrlinger, & Kruger, 2003; Kruger & Dunning, 1999). The failure to recognize incompetence among the incompetent—often referred to as the DunningKruger effect—has far-reaching implications because, presumably, one of the prerequisites of voluntary self-improvement is actually recognizing the need for improvement. Indeed, those who would benefit the most from understanding the limits of their reasoning are the ones who are least likely to do so. From a theoretical standpoint, estimating the degree of miscalibration among the incompetent is also paramount for understanding the causes and consequences of biases in judgment and decision making writ large. In the present work, we examine incompetence in the realm of biased or intuitive responding in reasoning tasks and provide evidence that Dunning–Kruger effects extend to one’s self-reported analytic-thinking disposition.

Dual-process theory and analytic-thinking disposition

The distinction between autonomous/intuitive (Type 1) processes and deliberative/analytic (Type 2) processes, as formalized in dual-process theories (De Neys, 2012; Evans & Stanovich, 2013; Handley & Trippas, 2015; Kahneman, 2011; Pennycook, Fugelsang, & Koehler, 2015b; Sloman, 2014; Thompson, Prowse Turner, & Pennycook, 2011), is now exceedingly popular in many areas of psychology. One of the key implications of this distinction is that analytic thinking proceeds via at least some form of volitional control, which in turn indicates that people differ in terms of their dispositions toward analytic thinking. Put simply, some people appear to be more willing than others to engage in deliberative thought (Stanovich, 2012; Stanovich & West, 1998, 2000). Recent research has shown that individual differences in analytic-thinking disposition (also referred to as “analytic cognitive style”) are consequential for a wide range of psychological domains over and above individual differences in cognitive ability or intelligence (Pennycook, Fugelsang, & Koehler, 2015a).

Analytic-thinking disposition is also assessed using self-report scales, such as the Need for Cognition (NC) scale (Cacioppo & Petty, 1982; Cacioppo, Petty, Feinstein, & Jarvis, 1996; Epstein, Pacini, Denes-Raj, & Heier, 1996; Fleischhauer et al., 2010). According to Petty, Brinol, Loersch, and McCaslin (2009), need for cognition “refers to the tendency for people to vary in the extent to which they engage in and enjoy effortful cognitive activities” (p. 318). It includes items such as “I am not very good at solving problems that require careful logical analysis” and “I enjoy solving problems that require hard thinking.” Thus, “effortful cognitive activities” according to Petty et al.’s definition refers specifically to thinking and problem solving.

Pacini and Epstein (1999) further specified the NC scale by creating subscales that distinguish between “ability” and “engagement.” This distinction can be seen in the two examples offered above: The NC scale includes items that index both the self-reported ability to engage in effortful thought and the enjoyment derived therefrom. Contrast, for example, the item “I am much better at figuring out things logically than other people” (Ability subscale) with the item “I try to avoid situations that require thinking in depth about something” (Engagement subscale). Thus, Pacini and Epstein’s NC scale distinguishes two important components of general analytic-thinking dispositions in a way that parallels performance-based measures.

NC scores have been used frequently in studies on information evaluation and recall, attitude formation, and judgment and decision making (see Petty et al., 2009, for a review), along with a variety of other domains not traditionally associated with information processing. For example, high NC is associated with decreased religious and paranormal belief (Pennycook, Cheyne, Barr, Koehler, & Fugelsang, 2014; Svedholm & Lindeman, 2013), increased life satisfaction (Gauthier, Christopher, Walter, Mourad, & Marek, 2006), decreased support for punitive responses to crime (Sargent, 2004), and utilitarian moral judgment (Conway & Gawronski, 2013). Indeed, Petty et al. (2009) noted that over 1,000 publications have either cited the article that introduced the NC scale (Cacioppo & Petty, 1982) or the article that introduced a shortened version of the scale (Cacioppo, Petty, & Kao, 1984).

Methodological implications of the Dunning–Kruger effect

Although the NC scale has been used in hundreds of studies (Cacioppo et al., 1996; Petty et al., 2009), research on self-report thinking disposition faces a dilemma: People who are genuinely unwilling to engage analytic thinking may not be well suited to estimate their degree of NC. Indeed, there is evidence that high NC is associated with increased levels of thinking about one’s own thinking (Petty et al., 2009). Moreover, Kruger and Dunning (1999) found that people who were in the bottom quartile in terms of logical reasoning estimated that they were just above average in terms of accuracy whereas, if anything, those in the top quartile somewhat underestimated their performance. Mata, Ferreira, and Sherman (2013) demonstrated that relatively analytic people have a metacognitive advantage over those who rely primarily on their intuition because they are aware of both the intuitive answer and the deliberative (typically correct) response. Similar to Kruger and Dunning, Mata, Ferreira, and Sherman found that intuitive reasoners strongly overestimate their performance relative to deliberative reasoners. Given that people’s “chronic self-views” (i.e., opinion about one’s abilities independent of actual performance) have been shown to influence their performance (Atir, Rosenzweig, & Dunning, 2015; Critcher & Dunning, 2009; Ehrlinger & Dunning, 2003), previous research has suggested that people overestimate their performance on high-level reasoning tasks precisely because they view themselves as more reasonable than is justified by their objective performance.

On the basis of this research, we hypothesize that self-report thinking disposition scales are miscalibrated in a systematic way, such that those who are genuinely not analytic should overstate their relative analyticity, whereas people who are genuinely analytic should fairly accurately report their NC or perhaps even underreport it relative to others in the sample. In other words, the association between NC and objective performance should be similar to the association between estimated and objective performance: Overestimation should be largest among the most biased and the smallest (or reversed) among the least biased.

Theoretical implications of the Dunning–Kruger effect

The Dunning–Kruger effect has important—although heretofore unspecified—theoretical implications for recent work in the field of heuristics and biases in reasoning (but see Mata, Ferreira, & Sherman, 2013, for related empirical work). Namely, there is growing evidence that people can recognize (if implicitly) the conflict inherent in many reasoning problems (see De Neys, 2012, 2014, for reviews). For example, consider the following item from the cognitive reflection test (CRT; Frederick, 2005):

A bat and ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?

Typically, around 65% of participants respond “10 cents” to this problem (e.g., Pennycook, Cheyne, Koehler, & Fugelsang, 2016), even though that is incorrect (if the ball costs 10 cents, in that case the bat must cost $1.10, and in total they would cost $1.20). This is thought to occur because people are cognitive misers (i.e., they conserve mental resources when possible; see Toplak, West, & Stanovich, 2011) and unduly rely on the first thing that comes to mind.

If the Dunning–Kruger effect applies to this type of problem, those who give incorrect intuitive responses would be unlikely to recognize their bias (Mata, Ferreira, & Sherman, 2013). Surprisingly, however, there is some evidence that even people who get this problem incorrect have some sense that there is something “off” about the problem. Namely, participants who give the incorrect intuitive response to conflict problems are less confident than when they answer a nonconflict version of the task (i.e., one that does not cue an incorrect intuitive response; De Neys, Rossi, & Houdé, 2013)—a pattern of results that remains even after participants are put under cognitive load (Johnson, Tubau, & De Neys, 2016). Such conflict detection has been evidenced using a variety of different measures (e.g., response time; De Neys & Glumicic, 2008; memory recall: Franssens & De Neys, 2009; skin conductance response: De Neys, Moyens, & Vansteenwegen, 2010; fMRI: De Neys, Vartanian, & Goel, 2008; and ERPs: Banks & Hope, 2014) and a wide range of tasks (e.g., syllogisms: De Neys & Franssens, 2009; the conjunction fallacy: De Neys, Cromheeke, & Osman, 2011; and the ratio bias: Bonner & Newell, 2010).

Conflict detection during reasoning has been referred to as “omnipresent, regardless of whether participants answer problems correctly or incorrectly” (De Neys et al., 2008, p. 488). Indeed, there is evidence for conflict detection even among particularly biased participants, using subtle low-level measures such as skin conductance (De Neys et al., 2010). Interestingly, however, this line of research does not appear consistent with the strong evidence for the Dunning–Kruger effect: If people are good at detecting conflict during reasoning, why are the incompetent so unaware of their incompetence?

One possibility is that the low-level detection of conflicting outputs, however efficient, may not translate into changes in behavior and, ultimately, reductions in biased responding. Indeed, recent research has indicated that failures of conflict detection (which may be due to either a lack of conflict detection signal or a lack of responsiveness to a present conflict signal) are more common than had previously been thought (Pennycook, Fugelsang, & Koehler, 2012, 2015b). Moreover, there is evidence that less-analytic individuals are less likely to respond to conflict during reasoning (Mevel et al., 2015; Pennycook et al., 2014; Pennycook et al., 2015b; Thompson & Johnson, 2014). Thus, metacognitive monitoring may be more effective among genuinely analytic individuals, regardless of the effectiveness of conflict detection per se. This suggests that nonanalytic participants should overestimate their accuracy on problems that include an intuitive yet incorrect lure, like the bat-and-ball example above (Mata, Ferreira, & Sherman, 2013). More generally, the foregoing discussion indicates that analytic individuals may be better-suited to assess their relative degree of analyticity.

The present work

We report two studies in which participants completed a popular performance-based measure of analytic-thinking disposition, the CRT (Frederick, 2005), and were subsequently asked to estimate how many of the items they had gotten correct (Mata, Ferreira, & Sherman, 2013; Noori, 2016). Following Kruger and Dunning (1999), we hypothesized that participants who performed poorly on the CRT would overestimate their performance to a greater extent than would those who performed well (i.e., less-analytic people should be more poorly calibrated). In addition, participants were also asked to self-report their need or desire to think analytically using the NC scale. We predicted a Dunning–Kruger effect, such that participants who performed particularly poorly on the CRT (indicating an intuitive or non-analytic-thinking disposition) would overreport the degree to which they were disposed to analytic thinking. In Study 2, we used an independent assessment of analytic thinking—the heuristics-and-biases inventory (Toplak, West, & Stanovich, 2011, 2014)—to assess whether nonanalytic individuals are genuinely worse at recognizing their bias. Put differently, participants who were decidedly nonanalytic based on the performance measure should be less-suited to assess their degree of analyticity on a self-report measure, leading to poor calibration in terms of both estimated CRT accuracy and self-reported NC.

Study 1

Kruger and Dunning’s (1999) primary finding was a larger difference between actual performance and estimated performance (i.e., miscalibration in the form of overestimation) for the incompetent relative to the competent. Thus, participants who perform relatively poor on the CRT should overestimate their performance and those who do relatively well should be better calibrated (Mata, Ferreira, & Sherman, 2013), which suggests that those who are genuinely nonanalytic may not be aware of their lack of analyticity.

Assuming there was good evidence for overconfidence on the estimated CRT task, we could then assess the association between this miscalibration and self-reported analytic-thinking style (via the NC scale). If the tendency to overestimate CRT performance translates into a tendency to rate oneself relatively high in NC, estimated CRT performance should predict NC over and above actual CRT performance. Similarly, the well-established association between CRT and NC (e.g., Pennycook et al., 2016) should not be evident once CRT scores had been calibrated by subtracting estimated CRT scores. Furthermore, these findings should hold for both the Ability and Engagement subscales of the NC scale.

Method

In all studies, we report how we determined our sample size, all data exclusions, and all measures. The data are available online via the Open Science Framework: https://osf.io/3kndg/.

Participants

We recruited our participants through Amazon’s Mechanical Turk and chose a target N of 200 in order to have sufficient power (90%) to detect a moderate effect size (r = .20). We removed 14 participants because they responded affirmatively when asked whether they had responded randomly at any point during the survey. We also removed a participant who had missing values for the CRT questions. Two additional participants were removed because they gave numerical answers to a CRT question that required a nonnumerical response. The resulting sample (N = 183; mean age = 33.7, SD = 9.5) consisted of 99 males and 84 females.

Materials and procedure

Cognitive reflection test

Frederick’s (2005) original CRT consisted of three math problems that reliably cue intuitive but incorrect answers. The test is considered an index of analytic-thinking disposition because the items require the participant to question a compelling intuitive response—a process that requires a willingness to think analytically (Pennycook & Ross, 2016; Travers, Rolison, & Feeney, 2016). Since the original CRT has been used extensively on Mechanical Turk, we used two recently developed tests that were designed to measure the same underlying construct instead of the original measure. Specifically, we used Toplak, West, and Stanovich’s (2014) four-item CRT and Thomson and Oppenheimer’s (2016) four-item CRT. The numbers of correct responses for the two CRTs were significantly correlated, r(183) = .336, p < .001, and combined to make a scale with acceptable reliability, α = .70. The eight CRT items were presented one at time and in a random order for each participant. After they had completed all CRT items, we asked participants to estimate the number of CRT questions they had answered correctly.

Need for cognition

Participants then completed Pacini and Epstein’s (1999) 20-item NC scale. Pacini and Epstein’s NC scale consists of Ability and Engagement subscales, which were treated as separate scales for the present purposes. Items were presented in a randomized order, and participants responded on a scale from 1 Definitely not true of myself to 5 Definitely true of myself. Both the Ability (NC-A) and Engagement (NC-E) subscales had excellent reliability, αs = .90 and .95 (for NC-A and NC-E, respectively). The full NC scale was also reliable, α = .95

Results and discussion

Dunning–Kruger effects in estimated CRT performance

As predicted, participants overestimated their total accuracy on the eight CRT items. On average, participants estimated that they had correctly solved 5.59 CRT problems (SD = 1.52), but the mean performance was only 3.88 (SD = 2.11), t(182) = 11.14, SE = 0.15, p < .001, d = 0.82. The correlation between estimated and actual CRT performance was modest, r(183) = −.379, p < .001, such that actual CRT performance only explained 14.4% of the variance in estimated CRT performance. Following Kruger and Dunning (1999), we split the sample into groups, based on accuracy, to test whether those who did poorly on the CRT were more strongly miscalibrated than those who did better. Although there are only nine possible scores on the eight-item CRT, and in theory we could create nine different groups, only a small number of participants scored either 0 (N = 4, 2.2% of the sample) or 8 (N = 8, 4.4% of the sample). Thus, we increased the N in each CRT group by combining across accuracy scores and creating four groups (0–2, 3–4, 5–6, and 7–8).

Using a mixed-design analysis of variance (ANOVA), we found an interaction between CRT group and the difference between the actual CRT score and the estimated CRT score (see Fig. 1), F(3, 179) = 56.24, MSE = 1.13, p < .001, ƞ 2 = .49. As is evident from Fig. 1, overestimation decreased (i.e., calibration increased) systematically as accuracy increased. To investigate the association between calibration and analytic thinking, we computed a difference score between the estimated and actual CRT scores and compared it across the levels of CRT performance (based on our four groups). A post-hoc Tukey honest significant difference (HSD) test comparing the differences between estimated and actual CRT performance (i.e., calibration) indicated that each of the four groups emerged as a separate, homogeneous subset (p < .05). Notably, those who scored very low on the CRT (0–2; M = 1.42, 17.8% accuracy) estimated that they had answered 4.78 questions correctly (59.8% accuracy)—that is, they overestimated by a factor of 3.4. Those who correctly answered three or four of the CRT problems overestimated by a factor of 1.7, whereas those who answered five or six problems correctly only overestimated by a factor of 1.1 (see Fig. 1).

Fig. 1
figure 1

Miscalibration between actual cognitive reflection test (CRT) accuracy and estimated CRT accuracy as a function of CRT group (those who scored 0–2, 3–4, 5–6, or 7–8 out of 8) in Study 1. The y-axis represents the number of correct/estimated correct out of the eight CRT problems. Error bars represent 95% confidence intervals. Group 1 (0–2), N = 50; Group 2 (3–4), N = 65; Group 3 (5–6), N = 38; Group 4 (7–8), N = 30

In contrast, and similar to Kruger and Dunning’s (1999) results, we found that those who scored 7 or 8 underestimated their performance by a factor of 1.1, t(29) = 3.21, p = .003, d = 0.59. This was not driven by the few participants (N = 8) who scored a perfect 8 out of 8 (and for whom it would be impossible to overestimate performance)—those who scored 7 out of 8 (N = 22) estimated that they had got, on average, 6.36 (SD = 1) out of 8 correct, which was significantly lower than their true score of 7, t(21) = 2.98, p = .007, d = 0.64. Importantly, those who scored 5 or 6 out of 8 significantly overestimated their performance, t(37) = 4.35, p < .001, d = 0.71, indicating that the monotonic improvement in calibration as CRT performance increased cannot be attributable to those who scored too high for overestimation to be possible. Moreover, it should be noted that these results cannot be attributed to mere regression to the mean. Particularly intuitive individuals (those who scored 0–2 out of 8) overestimated by a factor of 3.4, whereas particularly analytic individuals (those who scored 7 or 8 out of 8) underestimated by a factor of 1.1. That is, the reported miscalibration is asymmetric in a way that is predicted by previous research.

Association between overconfidence and need for cognition

If the NC scale is a good measure of actual analytic-thinking disposition, it should correlate more strongly with actual CRT performance than with estimated performance. However, if anything, estimated CRT performance was more strongly correlated with NC, r(183) = .307, p < .001, than was actual CRT performance, r(183) = .246, p = .001. This was more evident for the Ability subscale (NC-A), which correlated with estimated CRT performance nearly as strongly as did actual CRT performance, r(183) = .352, p < .001, but was more modestly correlated with actual CRT performance, r(183) = .242, p = .001. However, there was no difference between the magnitudes of these correlations via a Williams test, t(180) = 1.42, p = .159. The correlation between the Engagement subscale (NC-E) and estimated CRT performance, r(183) = .229, p = .002, was more similar to the subscale’s correlation with actual CRT performance, r(183) = .216, p = .003. Thus, if anything, miscalibration was strongest when participants self-reported their ability (as opposed to willingness) to engage analytic thinking.

If the tendency to overestimate CRT performance is linked with the tendency to rate oneself relatively high on the NC scale, the estimated CRT score should predict NC over and above actual CRT performance. We therefore entered both accuracy and estimated accuracy (along with their interaction) as predictors in two regression analyses, with NC-A and NC-E as the dependent variables (see Table 1). Not only did estimated CRT accuracy significantly predict self-reported NC-A and NC-E once actual CRT scores were taken into account, but actual CRT accuracy was not a robust predictor once estimated accuracy was taken into account. We also found an interaction between estimated accuracy and actual accuracy for each of the NC subscales. This interaction emerged because estimated CRT accuracy was more strongly positively correlated with NC for those who did well on the CRT (see Fig. S1 in the supplementary materials). These results indicate a link between miscalibration in analytic thinking and self-reported need for cognition. As a follow-up analysis, we created a calibration score by taking the difference between estimated and actual CRT. This calibration score was not significantly correlated with either NC-A, r(183) = −.012, p = .871, or NC-E, r(183) = .051, p = .491. This, again, indicates that self-report NC is as much a measure of estimated CRT performance as it is a measure of actual CRT performance.

Table 1 Final step of the hierarchical multiple regression analysis predicting self-reported need for cognition (NC), with estimated CRT accuracy (CRT Est), actual CRT accuracy (CRT Acc), and their interaction (Acc × Est) as predictors

Study 2

In Study 1, those who were prone to errors on the CRT were more likely to overestimate their performance—a Dunning–Kruger effect. Moreover, consistent with previous research in other domains (Atir et al., 2015; Critcher & Dunning, 2009; Ehrlinger & Dunning, 2003), the estimates of CRT accuracy were as predictive of self-reported analytic-thinking disposition as was actual CRT accuracy. These findings suggest that self-report measures of thinking disposition may lack precision. Overconfidence among those genuinely low in analytic thinking may translate into overestimates of self-report analytic thinking, whereas proper calibration (or, if anything, underconfidence) among those genuinely high in analytic thinking may translate into relative underestimates of self-report analytic thinking. In other words, it may be the case that those who rely on their intuition are not analytic enough to know that they are not analytic, whereas analytic people are analytic enough to know the limits of their analyticity.

Although the results of Study 1 were consistent with the proposed Dunning–Kruger effect in self-reported thinking disposition, the evidence was indirect. Specifically, the same performance-based analytic-thinking disposition measure, the CRT, was used to make inferences about miscalibration for both estimated CRT scores and self-reported NC. To overcome this limitation, we included a separate measure of analytic thinking: the heuristics-and-biases inventory (H&B; Toplak, West, & Stanovich, 2011). Having a second performance-based (“objective”) measure of analytic thinking would allow us to more directly test the hypothesis that NC scores are systematically miscalibrated. That is, the difference between actual and estimated CRT performance (calibration) should be associated with objective H&B performance, but not with self-reported NC. This would illustrate a correspondence between objective measures of analytic thinking even after miscalibration has been taken into account. There should be no similar correspondence with self-report NC because, presumably, this miscalibration also affects people’s perception of their analytic-thinking disposition.

Including a second objective measure of analytic thinking would also allow us to compare relative performance on our two analytic-thinking benchmarks (CRT and H&B) with relative self-reported NC by creating a number of additional calibration scores. Individuals who are not particularly analytic (based on their relative CRT and H&B scores) should nonetheless rate themselves as relatively analytic (based on self-reported NC). We predicted that the difference between relative CRT performance and self-reported NC would correlate positively with H&B performance (and, likewise, that the difference between H&B and NC would correlate with CRT scores). In other words, there should be miscalibration between NC and our objective measures of analytic thinking in the same way that there is miscalibration between estimated and actual CRT scores. This would be akin to a Dunning–Kruger effect in self-reported analytic-thinking disposition. As in Study 1, this should be evident for both the Ability and Engagement subscales of the NC scale.

Method

The data are available online via the Open Science Framework: https://osf.io/zrjje/.

Participants

In total, 400 participants were recruited using Mechanical Turk.Footnote 1 We removed 56 participants because they responded affirmatively when asked whether they had responded randomly at any point during the survey, and three participants who had missing data for the CRT. The resulting sample (N = 341; mean age = 33.9, SD = 10.5) consisted of 200 males and 139 female (two participants did not indicate their gender).

Materials and procedure

Cognitive reflection test

The CRT was administered as in Study 1. The two CRT scales were significantly correlated, r(341) = .362, p < .001, but the scale had weaker reliability than in Study 1, α = .65. One participant estimated that they had solved nine out of eight CRT problems correctly, so we changed the value to 8.

Heuristics-and-biases battery

As an independent measure of analytic thinking, we used Toplak et al.’s (2011) heuristics-and-biases battery. The scale consists of 16 problems that were derived from Kahneman and Tversky’s heuristics-and-biases research program (see Kahneman, 2011, for a review). Items include biases such as the gambler’s fallacy and the conjunction fallacy (see the Appendix in Toplak et al., 2011). The scale had weak reliability, α = .68.

Need for cognition

The NC scale was administered as in Study 1, except in this case it followed the heuristics-and-biases battery and was presented before a paranormal belief scale (which will not be considered further; see the supplementary materials). One participant had an outlying NC score (three SDs below the mean)—subsequent analyses are reported with the outlier removed, although the pattern of results was identical when this participant was included. The subscales were reliable, α = .89 and .94 for NC-A and NC-E, respectively.

Results and discussion

Dunning–Kruger effects in estimated CRT performance

Estimated CRT accuracy (M = 5.62, SD = 1.59) was greater than actual accuracy (M = 3.85, SD = 1.98), t(340) = 16.42, SE = 0.11, p < .001, d = 0.89. As in Study 1, we split the sample into four groups based on accuracy of CRT scores, to examine the relationship between actual and estimated CRT scores. Using a mixed-design ANOVA, we observed an interaction between low/high CRT group and the difference between the actual CRT score and the estimated CRT score (see supplementary Fig. S2), F(3, 337) = 92.64, MSE = 1.09, p < .001, ƞ 2 = .45. A post-hoc Tukey HSD test comparing the differences between estimated and actual CRT performance (i.e., calibration) indicated that all four groups emerged as separate, homogeneous subsets (p < .05). This indicated a decrease in overestimation at each increasing level of CRT performance. As in Study 1, those who scored lowest on the CRT (0–2 correct) overestimated their performance by a factor of 3.4. Even those who scored 5 or 6 out of 8 significantly overestimated their performance, t(52) = 6.63, p < .001, d = 0.67. In contrast, those who scored highest (7 or 8 correct) significantly underestimated their performance, t(33) = 3.86, p = .001, d = 0.66.

Association between overconfidence and need for cognition

Here we replicated the associations between estimated and actual CRT performance and NC. Correlations among the primary variables can be found in Table 2. As in Study 1, the correlations between NC and actual CRT performance were not significantly different from the correlation between NC and estimated CRT performance, t < 1.Footnote 2 To estimate the extents to which actual CRT accuracy and estimated CRT accuracy predicted NC, we entered both variables (along with their interaction) as predictors in two regression analyses, with NC-A and NC-E as a dependent variables (see Table 3). Unlike in Study 1, both actual and estimated CRT accuracy (along with their interaction; see Fig. S3 in the supplementary materials) independently significantly predicted NC-A, and only actual CRT significantly predicted NC-E. These independent associations between NC-A and CRT (actual and estimated) were relatively modest, however; β ranged from .120 to .181.

Table 2 Correlations (Pearson’s r) among the primary variables in Study 2
Table 3 Final step of a hierarchical multiple regression analysis predicting self-reported need for cognition with estimated CRT accuracy (CRT Est), actual CRT accuracy (CRT Acc), and their interaction (Acc × Est) as predictors

As in Study 1, neither NC-A nor NC-E was associated with the difference between actual and estimated CRT performance, rs < .085, ps > .130 (Calibration 1; see Table 2). In contrast, H&B performance did positively associate with Calibration 1, r(341) = .148, p = .006. Thus, we found a correspondence between objective measures of analytic thinking (CRT and H&B), even after miscalibration (CRT estimates) had been taken into account. This pattern of results was not evident for self-reported NC because, presumably, the miscalibration that is reflected in estimated CRT performance also affects people’s perceptions of their analytic-thinking disposition.

Dunning–Kruger effects in self-reported need for cognition

To further test the hypothesis that participants who are genuinely low in analytic thinking overreport their NC, we created three additional calibration scores. We did this by converting CRT, H&B, and NC raw scores into z scores and computed their difference scores (Table 2). The conversion to z scores allowed us to compare participants’ relative positions on the three ostensibly linked measures. The goal was to treat self-reported NC in the same way as estimated CRT scores. For this analysis, we decreased the number of comparisons by focusing on the Ability subscale (NC-A), which more directly relates to people’s assessments of their analytic-thinking disposition (as opposed to mere enjoyment of analytic thinking).Footnote 3

If NC-A is miscalibrated in the same way as estimated CRT scores, as we have predicted, there should be a correspondence between the parallel estimate-based and NC-based calibration scores. Indeed, the difference between the z scores for CRT accuracy and NC-A did correlate positively with H&B performance (“Calibration 2”; Table 2), r(341) = .172, p = .001. Correspondingly, the difference between the z scores for H&B accuracy and NC-A correlated positively with CRT performance (“Calibration 3”; Table 2), r(341) = .165, p = .002. Participants who were less analytic on the basis of H&B performance were more likely to rate themselves as relatively higher in NC than would be warranted, given their CRT performance (and vice versa). There was no corresponding correlation between NC-A and the difference between the z scores for CRT and H&B, r(341) = −.007, p = .894, which indicates a dissociation between self-reported and performance-based measures of analytic thinking.

This analysis illustrates that self-reported NC (Ability subscale) is more similar (in terms of its correlates) to estimated CRT performance than to either actual CRT or actual heuristics-and-biases performance. To summarize, self-reported NC does not predict the difference between actual and estimated CRT performance, but heuristics-and-biases performance does predict this difference. Moreover, when the difference between relative (actual) CRT and NC scores is used as a calibration score, heuristics-and-biases performance predicts that, as well. Similarly, when the difference between heuristics-and-biases performance and NC score is used as a calibration score, CRT performance predicts that. Finally, to complete the full set of possible analyses, the difference between CRT and heuristics-and-biases performance is not predicted by NC score, since it is not a meaningful difference (unlike the difference between either CRT or H&B performance and NC scores, as hypothesized).

To illustrate the source of the Dunning–Kruger effect on self-reported NC, we will focus on the parallel between two independent patterns of results: (1) the difference between actual and estimated CRT scores as a function of CRT group (Fig. 2) [interaction: F(3, 337) = 92.64, MSE = 1.09, p < .001, ƞ 2 = .45], and (2) the difference between H&B performance and self-reported NC as a function of CRT group (Fig. 3) [interaction: NC (full scale), F(3, 337) = 2.92, MSE = 0.77, p = .034, ƞ 2 = .03; NC-A, F(3, 337) = 2.79, MSE = 0.79, p = .040, ƞ 2 = .02; NC-E, F(3, 337) = 2.87, MSE = 0.78, p = .012, ƞ 2 = .03]. In both cases, we observed an interaction such that estimated accuracy/self-report NC was higher than objective performance for those who scored relatively low on the CRT, but this difference decreased and eventually reversed as CRT performance increased. Indeed, the comparison between H&B performance and NC indicates that, if anything, particularly analytic individuals underreported their NC relative to the remainder of the sample. It should be noted, however, that the confidence intervals overlapped substantially at every level in this analysis, despite the overall evidence for an interaction (see Fig. 3).

Fig. 2
figure 2

Miscalibration between actual cognitive reflection test (CRT) accuracy and estimated CRT accuracy as a function of CRT group (those who scored 0–2, 3–4, 5–6, or 7–8 out of 8) in Study 2. The y-axis represents the number of correct/estimated correct out of the eight CRT problems. Error bars represent 95% confidence intervals. Group 1 (0–2), N = 96; Group 2 (3–4), N = 113; Group 3 (5–6), N = 98; Group 4 (7–8), N = 34

Fig. 3
figure 3

Miscalibration between heuristics-and-biases performance (H&B) and self-reported need for cognition (Ability subscale, NC-A; Engagement subscale, NC-E) as a function of CRT group (those who scored 0–2, 3–4, 5–6, or 7–8 out of 8) in Study 2. The y-axis represents z scores. Error bars represent 95% confidence intervals. Group 1 (0–2), N = 96; Group 2 (3–4), N = 113; Group 3 (5–6), N = 98; Group 4 (7–8), N = 34

General discussion

Our results provide empirical support for Dunning–Kruger effects in both estimates of reasoning performance and self-reported thinking disposition. Particularly intuitive individuals greatly overestimated their performance on the CRT—a tendency that diminished and eventually reversed among increasingly analytic individuals. Moreover, self-reported analytic-thinking disposition—as measured by the Ability and Engagement subscales of the NC scale—was just as strongly (if not more strongly) correlated with estimated CRT performance than with actual CRT performance. In addition, an analysis using an additional performance-based measure of analytic thinking—the heuristics-and-biases battery—revealed a systematic miscalibration of self-reported NC, wherein relatively intuitive individuals report that they are more analytic than is justified by their objective performance. Together, these findings indicate that participants who are low in analytic thinking (so-called “intuitive thinkers”) are at least somewhat unaware of (or unresponsive to) their propensity to rely on intuition in lieu of analytic thought during decision making. This conclusion is consistent with previous research that has suggested that the propensity to think analytically facilitates metacognitive monitoring during reasoning (Pennycook et al., 2015b; Thompson & Johnson, 2014). Those who are genuinely analytic are aware of the strengths and weaknesses of their reasoning, whereas those who are genuinely nonanalytic are perhaps best described as “happy fools” (De Neys et al., 2013).

This research has both methodological and theoretical implications. With respect to methodological implications, self-report measures of thinking disposition such as the NC scale have been used in hundreds of studies (Cacioppo et al., 1996; Petty et al., 2009). Nonetheless, correlations with NC are often modest. In the present work, for example, correlations (Pearson’s r) between performance-based measures of analytic thinking and the Ability and Engagement NC subscales ranged from .190 to .242. Given the evidence for systematic miscalibration among genuinely nonanalytic individuals, we suggest that the predictive power of analytic-thinking dispositions may have been underestimated in research that has focused on self-report measures. Future research should continue to explore whether performance-based measures of analytic thinking yield stronger associations than do self-report measures.

In terms of theoretical implications, there has been a recent surge in metacognitive perspectives in the realm of dual-process theory. For example, Thompson and colleagues (2011; Thompson et al., 2013) have provided evidence that “feelings of rightness” are predictive of the extent and quality of analytic thinking. Mata, Fiedler, Ferreira, and Almeida (2013) found a metacognitive advantage for deliberative reasoners because they are aware of both the intuitive and reflective responses to reasoning problems. In contrast, much research has indicated that people are capable of detecting conflict between competing reasoning outputs (see De Neys, 2012, 2014, for reviews). Indeed, as mentioned earlier, conflict detection has been referred to as “omnipresent” (De Neys et al., 2008, p. 488), and even particularly biased participants have been shown to have an increased skin conductance response when faced with a conflict-inducing reasoning problem (De Neys et al., 2010). This indicates that even the most biased of reasoners can be sensitive to stimuli that cue conflicting responses.

The contrast between these lines of research is particularly stark when it comes to the CRT. Whereas De Neys et al. (2013) found that even people who gave the intuitive response to the bat-and-ball problem (as highlighted in the introduction) were only around 82% confident of their response (as compared to 97% confidence on a control version of the problem), the present results indicate that those who perform poorly on the CRT massively overestimate their accuracy (i.e., by a factor >3). This contraposition indicates that the neurological (e.g., De Neys et al., 2008) and physiological (De Neys et al., 2010) conflict detection signals may be relatively effective, but the response to this signal may actually be rather ineffective. De Neys et al. (2013) found a large (15%) decrease in confidence for the bat-and-ball problem relative to a control, but 82% confidence is still quite high.

There is also evidence from verbal protocols that conflict detection is often implicit (De Neys & Glumicic, 2008), and it seems likely that it may only become explicit upon further reflection via analytic processing. Thus, given that analytic thinking relies (to some extent) on volitional control, the presence of a signal to think analytically does not guarantee that the individual will engage in more than cursory levels of analytic thought. This line of reasoning is supported by both the present findings and previous work showing that the propensity to think analytically correlates with increases in response time for biased responses to incongruent (conflict) base-rate problems (Pennycook et al., 2014; Pennycook et al., 2015b). That is, more-analytic individuals engaged in more substantive analytic thinking, even in cases in which they ultimately rationalized their initial (“biased”) response. Collectively, these findings reveal pervasive differences between intuitive and analytic individuals at various levels of cognitive functioning.

The present results are also unique in the sense that they illustrate the everyday consequences of metacognitive differences as a function of analytic thinking. Namely, less-analytic people are not only less effective at metacognitive monitoring when given a reasoning task, but they may also be less accurate at self-reporting their relative level of analytic thinking. It may be the case that this metacognitive advantage for analytic individuals may be part of the reason why analytic thinking is associated with a wide range of important psychological factors, such as morality, religiosity, and creativity (see Pennycook et al., 2015a). Thinking about how one thinks may influence how and what people think about. Indeed, our results suggest that part of the reason why debates in areas such as politics are often futile is because those for whom overconfidence is the most consequential (i.e., those who need the most correcting, due to their low level of analyticity) are the least likely to recognize their overconfidence. Those most likely to be biased are also the least likely to recognize their bias.