The Cognitive Reflection Test (CRT; Table 1) is a three-item measure of reflective reasoning first introduced by Frederick (2005). Each of the problems reliably cues a compelling intuitive response that participants must reflect upon in order to reject it as mistaken. Although the requisite mathematical operations are neither complicated nor difficult, people tend to perform poorly on the CRT. Web-based and college samples typically produce means of 0.5 to 1 correct, out of a possible maximum of 3, and students at elite colleges such as Princeton and the Massachusetts Institute of Technology typically yield means of 1.5 to 2 (Frederick, 2005).

Table 1 The Cognitive Reflection Test

The observed difficulty of the CRT is consistent with a basic understanding of cognitive architecture that has arisen from the field of reasoning and decision-making. According to dual-process theory, two general types of processes operate in the mind (e.g., Evans & Stanovich, 2013): Type 1 processes that generate so-called “intuitive” outputs autonomously and with little effort, and Type 2 processes that require a more effortful implementation of working memory capacity, often with the goal of overriding the Type 1 output. According to this account, low scores on the CRT suggest that rapidly accessible intuitive responses typically dominate reasoning, perhaps because humans have evolved to conserve mental resources (and time) in cases in which the context cues a computationally simple but functionally adequate solution (Stanovich & West, 2003). Indeed, that humans rely on intuitive heuristics when making decisions has been known for some time, dating at least as far back as Kahneman and Tversky’s heuristics-and-biases research program in the 70s and 80s (see Kahneman, Slovic, & Tversky, 1982, for a review). This research, along with studies of formal-reasoning paradigms (e.g., Evans, 1989; Stanovich, 1999), suggests that the willingness to engage analytic reasoning processes is an important component of general cognitive function (see Stanovich, 2004, 2009).

Because the CRT consists of math problems, it is clear that mathematical ability is important for performance on this test. Nonetheless, there is strong evidence that the CRT is not just another numeracy test (Campitelli & Gerrans, 2014; Cokely & Kelley, 2009; Liberali, Reyna, Furlan, Stein, & Pardo, 2011; Toplak, West, & Stanovich, 2011, 2014; but see Weller et al., 2013). Under the assumption that one must engage reflective reasoning processes to override a prepotent intuitive response, the willingness or disposition to engage Type 2 processing should be an important determinant of performance. Although this line of reasoning appears straightforward, there is disagreement about the form that such a disposition may take. For example, Toplak et al. (2011, 2014) have argued that successful CRT performance relies on “rational thinking,” or the tendency to avoid miserly cognitive processing. In other words, those who fail to question their intuitions by using Type 2 processing do worse on the CRT (see also Baron, Scott, Fincher, & Metz, 2014, for a discussion of “reflection-impulsivity”). Other researchers (e.g., Campitelli & Gerrans, 2014; Campitelli & Labollita, 2010; Liberali et al., 2011) have argued that CRT performance relies on “actively open-minded thinking,” or the search for alternative responses. Since these alternative responses may themselves be intuitive, the latter account differs somewhat from other accounts. Nonetheless, both accounts suggest that successful CRT performance relies on additional analytic processing that can undermine an inadequate prepotent response (whatever its provenance) and that is subject to an individual-difference analysis.

Given the generality of the cognitive mechanisms thought to contribute to scores on the CRT and their relevance to cognitive theories such as dual-process theory, along with the ease with which it can be administered, it is not surprising that the measure has become widely employed in research on human reasoning and decision making.Footnote 1 Perhaps more surprising is the scope and importance of its correlates. Accuracy on the CRT is positively associated with better performance on multiple decision-making (e.g., Campitelli & Labollita, 2010; Frederick, 2005; Hoppe & Kusterer, 2011; Koehler & James, 2010; Oechssler, Roider, & Schmitz, 2009; Toplak et al., 2011, 2014) and reasoning (e.g., Lesage, Navarrete, & De Neys, 2013; Sirota, Juanchich, & Hagmayer, 2014; Toplak et al., 2011, 2014) tasks, as well as with utilitarian moral judgment (Paxton, Unger, & Greene, 2012; Pennycook, Cheyne, Barr, Koehler, & Fugelsang, 2014), less traditional moral values (Pennycook et al., 2014; Rozyman, Landy, & Goodwin, 2014), religious disbelief (Gervais & Norenzayan, 2012; Pennycook, Cheyne, Seli, Koehler, & Fugelsang, 2012; Shenhav, Rand, & Greene, 2012), paranormal disbelief (Cheyne & Pennycook, 2013; Pennycook et al., 2012), improved scientific understanding (Shtulman & McCallum, 2014), and creativity on complex tasks (Barr, Pennycook, Stolz, & Fugelsang, 2015).

CRT scoring techniques

There are three possible answer types for each CRT item: correct, intuitive incorrect, and “other” incorrect. Intuitive incorrect responses are defined as being plausible but incorrect responses that come to mind quickly and fluently as a consequence of the structure or wording of the question (e.g., “10 cents” for the bat-and-ball question in Table 1). This definition is supported by the observation that the majority of incorrect answers are indeed the cued “intuitive”Footnote 2 answer (Campitelli & Gerrans, 2014; Frederick, 2005). The standard way to score the CRT is simply to add up the number of correct responses. This scoring strategy will be referred to as CRT–Reflective, since the goal, consistent with the test’s name, is to assess individual differences in the ability to reflect upon and ultimately override the intuitive responses. This strategy does not distinguish between intuitive incorrect responses and “other” incorrect responses (e.g., “$1.05” for the bat-and-ball question). In contrast, in some recent publications (Brosnan, Hollinworth, Antoniadou, & Lewton, 2014; Piazza & Sousa, 2014; Shenhav et al., 2012), CRT responses have been scored by adding up the number of intuitive incorrect responses. Thus, this strategy does not distinguish between correct responses and “other” incorrect responses. The goal of this scoring—which will be referred to as CRT–Intuitive—is, effectively, to invert the standard use of the CRT and make it a measure of intuitiveness.

Beyond the restricted meaning within the context of the CRT, intuitiveness conventionally refers to the trust or faith that a person has in his or her “gut feelings,” which, at least in principle, is separate from, though not necessarily opposed to, the willingness to engage in analytic reasoning (e.g., Pacini & Epstein, 1999). This distinction has some bearing on current theoretical debates. For example, Shenhav et al. (2012) utilized CRT–Intuitive scoring to support their claim that intuition leads to increased religious belief, a claim that is different from the claim that reflection leads to decreased religious belief (cf. Pennycook et al., 2012). As another example, Brosnan et al. (2014) used CRT–Intuitive scoring to investigate the relative roles of intuition and reflection in empathizing (i.e., striving to understand others’ thoughts and feelings) and systemizing (i.e., striving to understand nonhuman systems). In contrast, Piazza and Sousa (2014) used the CRT as a potential mediator between religiosity/conservatism and judgments about taboo moral dilemmas and cited the desire to “avoid scoring nonintuitive incorrect responses as intuitive” (p. 339) as a justification for using CRT–Intuitive scoring. However, none of the reported comparisons in these studies has provided evidence for differential utility between the CRT–Intuitive and CRT–Reflective scorings, suggesting that, up to this point, the CRT–Intuitive scoring technique has been implemented primarily for rhetorical reasons.

The logic for CRT–Intuitive scoring is simply that participants who give more intuitive responses do so because they are relatively more intuitive thinkers. The goal of the present work was to investigate this claim both theoretically and empirically. To do this, we introduced a potential CRT–Intuitive scoring strategy that would address statistical issues (discussed subsequently) that otherwise structurally confound CRT–Intuitive and CRT–Reflective scoring.

The present investigation

Although CRT–Intuitive scoring has been used in previous work, there has been no attempt to validate this measure. Convergent measures of “intuitiveness” are unfortunately rather rare (likely for theoretical reasons; see the Discussion). One exception is the Faith in Intuition scale (FI; Epstein, Pacini, Denes-Raj, & Heier, 1996; Pacini & Epstein, 1999), which was developed to assess how much individuals trust their intuitions and instincts. It includes items such as “I hardly ever go wrong when I listen to my deepest gut feelings to find an answer” and “I believe in trusting my hunches.” The Faith in Intuition scale may be contrasted with the Need for Cognition scale (NFC; Cacioppo & Petty, 1982; Cacioppo, Petty, Feinstein, & Jarvis, 1996), which was developed to assess how much a person engages in and enjoys effortful thinking. Although the two scales may appear conceptually to be polar opposites on a single dimension, FI and NFC typically emerge as separate factors and are generally not strongly negatively correlated (Epstein et al., 1996). Moreover, both scales have been used to predict (differentially, in some cases) a wide range of psychological measures (Cacioppo & Petty, 1982; Cacioppo et al., 1996; Epstein et al., 1996; Pacini & Epstein, 1999), similar to the recent uses of the CRT.

Although the FI and NFC are self-report measures of preferences (for intuition and effortful cognition) and the CRT is a performance-based ability measure, we expect that preferences should be positively correlated with ability, though possibly attenuated because of unshared method variance. Hence, we predicted that CRT–Intuitive should be more strongly correlated with FI, whereas CRT–Reflective should be more strongly correlated with NFC. There is, however, a logical and statistical problem with the two CRT measures—namely, the ipsative nature of the two CRT measures (i.e., the forced-choice format for each question means that positively choosing one option requires negatively choosing all others, forcing negative correlations among the items). Moreover, because a relatively small proportion of “other” incorrect responses is typically observed, the CRT–Intuitive and CRT–Reflective will be highly negatively correlated for purely artificial structural reasons. Hence, it is impossible to know to what extent the strong negative correlation between the measures (e.g., r = –.75; Shenhav et al., 2012) is determined either empirically or structurally. Hence, the unqualified use of the CRT–Intuitive measure, as in the previous research discussed above, accomplishes little more than reversing the sign of the correlations and is otherwise redundant with, and largely indistinguishable from, the conventional CRT–Reflective measure.

Nonetheless, a potential measure can be derived from the CRT that might assess individual differences in intuitiveness independently of CRT–Reflective. Specifically, we may shift focus entirely to the incorrect responses, under the hypothesis that more intuitive individuals should be more likely to give intuitive incorrect responses than “other” incorrect responses. Individuals who select an “other” incorrect response on a CRT item should either have less intuitive ability to generate the answer suggested by the wording of the questions or have less faith in that intuition than those who ultimately provide an intuitive incorrect response. Thus, if individual differences in intuitiveness can be assessed using the CRT, they should be reflected in the proportion, out of all incorrect responses, that are intuitive. Using this measure, there is no structurally necessary correlation between the proportion of “intuitive” to “nonintuitive” incorrect responses and the number of correct responses (CRT–Reflective), because the former is derived within errors and the latter is the number or proportion of correct responses relative to all responses.

Method

Participants

Undergraduate students at the University of Waterloo participated in an online study that included the CRT along with additional reasoning measures not of interest here. Participants who completed the CRT were then permitted to sign up for a second online study that included a number of questionnaires.Footnote 3 Although the two studies were not presented as being directly related in any way, some participants may have been aware that the first study was a prerequisite for the second (along with a number of other studies). Students received course credit for both studies. We had complete data for 497 participants (343 female, 154 male; M age = 20.5, SD age = 4.6). Because the CRT had been administered in some previous studies conducted with this population, we asked participants whether they had seen any of the CRT problems before. In total, 125 (25.2 %) of the participants responded “yes” to this question or failed to respond, and were excluded from the subsequent analysis.Footnote 4 This left us with 372 participants (268 female, 104 male).

Materials

The CRT is presented in Table 1. As we discussed, there are two possible types of incorrect responses for the CRT: (1) cued-intuitive incorrect responses (e.g., “100” for the widget question), and (2) “other” incorrect responses (e.g., “20” for the widget question). We derived a number of different scores from the CRT performance. We summed the numbers of correct responses (CRT–Reflective) and summed the numbers of intuitive incorrect responses (CRT–Intuitive). We also computed the proportion of incorrect responses that were intuitive (PI) for each CRT item. Finally, we computed the mean proportion of intuitive out of the total incorrect answers across the three items.

We used Pacini and Epstein’s (1999) Rational–Experiential Inventory, which included a 20-item Need for Cognition (NFC) scale and a 20-item Faith in Intuition scale (FI). Both scales had acceptable reliability: Cronbach’s alpha = .86 (NFC) and .87 (FI). Participants were given questions such as “reasoning things out carefully is not one of my strong points” (NFC, reverse scored) and “I like to rely on my intuitive impressions” (FI). They were asked to respond using a 5-point scale, from 1 (Definitely not true of myself) to 5 (Definitely true of myself). We converted each item to a Percent of Maximum Possible (POMP) score to create interpretable values and computed the means for the two scales individually (Cohen, Cohen, Aiken, & West, 1999).

Results

The proportions of each response type (correct, intuitive incorrect, and “other” incorrect) for the three individual CRT items are presented in Table 2. The majority of participants (>80 % for each item) either entered the intuitive incorrect response or correctly solved the problem. Very few participants gave more than one “other” incorrect response (5.3 %), and most gave zero (73.7 %); see Table 3. Thus, as expected, the CRT–Reflective and CRT–Intuitive scores were highly negatively correlated (r = –.85; see Table 4). In contrast, FI and NFC were not significantly correlated (r = .05, p = .315).

Table 2 Numbers (and proportions) of participants who gave each response type for each CRT problem
Table 3 Numbers (and proportions) of participants scoring 0, 1, 2, or 3 (out of 3) for each response type
Table 4 Correlations between the Cognitive Reflection Test (CRT) and Rational–Experiential Inventory measures (i.e., Faith in Intuition and Need for Cognition)

Correlations between the CRT and the self-report measures are presented in Table 4. CRT measures include the CRT–Reflective and CRT–Intuitive scoring strategies, as well as the proportions of intuitive incorrect responses (PIs) for each CRT item and for the entire scale. CRT–Reflective (the number of correct responses) is correlated with both NFC and FI. As predicted, this correlation is nominally larger for NFC than for FI. CRT–Intuitive (the number of intuitive incorrect responses) is also correlated with both NFC and FI, but the correlations are basically indistinguishable from those for CRT–Reflective. Moreover, CRT–Intuitive is also more strongly correlated with NFC than with FI; the opposite pattern would be expected if the number of intuitive incorrect responses on the CRT indexed intuitiveness.

Although this pattern of correlations is informative, a more stringent PI measure of intuitiveness derived from the CRT might reveal a stronger association with the FI scale. For this, we compared the NFC and FI scores for participants who gave intuitive incorrect responses with those for participants who gave “other” incorrect responses (Table 4). Importantly, participants who answered correctly were excluded from this analysis. We turn first to the analysis of PIs as derived separately from each of the three CRT items. This is beneficial because it does not require any assumptions about the proportion of incorrect responses across items, which would increase the structural dependency with CRT–Reflective. We note that despite the relatively small proportion of participants who gave “other” incorrect responses, an adequate number of observations was still available for each item to permit this analysis (see Table 2), due to the large sample size of the study. Moreover, unlike the CRT–Intuitive measure, the proportions of incorrect intuitive responses (PI) do not significantly correlate with CRT–Reflective (with one exception, discussed below). They do, however, correlate positively with CRT–Intuitive and with each other (Table 4).

As is evident from Table 4, FI scores were not higher for participants who gave an intuitive incorrect response than for those who gave an “other” incorrect response (rs = .04, .06, and .02). This result raises serious questions about the validity of the CRT as a measure of intuitiveness. Curiously, there was a difference in the NFC scores for one of the three CRT items. Participants who gave an “other” incorrect response on the lily pad item had higher NFC (M = 64.5, SD = 13.2) than did those who gave the intuitive incorrect response (M = 59.1, SD = 12.3), t(209) = 2.82, SE = 1.92, p = .005. The lily pad item was notably easier than the other two items for this sampleFootnote 5 and was the only item for which the proportion of intuitive incorrect responses correlated with the overall CRT–Reflective score (Table 4). This may have come about partly because it came last in this experiment, though the lily pad item is usually associated with the highest accuracy (e.g., Campitelli & Gerrans, 2014). It may be that the lower NFC among those who gave an “other” incorrect response can be accounted for by differences in numeracy.

As an additional analysis, we computed the mean proportion of intuitive incorrect responses across the three CRT items (Table 4, variable 3). Scores on this aggregate PI measure can range from 0 to 1, with 0 indicating no intuitive incorrect responses and at least one “other” incorrect response and 1 indicating at least one intuitive incorrect response and no “other” incorrect responses. This measure did not significantly correlate with either FI (r = .05) or NFC (r = –.09). Again, this is inconsistent with the idea that CRT–Intuitive can be used as a measure of relative intuitiveness.

Finally, our results replicated previous work demonstrating gender differences in CRT performance (e.g., Campitelli & Gerrans, 2014; Frederick, 2005; Toplak et al., 2011). Males (M = 1.42, SD = 1.08) had more correct responses (CRT–Reflective) than did females (M = .90, SD = 1.00), t(369) = 4.39, SE = 0.12, p < .001. This result was also reflected in a higher number of intuitive responses (CRT–Intuitive) from females (M = 1.78, SD = 1.06) than from males (M = 1.27, SD = 1.03), t(369) = 4.12, SE = 0.12, p < .001. Females (M = 0.33, SD = 0.61) were no more likely to give “other” incorrect responses than were males (M = 0.31, SD = 0.54), t < 1, and the mean proportions of intuitive incorrect responses (CRT–PI) did not differ between males (M = .80, SD = .32) and females (M = .83, SD = .30), t < 1. There were, however, gender differences in the self-report thinking dispositions. Namely, males had a higher NFC (M = 67.7, SD = 13.1) than did females (M = 62.0, SD = 13.3), t(369) = 3.70, SE = 1.54, p < .001, and females had a higher FI (M = 56.4, SD = 13.4) than did males (M = 52.5, SD = 12.2), t(369) = 3.93, SE = 1.52, p = .01. This replicates previous work using these self-report scales (e.g., Pacini & Epstein, 1999).

Discussion

Given the ubiquity of the CRT’s use in research, it is necessary to determine how best to interpret what it measures. Our results very clearly indicate that the CRT is a questionable measure of the propensity to rely on or trust “gut feelings.” Although the CRT–Intuitive measure assessed in previous literature (i.e., the number of intuitive incorrect responses) was correlated with Faith in Intuition, a self-report measure of intuitiveness, this correlation was not robust and, in fact, was nominally [though not significantly: t(372) = 1.37, p = .171] smaller than the corresponding correlation with Need for Cognition, a self-report measure of how much one engages in and enjoys effortful thinking. Moreover, these correlations were essentially indistinguishable from the parallel correlations for the CRT–Reflective measure (i.e., the number of correct responses). The success of the CRT–Intuitive measure in previous research (e.g., Brosnan et al., 2014; Piazza & Sousa, 2014; Shenhav et al., 2012) may be entirely explained by its strong negative correlation with the CRT–Reflective measure.

We also attempted to derive a measure of intuitiveness from the CRT that was not structurally related to or correlated with the standard CRT–Reflective score. For this measure, we compared participants who gave intuitive incorrect responses with those who gave “other” incorrect responses, under the assumption that the former group would have relatively more faith in their intuition. This prediction was not borne out for either individual items or the mean across items, despite strong intercorrelations.

Theoretical considerations

Our results raise questions about the role of intuition in the CRT. Part of the power of the CRT is that the cued responses have a very high likelihood of coming to mind (i.e., they appear to be intuitive insofar as they are both rapidly available and compelling). Scoring based on accuracy assumes that the correct response requires the participant to perform the requisite mental operation to produce the correct response (unless, of course, the respondent has seen the problem before).Footnote 6 If the intuitive response is a default common to most, if not all, people, as is assumed by the logic of the test, it is an inefficient instrument, on principle, to assess people on the basis of intuitive ability, though it might be a measure of intuitive preference. That is, “intuitive” individuals may or may not detect the need to think analytically about the problem, but they decide nonetheless to “go with their gut.” Indeed, a recent investigation showed that participants were less confident on the bat-and-ball item than on an isomorphic control version that required the same mathematical operation but did not cue an intuitive response (De Neys, Rossi, & Houdé, 2013). This decrease in confidence suggests that the participants recognized, at some level, a problem with the intuitive answer to the CRT item. Crucially, this finding was evident even for those who gave the intuitive response on the bat-and-ball problem, suggesting that individuals who incorrectly respond with the intuitive response likely do so largely because of a lack of willingness or ability to engage in analytic reasoning to question the default answer.

More generally, it is unclear how “intuitiveness” would affect performance on the CRT. Some forms of intuition may be associated with highly overlearned tasks (e.g., Kahneman & Klein, 2009; Lieberman, 2000), and hence are employed only within particular domains. A chess player may become a very “intuitive” player through years of practice, but this does not imply that she is dispositionally an “intuitive” person in terms of preferred cognitive style. In this regard, using the CRT as a measure of intuitiveness could only distinguish people for whom the intuitive response does not come to mind (though, arguably, those who give “other” incorrect responses may have just as intuitive an initial response, but simply make a mathematical error). Such people, we speculate, would falsely appear to lack “intuitiveness” in this domain because they are particularly experienced with math problems, not because they are dispositionally less intuitive. Indeed, their mathematical intuitions may be quite different from the intuitions of those with low mathematical ability. At the very least, even if high-ability individuals have the same initial intuitions as low-ability individuals, they likely have greater accessibility to alternative intuitions. Regardless, multiple investigations have established that CRT performance is not fully explained by numeracy or cognitive ability (Campitelli & Gerrans, 2014; Cokely & Kelley, 2009; Liberali et al., 2011; Toplak et al., 2011, 2014).

The logic of the CRT requires the assumption that the cued intuitions are common and are available to all or nearly all test-takers, but that the disposition and ability to override these highly available intuitions are variable individual differences. The literature cited above and the present results provide evidence of the validity of that assumption. Thus, although intuition is clearly an important strategic component of the CRT, the logic of the test and the present evidence suggest that individual differences in “intuitiveness” cannot be reliably measured by performance on the CRT.