The cognitive reflection test is robust to multiple exposures

The cognitive reflection test (CRT) is a widely used measure of the propensity to engage in analytic or deliberative reasoning in lieu of gut feelings or intuitions. CRT problems are unique because they reliably cue intuitive but incorrect responses and, therefore, appear simple among those who do poorly. By virtue of being composed of so-called “trick problems” that, in theory, could be discovered as such, it is commonly held that the predictive validity of the CRT is undermined by prior experience with the task. Indeed, recent studies have shown that people who have had previous experience with the CRT score higher on the test. Naturally, however, it is not obvious that this actually undermines the predictive validity of the test. Across six studies with ~ 2,500 participants and 17 variables of interest (e.g., religious belief, bullshit receptivity, smartphone usage, susceptibility to heuristics and biases, and numeracy), we did not find a single case in which the predictive power of the CRT was significantly undermined by repeated exposure. This occurred despite the fact that we replicated the previously reported increase in accuracy among individuals who reported previous experience with the CRT. We speculate that the CRT remains robust after multiple exposures because less reflective (more intuitive) individuals fail to realize that being presented with apparently easy problems more than once confers information about the task’s actual difficulty.

The intuitive response is 10 cents. Yet, it's incorrect: If the ball costs 10 cents, the bat would have to cost $1.10, and in total they would cost $1.20. The correct response is 5 cents, an answer that is reached by only roughly 30% of university undergraduates, depending on the university (De Neys, Rossi, & Houdé, 2013;Frederick, 2005;Pennycook, Cheyne, Koehler, & Fugelsang, 2016). However, errors are also not random: Almost all who get CRT questions wrong give the intuitive response (Campitelli & Gerrans, 2014;Frederick, 2005;Pennycook, Cheyne, et al., 2016). Moreover, the majority of participants who answer correctly are aware of the incorrect intuitive answer, whereas those who got it wrong naturally failed to consider the correct answer (Mata, Ferreira, & Sherman, 2013;Pennycook, Ross, Koehler, & Fugelsang, 2017). Although the CRT obviously requires some degree of numeracy, it is thought to also capture the propensity to think analytically Toplak et al., 2011). That is, those who do well on the CRT are also less prone to rely on heuristics and biases even after measures of cognitive ability have been taken into account (Toplak, West, & Stanovich, 2011, 2014. Moreover, the CRT predicts a wide range of variables after controlling for numeracy (Pennycook et al., 2015a;. Perhaps because it is short and strongly predictive of a variety of outcome variables, the CRT has become a nearly ubiquitous psychological test. It is widely held that so-called Btrick problems^like the above bat-and-ball problem will not be robust to multiple testing, because participants will realize that the problems only seem easy at first blush (Haigh, 2016;Stieger & Reips, 2016). Indeed, prior experience with the CRT is associated with higher scores (Haigh, 2016;Stieger & Reips, 2016). This casts doubts about whether the test can continue to be used as a valid tool for assessing analytic cognitive style. As a consequence, much effort has gone into finding newer versions of the CRT (Primi, Morsanyi, Chiesi, Donati, & Hamilton, 2016;Thomson & Oppenheimer, 2016;Toplak et al., 2014).
Although there is strong agreement that multiple exposures to the CRT invalidate it as a test, remarkably, no studies (that we are aware of) have empirically tested this claim. Does the CRT continue to predict important psychological factors even among people who have been given the test before? There is good reason to believe that it should: Namely, if the people who do poorly on the CRT are genuinely intuitive (i.e., are not willing to think analytically), the most intuitive among them will either not realize that they are seeing a repeated problem, or if they realize this, they will not consider that the (apparently simple) problem is being repeated for a reason. Put differently, repeated exposure can be thought of as an additional sort of CRT test. Researchers can be confident that those who continue giving intuitive responses after being presented with the problem more than once are strongly intuitive. Moreover, those who do relatively well on the CRT originally do not gain much from repeated exposure. The effect of repeated exposure, then, may only cause researchers to mislabel a genuinely intuitive person as reflective (on the basis of accuracy) in a relatively small proportion of cases.
To investigate this issue, we used previously collected datasets to test how prior exposure to the CRT affects the strength of the reported correlations with different behavioral and cognitive factors of theoretical interest.

Empirical tests
A set of six experiments with almost 2,500 participants was used to test the effect of previous exposure on the CRT. In the reported reanalyses, we compared participants who declared (when asked directly) that they had seen at least one CRT item before (Bexperienced^) with participants who had no recollection of prior exposure (Bunexperienced^). Prior experience was probed using the following question: BHave you seen any of the last 3 word problems before?^The participants who indicated Byes^were considered experienced, and those who selected Bmaybe^or Bno^were considered unexperienced. The results include data from every published or submitted manuscript (in which one of the present authors was a lead (or co-lead) author of the study) that had asked participants about prior experience with the CRT. This was the only inclusion criterion. The data for all studies are available at the Open Science Framework: https://osf.io/kawv8/.
CRT scores are affected by prior experience Table 1 presents average accuracies on the three-item CRT for individuals with and without prior experience. The mean accuracies on the CRT are followed by a direct t test, an estimation of the effect size, and a Bayesian test that weights the evidence for the CRT score being lower in the unexperienced group than in the group of individuals with experience. The Bayes factor (BF) is interpreted in a continuous manner as the strength of evidence supporting one model against the other, but BF > 3 is often interpreted as the lowest acceptable evidence to support a particular hypothesis (Dienes, 2014;Masson, 2011). As is evident from Table 1, the individuals who had prior exposure to the CRT (about a third of our tested sample) scored higher (d = 0.57). This is consistent with the previously reported findings of Haigh (2016) and Stieger and Reips (2016), although their effect sizes were slightly smaller (d = 0.48 and d = 0.41, respectively).

Prior experience does not affect the CRT's predictive validity
We compared the magnitudes of the CRT's correlations with a variety of outcome variables using Fisher's z test (Table 2). A significant result on the z test indicates that the relationship between the CRT and another variable was different between the unexperienced and experienced groups. Finally, we present a suggested method for future testing-partial correlations, in which the correlations between the CRT and the dependent variables are controlled by the CRT-exposure score. If prior exposure undermines the predictive validity of the CRT, the test should not be predictive for the experienced group (or, at least, it should be less predictive than it is for the unexperienced group).
As is evident from Table 2, the Fisher z tests comparing the correlations between experienced and unexperienced participants were significant in only three out of 23 cases. However, in all three cases the CRT was more predictive among participants who reported having previous experience. Among these exceptions were two cases in which the CRT became more strongly predictive of performance on the heuristicsand-biases battery (via Toplak et al., 2011)-a set of tasks that include problems taken from Kahneman and Tverksy's heuristics-and-biases program of research (Kahneman, Slovic, & Tversky, 1982). The battery includes problems relating to the conjunction fallacy, base-rate neglect, the gambler's fallacy, and so on. Perhaps the most straightforward explanation for the increase in the strength of the correlation between the CRT and the heuristics-and-biases battery is that the two tests are often used in conjunction with each other (prior exposure to the heuristics-and-biases battery was not assessed).
For the other case-smartphone Internet use-it is unclear why the CRT correlation would be stronger among experienced participants. One possibility is that the CRT is more reliable among the experienced participants. However, we tested for changes in the reliability (Cronbach's alpha) of the CRT for experienced relative to unexperienced groups and found no consistent differences (reliability was greater for the experienced group in three studies, but was smaller in two others, and in one case the two groups were equal): Study 1, α unexp = .65, α exp = .72; Study 2, α unexp = .64, α exp = .60; Study 3, α unexp = .61, α exp = .53; Study 4, α unexp = .57, α exp = .65; Study 5, α unexp = .63, α exp = .63; Study 6, α unexp = .62, α exp = .76. In fact, CRT reliability was lower in the experienced group in the study that included the smartphone Internet use question (Study 3).
It should be noted that nonsignificant differences between reported correlations cannot be used as positive evidence against group differences (Dienes, 2014). For example, one reason for a nonsignificant Fisher's z test result would be a study having too little power to detect small differences between the groups. In our case, however, we are concerned with the practical issue of the CRT's predictive validity, and therefore small significant differences were not our primary concern. Nonetheless, the reported correlations are fairly consistent across the experienced and naive groups, and our leastpowered comparison (Study 6) was sensitive enough to detect differences in correlations of r = .161. Such differences are small enough to justify our claim that any smaller differences would be of little practical importance (Ellis & Steyn, 2003;Taylor, 1990).
In two cases, an originally significant correlation was not significant among individuals who had prior experience with the CRT. In one case, the correlation between CRT performance and religious belief was not significant among experienced participants (Study 2), r(114) = -.14, p = .131. Nonetheless, this is well within the range of results of previous studies on the topic (see , for a meta-analysis). Moreover, the CRT negatively predicted religious belief among experienced participants in two larger studies (Studies 1 and 5; see Table 2). In the other case, the CRT did not predict faith in intuition among those with experience on the CRT. Nonetheless, the correlation with faith in intuition among the unexperienced was small in the first place, .589 (N=639) .559 (N=494) .580 (N=144) -0.33, p = .372 .564 Base rate neglect (accuracy) .257 (N=520) .182 (N=411) .235 (N=108) -0.56, p = . 306 .194 Heuristics/biases (accuracy) .478 (N=267) .446 (N=200) . All ps < .05 unless otherwise indicated. BFull sample^is the sum of all participants, regardless whether or not they declared their experience with the CRT, but particular subsamples do not include the individuals who failed to declare their previous experience with the CRT. * Via Toplak, West, & Stanovich (2014). ** The Ns were variable for  because the data set was actually the combination of four separate studies that had used different combinations of variables (but always the original CRT, verbal intelligence scale, numeracy scale, and religious belief scale). The reported correlations are from three of the four studies in  because one of the studies (Study 3) focused on data from Barr et al. (2015) r(372) = -.205. Moreover, to reiterate, the difference between the correlation coefficients for experienced and unexperienced participants was not significant.

Discussion
Across six data sets with close to 2,500 participants and 17 variables of theoretical interest, we did not find a single case in which prior exposure to the CRT had a significant negative impact on the test's predictive validity. Indeed, the correlation coefficients stayed fairly consistent, regardless of prior exposure, with three exceptions in which the correlations became stronger with additional exposure to the CRT. We therefore conclude that exposure does not adversely affect the correlation of the CRT with other variables. We replicated the finding that CRT accuracy increases with experience (Haigh, 2016;Stieger & Reips, 2016). Hence, there is a chance that, when doing an experiment that compares accuracy across two groups (rather than using the CRT as an individual difference measure, as was done here), one group will have artificially higher scores because it contains more experienced participants than the other. To avoid this problem, researchers could ask participants whether they have prior experience with the CRT and set an analysis plan in which they use CRT experience as a covariate (thus controlling for experience level differences between conditions).
The present results indicate that the CRT remains strongly predictive of a variety of outcome variables, given prior experience. However, the results offer no clues as to why this is the case. We see two non-mutually-exclusive explanations. The first hypothesis relates to the possibility that those who do poorly on the CRT have a metacognitive disadvantage (Mata et al., 2013;Pennycook, Ross, Koehler, & Fugelsang, 2017). Namely, only relatively analytic individuals may increase their performance on the CRT with repeated testing, whereas relatively less analytic and more intuitive individuals will continue to do poorly on the test. This may occur because relatively intuitive individuals fail to realize that being presented with apparently easy problems more than once confers information about the tasks' apparent difficulty. This coincides with research showing that participants who do poorly on the CRT massively overestimate their performance (i.e., they do not realize they are doing poorly; Pennycook et al., 2017), which indicates that intuitive individuals may have a metacognitive disadvantage (see also Mata et al., 2013). Other research has indicated that more reflective individuals are also better able to detect conflict during reasoning (Pennycook, Cheyne, Barr, Koehler, & Fugelsang, 2014a;Pennycook, Fugelsang, & Koehler, 2015b), which may, in turn, explain their metacognitive advantage.
It is also possible that the increase in accuracy on the CRT upon repeated exposure is actually the result of a self-selection effect-that is, smarter individuals are more likely to complete more studies. Hence, their CRT score is higher not because multiple exposures allow them to solve the problems, but because this subgroup of participants is simply more reflective than the rest of population (i.e., a selection instead of a treatment effect). This would increase the overall accuracy on the CRT but would not affect how it correlates with various outcome variables. Further research will be required in order to delineate between these accounts (and, of course, additional accounts are possible). An experiment aimed directly at the effects of retesting could isolate which people show improved performance over time and whether the test's predictive validity changes over time.
The present data also do not allow us to claim that many repeated exposures do not eventually undermine the CRT's predictive validity. In Studies 2-5, the participants were asked how many times they had seen the CRT, and the majority (8 5%) declared they had only seen the CRT one to three times. However, focusing on the study that had the most participants who reported seeing the CRT more than three times (Study 5), the correlations between CRT and religious belief (for example) were very similar between highly experienced participants, r(37) = -.20, and those with no prior CRT experience, r(828) = -.18 (this pattern emerged for all other variables of interest, as well). Nonetheless, this sample is not sufficient to definitively test whether the predictive validity of the CRT will eventually break down after repeated exposure. Naturally, this is less of a pragmatic concern for researchers who are interested in using the CRT since, in most cases, a large number of repeated exposures will be less common than a small number.
Another limitation of the present analysis is that none of the participants were asked where they had previously encountered the CRT. Haigh (2016) found that, among those who reported prior exposure to the CRT, 30.6% had been exposed via popular media, and 22.2% had been exposed via a course in school or university. In such contexts, it is likely that the CRT was not simply presented, but explained. It is quite possible that this type of exposure has relevance for whether the CRT remains a potent predictor. 1 We suggest that future CRT studies ask Bhave you seen the last 3 word problems before( response options: Byes,^Bmaybe,^Bno^) and give participants an opportunity to indicate the context where this occurred (response options, selecting all that apply: Bprevious research studies,^Bin popular media, such as books, websites, and social media,^Bschool or university,^and Bnot sure^). Exclusions based on these questions should be preregistered or, at least, fully reported along with alternative analyses to demonstrate that the presented results do not rely on a particular post hoc exclusion criterion.
The CRT is far from a perfect measure. Apart from the issues that may emerge from familiarity with the task (however overblown), the CRT consists of only three items, is not particularly reliable, and suffers from range restriction issues (across the 1,624 naive participants in our sample, 42.2% got none of the questions correct, and 12.9% got all of the questions correct). Thus, despite the present results, the continued development and testing of expanded versions of the CRT remains imperative (see Primi et al., 2016;Thomson & Oppenheimer, 2016;Toplak et al., 2014). Indeed, Stanovich and colleagues have developed a more comprehensive measure of rational thinking (the Brationality quotient^; Stanovich, 2016;Stanovich, West, & Toplak, 2016) that will be an important tool for future work in psychology and education.
Our analysis provides a clear answer to the potential problem of prior exposure to the CRT: The CRT is robust to multiple testing, and there is no need to abandon it as an individual difference measure. Although some people may benefit from experience with the task, these individuals are evidently among the more analytic ones in the sample, and therefore do not impact the overall predictive validity of the test.