Women and ‘the philosophical personality’: evaluating whether gender differences in the Cognitive Reflection Test have significance for explaining the gender gap in Philosophy

The Cognitive Reflection Test (CRT) is purported to test our inclination to overcome impulsive, intuitive thought with effortful, rational reflection. Research suggests that philosophers tend to perform better on this test than non-philosophers, and that men tend to perform better than women. Taken together, these findings could be interpreted as partially explaining the gender gap that exists in Philosophy: there are fewer women in Philosophy because women are less likely to possess the ideal ‘philosophical personality’. If this explanation for the gender gap in Philosophy is accepted, it might be seen to exonerate Philosophy departments of the need to put in place much-needed strategies for promoting gender diversity. This paper discusses a number of reasons for thinking that this would be the wrong conclusion to draw from the research. Firstly, the CRT may not track what it is claimed it tracks. Secondly, the trait tracked by the CRT may not be something that we should value in philosophers. Thirdly, even if we accept that the CRT tracks a trait that has value, this trait might be of limited importance to good philosophising. Lastly, the causal story linking the gender gap in CRT score and the gender gap in Philosophy is likely to be far more complex than this explanation implies.


Introduction
The Cognitive Reflection Test (CRT) is purported to test our inclination to overcome impulsive, intuitive thought with effortful, rational reflection. Research suggests that philosophers tend to perform better on this test than non-philosophers, and that men tend to perform better than women. Taken together, these findings could be interpreted as partially explaining the gender gap that exists in Philosophy: there are fewer women in Philosophy because women are less likely to possess this aspect of the ideal philosophical personality. If this explanation for the gender gap in Philosophy is accepted, it might be seen to exonerate Philosophy departments of the need to put in place muchneeded strategies for promoting gender diversity. This paper discusses a number of reasons for thinking that this would be the wrong conclusion to draw from the research. Firstly, the CRT may not track what it is claimed it tracks. The dominant interpretation of the CRT is that it tracks an aspect of rationality, but it may be that the CRT tracks numeracy or confidence instead. Secondly, the trait tracked by the CRT may not be something we should value in philosophers. Even if we currently select for this trait in Philosophy, it may be that this trait is not, in fact, an asset to good philosophising. Thirdly, even if we accept that the CRT tracks a trait that has value, this trait might be of limited importance to good philosophising. A whole range of virtues and skills can plausibly be postulated as part of the ideal philosophical personality, and it is not clear to what extent the trait tracked by the CRT is an important philosophical virtue or skill. Lastly, the causal story linking the gender gap in CRT score and the gender gap in Philosophy is likely to be far more complex than this explanation implies. If the CRT gender gap is explanatory for the Philosophy gender gap, it is likely that it will be one of several, interacting causal factors.
The research at present does not allow us to draw clear conclusions over which route (or combination of routes) we should take in response to the findings. However, one response is clear. Even if it is the case that the CRT gender gap is somewhat explanatory of the Philosophy gender gap, and even if it is right that the CRT tracks a trait that is conducive to good philosophising, this does not justify inaction on the part of Philosophy departments or wider society. Rather, it points to the need for a selfconscious analysis of the discipline, including looking at what other obstacles there may be to women's participation in Philosophy. Additionally, since gender differences in CRT score are likely to be (at least partly) the result of environmental factors, it points towards the need for action aimed at rectifying injustices in wider society, so that women can develop their skills at whatever it is that the CRT tracks.

The Cognitive Reflection Test
In 2005, Shane Frederick proposed the CRT as a measure of one type of cognitive ability. A participant's CRT score is the number of questions that he or she answers correctly on the following, three-item test: 1. A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? Answer: ___ cents.
2. If it takes 5 machines 5 min to make 5 widgets, how long would it take 100 machines to make 100 widgets? Answer: ___ minutes. 3. In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake? Answer: ___ days. The test questions have been chosen because they invite an intuitive, wrong answer. For example, consider the first question. The answer '10 cents' springs to mind, but this "impulsive" answer is incorrect (Frederick 2005, p. 26). Further reflection on the problem leads one to realise that the difference between $1.00 and 10 cents is only 90 cents, not $1.00, and "catching that error is tantamount to solving the problem, since nearly everyone who does not respond "10 cents" does, in fact, give the correct response: "5 cents."" (Frederick 2005, p. 27) In his original paper, Frederick discusses how CRT score is positively and significantly correlated with various other measures of cognitive ability (for example, the Wonderlic Personnel Test and the Scholastic Achievement Test). However, he argues that the CRT tests something distinctive-"cognitive reflection"-which he defines as "the ability or disposition to resist reporting the response that first comes to mind" (Frederick 2005, p. 35).
Frederick links performance on the CRT with the distinction between two types of cognitive processing, referred to by Stanovich and West (2000) as "System 1" and "System 2". Nobel Prize winner Daniel Kahneman has brought this dual-process model of decision-making (as well as the CRT itself) to the attention of the public through his internationally bestselling Thinking, Fast and Slow. 1 System 1 operates quickly and automatically, with little effort and no sense of voluntary control (Kahneman 2011, p. 20). It is what gives us the immediate, wrong answers to the CRT questions. System 2 involves slower, more deliberate and effortful thinking (2011, p. 13). It has a "supervisory function" (2011, p. 48), monitoring and controlling the thoughts and actions being 'suggested' by System 1 (2011, p. 44). Thus, if System 2 is activated in response to a CRT question, it can override System 1 to give the right answer.
Kahneman cautions us against interpreting 'systems' too literally: the terms do not describe two parts of the brain enacting distinct functions (2011, p. 29). Writing with Frederick, he clarifies that: [The terms System 1 and System 2] may suggest the image of autonomous homunculi, but such a meaning is not intended. We use systems as a label for collections of processes that are distinguished by their speed, controllability, and the contents on which they operate. (Kahneman and Frederick 2002, p. 51) Different test scores on the CRT are said to indicate individual differences in the way this dual-system functions. Performing poorly on the test indicates a "lazy" System 2 that relies on System 1 to do the work (2011, p. 48). In Kahneman's words, these individuals are "impulsive, impatient, and keen to receive immediate gratification" (2011, p. 48). In contrast, avoiding the intuitive incorrect answer indicates a "more active" mind (2011, p. 45). These individuals are more likely to invest the effort required to check their intuitions in other circumstances, and more likely to defer gratification (2011, p. 48).
The CRT has since become a "tremendously influential measure of reflective thinking" (Thomson and Oppenheimer 2016, p. 107) and has been utilised in dozens of research studies. At the time of writing, Frederick's paper has been cited 2835 times (Google Scholar, 5 October 2018). The dominant view remains that the CRT measures something unique (Szaszi et al. 2017, p. 207), marking it out from other cognitive tests. For example, Toplak et al. (2011Toplak et al. ( , p. 1275 concluded that "the CRT was a unique predictor of performance on heuristics-and-biases tasks" and that it tracks "miserly processing" in a way no other test does. The quest for identifying correlations between CRT scores and other individual differences has continued until the present day. 2

Interpreting the CRT
The standard interpretation of the CRT has been that it tracks 'reflectivity', understood as an inclination to stop and reflect on one's intuitions. In his original paper, Frederick gives a number of reasons to support this, including that many of the participants who gave right answers had scribbles in the margin or gave verbal reports indicating that the wrong, intuitive answer was considered first (Frederick 2005, p. 27).
However, some recent studies have questioned this standard interpretation. It makes sense to assume that 'stopping and reflecting' will take additional time, yet Stupple et al. (2017) found only a weak correlation between CRT response times and accuracy. In their 'thinking aloud' study, Szaszi et al. (2017) found that a significant proportion (77%) of 'Correct Answer' respondents started their thought process with a 'Correct Start' thinking lead (as opposed to the wrong intuition being reflected upon and corrected). Moreover, 39% of 'Wrong Answer' respondents did attempt to reflect on their answers. Perhaps these latter participants suffered not from a lack of reflectivity, but from a 'mindware problem': individuals lack the declarative knowledge and strategic rules that are needed to solve some problems. Consequently, even when individuals put considerable mental effort into the problem-solving process, the lack of this necessary knowledge can lead to thinking failures. (Szaszi et al. 2017, p. 223) These studies raise important doubts over the standard interpretation of what the CRT tracks, and this should be kept in mind in the discussion that follows (especially Sect. 6.1.1). However, these studies are far from conclusive. Jimenez et al. (2018) also tested response times and got the reverse conclusion to Stupple et al. (2018): "impulsive subjects complete the test quicker than reflective subjects" (Jimenez et al. 2018, p. 41). Szaszi et al.'s study was a 'thinking aloud' study, relying on participant self-reports of the thought process, which may be incomplete. Their data can be explained in ways that are consistent with the standard interpretation. For example, the 'Correct Start' participants may have considered the intuitive, wrong answer but never voiced this. 3

The CRT and gender
Frederick's original study indicated a significant gender gap in CRT score. The average score for men was 1.47 and for women was 1.03 (out of a maximum score of 3). The significance of group difference was calculated as p < 0.0001 (Frederick 2005, p. 38). Frederick suggests that the test maps "something that men have more of" and concludes that "men are more likely to reflect on their answers and less inclined to go with their intuitive responses" (2005, p. 37).
The finding that there is a gender gap in CRT score has been reliably replicated ever since (e.g. Szaszi et al. 2017, p. 216;Zhang et al. 2016;Thomson and Oppenheimer 2016, p. 106;Livengood et al. 2010), including in studies with participants from different age groups, educational levels and countries, and using the original CRT as well as some modified versions (Primi et al. 2018, p. 259). For example, a 2016 study gave typical results when it found that women are more likely than men to answer all three questions incorrectly and that the average CRT score of men is significantly higher than women (1.12 vs. 0.58, p < 0.001) (Cueva et al. 2016, p. 82). One 2017 study reported that "males scored 83.8% higher on the CRT" than females (Agnew 2017, p. 8). In their meta-analysis of 118 CRT studies (comprising of 44,558 participants across 21 countries), Brañas-Garza et al. (2015) found a negative correlation between being female and giving correct answers to the CRT test questions. Amongst 'Wrong Answer' respondents, women are more likely than men to give the intuitive response (e.g. Frederick 2005;Cueva et al. 2016;Pennycook et al. 2016).

The CRT and Philosophy
The research on the gender gap in CRT score makes for some rather uncomfortable reading, especially in light of the association of the CRT with an aspect of rationality. It appears to support the stereotype that women are more guided by intuition than rationality. On the face of it, it lends credibility to the view that "rationality is masculine", a view that forms "a backdrop to common Western conceptions of gender difference that have a deep influence on everyday life" (Haslanger 2012, p. 47).
These feelings of discomfort escalate when we pair these findings with a look at how CRT scoring has been used in some recent research in experimental philosophy. In a study involving data on 4472 participants, Livengood et al. (2010) investigated the relationship between philosophical training and CRT score. They found that the mean CRT score for participants with some training in Philosophy (0.98) was more than double the mean CRT score for participants with no training in Philosophy (0.44). Further, the mean CRT score for participants with some graduate training in Philos-ophy (1.32) was triple the mean CRT score for those with no training in Philosophy (Livengood et al. 2010, p. 316).
Those with more training in Philosophy tend to be better educated than those with no training in Philosophy, and so Livengood et al. sought to isolate 'training in Philosophy' as a factor. They found that even when controlling for relevant factors such as levels of education and gender, people with more philosophical training tend to exhibit higher CRT scores. 4 For example, out of those participants who reported having had some college education, the mean CRT score of those who had taken some Philosophy courses was nearly 70% higher than that of those who had not taken Philosophy courses (Livengood et al. 2010, p. 316).
The authors comment that their data suggests that there are some "deep commonalities" among philosophers. They hypothesise that philosophers share a "philosophical temperament"-a "cluster of dispositions that distinguishes philosophy from other intellectual endeavours" (Livengood et al. 2010, p. 318). Livengood et al. do not suggest what other dispositions might make up this 'philosophical temperament', instead focusing their discussion entirely on the "single, but important aspect of philosophical temperament" that they suppose is tracked by the CRT (2010, p. 318). This aspect is "cognitive reflectivity", understood as "a disposition to challenge one's own intuitions whenever presented with a novel problem, rather than simply relying on whatever first comes to mind" (2010, p. 314). They suggest that "philosophers are less likely to blindly accept their intuitions and more likely to submit those intuitions to scrutiny" (2010, p. 319). They conclude that their data suggests that this reflectivity is "an important facet of philosophical personality" (2010, p. 314).
More recently, Justin Sytsma (2016) has used CRT scoring to hypothesise that religious philosophers are "less analytic", perhaps explaining the alleged "poor health" of the sub-discipline of Philosophy of Religion. Sytsma implies that the CRT tracks what type of 'thinking style' you have ("analytic" versus "intuitive") and speculates that this might correlate with an ability to evaluate arguments. Putting his controversial hypotheses to one side, we can note that there are two assumptions at work here. Firstly, the CRT tracks analyticity (understood as a propensity to think analytically as opposed to intuitively). Secondly, possessing the feature tracked by the CRT contributes to 'healthy' philosophising. Both of these assumptions will be questioned in this paper.

Criticism of the idea of a 'philosophical personality'
The view taken by Livengood et al.-that philosophers are in some sense 'expertintuiters'-is a version of what has become known as the 'expertise defence'. 5 Whilst some have seen philosophical expertise as lying in having better intuitions, Livengood et al.'s understanding seems to be that philosophers possess a special trait enabling them to overcome biases of judgement and to reflect appropriately on intuitions. They talk about it being part of "expert philosophizing" to employ "intuition-poking practices" (2010, p. 319), "a range of possible practices, which all have in common that they are meant to determine whether the intuition is trustworthy and should thus be endorsed" (2010, p. 318). Philosophers are "more reflective than their peers: they are less likely than their peers to embrace what seems obvious without questioning it, and they are disposed to submit to scrutiny their intuitive inclination to judge that something is the case" (2010, p. 314).
Recently, a number of empirical studies have led to the expertise defence coming under scrutiny (Tobia et al. 2013;Schulz et al. 2011;Horvath and Wiegmann 2016). Notably, Schwitzgebel and Cushman (2012)'s research suggests that philosophers are as easily trapped by unreflective intuitions as non-philosophers. Earlier studies had found that how people judge a hypothetical moral scenario is affected by the order in which these scenarios are presented. That is, moral judgements are subject to 'order effects'. Since order of presentation is a factor that seems irrelevant to the rightness or wrongness of a scenario, we would hope that philosophers would be protected against this source of bias. Yet Schwitzgebel and Cushman found that philosophers judging moral scenarios were also subject to these order effects. Moreover, the order in which scenarios were presented substantially influenced which general moral principles the philosophers then endorsed. Contra Livengood et al., this suggests that there is no distinctive personality trait of 'reflectivity' that gives philosophers a special ability to overcome biases of judgement.
It might be thought that these new findings prevent the idea of a 'philosophical personality' from getting off the ground. However, there are a number of ways of explaining Schwitzgebel and Cushman's results that leave the expertise defence intact (Rini 2015). For example, it might be that Schwitzgebel and Cushman get the results that they do for philosophers because they are forcing them to make a binary choice on a question which the philosopher believes does not have a clear 'yes' or 'no' answer. This is supported by Bourget and Chalmers's (2014) study of professional philosophers, which indicated that philosophers are disinclined to make binary judgements on moral principles similar to those asked for in Schwitzgebel and Cushman's study. Forcing a binary judgement in response to a moral scenario or principle already identified by philosophers as problematic is therefore unlikely to reveal much about ordinary philosophical practice (and correspondingly, about what philosophical expertise consists in), since ordinary philosophical practice seems to involve refraining from forming these judgments (Rini 2015, p. 445). Thus it might still be the case that philosophers, when acting qua philosophers (engaged in their ordinary philosophical practice), are Footnote 5 continued simply have better intuitions than non-philosophers, and one that says that philosophers make better use of their intuitions. Note that both of these are different from how I formulate the defence here. I try to do so in a way that is consistent with the strong performance of philosophers on the CRT, which seems to be about simply discarding intuitions, rather than starting off with correct intuitions or making good use of our intuitions. For defences of philosophical expertise, see Singer (1972), Ludwig (2007), Grundmann (2010), Williamson (2007) and Williamson (2011). For a theoretical challenge to the expertise defence, see Weinberg et al. (2010). particularly careful when drawing conclusions based on certain intuitions, and here lies an element of their expertise.
More recently, Drożdżowicz (2018) has argued that even in light of the empirical studies, there remains room for a task-based version of the expertise defence, where philosophical expertise lies in (i) devising and discussing arguments, (ii) proposing, modifying, and refuting theories, and (iii) articulating and applying distinctions. She actually cites the Livengood et al. study as an example of one potentially fruitful way of testing whether philosophers have this kind of expertise. If philosophers have "extensive training in argumentation", which plausibly involves "evaluating one's intuitions as premises and blocking them, if needed, then it could perhaps be hypothesized that philosophers will score better in the CRT than non-philosophers…" (Drożdżowicz 2018, p. 268). Since this is precisely what was found in Livengood et al.'s study, the idea that such a disposition might be part of the philosophical personality remains somewhat plausible.

How do these findings bear on the under-representation of women in Philosophy?
The empirical research on the CRT and its relation to gender and Philosophy appears to be telling us two things: Women tend to perform worse on the CRT than men, and philosophers tend to perform better than non-philosophers. These purported facts could be interpreted as shedding light on a further fact: women are underrepresented in Philosophy. 6 Whilst most career paths and subject areas have seen a steady increase in women's participation, often to the point of equal representation or over-representation, there remains a lack of gender parity in Philosophy, comparable to the under-representation of women in 'STEM' subjects (Science, Technology, Engineering and Mathematics). A steady decline (often referred to as a 'leaky pipeline') can be seen in women's participation in Philosophy as we move 'up the stages'. For example, in the UK, a drop was seen from 46% at undergraduate level, through to 31% at PhD level, to 24% of permanent staff and 19% of professors (Beebee and Saul 2011). In the US, women make up about 30% of those earning Philosophy PhDs, far less than the average for all disciplines (Figdor and Drabek 2016). According to the Survey of Earned Doctorates in the US in 2009, Engineering, Computer Science and Physics are the only subjects where women earn fewer PhDs than in Philosophy (Healy 2011). Women are also poorly represented in the highest-ranked Philosophy journals, even when compared to the number of women working in elite universities. Sally Haslanger's survey of seven top Philosophy journals from 2002 to 2007 found that 12.4% of all authors were women (Haslanger 2008). Greeted with these three streams of research (on the CRT and gender, on the CRT and Philosophy, and on the Philosophy gender gap), one might be tempted to propose something like the following: Perhaps women are less likely to possess the aspect of the ideal philosophical personality that is tracked by the CRT, and this contributes to the gender gap in Philosophy. 7 If this is right, then it might be thought that the current trend towards encouraging Philosophy departments to engage in affirmative action strategies is misguided. 8 Since this is a natural (perhaps 'intuitive') response to the research, I call this the 'Quick Conclusion'.

Quick Conclusion:
The CRT tracks something valuable in Philosophy -an aspect of the ideal philosophical personality -which women tend to lack. The gender gap in the CRT is therefore explanatory for the gender gap in Philosophy. Philosophy departments and wider society are therefore exonerated of the need to institute or maintain practices intended to decrease the gender gap in Philosophy.
Many readers will find this a highly unpalatable explanation for the gender gap in Philosophy. At its most crude, this view suggests that there are fewer female philosophers because women are less rational. Livengood et al. and Sytsma do not draw this conclusion (in fact, they do not discuss the implications of their findings for women and Philosophy). But there is a risk that others who encounter these findings will do, for it is hard to deny that this kind of explanation for the gender gap has at least some prima facie plausibility.
First, it is suggested by the ideas presented above: the important role that intuitions play in philosophical practice, the dominance of intuitions in discussions of philosophical expertise, and the CRT as a particularly potent measure of how people tend to respond to intuitions. 9 As has been discussed, high CRT score is said to indicate a predisposition to careful reflection rather than reliance on intuitions. Some see intuitions as the "raw data" of Philosophy, with the role of the philosopher being to rigorously analyse these intuitions (Hutchison 2013, p. 112). Kahneman talks of System 2 as sometimes acting as an "apologist" for the automatic responses provided by System 1 (2011, p. 103). Thus we might see the practice of Philosophy as hyper-exercise of System 2, in order to scrutinise, justify and in some cases, override the intuitions provided by System 1. If women are more inclined to simply go with the first intuition 7 Someone pushing a philosophical personality explanation of the gender gap would need to propose other aspects of the philosophical personality that women supposedly lack-perhaps an ability to withstand harsh criticism and aggressive questioning (Beebee 2013) or a propensity to enjoy topics that do not appear practically useful or relevant to one's life (Thompson et al. 2016). Following Livengood et al. (2010), my focus throughout this paper is solely on the one aspect of the philosophical personality apparently tracked by the CRT. There are a number of reasons for this. First and foremost, it is because the fundamental question of this paper arose when I encountered three apparent facts-the low CRT score amongst women, the high CRT score amongst philosophers, and the low representation of women in Philosophy-and had the 'Quick Conclusion' offered by my interlocutors in response to these facts. Thus the paper's main aim is to make sense of this combination of facts together. Second, the idea that a special skill relating to intuitions is central to good philosophical practice has prima facie plausibility and has been a dominant position in Philosophy, evidenced by the wide amount of discussion of the role of intuitions in Philosophy and of the idea that philosophers are 'expert-intuiters'. Third, there is a general consensus that this skill is particularly well-tracked by the CRT, and there is a vast amount of evidence, interest and literature surrounding the CRT to draw on, including in the experimental philosophy and philosophical methodology literature. Thanks to an anonymous reviewer for Synthese for pressing me on this issue. 8 For examples of this trend, see APA (2017), BPA/SWIP (2011), and Hassoun (2017). that pops into their heads rather than employing System 2 processes, then perhaps this amounts to being less inclined to philosophical thinking.
Second, appeals to a cognitive gap-to a cognitive trait that women tend to have less of than men and which arguably has an important role in philosophical practice-have some explanatory merit over other explanations that have dominated the literature on the gender gap, such as stereotype threat and implicit bias. 10 These other explanations require generalising from research conducted in other fields or in the laboratory, and it is not yet clear to what extent it is legitimate to extrapolate to Philosophy. According to a recent review of the research into stereotype threat and implicit bias "there is little empirical evidence of their effects within Philosophy" (Thompson 2017, p. 5). It also remains unclear why these mechanisms would have had more of an effect on women in the field of Philosophy than in other disciplines. In contrast, appealing to a cognitive gap helps to explain the distinctive situation of Philosophy. Indeed, it fits with the finding by Livengood et al. that the opposite pattern can be found in Psychology (a field where women are significantly over-represented): those with more psychological training tend to exhibit lower CRT scores (2010, p. 328, n. 10). Livengood et al. do not attempt to explain this finding, nor do they report data for other disciplines. However, a defender of the Quick Conclusion might hypothesise that whilst women trickled into Psychology as the negative effects of discrimination and stereotype threat were gradually overcome, a matching trend did not happen in Philosophy because additional obstacles were (and remain) present. One such additional obstacle might be that women tend to lack an important aspect of the personality required to engage properly in philosophical practice.
Moreover, the research by Livengood et al. removes one obstacle to pursuing cognitive gap explanations for the under-representation of women in Philosophy. It has been argued by Thompson (2017, p. 3; 10, n.5) and Lemoine (2017) that it is not worth pursuing cognitive gap explanations for the gender gap in Philosophy because we do not know which cognitive abilities are correlated with philosophical aptitude. But since the research by Livengood et al. suggests one such correlation, this particular obstacle to pursuing cognitive gap explanations is now removed.
If this explanation for the gender gap in Philosophy is accepted, it might be seen to exonerate Philosophy departments of the need to put in place much-needed strategies for promoting gender diversity. If women are simply not up to doing Philosophy, there is little point in investing time and effort into making Philosophy departments more hospitable places for women. My view is that this would be the wrong response to the empirical research, since there are many plausible interpretations of the findings that avoid this implication. In the remainder of this paper, I show how thoughtful reflection on the research points against the Quick Conclusion, towards other interpretations that would necessitate different practical responses.
In order to properly assess the Quick Conclusion, it will be helpful to disaggregate it into several different claims that are at stake. To begin with, we can note that talk of the 'ideal philosophical personality' and a trait being 'valuable in Philosophy' is ambiguous, allowing for either a descriptive or normative interpretation: Descriptive Ideal Philosophical Personality Hypothesis (IPP D ): The CRT tracks something that is currently valued within the discipline of Philosophy-an aspect of what is (consciously or unconsciously) viewed as part of the 'ideal philosophical personality'-which women tend to lack. The gender gap in the CRT is therefore explanatory for the gender gap in Philosophy. 11 Normative Ideal Philosophical Personality Hypothesis (IPP N ): The CRT tracks something that (as a matter of fact) is a valuable philosophical trait -an aspect of the ideal philosophical personality -which women tend to lack. The gender gap in the CRT is therefore explanatory for the gender gap in Philosophy.
Even if both of these claims were true, it would not necessarily result in the following, action-guiding claim that is part of the Quick Conclusion: Inaction Conclusion: Philosophy departments and wider society are exonerated of the need to institute or maintain practices intended to decrease the gender gap in Philosophy.
In what follows, I will give reasons to question all three of these claims. However, the only claim that we can dismiss with confidence is the Inaction Conclusion. As I will discuss below, neither IPP D nor IPP N entails the Inaction Conclusion. It is worth at the outset pointing to one reason why this is so: the gender gap in CRT may be caused by environmental factors (as opposed to it being part of the 'female nature' that there is a tendency to exhibit less of the trait(s) tracked by the CRT). If this is the case, then action is still required. This would primarily need to take place outside Philosophy departments, in wider society, in order to rectify widespread, far-reaching structural injustices that result in women's poorer performance at this cognitive skill. The reason that changes within Philosophy capture more of my attention in what follows is simply that as philosophers, there is more that we can do to make an impact within the discipline than we can in society as a whole.

Does the CRT track what it is claimed to track?
The CRT has been seen as an indicator of rationality Toplak et al. 2011Toplak et al. , p. 1283, reflectivity (Livengood et al. 2010;Szaszi et al. 2017, p. 208) and 11 The 'ideal philosophical personality' in the sense of IPP D is somewhat similar to the idea of the "philosophical personality" discussed by Peña-Guzmán and Spera (2017). They see the "philosophical personality" as "the profile of the contemporary philosopher that emerges from the organization and interaction of two specific forces" (Peña-Guzmán and Spera 2017, p. 911). First, the philosopher as imago-"the figure of the professional philosopher who has succeeded by the standards established by his field" (2017, p. 914). Second, the philosopher as idea(l)-the mental representation that philosophers have of 'the philosopher'. Since both philosopher as imago and philosopher as idea(l) are dictated by current sociological trends, neither term captures my normative understanding of the ideal philosophical personality (IPP N ). analyticity (Sytsma 2016;Stahl and van Prooijenb 2018). These are all traits that have prima facie plausibility as part of the ideal philosophical personality. 12 Yet it is far from clear whether we can straightforwardly associate CRT performance with these traits. Two alternative possibilities for what the CRT tracks stand out in the literature: numeracy and/or confidence.

Numeracy
Numeracy is one's ability to store, represent and process mathematical operations (Peters, 2012). It has been widely discussed how difficult it is to disentangle cognitive reflection from numeracy (Thomson and Oppenheimer 2016, p. 101). All three test questions involve numbers, lending prima facie plausibility to the suggestion that the CRT tracks numeracy. There also exists a large body of research suggesting that the CRT measures both cognitive reflection and numeracy. 13 In Frederick's original study, only one other cognitive test showed a gender difference-the SAT maths scores (Frederick 2005, p. 37). Frederick comments that "men generally score higher than women on math tests" and he cites various studies from the 80s and 90s to support this claim. As I will discuss below, there is now strong counter-evidence to this. However, some studies do continue to point towards gender differences in maths ability, particularly as age of participants and complexity of the test increases (e.g. Ganley and Vasilyevam 2014;Lindberg et al. 2010Lindberg et al. , p. 1132Benbow et al. 2000). Primi et al. (2018, pp. 261-262) suggest that the strongest available evidence for gender differences in maths performance comes from the Programme for International Student Assessment (PISA), which assesses the competencies of 15 year old students from 65 countries in various subjects, including Mathematics. On average across OECD countries, boys outperform girls in Mathematics by eight score points. The difference is most notable amongst the highest achieving students: the highestscoring 10% of boys score 16 points higher than the best-performing 10% of girls (OECD 2016, p. 196). 14 If there is a numeracy gender gap, it seems plausible that this might be explanatory for the CRT gender gap. This explanation is supported by research by Thomson and Oppenheimer (2016). They piloted the 'CRT-2', a test designed to measure cognitive reflection whilst avoiding conflation with numeracy. The CRT-2 uses "trick questions" that "do not require a high degree of mathematical sophistication" (2016, p. 101). 15 200 participants were tested on both the CRT and the CRT-2 and it was found that the gender gap significantly lessened on the CRT-2. Whilst men (M 65.9% correct) significantly outperformed women (M 36.0% correct) on the original CRT (p < 0.001), men (M 60.5% correct) and women (M 53.3% correct) were not reliably different on the CRT-2 (p > 0.05) (Thomson and Oppenheimer 2011, pp. 106-107). This finding is consistent with differences in numeracy being a cause of the gender gap on the original CRT.
This explanation is further supported by a recent study by Primi et al. (2018), which found that the direct effect of gender was no longer statistically significant once the variables of mathematical reasoning and maths anxiety were taken into account. Additionally, Szaszi et al. (2017) suggest that we simply cannot separate numeracy from reflectivity on the CRT, since good numeracy is likely to deliver you the right intuitions from the start. Indeed, it is notable from reading their examples of participants' vocalised thought processes that 'Correct Answer' respondents often recognised that there was an equation that needs solving in the bat and ball question (Szaszi et al. 2017, p. 218).
The research at present does not, however, lead us to a position where we can say that gender differences in the CRT can be entirely explained via gender differences in numeracy. Firstly, we should note that Thomson and Oppenheimer's CRT-2 has not gained popularity, nor is it agreed whether it tests the cognitive skill that behavioural economists and psychologists have become so interested in. As Primi et al. (2018, p. 274) point out, the correlations between the CRT-2 and various measures of rational thinking and decision-making skills were generally weaker than the correlations between these measures and the original CRT. Secondly, other studies, including Frederick's original study, claim to have controlled for numeracy and yet found that a significant gender gap remains (Frederick 2005, p. 37;Agnew 2017, p. 12). Thirdly, it is far from clear to what extent there is, in fact, a gender numeracy gap. In their meta-analysis of 242 studies published between 1990 and 2007, representing the testing of 1286,350 people, Lindberg et al. (2010Lindberg et al. ( , p. 1131 conclude that "there is no longer a gender difference in mathematics performance". This is consistent with Hyde et al.'s (2008) study, which (using data from over seven million students) found no evidence of gender differences on US state math tests among students between Grade 2 and Grade 11. Where gender differences in favour of males are seen (for example, in complex problem-solving at high school level), these differences appear to be attributable to multiple possible environmental explanations (for example, that parents and teachers give higher ability estimates to boys than girls, and that patterns of interest are affected by cultural influences) (Lindberg et al. 2010(Lindberg et al. , p. 1132). This latter possibility would also help explain why gender differences in maths differ across countries, as well as the fact that these differences correlate with gender inequality measures for those countries (Else-Quest et al. 2010; Guiso et al. 2008;Penner 2008).
Nevertheless, a consensus does seem to have developed that numeracy is at least one component in performance on the CRT (Thomson and Oppenheimer 2016, p. 101;Szaszi et al. 2017, p. 207;Primi et al. 2018). What is the significance of this for explaining the gender gap in Philosophy?
It might be that the CRT tracks numeracy, and numeracy is required for success in Philosophy. 16 This fits with the high regard that philosophers have historically held for mathematics. It may be that maths skills are closely related to philosophical skills, particularly those required for Logic, which is often a compulsory component of Philosophy programmes. Evidence suggests that studying advanced mathematics develops some aspects of conditional reasoning, including the ability to reject invalid inferences (Inglis and Attridge 2016, p. 130), and so there is good reason to think that maths skills and logical skills are linked. Some have even argued that mathematical competence is crucial to good Philosophy. Boghossian and Lindsay (2016) declare that "If you want to be a good philosopher, don't rely on intuition or comfort. Study maths and science." Their reason is that "Philosophers who can think like mathematicians are better at clear thinking, and thus philosophy." However, evidence supporting this view seems rather sparse. As Thompson (2017, p. 3) says, the extent to which maths skills are required for success in Philosophy is not yet clear. Moreover, evidence of good numeracy is rarely, if ever, an entry requirement for university Philosophy programmes. 17 Given the research at present, it is unclear (i) whether women are worse at numeracy, (ii) the extent to which the CRT measures numeracy and (iii) whether numeracy is required for success in Philosophy. We therefore cannot adequately justify the conclusion that women's tendency towards a low CRT score represents low numeracy, which contributes to their low participation in Philosophy.

Confidence
Some have praised the CRT for being a "performance measure rather than a selfreport measure" (Toplak et al. 2011(Toplak et al. , p. 1275, but this neglects the important effect that self-perception of one's abilities can have on performance. It may be that the CRT tracks confidence in numerical abilities rather than (or in addition to) actual cognitive abilities. Zhang et al. (2016, p. 427) found that when differences in quantitative selfefficacy (perceived fluency with numerical information) are controlled for, gender differences on the CRT disappear. They conclude that "men perform better on the CRT because they are more confident in their quantitative abilities" (2016, p. 427). This is consistent with research on maths anxiety and gender differences, which has found that females suffer more from maths anxiety than males (Else-Quest et al. 2010;Devine et al. 2012). Ganley and Vasilyevam (2014)'s research suggests that female's heightened worry on maths tests utilizes their visuospatial working memory resources, leading to poorer performance. This would fit with Szaszi et al.'s (2017) suggestion (discussed in Sect. 2.1) that those answering the CRT questions incorrectly may be failing to bring to mind the strategic rules needed to solve the questions.
Footnote 16 continued could refer to producing new, plausible ideas that take us closer to the truth, or inspiring others to engage thoughtfully in philosophical issues, or some other measure of what it means to be a successful philosopher that is not dictated by one's success in the academy. These two kinds of success could, and perhaps sometimes do, come apart. Where I wish to distinguish between these two kinds of success, I refer to the first kind of success as 'successful progression in the field' and to the second kind of success as 'good philosophising '. 17 Of course, this does not mean that it is right that maths skills are ignored as a selection criterion in Philosophy. In the UK, the largest 'drop-off' of women tends to happen between undergraduate and Masters level (BPA 2011, p. 9). One (amongst many) possible explanations of this is that some female undergraduates find that they are just not 'up to it', because of reasons linked with their poorer numeracy.
It also fits with the wider picture given by research on confidence, which has suggested that women tend to have lower levels of self-confidence than men. 18 We might hypothesise that pursuing Philosophy to higher levels requires a degree of confidence in one's abilities that women are less likely to possess. There is at least some prima facie reason to think that confidence contributes to successful progression within the field. For example, the level of confidence with which you deliver your question or paper, or the conviction with which you profess your conclusion, is likely to affect the way that it is received by others (see Schwitzgebel (2010) on the potential effects of "being good at seeming smart"). Additionally, effectively 'batting away' opponents requires not just intellect, but an element of performance (Larvor 2015). As Justin Weinberg (2015) comments, most graduate students in Philosophy are advised to "project confidence". Perhaps women's poorer performance on the CRT tracks their high anxiety and low confidence, and these traits affect their levels of participation and performance in Philosophy.
However, a concern with this line of reasoning is that Zhang et al.'s study, like many others, does not account for the possibility that people's beliefs about ability are accurate (Lemoine 2017;Jussim 2012). That is, the self-report measure of quantitative self-efficacy may track numeracy, because the people that lack confidence in their quantitative abilities do so because they are, as a matter of fact, less competent at numeracy. This is consistent with research by Primi et al. (2018, p. 273), which found a direct link between maths anxiety and cognitive reflection, but found that the effect of maths anxiety on cognitive reflection was partially mediated by mathematical reasoning.
If quantitative self-efficacy is strongly linked with actual mathematical ability, then we are back to our unanswered question of whether numeracy is relevant to success in Philosophy.

Implications
The research discussed in this section does not point to clear conclusions about what the CRT tracks. Nor is it clear what the relevance to explaining the gender gap in Philosophy would be. However, it does suggest that we should, at the least, be sceptical about a straightforward equating of CRT score with rationality, reflectivity or analyticity. It therefore attacks a version of IPP that suggests that it is a lack of these particular traits that holds women back in Philosophy.
The discussion so far has not attempted to deny that there may be traits that women tend to lack which might help explain the gender gap in Philosophy. Rather, it has explored the possibility that the CRT tracks numeracy or confidence. The absence of relevant empirical research on the roles that numeracy and confidence play in Philosophy means that we are unable to say to what extent these attributes are currently valued in Philosophy and whether they contribute to successful progression in the field as it stands. There is, however, some anecdotal evidence suggesting that confidence might contribute to successful progression in the field, lending at least some, limited support to IPP D .

Is the trait tracked by the CRT something we should value in philosophers?
There is clearly a question mark over what the CRT tracks. But whatever it tracks, this is something that women tend to have less of than men and philosophers tend to have in abundance. So, we can raise a second question asking why we should think that the CRT tracks something that we should value in philosophers. That is, even if IPP D is true, why should we think that IPP N is true?
The idea that the CRT tracks something we should value in philosophers seems to be assumed by Livengood et al. When talking of the 'philosophical personality', the authors say that they seek only to describe "who philosophers are" (2010, p. 314). But at points they slip from this descriptive exercise by implicitly adopting the normative assumption that they have identified a philosophical virtue. For example, they imply that what the CRT tracks is part of the expertise of philosophers (2010, pp. 319, 320).
But who philosophers are and who philosophers should be are different questions. The fact that some norm exists amongst philosophers which correlates with their good performance on CRTs does not, in itself, tell us that this trait is an asset to philosophising. Imagine that there was evidence suggesting that philosophers are more likely to exhibit social awkwardness than non-philosophers. It would be wrong to conclude from this research that social awkwardness is part of the ideal philosophical personality (even in the sense of IPP D , for this trait might appear accidentally, rather than being (consciously or unconsciously) selected for). Rather, this trait is irrelevant (or even detrimental) to good philosophising.
Similarly, we might generalise from the finding about CRT tracking quantitative self-efficacy to say that philosophers have a tendency to be more confident about their cognitive abilities. But again, this attribute does not necessarily make for better philosophising. The philosophers discussed in the previous section who have offered anecdotal support for the role of confidence in Philosophy have tended to see this as a flaw in currently philosophical practice-a mark of a deep methodological problem with the way that Philosophy currently operates (Larvor 2015). Indeed, one might even think that those with lower confidence actually make for better philosophers, because they may be more open to counter-arguments. When evaluating IPP N , the salient question in assessing the relevance of the CRT should be whether whatever it tracks is an epistemologically relevant trait, one that we should value as conducive to the pursuit of knowledge (or whatever we see as the aim of Philosophy).
This idea that certain traits might be dominant in Philosophy without necessarily being conducive to good philosophising becomes more plausible when we consider the flaws in the supposedly meritocratic system used to select philosophers (onto courses, and into posts). It has been well-discussed that meritocratic selection may be subject to biases at the level of deciding whether a candidate fulfils certain criteria. 19 But it may also be that there is bias present in deciding what these criteria are. 20 The 'success criteria' of what it is to be a good philosopher are (at least partially) decided by those already successful in the discipline, so that the norms and values of these individuals are reproduced in those selected, in a kind of feedback loop (Jenkins 2013). For example, Haslanger (2008, p. 217) and others have expressed concern over the dominance of a hyper-rational norm in Philosophy, which is often taken to represent the high-end of the discipline, but which may not necessarily contribute towards good philosophising.
So, it may be that the CRT tracks trait T, and those possessing T are (intentionally or unintentionally) more likely to be recruited to Philosophy. But this does not, in itself, tell us that T is important for good philosophising. This 'irrelevant trait hypothesis' resists the move from IPP D to IPP N , as it suggests that although the trait(s) tracked by the CRT may be part of the philosophical personality, this does not mean that they are part of the ideal philosophical personality. It suggests that the CRT tracks a trait that is not relevant to good philosophising, but either (1) just so happens to be well-represented in philosophers, despite not being selected for (as in the social awkwardness example) or (2) is unconsciously or consciously selected for because it is mistakenly thought to be part of the ideal philosophical personality (as in the confidence and hyper-rationality examples). If this were the case, then we certainly should not settle for the Inaction Conclusion. Rather, we should seek changes to the status quo in the discipline, such as re-evaluation of the criteria used when assessing applicants for Philosophy jobs.
This response has flagged that there is an open question as to whether we should be valuing whatever it is that the CRT tracks. But there are difficulties with pursuing this 'irrelevant trait hypothesis'. Though there may be scope for debate over the purposes and methodology of the discipline, there is also wide agreement that Philosophy aims at the truth. The person who does badly in the CRT gets the wrong answers, and philosophers are after right answers. Moreover, as has been discussed, it seems plausible to say that it is part of good philosophising to engage in careful reflection over one's intuitions, and to be especially immune to biases of judgement. We therefore might not want to press too hard with the idea that there is nothing of value in what is tested by the CRT.

How important is this trait to good philosophising?
We might concede that the CRT tracks something of value, but argue that it is only one small part of the cognitive skills that contribute to good philosophising.
Imagine a test used to assess physical fitness for the military that has press-ups as the key element. Since women tend to have lower levels of arm strength than men, they might find it harder to pass this test. But it would be wrong to conclude that the women who fail this test are 'physically unfit'. Arm strength is just one small part of 20 Studies suggest that people alter what criteria they say are relevant for a particular job according to the characteristics of the person that they want to hire (Uhlmann and Cohen 2005;Luzadis et al. 2008). On the inherent difficulties with neutrally assessing merit in specific domains, see Crosby et al. (2003), Kane (1998), andCicchetti (1994). physical fitness; core strength and endurance also have an important role. In the same way, we might allow that the CRT tracks one aspect of rationality that women tend to have less of than men, but without drawing any conclusions about overall levels of rationality.
In our military fitness example, the important practical question is whether a certain level of arm strength is required for success in the military. Analogously, the salient question for us is whether the aspect of rationality potentially tracked by the CRT is an essential element in good philosophising. There seem to be good reasons to think that it is not, and rather, that the type of reflectivity tracked by the CRT is only one, fairly minor skill utilised by philosophers. It might make for a good start to one's philosophical project to begin with sound thoughts that have already been subject to some System 2 scrutiny, but it seems that the bulk of philosophical work comes later.
Consider how Livengood et al. set the scene for explaining the aspect of the philosophical personality that they are interested in: An intuition is a spontaneous intellectual sensation: p seems to be true without being consciously inferred. In considering the first question of the CRT, for example, it intuitively seems that the answer must be 10 cents. Similarly, in the Gettier case, it intuitively seems that the agent does not have any knowledge… (Livengood et al. 2010, p. 318) There seems something odd about this analogy. In the CRT, intuition delivers the wrong answer, and getting the right answer requires overriding intuition (rather than making use of it). In the Gettier case, we have an intuition which then becomes the subject of further philosophical exploration. By presenting thought experiments invoking certain intuitions, Gettier's (1963) paper far from closed the question of whether knowledge is justified true belief. Rather, it was the starting point of an ongoing philosophical project. Further philosophical work has consisted in: (i) suggesting additional conditions that might be added in order to avoid Gettier cases, such as the 'no false lemmas condition' (e.g. Armstrong 1973, p. 152;Clark 1963), (ii) engaging in further thought experiments to question the conditions for knowledge (e.g. Goldman's (1976) 'fake barn' cases), (iii) making distinctions within 'justification' and exploring what it takes for a belief to be justified (e.g. Feldman and Conee 2001), and (iv) suggesting alternative accounts of what constitutes knowledge, such as reliabilism (e.g. Nozick 1981). If it makes sense to talk of 'getting the right answer' to a Gettier case, arriving at this 'right answer' when the case is first presented seems to be only a small and insignificant part of the process, and it is not clear to what extent getting the answer wrong at the start would be damaging to the long-term philosophical project. 21 Philosophers have far more than the few seconds or minutes spent on the CRT questions to properly evaluate Gettier cases and come to a judgement on what knowledge really consists in. Thus although there might be something in the reflectivity that is tested in the CRT, it seems like there is another, broader type of reflectivity that is of more value and importance to the long-term philosophical project-perhaps one involving an indefatigable pursuit of answers, even where these are particularly hard to find. 22 The above discussion has given just one reason to question the relative value of the trait(s) tracked by the CRT compared to other traits that are potentially part of the ideal philosophical personality. Given the precise nature of the CRT questions, set against the range of virtues and skills that we might plausibly postulate as part of the ideal philosophical personality, my view is that we probably need not hang too much on whatever the CRT tracks. Not all philosophers perform well on the CRT and so it is, at the least, possible to successfully progress in the field whilst lacking this particular skill. And even if the trait that the CRT tracks contributes to good philosophising, it is far from clear that this trait is essential to good philosophising and therefore to the ideal philosophical personality in the sense of IPP N .
Moreover, it could be that there is a correlation between possessing above-average levels of analyticity (or whatever we suppose it is that the CRT tracks) and lacking other skills that are valued amongst philosophers, such as creativity. This is purely speculative, but it is conceivable that a high CRT score comes at the expense of other virtues that we need more of in Philosophy. 23 Kahneman says that "absence of bias is not always what matters most" (2011, p. 192), and this surely applies to Philosophy. It could be that relief from the constraints of analyticity allows for more creative thinking, increasing the likelihood of hitting upon unusual, divergent ideas. If this were true, low CRT score should not be viewed as indicative of a poor philosopher.
Reflecting back on the military fitness example may be helpful here. Let us say that (1) men have more arm strength than women, (2) military personnel have more arm strength than those outside the military, (3) there are more men than women in the military and (4) there is a good prima facie case for thinking that arm strength contributes to doing your military service well. This state of affairs is perfectly consistent with there being other attributes that contribute to success in the military that women have more of than men (for example, endurance or emotional literacy). If this were the case, in addition to checking that entry tests for the military are not overly-focused on arm strength, it would also be important to look at how other factors such as discrimination and unconscious bias might be contributing to the under-representation of women.
Applying the same reasoning to our case: Even if it is true that the trait T tracked by the CRT is currently valued amongst philosophers (i.e. there is some truth to IPP D ), and even if possessing T does, as a matter of fact, make some contribution to good philosophising (i.e. there is some truth to IPP N ), it would still be wrong to think that 22 The ideas in this paragraph are heavily influenced by Emily Perry's excellent response to this paper at the London-Berkeley Graduate Conference 2018. 23 The idea that diversity amongst participants will benefit the discipline itself has been argued for in relation to other disciplines, particularly Science (Rubin and O'Connor 2018;Harding 1991). Rubin and O'Connor (2018) outline the potential benefits of diverse collaborations in Science, pointing to research by Zollman (2010) suggesting that a diversity of beliefs within an epistemic community is key to ensuring that the group eventually arrives at true beliefs. Moreover, Page (2008) and Phillips et al. (2006) have found that a diversity of perspectives can aid complex problem-solving, as well as creative work. It seems plausible that these arguments can be extended to Philosophy, in order to say that a diversity of philosophers may positively influence the methodology, content, outcomes and practice of the discipline. For examples of this kind of argument applied to Philosophy, see Wylie (2011). For criticisms of using this kind of argument to justify affirmative action, see Anderson (2002). the empirical research entirely explains the gender gap in Philosophy. Since T is one amongst many possible philosophical virtues and skills, women tending to exhibit less of T should not be having such a dramatic effect as to produce the wide gender gap we see in Philosophy. If that is not the case, and in fact it is a significant factor, because possessing trait T is wrongly being prioritised as an important selection criterion, we might speculate that this is detrimental to the discipline of Philosophy, since prioritisation of T might come at the expense of other valuable philosophical virtues and skills. Regardless of the truth of this last hypothesis, the Inaction Conclusion would be unjustified. Rather, as in the military case, we should turn a critical eye to entry criteria, as well as onto whether there are other obstacles to women's participation such as discrimination and unconscious bias.

How should we understand the causal story?
Lastly, and importantly, we should note that the direction of causation has not been established between CRT scores and philosophical training. We cannot say whether it is philosophical training that leads to the increased CRT score amongst philosophers, or whether possessing the trait(s) tracked by the CRT to a high degree leads people to undergo more philosophical training. 24 At least three possibilities explain the current data. Firstly, it may be that people with higher CRT score are more likely to take up further philosophical training (Fig. 1). 25 Since women tend to have lower CRT score, fewer women continue in Philosophy.
Secondly, it might be that the two facts are independent and we should draw no conclusions from the gender gap in CRT score and the increased CRT score of philosophers (Fig. 2).
However, given all that has been said so far, it seems unlikely that these facts are entirely independent. Imagining our social awkwardness example to be true, we would probably want to posit at least some causal relation between the phenomena. For example, we might hypothesise that you need to be clever to be a philosopher and being clever makes it harder to talk to other people. Analogously, there is likely to be some causal story that can be told between the gender gap in CRT score and the gender gap in Philosophy.
Thirdly, it could be that practising Philosophy brings up your CRT score, but fewer women are continuing with Philosophy (for reasons unrelated to CRT score) (Fig. 3). Women may be put off staying in Philosophy by contingent features of the discipline in its present state, features that are amenable to change by the actions of university faculties. 26 If this were the case, it would be an injustice that we should actively seek to rectify, for women would be missing out on opportunities to develop their capacities in whatever it is that the CRT tracks.  This third explanation denies both IPP D and IPP N , since it denies that the gender gap in the CRT is explanatory for the gender gap in Philosophy. It says that although the CRT may track a trait that is part of the philosophical personality, it is the study of Philosophy that nurtures this trait, and so we must look elsewhere for explanations of why women tend not to continue studying Philosophy beyond their tendency to have a lower CRT score.
However, given the small number of participants in Philosophy, clearly the gender gap in Philosophy cannot account for the gender gap in CRT score on its own. If the hypothesis that the direction of causation runs this way is to be at all plausible, we would need to speculate that Philosophy is one of a number of disciplines or activities that improve CRT score and which men are more likely to engage in than women (Fig. 4).
The causal story behind the gender gap in Philosophy is likely to be far more complex than is allowed by any of the possibilities discussed so far. An intelligent supporter of the Ideal Philosophical Personality Hypothesis would not claim that the only cause of the gender gap in Philosophy is that women lack the aspect of philosophical personality tracked by the CRT. A 'perfect storm' explanation of the gender gap seems more plausible, where many factors combine to produce the dramatic gender gap we see in Philosophy (Antony 2012). Figure 5 illustrates this with some hypothetical (but plausible) examples of other causal factors.
The interesting question is then whether the tendency for women to exhibit less of the trait(s) tracked by the CRT is one cause amongst many. Where a factor F is one cause amongst several (mutually independent) causes, we should be able to vary the other causes without this leading to a change in F. And yet, it is not clear that this would be the case here. For example, it seems plausible that were we to vary one of the social norms that contribute to the gender gap in Philosophy, this would also lead to a change in the CRT gender differential. In that case, this social norm would be a common cause of both the CRT gender differential and the gender gap in Philosophy, and the tendency for women to exhibit less of the trait(s) tracked by the CRT would not be a mutually independent cause of the gender gap in Philosophy.

Fewer women in Philosophy
Lower selfconfidence To make this thought more concrete, we can take as an example the stereotype that women are more intuitive and less logical. This stereotype might make women less likely to imagine themselves as philosophers, with the consequence that they are less likely to continue in the discipline (Demarest et al. 2017). In that case, this stereotype is a causal factor in the gender gap in Philosophy. But the stereotype might also contribute by a more indirect route. For example, the stereotype may have the effect that adults are less likely to give girls toys that develop logic (Oksman 2016), with the consequence that girls have fewer opportunities to develop skills at whatever the CRT tracks. In that case, the stereotype acts as a causal factor in the CRT gender differential, which then feeds into women appearing less likely to have the 'philosophical personality' and there being fewer women in Philosophy. The gender gap in Philosophy, as well as the poorer performance of women on tasks like the CRT, would then provide further evidence for the stereotype. So, the stereotype would be causing several environmental interventions, which have effects that validate the stereotype (Fig. 6). 27 Stories like this, where there is a kind of causal feedback loop operating between different factors, seem plausible. To endorse this particular story would be to endorse a version of the Ideal Philosophical Personality Hypothesis, since it allows that there are fewer women in Philosophy partially as a result of women tending to lack an aspect of the philosophical personality. But it is a story that points against the Inaction Conclusion, because it blames women's low CRT score not on innate differences in aptitude, but on contingent structural norms and cultural practices that would lessen or disappear in a fairer, more equal society. An implication of this is that tackling the structural injustice leading to the gender gap in CRT score would require far more than simply making changes within the discipline of Philosophy.
Whatever we think of that story, it should at least be clear that a straightforward causal arrow from CRT aptitude to the gender gap in Philosophy is highly implausible. The causal story is likely to be far more complex than any of the initial hypotheses allowed.

Conclusion
We have seen that there are several routes by which we can argue against the Quick Conclusion. Firstly, we can dispute whether the CRT tracks what it is claimed it tracks. Secondly, we can question whether the trait tracked by the CRT is something we should value in philosophers. Thirdly, even if we allow that the CRT does track a valuable trait, we can question how important this is when compared to all the other traits that contribute to good philosophising. Lastly, we should question the implausibly simplistic causal story that crude versions of the Ideal Philosophical Personality Hypothesis imply.
The empirical research in this area is still in its infancy and there remain many unanswered questions. It is unclear exactly what the CRT tracks, and the extent to which whatever it tracks is currently selected for when recruiting philosophers onto courses and into posts. It is therefore impossible to draw firm conclusions about the truth of IPP D . However, if IPP D is true, it is only plausible to claim that the gender gap in the CRT is somewhat explanatory for the gender gap in Philosophy. Women exhibiting less of the trait tracked by the CRT will be one amongst many factors, and it is likely that there will be a number of interaction effects between these causal factors, resulting in a complex causal story where causal connections run in multiple directions.
A plausible case can be mounted against IPP N , particularly when we think about the range of philosophical virtues and skills that plausibly might constitute the ideal form of the philosophical personality. This, however, is a matter for debate; there is, at the least, a prima facie case for thinking that the skill tracked by the CRT has at least some value for good philosophising (though this may be offset if it comes at the expense of other, valuable traits). What we can say with confidence is that the interpretation of the empirical research which says that there are fewer women in Philosophy because women naturally lack the personalities required to be good philosophers is unconvincing.
Rather than endorsing one particular response, the intention of this paper is to open up discussion of how best to make sense of the research as it currently stands, and to prompt reflection on what practical responses are appropriate in light of the different hypotheses. For example, if it is right that poor CRT performance is an indicator of low confidence, then this would add urgency to the already growing cries for finding ways to increase self-confidence amongst women. Even simple interventions, such as giving more explicit encouragement to undergraduates (Saul 2013, p. 51) or emphasising the importance of effort rather than 'brilliance' (Thompson et al. 2016), might partially stem the flow out of Philosophy's leaky pipeline. 28 Although not all the routes discussed have dismissed the part of the Quick Conclusion that claims that the gender gap in the CRT is explanatory for the gender gap in Philosophy, all routes imply that it would be the wrong response to the empirical findings for Philosophy departments to simply relax and take no action aimed at narrowing the gender gap in Philosophy. Instead, the discussion points towards the view that making relevant structural changes to the environment (both inside and outside of Philosophy) should remain our focus when thinking about the gender gap in Philosophy.