1 Introduction

The expertise defense is a common reply to concerns about the legitimacy of appeals to intuition in philosophical arguments. It asserts that we have good reason to expect research conducted on ordinary people not to generalize to professional philosophers. The reason is that, given their training, professional philosophers tend to make more reliable judgments about philosophically puzzling cases than the folk. This is plausible as far as it goes. However, even if there are differences in intuitions between philosophers and laypeople, it is possible that they do not arise from differences in expertise. For one thing, people who enroll to study philosophy may have atypical intuitions to begin with. For another, philosophy students whose intuitions are at odds with the community-wide consensus may either try to conform, or else withdraw from the philosophy program altogether.

In this paper, we present the results of a controlled longitudinal study designed to shed light on how the putative cognitive skills invoked by the expertise defense develop over the course of a philosophical education. We begin by sketching the two standard challenges to the method of cases that the expertise defense is intended to address (Sect. 2). Both are based on experimental philosophy findings which indicate that ordinary people’s judgments about philosophically puzzling cases depend to a significant extent on epistemically irrelevant factors. We then present the expertise defense, taking care to reconstruct its assumptions as to the nature of the cognitive skills it attributes to professional philosophers and the likely time at which they develop. We also characterize three hypothetical models of the scope of philosophical expertise that are amenable to empirical investigation. Next, in Sect. 3, we review existing research aimed at testing the expertise defense. All of this research follows a cross-sectional design, which has certain limitations. We argue that a longitudinal design is poised to address some of them. The main part of the article (Sect. 4) is devoted to presenting the results of our study and Sect. 5—to discussing their philosophical implications.

2 The expertise defense as a reply to the Diversity Challenge and the Questionable Evidence Challenge

Empirical findings have been interpreted as raising two kinds of concerns about using judgments of philosophically puzzling cases as premises in philosophical arguments. The Diversity Challenge (as it is called by Mortensen & Nagel, 2016) appeals to correlational studies indicating that such judgments depend on a variety of demographic variables (Machery, 2017). For example, intuitions about the reference of proper names and natural kind terms differ across cultures (Machery et al., 2004; Beebe & Undercoffer, 2015, Beebe & Undercoffer, 2016; Machery et al., 2010; Machery, 2017, see Dongen et al., 2021 for a meta-analysis), young people are more likely to ascribe knowledge in Fake Barns scenarios than older ones (Colaço et al., 2014), and judgments about free will and moral responsibility vary according to where the person is on the introversion–extraversion dimension (Feltz & Cokely, 2016, 2019).

Findings of this kind are disconcerting because not only are demographic variables epistemically irrelevant, but they are also unchangeable. It is impossible for an extravert to become an introvert, or for someone raised in India to be raised in the US. Nor should such accidental features matter to what kind of philosophical views one espouses. The upshot is that, when a philosophical intuition of one demographic group conflicts with that of another, it makes little sense to say that either one is mistaken. This obviously complicates conceptual analysis, suggesting that different groups disagree on philosophically puzzling cases by virtue of having different concepts. It also frustrates attempts at exploiting philosophical intuitions to discover the nature of causation, knowledge, free will, etc., as these attempts assume that case judgments are objectively right or wrong (Machery, 2017; Stich & Tobia, 2016).

The Questionable Evidence Challenge, by contrast, appeals to research into effects of manipulable variables, such as the ordering and framing of the stories and questions used in the surveys (again, we borrow the name of the challenge from Mortensen & Nagel, 2016). Some experimental evidence indicates, for example, that people are more likely to ascribe knowledge in the Truetemp case when it is preceded by a clear lack-of-knowledge case than by a clear knowledge case (Swain et al., 2008; Wright, 2010, but see failed replication attempts in Ziółkowski, 2021 and Ziółkowski et al., 2023) and attribute responsibility, free will (Nichols & Knobe, 2007) and causal bases of behavior (Kim et al., 2017) depending on whether the question is asked in abstract or concrete terms. The concern here is that, being susceptible to such cognitive distortions, judgments about cases are not as trustworthy as many philosophers take them to be. Because it is possible to separate the influence of manipulable variables on case judgments from what may be regarded as unadulterated philosophical intuitions, evidence of this kind need not automatically threaten objectivist projects undermined by the Diversity Challenge, but it does suggest caution in pursuing them.

The strength of the Diversity Challenge is the subject of an ongoing debate. As Mortensen and Nagel (2016) point out, many of the demographic effects reported in the older literature have not been detected by subsequent studies, and Knobe (2019) even goes as far as to claim that the robustness of a wide range of philosophical intuitions, as indicated by 30 studies performed on a total of 12,696 subjects, suggests that the intuitions are innate. Yet Stich and Machery (2023) take a very different view of the literature, citing 100 studies, done on a total sample of over 40 million participants, that report differences in philosophical intuitions between various populations.

The expertise defense—a term coined by Weinberg et al. (2010)—sidesteps much of this debate because a vast majority of the findings under discussion concern the philosophical intuitions of the folk. This creates a problem for both challenges, for according to the proponents of the expertise defense, data about case judgments collected from the general population cannot undermine any aspect of philosophical research practices because philosophers have mastered a set of cognitive skills that enable them to outperform ordinary people on tasks involved in doing philosophy. Since construction and evaluation of thought experiments are arguably tasks involved in doing philosophy, it stands to reason that philosophical expertise, fostered by formal training and professional experience, encompasses some cognitive competences that improve performance on those tasks.

While it is not exactly clear what those competences are, they have been suggested to include sensitivity to the structure of philosophical concepts (Ludwig, 2007), and the abilities to closely analyze philosophical texts, construct and evaluate arguments, apply general concepts to specific situations with an eye to relevant detail (Williamson, 2011), and use the tools of formal logic (Weinberg et al., 2010, p. 335). Importantly, the cognitive skills posited by the expertise defense are by no means mysterious or occult. They are much like other familiar kinds of expertise, exhibited by, say, lawyers (Williamson, 2007), physicists (Hales, 2006) and mathematicians (Ludwig, 2007), who all cope better than the untutored person with problems representative of their domains.

2.1 When philosophical expertise is formed

In psychological research, the term “expertise” denotes the ability to consistently deliver outstanding performance in a given domain. Shanteau (1992, pp. 255–256), a prominent expertise researcher, writes “a naïve decision maker has little or no skill in making decisions in a specific area. For example, graduate students generally are naïve about the kinds of decisions made by experts. Novices are intermediate in skill and knowledge; they frequently have studied for years and may even work at subexpert levels.... Typically, advanced (graduate students) are novices in making skilled decisions.”

By contrast, proponents of the expertise defense tend to assume philosophical expertise to arise much earlier than it would if it were defined within psychology. Williamson (2007, p. 191) says “philosophy students have to learn how to apply general concepts to specific examples with careful attention to the relevant subtleties, just as law students have to learn how to analyze hypothetical cases. Levels of disagreement over thought experiments seem to be significantly lower among fully trained philosophers than among novices.” And, in a later paper, asks rhetorically “But who ever claimed that the difference in skill at thought experimentation between a professional philosopher and an undergraduate is as dramatic as the difference in skill at chess between a grandmaster and a beginner?” (Williamson, 2011, p. 224).

The reason for this discrepancy is that the expertise defense imposes a special constraint on the cognitive skills involved in making case judgments: that they are sufficiently developed in most members of the philosophical community for appeals to intuition in philosophical arguments to be effective. To put this the other way around, if only a minority of philosophers had the cognitive skills necessary to make credible case judgements, philosophical expertise could not account for community-wide agreement about those judgments.

If, however, philosophical expertise were defined as the set of cognitive skills that enable outstanding performance on philosophical tasks, including tasks involved in thought experimentation, then expert philosophers, thus understood, would not be sufficiently numerous. People of outstanding ability are simply in the minority regardless of the domain, and philosophy is no exception.

A second problem is that, given psychological evidence, it is doubtful whether philosophical training gives rise to genuine expertise, as it is understood in psychology. Weinberg et al. (2010) point out that while training in some domains, including chess, mathematics, physics and meteorology, clearly gives rise to genuine expertise, there are also areas where no amount of experience seems to improve performance—for example, clinical psychology, psychiatry, polygraph testing, and stock brokerage (see, e.g., Shanteau, 1992, p. 258). The crucial difference is that domains where there is genuine expertise rely on well-developed training regimens characterized, among other things, by the availability of large amounts of clear and reliable feedback. Since training in thought experimentation does not rely on such a training regimen, it is unlikely to give rise to genuine expertise.

Given the constraint imposed on the notion of philosophical expertise, it is reasonable to resist the temptation to adapt the standard psychological notion of expertise to the skill of thought experimentation. Accordingly, it is reasonable to suppose that philosophical expertise, in the appropriate sense, develops already during philosophical studies, rather than, say, within the first 10 years after the person has obtained a PhD in philosophy. This, as we shall see, is the approach taken by existing experimental studies into the impact of philosophical expertise on case judgments.

2.2 The likely scope of philosophical expertise

A consequence of the decision to define philosophical expertise in terms of acceptable performance is that one has to be cautious about exploiting psychological expertise research when considering the question of the likely scope of the cognitive skills involved in making case judgments. This means that, at this stage, we know very little indeed about the likely mechanisms underlying case judgment evaluation as well as their domains of operation. The situation is not hopeless, though. It is reasonable to suppose that the putative cognitive abilities developed through philosophical training are more or less restricted to a domain. The only problem is that the models of philosophical expertise that we can propose are tentative and somewhat speculative.

We propose to distinguish three distinct possibilities. First, formal instruction in philosophy may enable students to master a set of skills whose exercise affects case judgments in all areas of philosophy. We may provisionally identify these putative skills with the method of philosophical thought experimentation and suppose that students of philosophy become increasingly adept at making relevant intuitive judgments because, in the course of their education, they encounter and engage with many thought experiments. On this Method Model of Expertise, we would expect all case judgments to vary together with the level of competence in appraising thought experiments.

Second, the relevant cognitive skills developed by virtue of studying philosophy may be specific to a subfield. This kind of competence may arise from learning to deploy appropriate theories or having developed domain-specific conceptual schemata. If this Subfield Model is accurate, then we would expect training in a given subfield of philosophy, such as epistemology or ethics, to affect all judgments about cases relevant to that subfield, without necessarily influencing judgments in other subfields.

Third, the cognitive skills making up philosophical expertise may be even more specific than that, perhaps being restricted to only one concept or even part of a concept. If this Restricted Expertise Model is accurate, then we would expect case judgments to change piecemeal as the person gradually acquires a rich mental representation of the structure of a particular concept.

3 Testing the expertise defense

One way to find out if philosophical training enhances the capacity to make judgments of philosophically puzzling cases is to compare the responses of philosophers and non-philosophers. The expertise defense would be blocked if the responses in both groups were the same.

According to available data, they are not. It has been reported that philosophers’ intuitions about phenomenal consciousness do not coincide with those of ordinary people: while philosophers tend to treat diverse experiences such as feeling pain and seeing red as belonging to a single class of phenomenal mental states, the folk tend to distinguish between mental states that essentially have a valence (e.g. feeling pain), those that do not have a valence (e.g. being angry) and those that have both a valence and a perceptual component (e.g. smelling bananas) (Sytsma & Machery, 2010). Similarly, according to Machery (2012), although most people regardless of education are Kripkeans about the reference of proper names, the proportions of the causal–historical vs. descriptivist case judgments vary depending on background. Philosophers of language and semanticists have been found to have more Kripkean intuitions than comparably educated laypeople, whereas the judgments of linguists specializing in discourse analysis, historical linguistics, anthropological linguistics and sociolinguistics are more descriptivist. Differences associated with education have also been discovered in the area of knowledge attribution. According to Starmans and Friedman (2020), subjects holding a PhD in philosophy are less likely than non-philosophy academics or laypeople to attribute knowledge in some Gettier-style scenarios than in standard true justified belief situations. Knowledge attributions made by professional philosophers are also less sensitive to skeptical pressure than those made by either other academics or laypeople. Lastly, non-philosophy academics exhibit more skepticism about knowledge attributions than do philosophers and laypeople.

Although such findings serve to keep the expertise defense in the game, they are also consistent with the claim that philosophers’ intuitions are in fact no better than those of the folk. Consequently, studies detecting a significant difference between philosophers and ordinary subjects are often followed up with an investigation into susceptibility to various forms of bias.

The only study focused on a demographic variable we know of speaks against the expertise defense: compatibilist intuitions of professional philosophers, like those of the folk (Feltz & Cokely, 2016), are positively correlated with extraversion (Schulz et al., 2011). As for research into the influence of manipulable variables on case judgments, a majority of studies so far have focused on ethics. Most of their findings also undermine the expertise defense. Professional philosophers engaged in moral reasoning have been found to be affected by persistent ordering effects (Schwitzgebel & Cushman, 2012, 2015), the cleanliness bias (Tobia et al., 2013a, 2013b), the “Asian disease” framing bias (Horvath & Wiegmann, 2022; Schwitzgebel & Cushman, 2015), and the actor–observer bias (Tobia et al., 2013a, 2013b), though it must be noted that this last effect did not replicate (Horvath & Wiegmann, 2022). Other empirical results that weaken the expertise defense indicate that philosophers are subject to the status quo bias in experience machine scenarios (Löhr, 2019) and are as susceptible to certain modal illusions as other academics except mathematicians (Kilov & Hendy, 2022). But there are also data that confirm a positive influence of philosophical training on intuition: philosophers gave more consistent responses to different versions of the experience machine than laypeople (Löhr, 2019), and ethicists, unlike the folk, were unaffected by question-focus bias (Horvath & Wiegmann, 2022). Furthermore, subjects with a PhD in philosophy outperformed laypeople at identifying information relevant to judgments elicited by thought experiments modeled on Gettier cases, the Chinese room, Mary, Fake Barns, and Twin Earth, though the effect was small (Schindler & Saint-Germier, 2022).

However, all the studies to date suffer from an important limitation of all cross-sectional research, in which subjects are compared at a single point in time. Cross-sectional studies can provide a snapshot of dependencies between selected variables in the sample, but they are silent on what kind of processes caused those dependencies to arise. In the case of comparisons made between samples drawn from different populations, there is an indefinite number of differences between the samples that may contribute to the study’s outcome. The upshot is that, when subjects from a sample of philosophers exhibit a different pattern of responses from laypeople, this need not be due to a discrepancy in expertise. Because people do not choose their studies at random, philosophy students may have atypical intuitions to start with, or students whose intuitions conflict with those of their teachers may either drop out or strive to align their responses with what they perceive as the mainstream view.

In order to assess whether observed differences between philosophers and laypeople result from training or social selection, it is therefore necessary to conduct longitudinal studies, in which the intuitions of subjects are probed repeatedly over an extended period of time, providing a diachronic picture of variation in the responses. Although observational rather than experimental in character, longitudinal studies can provide invaluable information for causal inference. When employed with this aim in mind, they feature a control group selected in such a way as to resemble the experimental group as closely as possible but not be affected by factors hypothesized to influence the variables of interest.

In what follows, we report the results of a longitudinal study in which two cohorts of undergraduate students in philosophy (the experimental group) and in cognitive science (controls) were tested every semester for 3.5 years on their intuitions regarding ten widely discussed philosophical cases taken from a broad range of subfields. If the assumptions of the expertise defense are true, we should observe changes in intuitions resulting from training in the group of philosophy students. The predicted direction of these expected changes is, at least according to the proponents of philosophical expertise, pretty clear. The intuitions should become increasingly aligned with the consensus in a given area because only then the expertise defense could succeed. Because the courses are spread over time, we can also assess the generality of the putative competences developed by the training (see Sect. 3.2 for the discussion). By comparing the responses of philosophy students with those of the students of cognitive science, we can assess the extent to which observed patterns of changes in the experimental group could be explained in terms of philosophical training as opposed to being the result of a general academic education, age or some other factor present in both groups.

In the study, we investigated (1) whether formal training in philosophy affects case judgments and, if so, whether they are stable over time and whether they converge on textbook consensus, (2) whether the effects of formal training, if any, apply to all case judgments or only to some, and (3) whether the differences in case judgments, if any, between philosophers and laypeople can be explained by appeal to two types of social selection mechanisms: (a) people who enroll to study philosophy already have different intuitions from others, and (b) philosophy students whose intuitions do not conform to the community consensus tend to withdraw from the philosophy program.

4 Longitudinal study

4.1 Method

4.1.1 Materials

We selected from the philosophical literature ten classical cases to be tested. The choice was based on three criteria. First, the cases were selected from a wide range of philosophical subfields. Second, we included cases widely recognised in the philosophical literature. These are either thought experiments backed by a well-established philosophical theory that resolves what the judgment evoked by the case should be (e.g., Gettier cases undermining knowledge defined as a belief that is true and justified) or cases that are related to a certain well-known theory, albeit one that has its competitors in the philosophical market (e.g., Truetemp case as an argument against externalist conceptions of knowledge). Third, we were mainly interested in scenarios that had already been the subject of experimental research. Thus, we chose the Gettier case, Fake Barns, and Truetemp scenarios from epistemology, Putnam’s Twin Earth and Kripke’s Gödel/Schmidt cases from the philosophy of language, a Knobe-like harm scenarioFootnote 1 from the philosophy of action, Nozick’s Experience Machine, Thompson’s Violinist and Frankfurt’s (1969) case from ethics and moral philosophy, and Parfit’s Teleportation case from metaphysics.

Because many of the philosophical thought experiments in their original form were unsuitable for a questionnaire study,Footnote 2 we adapted experimental materials from previous experimental philosophy studies whenever possible. An additional advantage of this solution is that it enables us to use existing data as a baseline for interpretation purposes because we can compare obtained results to the known estimates in the general population. Table 1 presents a list of the scenarios we used in the study together with their original sources. Appendix 1, Table 15 contains the full text of the scenarios in Polish (the language of the survey) and their English translations.

Table 1 Scenarios used in the study together with their original sources of the thought experiments and empirical studies from which scenarios were adapted for the present study

Each scenario was followed by three questions. The first concerned the philosophical intuition elicited by the scenario. It was presented in a forced-choice format (yes/no or choose one from several possibilities). For example, in Fake Barns case, we asked participants whether the protagonist knew that near the road there was a barn. The second question was concerned with subjective confidence in the answer (“What level of confidence would you ascribe to your answer?”) and was answered on a pseudo-Likert 5-point scale ranging from “very low” to “very high”. From the fourth semester onwards, we also asked participants a yes/no question about whether they had discussed this kind of thought experiment in class. The data is presented in Appendix 2, Table 16.

4.1.2 Translation procedure

The scenarios, originally in all but one case formulated in English, were translated into Polish by a person with a formal education in English–Polish translation. The translations were then reviewed by two members of our team who have a background in philosophy and experience in translating philosophical texts from English to Polish. In a few cases, the scenarios were slightly altered to address the exact problem we intended to study, or to ask the type of question we chose. When necessary, the scenarios were adapted to the Polish participants’ knowledge (for example, in the Gettier case, information was added that the Buick is an American car). Suggested corrections were then consulted with the translator.

4.1.3 Procedure

The study consisted of seven measurement points: six at the beginning of each semester of the undergraduate program and the seventh at the beginning of the academic year following participants’ completion of the program. At each measurement point, the same questionnaire was administered with minor changes aimed to shorten the duration of the study (several repeated demographic questions were dropped).

In order to increase the response rate, we employed a mixed-mode survey design that combined a traditional pen & paper questionnaire with an online survey. At the first and second measurement points, a paper questionnaire was administered during an obligatory class. Participation was voluntary. For those who were unable to participate in the pen & paper version of the study, an online survey was also available. During the rest of the study, an online survey was the dominant mode of participation. The second author stayed in touch with all participants and reminded them each semester via e-mail to complete the next part of the study.

The research received ethical approval from the appropriate University Research Ethics Committee and informed consent was obtained from all participants. The surveys were anonymized by assigning each participant a unique identifier that participants needed to sign each completed questionnaire.

4.1.4 Participants

The participants were undergraduate students of philosophy and undergraduate students of cognitive science at the University of Warsaw, Poland. The sample of cognitive science students was used as a matching control group in order to evaluate the confounding effects of age and education.

The philosophy program at the University of Warsaw is relatively fixed during the first four semesters. In the first year, students are required to participate in a two-semester epistemology course. In the second year, there are obligatory two-semester courses in ethics and ontology. In the third year, philosophy of language and philosophy of mind are discussed as part of a compulsory course in the history of analytic philosophy. Third-year students are also offered an elective course in philosophy of action, philosophy of language, and philosophy of mind.

Courses offered at the University of Warsaw typically consist of two parts: lectures and tutorials in small groups. In the philosophy program, many classical philosophical thought experiments are discussed in lectures and especially tutorials. Gettier cases and Fake Barns are discussed thoroughly in epistemology classes from the middle to the end of the first semester. Criteria of identity over time are discussed in detail during the ontology tutorial (fourth semester), where students are also introduced to various thought experiments designed to elicit intuitions about personal identity. During the third semester, also in ontology classes, when discussing the concept of particular objects and modal notions, students learn about Gödel/Schmidt and Twin Earth thought experiments, the latter of which is briefly described earlier in an epistemology lecture on externalism. The Violinist case is discussed during the second year in ethics classes, whereas the Frankfurt case receives a cursory mention in the ethics lecture at the end of the fourth semester. In the third year (fifth and sixth semesters), students discuss in depth the Gödel/Schmidt and Twin Earth cases in the history of analytic philosophy. The Knobe experiment is mentioned (but not discussed in detail) in facultative philosophy of action classes (third year). As far as we know, the Truetemp and Experience Machine thought experiments are not covered in class at an undergraduate level.

Our control group was not perfect. Ideally, the students from the control group should not take any courses in philosophy. Unfortunately, this was not the case. The cognitive science program includes some elements of philosophy, which is common for all undergraduate programs at the University of Warsaw. The students are required to take a short introductory course in philosophy in the second semester, followed by obligatory courses in philosophy of language (where both Gödel/Schmidt and Twin Earth are discussed) and philosophy of mind in the third semester. They can also choose advanced courses in philosophy of mind and philosophy of language during the fourth semester. In addition, both cognitive science and philosophy students take a compulsory 120-h logic course during the first year of their studies. Hence, if learning logic has a significant influence on the formation of philosophical expertise (cf. Weinberg et al., 2010, p. 335), the influence should be the same in both groups.

Two cohorts participated in the study: students who started their program in Fall 2017 and those who started in Fall 2018. Table 2 presents sample sizes for each of the measurement points. In total, 226 students took part in the study [112 men, 107 women, 1 other gender and 6 participants who refused to answer the question; mean age at the first measurement point: 20.0 years old (SD = 1.48)]. 180 Subjects participated in the study from the beginning, 33 students started at the second measurement point, 4 students at the third, 3 at the fourth, 4 at the fifth, and 1 student at the sixth.Footnote 3

Table 2 The number of participants who took part in each stage of the study

4.2 Results

For all scenarios, we computed a combined score (e.g., Turri, 2016a) in the following manner: if the answer was “yes,” we multiplied the confidence rating by 1, and if the answer was “no,” we multiplied the rating by − 1. For answers that were neither “yes” or “no,” we multiplied the confidence rating by one just in case the answer was consistent with philosophical consensus. We thereby obtained the main numerical dependent variable that ranged from − 5 to + 5. For each such measurement, we fitted a linear mixed model using R lme4 package (Bates et al., 2015) with a group (cog-sci students vs. philosophy students) and time of the measurement (1–7) as predictors. For the time, we have successive differences contrast coding (cf. Venables & Ripley, 2002, pp. 147–149). The motivation behind this choice was that we were mainly interested in changes from semester to semester and not general linear trends. In cases where closer inspection of the results was called for, we fitted two additional models to the data. The first was a linear mixed model but, instead of using successive differences contrast coding, we coded the time of the measurement as a numeric variable. This allowed us to investigate simple linear trends that would be not visible in a semester-by-semester analysis. The second additional model was the same as the main one but fitted only to the data from philosophy students. It was used to investigate data in cases when the main analysis yielded interaction effects that were difficult to interpret. We also conducted an analysis of bare categorical responses for which we fitted a generalized linear mixed model with logit as a link function. The results of these analyses, together with additional data, are reported in Appendix 3, Figs. 12, 13, 14, 15, 16, 17, 18, 19, 20, and 21 and Tables 17, 18, 19, 20, 21, 22, 23, 24, 25 and 26.Footnote 4

4.2.1 Gettier case

The results are presented in Fig. 1. The participants in both groups began with similar confidence scores and their responses were consistent with a large body of available research on the epistemic intuitions of subjects from the general population (e.g., Machery et al., 2017, 2018; Nagel et al., 2013; Turri, 2013). The majority shared Gettier intuitions, refusing to ascribe knowledge to the protagonist of the story. The situation started to differ from the second measurement point onwards. We observed a large change in scores between the first and second semesters in the experimental group (philosophy students), whereas in the controls (cognitive science students) the ratings remained stable.Footnote 5 The difference in the experimental group between the first and second measurements was statistically significant (interaction between Group and Semester 2-1: b =  − 2.26, p < 0.001; see Table 3). Overall scores in the experimental group were significantly lower than those in the control group (b =  − 1.39, p < 0.001). A closer examination of the data based on a model with semester coded as a numeric variable revealed a statistically significant interaction between Semester and Group (b =  − 0.27, p = 0.005). Together with no statistically significant first-order effects in this model, this should be interpreted as evidence for a negative linear trend in the experimental group that is not present in the controls.

Fig. 1
figure 1

Changes in intuitions about the Gettier case over time. The combined score of + 5 represents attribution of knowledge with maximum confidence and − 5 represents a denial of knowledge with maximum confidence. Error bars correspond to the standard error of the mean

Table 3 Linear mixed-effects model for the combined scores in the Gettier case

4.2.2 Fake Barns

Although there are doubts as to the replicability of these results, Fake-barn cases have been found to be sensitive to some demographic variables, such as age (Colaço et al., 2014) and gender (Bergenholtz et al., 2023), and presentation order (Wright, 2010). Nonetheless, the results so far have regularly shown that lay people tend to attribute knowledge to the protagonist of the Fake-barn scenario (Colaço et al., 2014; Turri et al., 2015; Turri, 2016b, 2017). Our results are presented in Fig. 2. At the first measurement point, participants in both groups tended to attribute knowledge to the protagonist although the mean combined score was slightly lower in the experimental group. The tendency to attribute knowledge reversed for the experimental group from second semester onward (b =  − 1.86, p = 0.005, see Table 4). For the rest of the study, philosophy students tended to refrain from attributing knowledge to the protagonist (mean score below the midpoint) whereas scores for the cognitive students remained stable (and positive). The overall difference between the groups was statistically significant (b =  − 2.42, p < 0.001). No general linear trend was found (p > 0.05), which suggests that the observed change in intuitions occurred mainly between the first two measurement points.

Fig. 2
figure 2

Changes in intuitions about the Fake Barns case over time. The combined score of + 5 represents attribution of knowledge with maximal confidence and − 5 represents a denial of knowledge with maximal confidence. Error bars correspond to the standard error of the mean

Table 4 Linear mixed-effects model for the combined scores in the Fake Barns case

4.2.3 Truetemp

The results of previous studies show that lay people tend to disagree with the knowledge attribution in Truetemp cases or the average knowledge ratings are close to the midpoint of the scale (Swain et al., 2008; Wright, 2010; Ziółkowski, 2021). In our study, we observed an overall difference between groups (b = 0.91, p = 0.24, see Table 5). The mean rating in both groups was close to the midpoint (see Fig. 3), but the ratings were generally lower in the experimental group. Interestingly, this was the other way around at the first measurement point. The change from the first to the second semester (b = 1.67, p < 0.001) and an interaction between the semester factor and group (b =  − 2.25, p = 0.002) were statistically significant. After the first two periods, the intuitions remained more or less stable. An analysis with measurement points coded as numbers revealed a small but statistically significant positive effect of the semester variable (b = 0.23, p = 0.003) and an interaction between semester and experimental group but with the opposite sign (b =  − 0.30, p = 0.009), which means that the linear trend was present only in the control group.

Table 5 Linear mixed-effects model for the combined scores in the Truetemp case
Fig. 3
figure 3

Changes in intuitions about the Truetemp case over time. The combined score of + 5 represents attribution of knowledge with maximum confidence and − 5 represents a denial of knowledge with maximum confidence. Error bars correspond to the standard error of the mean

4.2.4 Knobe harm case

Given the robustness of the Knobe effect, we expected a positive answer—that is, an attribution of intentionality to the action’s side-effect.Footnote 6 Surprisingly, neither philosophy nor cognitive science students tended to ascribe intentionality in the Knobe-like scenario (see Fig. 4). The intuitions were stable over time, but the means were much closer to the midpoint in the experimental group than in the control group, where we observed a moderately strong negative response (b = 1.55, p < 0.001, see Table 6). Analyzing only the group of philosophy students, we found a statistically significant difference between the fourth and the fifth semesters in the opposite direction of orthodox theory of intentional action (b = 1.34, p = 0.035). No simple linear trends were observed.

Fig. 4
figure 4

Changes in intuitions about the Knobe harm case over time. The combined score of + 5 represents attribution of intentionality with maximum confidence and − 5 represents a denial of intentionality with maximum confidence. Error bars correspond to the standard error of the mean

Table 6 Linear mixed-effects model for the combined scores in the Knobe harm case

4.2.5 Twin Earth

Most previous studies have used more complex scenarios than ours. Their results undermine both pure internalism and pure externalism as two opposing folk theories of natural kind terms regardless of the type of natural kinds used in the scenario, and suggest the need for some hybrid account, according to which natural kind terms are ambiguous or polysemous (Jylkkä et al., 2008; Genone & Lombrozo, 2012; Nichols et al., 2016; Tobia et al., 2020). The participants of our study were asked whether XYZ is water. In both groups, the dominant answer was “no” and the intuitions about the case were relatively strong (see Fig. 5). The only statistically significant effect was a positive change at the second measurement point (b = 0.98, p = 0.007, see Table 7 for detailed results), which is also statistically significant for philosophy students analyzed separately (b = 1.01, p = 0.004). An analysis with semester coded as a numeric variable revealed a weak although statistically significant overall linear trend in the direction of agreeing with the claim that XYZ is water (b = 0.20, p < 0.001). All in all, the participants were less likely over time to agree with the answer implied by Putnam’s account of natural kinds.

Fig. 5
figure 5

Changes in intuitions about the Twin Earth case over time. The combined score of + 5 represents a belief in the statement that XYZ is water with maximum confidence and − 5 represents a denial of that statement with maximum confidence. Error bars correspond to the standard error of the mean

Table 7 Linear mixed-effects model for the combined scores in the Twin Earth case

4.2.6 Gödel/Schmidt

Previous studies (e.g., Beebe & Undercoffer, 2015, 2016; Machery et al., 2004, 2009, 2015; Sytsma et al., 2015) have shown that, in the western cultures, people’s judgments about the reference of proper names in Gödel/Schmidt cases are consistent with Kriple's causal-history theory. In our study, this was the first case where the question was not in the “yes/no” format. Instead, we asked the participants who the protagonist of the story was talking about when he used the name “Gödel”. Following the standard description of this case, there were two possible answers: the author of the theorem or the fraud. For analysis purposes, we decided to code the answer consistent with the causal theory of reference (the fraud) as 1 and the descriptivist answer (the author) as − 1. Thus, the answers combined with confidence ratings form a variable ranging from − 5 (strong confidence and descriptivist intuitions) to + 5 (strong confidence and causal–historical intuitions). Overall, the intuitions of philosophy students were more in line with the Kripkean theory of reference (b = 0.92, p = 0.022, see Table 8) than the intuitions of the controls, but the means in both groups were very similar at the first and sixth measurement points (see Fig. 6). Statistically significant change towards the negative answer was observed between the sixth and the seventh semesters (b =  − 1.97, p = 0.001). However, a marginally significant interaction with the opposite sign (b = 1.83, p = 0.051) indicates that the drop in ratings occurred predominantly in the control group. A separate analysis of the data from the experimental group revealed a statistically significant change in the direction of philosophical orthodoxy between the second and the third measurement points (b = 1.23, p = 0.013). No statistically significant simple linear trends were found.

Table 8 Linear mixed-effects model for the combined scores in the Gödel/Schmidt case
Fig. 6
figure 6

Changes in intuitions about the Gödel/Schmidt case over time. The combined score of + 5 represents a belief in the statement that the name “Gödel” refers to the fraud with maximum confidence and − 5 represents a belief in the statement that it refers to the author of the proof with maximum confidence. Error bars correspond to the standard error of the mean

4.2.7 Experience Machine

According to available research, a large majority of ordinary people who respond to vignettes modeled closely on Nozick’s original scenario refuse to be connected to the experience machine (de Brigard, 2010; Hindriks & Douven, 2018; Weijers, 2014). In our study, participants were presented with two choices: remain in the real world or plug into the experience machine. Following Nozick’s original analysis, we coded the former option as 1 and the latter as − 1. Combined with the confidence ratings, the dependent variable ranged from − 5 (a strong preference for connecting to the machine) to + 5 (a strong preference for staying in the real world). We did not find any statistically significant effects (see Table 9) and, as Fig. 7 shows, the intuitions remained rather stable across all measurement points. The participants tended to agree with Nozick’s interpretation of this case and rather confidently said that they would remain in the real world. Again, no overall linear trends were found.

Table 9 Linear mixed-effects model for the combined scores in the Experience Machine case
Fig. 7
figure 7

Changes in intuitions about the Experience Machine case over time. The combined score of + 5 represents a preference for staging in the real world with maximum confidence and − 5 represents a preference for connecting to the machine with maximum confidence. Error bars correspond to the standard error of the mean

4.2.8 Violinist

Subjects were asked whether they had a moral duty to stay connected to the violinist. Mean scores below midpoint indicate that they disagree with the claim, but as can be seen in Fig. 8, they were relatively close to the midpoint. The scores did not differ significantly between the two groups and no general linear trends were observed. In the experimental group, we observed an increase in scores between the third and fourth semesters (interaction: b = 1.74, p = 0.013, separate model: b = 1.05, p = 0.045) and a decrease between the fourth and fifth semesters (interaction: b =  − 1.53, p = 0.038, separate model: b =  − 1.13, p = 0.043, see Table 10 for detailed results).

Fig. 8
figure 8

Changes in intuitions about the Violinist case over time. The combined score of + 5 represents a belief that one has a moral duty to stay connected to the violinist with maximum confidence and − 5 represents a denial of the existence of this duty with maximum confidence. Error bars correspond to the standard error of the mean

Table 10 Linear mixed-effects model for the combined scores in the Violinist case

4.2.9 Frankfurt case

A study by Miller and Feltz (2011) indicates that people are willing to ascribe moral responsibility and blameworthiness in cases where there were no alternative possibilities available to an agent. In our study, subjects were asked three questions: whether it was possible for Frank not to kill Furt (Possible not to kill?), whether he was responsible for Furt’s death (Responsible?) and whether he was blameworthy for killing Furt (Blameworthy?). The only statistically significant effect found in the responses to the first question was a difference between sixth and seventh measurement points in the model fitted only to responses by philosophy students (b =  − 1.50, p = 0.032). Participants in both groups tended to believe that it was not possible for Frank not to kill Furt. They were also highly confident that Frank was both responsible and blameworthy for the killing. Interestingly, we found a statistically significant decrease in scores between the first and second semesters (Responsible?: b =  − 0.75, p = 0.044; Blameworthy?: b =  − 0.89, p = 0.013). A closer examination of this first-order effect suggests that the control group was responsible for it, which is reflected in a statistically significant interaction in the Blameworthy? question (b = 1.31, p = 0.008, see Fig. 9 for the visual comparison and Table 11 for detailed results). Again, analyzing only the group of philosophy students we found a statistically significant positive difference between the sixth and seventh semesters with regard to the Responsible? question (b = 1.03, p = 0.040), which is consistent with the previously noted negative difference with regard to the Possible not to kill? question.

Fig. 9
figure 9

Changes in intuitions about the Frankfurt case over time. Three panes of the plot correspond to the three questions that were asked. The combined score of + 5 represents a positive answer to a given question with maximum confidence and − 5 represents a negative answer with maximum confidence. Error bars correspond to the standard error of the mean

Table 11 Linear mixed-effects models for the combined scores in three questions about the Frankfurt case

4.2.10 Teleportation

Results of six studies by Weaver and Turri (2018) suggest that people allow for the possibility that one and the same individual can be in two different places at the same time. Our Teleportation case, which was closely modeled on Parfit’s thought experiment, involved split teleportation, where a malfunctioning teleporter reconstructed the teleported person in two copies. Participants were able to select one from four possible answers. Two answers implied that one of the two copies was identical to the original person, the third implied that both copies were identical, and the fourth—that neither copy was the original person. We coded the last option as + 1 and the rest of the possible answers as − 1. We did not observe any statistically significant effects, regardless of the model used to analyze the data. Participants did not exhibit a strong preference for any of the answers (see Fig. 10). The results are presented in Table 12. Appendix 4, Fig. 22 and Table 27 contain a detailed breakdown of the answers. The participants were fairly equally split between “neither” and “both” answers, which is consistent with previous research.

Fig. 10
figure 10

Changes in intuitions about the Teleportation case over time. The combined score of + 5 represents an answer that neither copy was the original with maximal confidence and − 5 represents the opposite view (other answers) with maximal confidence. Error bars correspond to the standard error of the mean

Table 12 Linear mixed-effects model for the combined scores in the Teleportation case

4.2.11 Confidence ratings

We also analyzed confidence ratings separately. To that end, we fitted a linear mixed-effects model for each scenario, in a way analogous to the previous analyses. Instead of using the combined score, we entered raw confidence ratings as a dependent variable. Table 13 presents the results for all scenarios. The overall pattern that can be seen in Fig. 11 is that philosophy students generally had slightly less confidence in their answers. In 9 out of the 12 analyzed questions, the effect of group (philosophy vs. cognitive science students) was statistically significant.

Table 13 Linear mixed-effects model for the confidence scores in all tested cases
Fig. 11
figure 11

Confidence levels for two groups of subjects for all questions. Frankfurt 1 refers to “Possible not to kill?” question, Frankfurt 2 to “Responsible?” question and Frankfurt 3 to “Blameworthy?” question

4.2.12 Attrition

We wanted to see whether students whose intuitions did not conform to the textbook consensus were more likely than their colleagues to withdraw from the philosophy program and, thus, to drop out of our study. To this end, we took all the responses at the first measurement point and compared the answers given by participants who later completed the questionnaire at the third measurement point with those given by participants who did not. The third measurement point was chosen because it nicely splits the sample into two groups of comparable size. If students whose intuitions matched the literature consensus are more likely to become academically trained philosophers, we would expect differences at this stage. The results are presented in Table 14. The overall pattern of the responses is that there are no statistically significant differences between those two groups.

Table 14 Combined scores at the first measurement point for participants who completed the third semester (Continued) and those who dropped out of the study (Dropped)

4.2.13 Analyses on a reduced dataset

To check the robustness of our findings, we decided to re-run the main part of the analysis (linear mixed-effects model with combined scored as a DV) on a reduced dataset. This dataset contains observations only for those participants who successfully completed the questionnaire at least six out of the seven times, and if one of the measurement points was missing, it came from either the sixth or seventh semester. The idea behind this analysis is that it enables us to completely eliminate the effect of selection bias, although at the cost of lower sample size and reduced statistical power.Footnote 7 The full models and plots can be found in Appendix 5. Here, we only summarize the main findings.

For the Gettier case, in the philosophy group we found a clear drop in combined scores between the first and the second semester (see Fig. 23), which is indicated by a statistically significant interaction (b =  − 2.28, p = 0.010; see Table 28). This is congruent with the results of the original analysis. In the Fake Barns case, we observed an almost identical pattern of responses to the original analysis (see Fig. 24), with a very large drop in scores between the first and second semesters in the philosophy group (b =  − 2.36, p = 0.013). The overall difference between groups remained significant (b =  − 2.79, p < 0.001; see Table 29). For the Truetemp case, all the effects that were found in the main analysis remained significant, except for the overall difference between the two groups (see Fig. 25; Table 30).

For the Knobe harm case, all the effects that reached statistical significance remained such when the analysis was re-run on the reduced dataset (see Table 31). The overall pattern of the responses also did not change (see Fig. 26). In the Twin Earth case, the only effect that was statistically significant in the original analysis was the change between the first and second semesters in the direction opposite to philosophical orthodoxy. In the analysis on the reduced dataset, this effect ceased to be statistically significant (b = 0.90, p = 0.090, see Table 32). It must be noted, however, that the regression coefficient is virtually the same (b = 0.98 vs. 0.90, see Fig. 27) and we think that the fact that it did not reach the level of statistical significance is the consequence of reducing statistical power by limiting the number of observations. For the Gödel/Schmidt case, the overall shape of the results stayed the same (see Fig. 28), but the significance of the individual predictors changed a little bit (see Table 33). The large change in the intuitions of the philosophy students between the second and third semesters was not significant in the original analysis (b = 0.47, p = 0.539), probably due to a similar trend in the sample of cognitive science students. In the reduced dataset, this tendency is not present, and a large shift towards orthodoxy in the sample of philosophy students is now statistically significant (b = 2.29, p = 0.030). Interestingly, whereas in the original analysis no differences between groups were observed for the Experience Machine case, in the reduced dataset we can observe a clear and consistent difference (see Fig. 29)—philosophy students are much less likely to wholeheartedly decide to remain in the real world (b =  − 1.35, p = 0.033; see Table 34). In the Violinist case, the main effect that we found in the original analysis was a statistically significant interaction indicating change in intuitions between the third and fourth semesters in the philosophy group. This finding was replicated in the analysis on a reduced dataset (b = 2.48, p = 0.004, see Fig. 30; Table 35).

In the original analysis, we did not find any statistically significant predictor for the Possible not to kill? question to the Frankfurt case. In the reduced dataset, we observed a change between the first and second semesters towards the positive response (b = 1.55, p = 0.015; see Table 36), which is mainly driven by the control group. For the Responsible? question in the original analysis we observed a trend towards negative responses between the first and second semesters, but in the re-run analysis no effects reached statistical significance. For the Blameworthy? question about the Frankfurt case, in the original analysis, we found change in intuitions in both groups but in the opposite direction between the first and second semesters. This pattern is still present in the reduced dataset, but in a much weaker form that did not reach statistical significance (b =  − 0.71, p = 0.135; interaction: b = 0.94, p = 0.199; see Table 36; Fig. 31).

The original analysis of answers to the Teleportation Case did not reveal any statistically significant predictors. However, in the analysis on the reduced dataset, we found two statistically significant effects. First, we observed a tendency to go towards philosophical orthodoxy between the third and fourth semesters in the control group (b = 1.06, p = 0.046) and a trend in the opposite direction in the group of philosophy students, which was indicated by a statistically significant interaction (b =  − 1.89, p = 0.023, see Table 37; Fig. 32).

Overall, additional analyses on the reduced dataset support the robustness of the findings of the main analyses. The main findings and the general pattern of responses remained largely the same.

4.3 Discussion

We found statistically significant changes in the intuitions of philosophy students in six out of the ten thought experiments we tested. Some of these changes can be directly connected to the classes the students were required to take at a particular time. First of all and most strikingly, we observed a massive change in intuitions regarding the Gettier case and the Fake Barns case after the first semester—which is to say, after both were covered extensively in epistemology. The changes were in the direction of philosophical orthodoxy. Moreover, a brief survey among lecturers in epistemology has confirmed that, in their opinion, philosophy students' judgments about knowledge as true justified belief change after discussing Gettier’s examples. Less pronounced changes occurred in participants’ responses to the Violinist case after obligatory courses in ethics, but they did not persist. In the philosophy group, we observed an increase in non-Thomsonian intuitions between the third and the fourth measurements, which then bounced back to the previous level at the beginning of the fifth semester. Although the change took an unexpected direction, we think it can be linked to the ethics course. The second change that can be related to the participants’ taking of an ethics course was a slight change in responses to the Frankfurt case. Philosophy students after this course tended to be more confident that it was impossible for Frank not to kill Furt. At the same time, they tended to agree more decisively that he was responsible for Furt’s death. This result is in line with the original interpretation of this case given by Frankfurt. These findings suggest that, at least in some cases, professional training has an influence on philosophical intuitions but in certain cases the change does not last.

We found little to no changes in judgments about cases that were not directly discussed in class. Importantly, the pattern of responses did not depend on the subject being taught. For example, taking epistemology did not affect the students’ intuitions about a wide range of thought experiments concerning knowledge. This is best illustrated by contrasting intuitions about the Fake Barns case, which is discussed in the epistemology course, and the Truetemp case, which is not. In the Fake Barns case, we observed a large change in intuitions in the philosophy students, whereas in the Truetemp case the change was very small compared to the Gettier and Fake Barns cases, and a statistically significant effect of interaction could not be straightforwardly attributed to the philosophy students’ correction of intuitions towards orthodoxy. This suggests a limited carryover of the effect of philosophical training even if we consider cases from the same philosophical subdiscipline.

With regard to two cases we do not have clear explanations, but we would like to offer tentative ones. First is the Gödel/Schmidt case, where we observed a change in intuitions in the direction of philosophical orthodoxy earlier than expected. Recall that the first time that our participants could encounter this case was during ontology classes, taken in the third and the fourth semesters. Unexpectedly, the most significant change in the philosophy group was observed after the first two semesters. Indeed, while the participants were, on average, descriptivists (mean score below midpoint) at the second measurement point, they were pretty firmly in the Kripkean camp at third. Note that, in the control group, this change occurred after the fourth semester, which is perfectly consistent with the change occurring as the result of the students’ taking philosophy of language in the second year. We think that this result could be related to the fact that this example was discussed during the logic course, which the students took in the first two semesters. Unfortunately our data on exposure to cases starts from the fourth semester and because of that we were unable to confirm this hypothesis. The second problematic case was the Twin Earth scenario. We observed a weak but statistically significant trend toward agreement with the statement that XYZ is water. However, a closer look at the data reveals that participants’ judgment about the case did not shift—what changed was their confidence in the answers. Because the effect was present in both groups, we think that it might be a reflection of the general critical attitude and cautiousness developed during a ternary education. However, it is difficult to square this explanation with the fact that a similar drop in confidence between the first and second measurement points did not occur in judgments associated with any other scenario.

In two cases (Knobe harm and Fake Barns), we found a considerable initial difference in intuitions between philosophy and cognitive science students. In the Fake Barns case, this difference increased at the second measurement point, but in the Knobe case, it remained stable. This finding indicates that there may be some peculiarities regarding the cognitive profile of people who decided to study philosophy at an academic level. It is interesting to note that, for the Knobe harm case, the side-effect effect was stronger for philosophy students than for students of cognitive science—they were more likely to attribute intentionality of the side effect to the protagonist’s action. One possible explanation of this finding is that we used a variation of the harm vignette that was not extensively tested.Footnote 8 That may have resulted in an unexpected pattern of responses.

Another interesting finding of our study is that philosophy students display generally lower confidence ratings compared to cognitive science students. Two possible explanations should be considered. First, individuals who choose the philosophy program exhibit a different cognitive profile compared to the general population. They might be more cautious and intellectually humble, which is why they have more doubts about their judgments on tested cases. Second, for philosophy students, the stories and the questions matter because they concern problems relevant to the field of their study; by contrast, students of cognitive science may regard the scenarios and the probes as irrelevant puzzles.

4.4 Objections and limitations

The presented study has several limitations. First, our control group was not perfect. As we have mentioned, students of cognitive science at University of Warsaw do have some exposure to philosophy. They are, inter alia, required to take courses in philosophy of language and philosophy of mind (which add up to a total of 150 h of compulsory philosophy classes during their studies). Another problem is that at least some of those students may have developed an interest in philosophy in general—the Program of Cognitive Sciences at the University of Warsaw is generally considered in the Polish academic community to be rather philosophy-heavy. Nonetheless, we think that their responses provide a reasonable baseline for the analysis of how the intuitions of philosophy students changed over time.

A second limitation of the study is that we had no direct control over which cases were discussed during classes by different instructors and how they were discussed. Many things might depend on the teaching style of an individual instructor and on the subject-matter of the course. Some instructors may have encouraged students to challenge the textbook consensus whereas others may have been more focused on explaining the thought experiments in a way that promoted intuitions associated with mainstream views. We feel that the former approach might be more widely adopted in courses in ethics and the latter in epistemology, where there is strong community-wide agreement about certain thought experiments. After the study, we conducted an informal survey with the instructors about this matter. Epistemology lecturers unanimously declared that they took pains to make sure that the students understood the Gettier and False Barn cases. Ontology lecturers made a similar declaration about Putnam's Twin Earth thought experiment. As to the other relevant courses, the lecturers reported that, while they had discussed the thought experiments in class, they did not expect the students to acquire a thorough, in-depth understanding of them.

One may also raise concerns about the representativeness of our sample. The question is how representative of philosophical training in general is the training offered at the University of Warsaw. The structure of the undergraduate program is typical of European universities with the focus divided between contemporary analytic and continental philosophy, on the one hand, and history of philosophy, on the other. That being said, it is possible that a different educational approach with more electives, characteristic of American and British universities, might yield different results. However, we suspect that given the narrow scope of typical elective courses, the pattern of response that we obtained—namely, that most changes in case judgments are course-driven—would still hold. Nevertheless, given the pioneering nature of the present study, the generalizability of its findings cannot be assessed right away and requires further empirical investigation.

5 Philosophical implications

Our study addresses the question of how case judgments made by philosophy students evolve over time compared to the judgments made by subjects from the control group. This means that, although we can attribute observed differences in case judgments to differences in the curriculum, we cannot establish the further claim that those differences in case judgments are the reflection of a developing philosophical expertise. Instead, we have to introduce this claim as a working assumption. Accordingly, in the first part of what follows, we assume that philosophical studies give rise to the kind of cognitive skills required by the expertise defense. This will allow us to evaluate the three models of philosophical expertise described in Sect. 2.2, but our conclusions will only be conditional: if the expertise assumption is true, then our data support some hypotheses about the influence of expertise on case judgments while undermining others. But, naturally, we will still be left with an answered question about the truth of the expertise assumption. Although it would be impossible, at this stage, to address it in a fully satisfactory manner, we will provide a provisional answer to it based on data from both our study and existing cross-sectional research.

Assuming that formal training in philosophy leads to the development of cognitive skills that improve the ability to make credible case judgments, the results of our study speak against two out of the three models of philosophical expertise sketched in the introduction. According to the Method Model, the student masters a general method of philosophical thought experimentation applicable to any area of philosophy. This model predicts that increased proficiency at philosophical thought experimentation informs all philosophical case judgments regardless of subfield. We found no such pattern in our data. In fact, all observed changes in case judgments were restricted to specific areas of philosophy. This would seem to support the Subfield Model of philosophical expertise. However, the Subfield Model also predicts that changes in discipline-related cognitive skills affect all case judgments in the relevant subfield and our data indicate otherwise. For example, in the domain of moral philosophy, we found significant changes in responses to the Violinist and the Frankfurt case during and after the second year, when the students were required to take a two-semester course in ethics, but we observed no changes in judgments regarding the Experience Machine. Thus, the only model that comports with our data is the Restricted Expertise Model, which predicts that cognitive skills involved in making case judgments are highly specific, affecting only some of the judgments relevant to a particular subfield or even concept.

This model is very weak, however. It would be supported even if each case judgment turned out to be affected by a separate cognitive skill. This is a problem because, intuitively, the ability to consistently make a single kind of case judgment hardly deserves the name of a skill. Since we found no persuasive evidence for a robust carryover effect—meaning that no cognitive skill acquired within a particular period seems to have had a significant impact on judgments about cases not discussed in class—we have to consider the possibility that formal training in philosophy does not improve the ability to make such judgments.

A natural strategy to handle this difficulty would be to argue for an interpretation of the data that moves beyond the simplified picture of the three models of philosophical expertise we have been assuming. This is fairly easy to do. Given how little is known about the determinants of philosophical case judgments, there are indefinitely many such interpretations to choose from. While we cannot discuss all the possibilities here, it is worth noting three kinds of moves that can be made in this connection. First, one can maintain that the cognitive skills involved in making case judgments do not correspond with the subfields of philosophy, so there may exist carryover effects that do not respect traditional subdisciplinary boundaries. For example, based on our data, one can hypothesize a causal link between a putative cognitive skillset acquired in the first year of philosophical studies and case judgments relevant to epistemology and philosophy of language: this is the period in which we observed statistically significant changes in the judgments made by philosophy students about the Gettier case, Fake Barns and also the Gödel/Schmidt case. Second, perhaps some philosophically relevant case judgments, such as those elicited by the Gödel/Schmidt and Fake Barns scenarios, are affected by expertise whereas others are not (e.g., the Knobe harm case, Teleportation and the Experience Machine). Third, it is possible that many, perhaps all, case judgments can be improved by expertise, but some of the relevant cognitive skills develop later than others, so we would observe more training-related changes if we had followed the participants of our study for a longer period of time.

Although impossible to exclude, these more complex accounts are open to the charge of being ad hoc. While we admit that one of them may eventually turn out to be true, we would argue that, at present, they all lack sufficient theoretical motivation. It would be difficult to explain why the Gödel/Schmidt intuition should be informed by cognitive abilities affecting the Gettier case and Fake Barns, but not Twin Earth. Likewise, we have no idea why only some intuitions should be affected by expertise—what could be the relevant difference between the Twin Earth and the Gödel/Schmidt scenarios, for example? The hypothesis positing delays in the development of selected cognitive skills faces similar problems.

To recapitulate, under the expertise assumption, our data undermine all but the weakest model of philosophical expertise—a model, on which specific cognitive skills developed in the course of an undergraduate program in philosophy each affect a very narrow set of case judgments. In fact, the only way to square our data with the existence of a carryover effect relies on introducing implausible hypotheses about the nature of philosophical expertise.

The weaknesses of an expertise-based account of our findings suggest that perhaps a better explanation of the data is possible that does not appeal to the assumption that variation in the curriculum influences case judgments via acquired cognitive skills. We believe that there is such an alternative explanation that fits well with the data, though it cannot account for all our observations. This alternative explanation says that most of the changes we have observed did not result from the students’ deploying new cognitive skills, but from the fact that they simply adopted specific beliefs endorsed by their teachers. This is not to say that philosophy instructors are bent on preserving textbook consensus or indoctrinating their students. Rather, in many classes, the student is required to know the canonical analysis and interpretation of certain thought experiments, and, having learned what they are, may simply adopt the corresponding beliefs without much deliberation. In sum, in light of our data, it is a plausible supposition that, when it comes to making case judgments elicited by philosophical thought experiments, professional philosophers do not have any special skills distinguishing them from laypeople. The significant difference between the two populations is that philosophers have accepted the “standard” interpretation of a number of philosophical thought experiments whereas the folk have not.

Besides accounting well for our data, this hypothesis seems simpler and more conservative than its expertise-based competition. It is not ad hoc, since the mechanism it invokes is familiar and well-established in psychology and social science. Furthermore, as things stand now, it meshes well with the findings of existing cross-sectional research on philosophical expertise. As we saw in Sect. 3, available cross-sectional studies indicate that professional philosophers asked to make case judgments are susceptible to many of the same biases as the folk. This suggests that, different though they may be, case judgments made by professional philosophers are by no means superior to those made by ordinary people.