Thought experiments are used in many domains, but their use in philosophy and physics stands out for particular inventiveness. To imagine travelling at the speed of light, chasing a light beam; or being confined to a black-and-white room for a large intitial segment of one’s life; or to undergo, as a victim of a recent road crash, a hemispherectomy followed by partial brain transplants to one’s triplet siblings – these undertakings are rather more exciting than standard academic or textbook content.

Moreover, thought experiments are used for quite different purposes. They are sometimes used as heuristic aids or vivid illustrations of general principles, which is didactically handy, especially if the principles are abstract or difficult to grasp. They are also sometimes offered as interesting, puzzling phenomena, inviting novel theorizing and explanation. But the role which set philosophical debate about thought experiments really going – around 1990 – is the one where thought experiments function as tests of some general theory or claim, much like ordinary empirical experiments in a lab or in the field.Footnote 1

In his rich and sprawling book, Nenad Miscevic (2021) assimilates thought experiments primarily to the third role, that of tests, commenting that we may view “scientific TEs [thought experiments] as a relevant sub-species” of “scientific testing procedures” (p. 13), but his first chapter makes it clear that he thinks that thought experiments serve this role in philosophy as well. Noting, as any sober observer of the literature must, that the term is used in many different, and differentially liberal, senses, he in effect stipulates that it be reserved for this role. Other activities in the vicinity, such as the use of intuitions about felicity or syntactical well-formedness in linguistics, or of the imagination in political utopias and dystopias, fiction in general, meditation, and more, are subsumed together with thought experiments under the wider genus of “imaginative enactments in thought” (IET).Footnote 2

For the striking aspect of thought experiments, in any of the three roles I separated above, is of course that they make essential use of the imagination, like their brethren in the IET family. By contrast, ordinary experiments make critical use of observation, which is strikingly absent in the functioning of thought experiments even in their role as tests (save for the propaedeutic use of drawings, diagrams, matrices and suchlike in connection with some of them).

Hence, if we accept that thought experiments at least sometimes do indeed function as tests, several questions arise. The most immediate ones concern the epistemic credentials of their use in this role, which comes in two different stripes, depending on disciplinary context.

For thought experiments used as tests in physics, this was the topic of the initial debate in philosophy of science, in which Miscevic was involved both through his central role in the IUC in Dubrovnik, where several of the seminal contributions by Brown, Norton and others were first presented, and as a contributor and proponent, along Nancy Nersessian, of an account drawing on mental models (Miscevic, 1992) – I will return briefly to this idea, which Miscevic elaborates in the book, further on.

For thought experiments serving to test philosophical theories, their epistemic value (or otherwise) is to a large extent the topic of current debates over philosophical methodology. These debates draw on the findings, from around 2000 onwards, of experimental philosophers like Weinberg, Stich, Machery and many others.

To recapitulate, what experimental philosophers claimed to find was that intuitions – I will mostly use the more neutral and less confusing term “case judgements” from now on – about cases varied with factors not easily construed as philosophically relevant, i.e. as pertinent to what is true in the domain covered by the theory under testing. Weinberg et al. (2001) and Machery et al. (2004) suggested that case judgements in epistemology and semantics vary with cultural background. Experimental philosophers have since gone on to claim that besides broadly speaking demographic effects (including cultural background, age, and gender), order effects, framing effects, and enviromental effects significantly affect case judgements (see e.g. Weinberg, 2015, p. 177; Miscevic plausibly assimilates order effects to framing effects).

If such variation is widespread and large enough, philosophers’ case judgements start looking shaky, or so the restrictionist argument went (Miscevic uses “the negative program” for the same camp of x-phiers invoking empirical findings to curtail or pause the use of thought experiments as tests).

In my view, the epistemological debate surrounding the use of thought experiments as tests in philosophy has been rather more important than the corresponding one concerning thought experiments in physics or other sciences. This is because this role looms larger in philosophy (despite claims to the contrary by Deutsch (2010, 2015), Cappelen (2012) and recently Horvath (2022). I also tend to agree with Miscevic that “… a big part of history of analytic philosophy, and some parts of general history of philosophy, can be understood as consisting of traditions starting with particular thought experiments… ” (p. 14). Note that the latter claim somewhat undercuts the centrality of testing: if philosophy cannot pride itself on very many stable findings, but rather on a proliferation of theories and ideas, that proliferation does not plausibly have its roots in something designed to test the theories and ideas. Hence, the latter claim seems most plausible for thought experiments serving as puzzle cases or anomalies, or showcase illustrations adding rhetorical oomph to an idea. But it still underscores the importance of thought experiments in philosophy. And as I stated, I do think that in analytic philosophy, the role of them as tests is indeed central.

By contrast, and despite the many striking imaginary cases invoked by great physicists – Miscevic offers lavish discussion of some of the most famous thought experiments by Galileo, Huygens, and Einstein – scientists don’t actually base their theory choices on thought experiments. They don’t have to. So whatever insight was provided by, for instance, Galielo’s thought experiment on falling bodies, the course of the history of mechanics didn’t really pivot on it. For many areas of philosophy, however, there seems little else besides thought experiments to test one’s theories on. There are exceptions, of course (philosophy of space and time, logic, parts of philosophy of language, perhaps theorizing on mental represenation). But for normative ethics, metaphysics, epistemology and several other areas of philosophy, the prospects of alternative forms of data are meagre (as Miscevic stresses, p. 87). Hence the importance of current metaphilosophical debates.

For this reason, I will focus on the defence Miscevic offers in Chap. 6 against restrictionist critiques of thought experiments. This defence draws on his account of, as it were, the life cycle of thought experiments, offered throughout the book but especially in Chap. 5: the stages account. Miscevic accordingly dubs it The stages defence. And the stages defence is itself – or is at least presented as – a variant of the so-called expertise defence.

The expertise defence took off from the fact that the empirical studies first showing case judgements to vary with irrelevant factors used lay people, with little or no philosophical training, as subjects. Proponents of the defence argued that this blocks the induction from problematic variation among the public at large to professional, highly trained experts. To fully succeed, of course, the expertise defence has to render plausible that trained philosophers are in fact less susceptible to various biases and confounders than the lay populations sampled in early studies. But the gist of the argument is that trained philosophers should be credited with at least somewhat reduced susceptibility to – as opposed to immunity against – error either by default assumption (Williamson, 2011) or on empirical grounds (e.g. Schindler & Saint-Germier, 2022).

There is much to be said about the expertise defence; much has already been said, and this short commentary allows no room for a general discussion of it. Let me just note that the defence has probably fared a bit worse than Miscevic seems to suggest when he writes that it is a “widely accepted view” and constitutes “the most successful proposal [for defending the use of thought experiments as tests against restrictionism]” (p. 99).

In connection with the expertise defence, Miscevic distinguishes between what he calls a principled and a pedagogical issue (p. 101). The former concerns the question of how expertise might emerge at all; the latter how it may be imputed to a novice in a way that respects her intellectual autonomy, i.e. without prescribing her case judgements. In the book, it isn’t always clear which of these the Stages defence is meant to address, although Miscevic does occasionally flag the distinction in his discussion (e.g. at p. 112), but I take it that it is at least partly intended to address the principled issue. One thing to note, then, is that philosophy does seem peculiar in that the corresponding issue does not arise for expertise in, say, cancer diagnostics, chess, or meteorology. In these areas, feedback from the world in tandem with learning abilities make the existence of genuine expertise wholly non-mysterious (and also, of course, partly answers the pedagogical issue). As noted above, many areas of philosophy do by contrast lack a corresponding corrective which might serve to calibrate judgements and enhance reliability.

Miscevic reckons with no less than eight or nine stages in the “biography” of a typical thought experiment. With apologies to him, I’ll somewhat recklessly compress these as follows.Footnote 3 First, the thought experimenter presents a case, whose verbal presentation the subject has to interpret correctly, along with understanding the point of the case at hand. Then, based on her understanding of the scenario, the subject constructs a mental model of it – Miscevic calls this phase “tentative conscious production” (pp. 18, 86, 107). This is then often subconsciously elaborated, perhaps by specialised modules. As a result of such elaboration, an “immediate, spontaneous verdict, often non-conscious” (p. 109) is issued and surfaces to consciousness when the case judgement is expressed. The case judgement is then generalized into a broader intuitive judgement by a process of (somewhat Aristotelian) intuitive induction. The general judgement is then fitted, perhaps after modification, into an (ultimately wide) reflective equilibrium.

Although I have some doubts about the applicability of these stages to thought experiments in general, I will bracket these for present purposes and focus on the reliability-enhancing countermeasures proposed by Miscevic at various steps.

Miscevic rightly points out that much ameliorative work can be achieved at the early stages: the thought experiment may be worded in a way that avoids (i) misleading implicatures, (ii) misconstruals of the point of the case, and (iii) decreased motivation to engage with it due to needless complexity or unfamiliarity with its vocabulary and notions. This idea is in line with other ambitious proposals to explain some of the variation uncovered by experimental philosophers as (various sorts of) performance errors (such as Nagel, 2012). However, it is unclear to what extent this addresses more that the pedagogical issue. A benevolent teacher or careful experimentalist will (as Miscevic’s own classroom anecdote about his improved Gettier case exemplifies) seek to ensure that noise due to these factors is minimized. But it seems to me that this does not preclude influential cases, case judgements, and indeed whole research programs and philosophical traditions – to revert to Miscevic’s point about the influence of thought experiments – being based on ineliminable linguistic priming.

What I have in mind here is the idea, largely due to Eugen Fischer, that some historically famous thought experiments, like the so-called argument from illusion (in e.g. Hume, Russell, Ayer) used to motivate the existence of sense-data, are entirely driven by subconscious, automatic processes mimicking broadly Gricean pragmatic heuristics (Fischer & Engelhardt, 2016). The judgement in such cases that the protagonist does not perceive, say, a round object when looking at a coin at an angle, is – according to Fischer and Engelhardt – the product of automatic but erroneous stereotype-driven inferences implemented by processes working perfectly well but delivering false conclusions in the cases in question. And what I mean by “ineliminable” is that these cases can’t simply be cleansed from misleading pragmatics – there just is no such thing as argument-from-illusion cases anymore, if the stereotype-driven inferences are blocked.

In the later stages of a thought experiment, Miscevic puts great store in the ameliorative effects of collating and comparing judgements/intuitions, and eventually, maybe, even reaching what he calls a “collective wide equilibrium” (p. 115) including and accommodating insights from various fields and sciences. If achieved, this might result in “explicit understanding” (p. 113). As he puts it in summing up: “More comprehensive understanding is the goal, and collective reflective equilibrium the desired state.” (p. 114). Although Miscevic rejects the idea that such an equilibrium has to be strongly idealized, I have to confess to some difficulty in understanding what the notion of a collective wide equilibrium amounts to, unless heavily idealized. My main problem with the recipe offered, however, is to do with a basic problem with the method of reflective equilibrium once stressed by Cummins (1998), but not discussed much by Miscevic.

Cummins’ objection is that the value – as in the epistemic value of truth, or of getting things right (which is possibly different and more demanding than the goal of understanding) – of reflective equilibria depends on the trustworthiness of one half of them, namely the particular judgements involved. For many areas, the reliability of sources of data or particular judgements can be calibrated by other sources or forms of access to the target domain, or via theories based on such sources. But for many philosophical case judgements (“intuitions”) such access seems absent byt the nature of the target domains, as already noted. Cummins argues that if the areas of philosophy – such as ethics or modal metaphysics – making use of intuitions did have the means for calibrating them, the intuitions wouldn’t be needed; short of calibration, however, the judgements don’t merit the requisite epistemic weight. Cummins further argues that the only credible source of intuitions in the areas is either tacit theory (which is highly idiosynchratic and contingent, and not to be trusted) or explicit theory (which makes particular judgements uninformative as evidence, hence as counterbalance in equilibria). There is no need to follow Cummins that far – I don’t find Nagel’s (2012) and others’ ideas about, in effect, detection abilites as the source of epistemic intuitions implausible, for instance – but the basic question remains of why we should care about achieving equilibrium with judgements we cannot calibrate. Again, this holds of course on the assuption that the goal is truth (or something like it) as opposed to just putting one’s own beliefs in order.

Cummins’ attack preceded experimental philosophy but anticipated many of its themes. Miscevic has an interesting discussion of the three main arguments of a leading restrictionist: Édouard Machery. One of Machery’s (2017) arguments is that case judgements in many philosophical thought experiments are unreliable (as suggested by an induction from empirical studies of seminal cases). The measures recommended by Miscevic at the early stages of a thought experiment’s “biography”, which I related above, are meant to partly mitigate unreliability.

But there is an interesting aspect of Machery’s unreliability argument which Miscevic doesn’t discuss, and which I am curious what he thinks about. Machery argues that unreliability pertains especially to “modally immodest” thought experiments that have certain “disturbing charateristics” – they pull apart properties that normally go together, they are unusual, and they entangle “superficial” (philosophically non-essential) and “target” (essential) content. Moreover, such cases are likely to be unreliable because of these characteristics. Interestingly, Machery claims that philosophical cases have these traits, which make for unreliability, non-accidentally: to fulfil their function of tests capable of differentiating between modally ambitious theories, they must have them. It would be interesting to learn what Miscevic thinks about this claim, especially in view of the fact that he does offer at least an implicit modal epistemology in connection with his notion of mental models, as in this passage (p. 61): “models of possibilities are constructed in a combinatorial manner from components, representations of actual items”. This is said by way of countering Brown’s (1991) platonism, but also seems potentially pertinent to the restricted modal skepticism defended by Machery.

The other two main arguments in Machery (2017) are Dogmatism and Parochialism. These form the two horns of a dilemma, but are somewhat curiously treated by Miscevic as independent – he doesn’t claim to fully rebut them, but optimistically hopes that they might be “tempered” by the collating and comparison of case judgements between subjects, and “cleared from … irrelevance” (p. 112). However, for Machery, they are quite different responses to two different paths a defender of armchair methodology may choose in the face of apparent variation, especially cultural variation.

One path is to insist (somewhat like Deutsch, 2010) that one judgement (typically one’s own) is correct, even in the face of dissent from others. Dogmatism argues that one should not do this, but instead suspend judgement, since the dissenters are often enough one’s epistemic peers. So Dogmatism is an argument from the epistemology of peer disagreement and is used on the assumption that variation reflects genuine disagreements.

Parochialism, by contrast, applies if the defender opts for relativism, and suggests that the variation reflects application of different concepts rather than genuine disagreement. In that case, Parochialism urges philosophers with orthodox case judgements to take a break from the study of – or conducted by means of – their own, parochial concepts rather than those they attribute to the people making differing judgements, i.e. those they would take themselves to be disagreeing with if they had instead opted for the first, dogmatist, line. After all, research allocation is largely “a zero sum game” (Machery, 2022, 336), and it is not clear from the outset which concepts (or properties) are most fruitfully used (or studied).

Much therefore depends on whether the defender of armchair judgements opts for relativism or not. Miscevic plausibly rejects the idea – not actually proposed by anyone, I think – that, say, East Asians and North Americans operate with extremely different concepts whose terms just happen to be homonyms (in the case of “knowledge”). But he seems open to a “moderate pluralism” of concepts, and suggests that “we as philosophers can live with this [form of pluralism or concept-relativism]” (p. 104). The interesting issue, then, is whether his pluralism extends so far as to face Parochialism (and if so, what he thinks of the argument as specifically directed against such relativism), or not (and if so, what his response to Dogmatism specifically would be – as opposed to just an openness to revision of one’s views as a result of comparing judgements and striving for collective reflective equilibrium, which appears to be Miscevic’s panacea for dogmatism in a more general sense), or whether he perhaps rejects Machery’s dilemma as false; perhaps by not exhausting the options here.

There are many more issues in this rich book which it would be nice to discuss, but I will stop here. I am honoured to have been invited to this symposium.*.

* Thanks to James Nguyen for helpful comments.