1 Introduction

This special issue on what some regard as a crisis of replicability in cognitive science (i.e. the observation that a large proportion of experimental results across a number of areas cannot be reliably replicated) is informed by three recent developments. First, philosophers of mind and cognitive science rely increasingly on empirical research, mainly in the psychological sciences, to back up their claims. This trend has been noticeable since the 1960s (see Knobe 2015). This development has allowed philosophers to draw on a wider range of relevant resources, but it also makes them vulnerable to relying on claims that may not survive further scrutiny. If we have reasons to believe that a large proportion of findings in the psychological sciences cannot be reliably replicated, this would be a problem for philosophers who use such findings in their work.

Second, philosophers are increasingly designing and carrying out their own experiments to back up claims, or to test claims earlier made from the armchair, for example, on the perceived permissibility of diverting trolleys or on the nature of free will. This growing field of experimental philosophy has diversified the intellectual field in philosophy, but may also be vulnerable to issues of replicability that philosophers did not face before.

Third, the recent evidence of apparently widespread non-replicability in the social sciences (and other fields) has forced philosophers of science to grapple with long standing questions from their field from a new perspective. To what extent does replicability matter for theory construction? How do the notions of replicability and scientific progress interact? How can normative insights from philosophy of science be used in order to improve scientific practice?

It is with these three developments in mind—the increasing importance of empirically-informed philosophy, of experimental philosophy, and of philosophy of science around replicability—that this special issue has been conceived.

2 Background

The epidemiologist and meta-scientist John Ioannidis’ groundbreaking work in the medical and biological sciences shone light on the fact that a worryingly large percentage of scientific findings that inform everyday medical practice may not be reproducible (2005a, b). This work was the catalyst for large-scale meta-analyses of prior studies and a raft of new studies seeking to quantify and understand the root causes of non-replicability in other fields like economics (Camerer et al. 2016; Camerer et al. 2018; Ioannidis et al. 2017), marketing (Hunter 2001; Aichner et al. 2016), sports science (Halperin et al. 2018), water resource management (Stagge et al. 2019), computer science (Ferrari Dacrema et al. 2019; Ekstrand et al., 2011), and psychological science (Open Science Collaboration 2015; Klein 2018; Stanley et al. 2018; Sprouse 2011).

Specifically in the psychological sciences (the scientific field to which the guest editors and ROPP are most closely connected), recent work documenting unsuccessful attempts at replication has been fascinating from a sociological, methodological, and philosophical point of view. Despite some high-profile instances of fraud, such as the so-called Stapel affair (Carpenter 2012), much of the focus in this literature has been on the large-scale production of non-reliable findings that may be a product of structurally systematic causes that go beyond individual fraudulent acts. The Open Science Collaboration project (OSC 2015) estimated the replicability of 100 experiments from three of experimental psychology’s most prominent journals through collectively organized direct replication attempts. Their overall finding was that only 36% of the replication attempts produced statistically significant findings using conventional methods of statistical analysis (i.e., Null Hypothesis Significance Testing). This is compared to 97% of studies with reported statistically significant findings, using the same method of analysis, in the original published set. Camerer et al. (2018) carried out a similar large-scale study examining the replicability of 21 papers from the social and behavioral sciences appearing in Nature and Science, and found that only 13 replicated according to similar criteria (61%). A final project from the Center for Open Science (Klein et al. 2018) put together a team of 186 researchers from 60 different laboratories to conduct direct replications of 28 high profile classic and modern findings from psychology. The authors found that only half of the studies replicated, but that amongst those that did replicate, they tended to do so in most samples.

Together this massive body of work compellingly documents some of the challenges of modern experimental psychology with regard to the interpretability and reliability of its findings. Complementary work has attempted to understand the root causes of wide-scale non-replicability, and proposes practical solutions to help rectify the problem. Root causes that have been discussed can be divided into “ultimate causes” which relate to underlying motivations and rewards, and “proximate causes” which relate to specific sub-optimal practices that are present as experiments are being designed, conducted, and analyzed. One useful rule of thumb in thinking about this distinction is that ultimate causes, unlike proximate causes, typically predict not only that there will be error but also the direction error, as we will see below.

Prominent ultimate causes that have received attention in recent years include publication bias and the related “file drawer problem” (Rosenthal 1979; Ioannidis 2005a), whereby null results are less likely to be published than positive results; financial and career incentives for positive and surprising findings (Heesen 2018); high costs associated with direct replication attempts (Everett and Earp 2015); confirmation and experimenter bias (Strickland and Suben 2012; Rosenthal and Fode 1963); and even a desire on the part of the experimenter to see the truth propagated (Bright 2017). All of these types of causes predict a specific direction of expected error or bias. For example, as Bright (2017) argued, if scientists have an earnest desire to sway the opinions of their peers toward what they think is true, they may be incentivized to illegimately produce results in line with their perceived truth. Thus the direction of expected error here would be towards conforming with experimenters’ prior beliefs. Proximate causes, in contrast to ulimate causes, do not offer insight into the direction of expected error. They instead explain why results may end up being imperfect or noisy, regardless of directionality. Prominent proximate causes which have been discussed include substandard analytic and statistical practices (such as allowing too many degrees of freedom in statistical analyses; Simmons et al. 2011); low statistical power (Ioannidis 2005a); and suboptimal measurement practice (e.g. Doyen et al. 2012).

During the beginning of the replication crisis, Ioannidis (2012) advanced the idea that widespread non-replicability was evidence that science is not necessarily self-correcting, or at least not as self-correcting as one would hope. However the wide-scale response in the years that have followed directly brings this idea into question, and in our view actually shows clear evidence of self-correction, at least in terms of the norms and practices that respond to systemic issues around reliability (even if not every non-reliable study ever published will undergo explicit correction, nor would this be necessarily desirable given the huge amount of resources it would take to replicate every study).

Along these lines, a number of solutions and practical responses to the problem of non-replicability have been proposed and, in some cases, adopted at scale. Standards are now generally higher for required statistical power in order to publish (Cumming 2012), and are also higher for norms of reporting and public availability of data (Shrout and Rodgers 2018). Pre-registration, meant to curb problematic flexibility in statistical analysis as well as data cherry picking, is more and more widely practiced in the psychological sciences (Simmons et al. 2011; Kupferschmidt 2018). Journals also often encourage direct replications prior to publication and are more tolerant of publishing non-replications (Nelson et al. 2018), though this latter trend needs to be balanced against the general informativity and interest of published findings. Finally, many journals have moved towards accepting registered reports prior to the results of a study being known, which additionally helps address the file drawer problem. It is highly likely that these practices will serve to drop the overall non-replicability rate of published findings in the field (Stromland 2019), though it remains to be seen what costs, particularly in terms of innovation, these new approaches may incur (Goldin-Meadow 2016).

Building on this family of responses to the replicability crisis, a next generation of new practices and tools is being created that pushes the boundaries (in terms of scale and breadth) in how the social and behavioral sciences are conducted. These include for example prediction markets that identify candidate findings that merit direct replication attempts (Dreber et al. 2015), meta-analytic methods that can help identify “p-hacking” (van Aert et al. 2019), and large scale “conceptual” replication efforts that, unlike the direct replication studies described above, help give a sense of how theoretically robust a finding is across a number of experimental parameters where one would expect a given type of finding to appear (Landy et al. 2020).

3 Content of this Special Issue

Our special issue appears in this broader context. Experimental philosophers, philosophers of science, and philosophically-minded psychologists have offered unique insights into the problem of replicability, what its causes might be, and what responses might be promoted in order to improve replicability rates.

The special issue includes one large-scale, collectively organized replicability study which concentrates specifically on experiments coming from the field of experimental philosophy (x-phi) (Knobe 2015). This study, by Cova et al., was proposed to the editorial board at Review of Philosophy and Psychology as part of the proposal for the current special issue.Footnote 1 It involved coordinating between 20 research teams across 8 countries in order to directly replicate 40 individual experiments from the field. The study found a successful replication rate of about 70% in the field according to a range of criteria. Quantifying replicability in Xphi has proven useful as it strongly suggests that it is not inevitable that replicability rates hover at or below the 50% mark that has so far been observed in the social and behavioral sciences. While 70% may sound very good, it is hard to provide a benchmark for what percentage of studies should be replicable in a field, since a lot depends on tradeoffs regarding experimental design, how easy it is to recruit participants, and other factors. Furthermore, the study provides some potential insight into why replicability may vary from field to field and thus helps us improve overall practice. In particular, Xphi  studies are generally run online using tools and methods where replication is cheap and easy, for example, using Amazon’s Mechanical Turk (MTurk) platform for crowd-sourcing tasks. This dynamic may mean that many groups replicate internally before publishing, and also may change the cost benefit analysis of publishing “risky” results. Given the low cost of replication, researchers may rationally think there is a good chance that others will attempt to reproduce their work, making them more cautious.

The study provides some additional understanding of more particular proximate factors that may explain high vs. low replicability rates. In particular, when the main finding of a study was an observed difference (for example, in the mean value of a distribution of scores) between experimental conditions within a given sample, such a finding tended to replicate at a high rate. However when the main finding of a study highlighted a difference between samples drawn from different populations (e.g. looking at cultural variability), this tended to have lower replicability rates.

While this study provides reasons to be optimistic about the field of experimental philosophy, Andrea Polonioli, Mariana Vega-Mendoza, Brittany Blankinship and David Carmel caution in their paper “Reporting in Experimental Philosophy: Current Standards and Recommendations for Future Practice” that the field relies extensively on null hypothesis statistical significance testing, but has only partially adopted additional measures that help to bolster the results (especially in the light of proposed shortcomings of significance testing against the null; see Trafimow and Earp 2017; Cumming 2012). In their review of 134 recent experimental philosophy papers, they find that only 53% of the papers report an effect size, 28% confidence intervals, 1% examined prospective statistical power and 5% report observed statistical power. Intriguingly, the extent to which these additional measures are adopted does not impact how often a paper is cited.

Other articles in the special issue examined a range of ultimate and proximate causes for non-replicability within the social and behavioral sciences at large, and proposed some novel solutions that should be considered as candidate scalable solutions. In “The Alpha War” Edouard Machery argues in favor of decreasing the significance level threshold for publishability by an order of magnitude, a measure that would be practical in many cases, likely effective, and could thus be broadly implemented.

Deborah Mayo’s “Significance tests: Vitiated or Vindicated by the Replication Crisis in Psychology?” pushes back against the idea that statistical significance testing would be to blame for unreplicable results, because statistical significance testing makes it too easy to find effects. However, such claims frequently miss the mark of what statistical significance testing is supposed to do. As Mayo argues, even Ronald Fisher, a statistician who contributed to the theory behind null hypothesis testing in the 1930s, already cautioned that one cannot demonstrate a genuine experimental phenomenon from just a single small p value. Yet this is what critics of statistical significance testing say is happening collectively. Moreover, alternatives to significance testing, such as likelihood ratios, Bayes Factors, or Bayesian updating, do not fare better than statistical significance based on alpha thresholds and p-values, and might give bias a free pass, because such methods, unlike statistical significance tests, cannot pick up on how data-dredging alters the capabilities of tests to distinguish genuine effects from noise.

Relatedly, Lincoln John Colling and Dénes Szűcs compare the frequentist and Bayesian approach as two very different perspectives on evidence and inference in their paper “Statistical Reform and the Replication Crisis”. They argue that the frequentist approach prioritizes error control, and the Bayesian approach offers a formal method for quantifying the relative strength of evidence for hypotheses.

Finally Mary Amon and John Holden advocate a general systems framework that can serve as a complement to standard inferential statistics, and better accommodate intrinsic fluctuations and contextual adaptations.

In our view, this collection of articles provides a unique perspective on the problem of replicability from philosophers of science and experimental philosophers. We hope that this special issue might spur further dialogue within these communities around this important topic.