1 Introduction

Philosophers of science have long debated the comparative epistemic value of novel prediction and accommodation: does a theory become better confirmed if it fits data that was not used in its construction versus if it was specifically designed to fit the data? One influential approach to this problem argues that novel prediction provides an advantage due to the disadvantages of accommodation (see Mayo, 1996, pp. 294–318; Lipton, 2004, pp. 164–183; Hitchcock & Sober, 2004). These predictivist theories posit that accommodation is associated with various methodological issues that should prompt evaluators to lower their confidence in the accommodative theory. Novel prediction provides evidence that these problematic forms of accommodation have not occurred, and hence a novelly successful theory warrants greater confidence than an accommodatively successful theory.

The argument from low-quality accommodation supplies a compelling motivation for predictivism, and its appeal has been recognized also in the sciences by both scientists expressing more approving (e.g. Kerr, 1998, p. 207) as well as skeptical (e.g. Weinberg, 1993, p. 96) views about predictivism. The thought is straightforward: if a theory is designed with particular evidence in mind, an opportunity arises for the theorist to tinker with their theory or methodology to ensure a fit with the evidence. If a theory fits the evidence without having been designed for this purpose, such suspicions about the theory’s origin are alleviated. Other things being equal, there is thus a reason to prefer the predictively successful theory to the accommodative one. Several philosophers of science have argued that such considerations underlie a predictivist advantage in theory confirmation: novel prediction provides, under certain circumstances, greater confirmation to a scientific theory than accommodation because it alleviates concerns about the possibility of low-quality accommodation (see Mayo, 1996, pp. 294–318; Lipton, 2004, pp. 164–183; Hitchcock & Sober, 2004).

The purpose of this paper is to critically evaluate the predictivist argument from low-quality accommodation. Does novel prediction provide a confirmatory advantage in science due to potential problems with accommodation? Contrary to commonly held predictivist views (for recent summaries, see Rubin & Donkin 2022; Dellsén 2023), I argue that there is currently insufficient evidence to conclude that novel prediction is advantageous due to issues with accommodation. Novel prediction is also associated with certain disadvantages in theory confirmation, and accommodation with certain advantages. Other evidential factors can influence which of novel prediction or accommodation becomes preferable, or often make the distinction between them become irrelevant to theory evaluation. Finally, recent evidence about the consequences of novel prediction and accommodation in the sciences suggests that their differences may not be very significant. All in all, novel prediction and accommodation appear roughly on a par, or accommodation is even superior to novel prediction in the current context (cf. Dellsén 2023).

The paper provides new evidence about the evidential differences between novel prediction and accommodation and the usefulness of this distinction to the philosophy of science. The predictivism debate has also gained increasing interest in the sciences concerning what is known as ‘the replication crisis,’ where a significant proportion of research findings have failed to replicate in several fields (e.g. Open Science Collaboration, 2015; Head et al., 2015; Nosek et al., 2018). Scientists have engaged in an active debate about the causes of the replication crisis, and as one possible culprit, they have raised the issue of HARKing, i.e. “hypothesizing after results are known” (see Kerr, 1998; Rubin, 2017, 2022). HARKing corresponds roughly to the philosophical notion of accommodation (see Rubin, 2017), and a number of scientists have argued that it could be among the causes of low replicability (e.g. Bosco et al., 2016; Nosek et al., 2018). Recently, Rubin (2022) provided a critical evaluation of the potential costs of HARKing to science raised by Kerr (1998). The current paper tackles predictivist arguments that have appeared in the philosophy of science literature, focusing in particular on the epistemic question of whether novel prediction provides greater confirmation to a theory than accommodation.

The paper is organized as follows. In Sect. 2, I outline the prediction vs. accommodation controversy and the predictivist argument that is based on the problems of accommodation. Section 3 explores situations and contexts where philosophers of science have argued for a predictivist advantage due to problems with accommodation. Section 3.1 discusses the problem of overfitting, where a statistical model is fit too closely to the idiosyncrasies of a particular dataset (see Hitchcock & Sober, 2004). Section 3.2 addresses ‘hypothesis hunting,’ which involves searching for statistically significant effects after statistical tests have already been performed (see Mayo, 1996, pp. 294–318). Section 3.3 explores ‘fudging,’ i.e. the generation of convoluted or overly complex theories for the purpose of accommodating for particular results (see Lipton, 2004, pp. 164–183). Section 4 concludes the discussion and explores some consequences of the results for both science and the philosophy of science.

2 Preliminaries

If theory T was designed on the condition that it fit evidence e, T is said to accommodate e. If T fits e but T was not designed to fit e, T is said to (use-)novelly predict e.Footnote 1 Philosophers of science have held different attitudes about the impact of this distinction to theory confirmation. Predictivists argue that e provides stronger confirmation to T in the latter case, either generally or at least in some important scientific contexts. Anti-predictivists claim that this distinction is in no way special: novel prediction does not stand out in any particular way among other epistemic considerations in science (e.g. Harker, 2008). Finally, accommodationists believe the reverse thesis where accommodation is sometimes or even generally superior to novel prediction (see Dellsén 2023).

Most contemporary predictivists have defended a ‘weak,’ ‘local’ version of predictivism (see Hitchcock & Sober, 2004, pp. 3–4; Barnes, 2022). Local predictivism holds that the predictivist advantage applies in certain circumstances only rather than universally (versus ‘global’ predictivism). The relevant situations are identified by the predictivist theory in question, and several such situations have been argued to arise in science (see Sect. 3). Local predictivism intersects with local accommodationism, so that one can be a predictivist in certain contexts but an accommodationist in others (see Hitchcock & Sober, 2004; Dellsén 2023). Weak predictivism argues that novel prediction is not inherently more valuable than accommodation (versus ‘strong’ predictivism), but rather the advantage is an indirect one. Novel prediction has value because it correlates with or is symptomatic of some other epistemically relevant factor that counts for theory confirmation.

Several candidates have been put forward for what the other evidential factor that novel prediction correlates with might be (see Barnes, 2022). One popular way to make the argument is to appeal to the disadvantages of accommodation. In these predictivist theories, the distinction between novel prediction and accommodation is argued to speak to the methodological practices that were employed in designing the theory. Accommodation is associated with certain problematic methodologies that are either not an issue or are less of an issue with novel prediction. Novel prediction provides evidence that these problematic methodologies have been avoided, and so it provides stronger confirmation.Footnote 2

The accommodation-based predictivist argument has provided a solution to multiple problems that anti-predictivists have raised with predictivism, making it an attractive approach to defend predictivism. First, the novelty of evidence is a contingent, historical matter that concerns the question of whether a theorist has made use of some evidence. Yet, many have held that a philosophical theory of scientific confirmation should only involve the contents of the theory, the evidence, and scientific background knowledge (see Musgrave, 1974). How on earth could the biography or intentions of the theorist be relevant to theory confirmation (e.g. Lipton, 2004, pp. 165–166)? The accommodation-based approach provides a simple solution: when theorists use evidence, they may make problematic methodological choices that in themselves should lead to a lower assessment of the theory. Thus, other things being equal, the theory warrants greater confidence in cases where certain evidence that fits the theory has not been used by the theorist, as this provides a reason to think that these problematic practices have not occurred.

Another issue concerns the confirmatory relevance of the distinction between novel prediction and accommodation. Predictivism is essentially a counterfactual thesis where we evaluate if it was better, or if it would have been better, to novelly predict rather than accommodate particular evidence in a particular situation (see Lipton, 2004, p. 165; Barnes, 2008, p. 5). Predictions and accommodations occur, however, in a wide variety of contexts in science, intersecting with many other epistemic factors and considerations. It may thus be easy to mistake another evidential distinction for a predictivist advantage (see Barnes, 2008, p. 5). For example, in any case where new evidence is produced for an existing theory, the evidence automatically becomes use-novel (as the original theorist could not have used evidence that did not exist when designing their theory), and confirmation for the theory also increases as the theory is now supported by more evidence.Footnote 3 A predictivist advantage in such a situation means that the fact that the evidence is use-novel rather than accommodated counts for some extra degree of confirmation, over and above the simple observation that having more evidence is better than having less evidence. The accommodation-based approach meets this challenge and provides a substantive predictivist argument. In certain situations, evidence does become better for the specific reason that it is use-novel, because if it were accommodated this would have raised concerns about the possibility of low-quality accommodation (see Mayo, 1996, pp. 294–318; Lipton, 2004, pp. 164–183; Hitchcock & Sober, 2004). Novel prediction thus provides a distinct contribution to confirmation: the fact that the evidence was not used by the theorist brings a further epistemic benefit because it provides counterevidence about the possibility of problematic accommodation.

A final concern raised with predictivism is the following: weak predictivists argue that novel prediction matters because it speaks to certain other epistemic factors, e.g. the use of appropriate methodological practices. But, scientific theories and the evidence for them are generally made available for public evaluation (e.g. Howson, 1988; Howson & Franklin, 1991). Whatever these other epistemic factors are, scientists will simply evaluate them as such, and the predictivist distinction is after all irrelevant in actual scientific evaluations (see, for example, Horwich, 1982, p. 117). Here, weak predictivists have appealed in one way or another to scientific evaluators’ access to epistemically relevant information about the theory, the evidence, and the methods. Even if we have the theory and the evidence explicitly on the table, scientists may for a variety of reasons be unable to determine whether the problems associated with accommodation have occurred (see, for example, Lipton, 2004, pp. 177–180). Thus, novel prediction matters to science even if scientists are generally expected to present their theory and evidence publicly.

In the following section, I provide a critical evaluation of weak predictivist theories in the philosophical literature that have adopted this accommodation-based strategy.

3 Forms of low-quality accommodation

If a theorist collects evidence and uses it in designing their theory, why can that be disadvantageous from the epistemic point of view? In this section, I discuss answers that philosophers of science have given to this question and evaluate whether novel prediction provides epistemic advantages over accommodation for these reasons. In each case, I raise several skeptical questions about the value of novel prediction in comparison to accommodation.

3.1 Problem #1: overfitting

A common research practice in the sciences involves fitting a curve to a dataset. The aim of curve fitting is to approximate underlying patterns in the data by selecting a model and tuning its parameters to achieve a fit between the model’s predictions and the actual data points. Often, a desirable end-goal is to produce a model that will achieve predictive accuracy with out-of-sample data, i.e. a model that has the highest expected predictive accuracy with further data samples drawn from the target population.

Hitchcock and Sober (2004) set up an example where we compare the evidential value of novel prediction and accommodation in relation to this problem. They consider a prediction competition between two modelers that involves modeling the relationship between certain variables, e.g. X = daily caloric intake and Y = weight, in dataset D. Both modelers posit a functional relationship between X and Y that is polynomial in form: Y = arXr + ar−1Xr−1 + ··· + a1X + a0. ‘Penny Predictor’ constructed her model (Mp) as follows. She first fit a function to a subset of the dataset (D1), and she then predicted the remaining data (D2) to a high degree of accuracy. ‘Annie Accommodator’ posited her model (Ma) based on the entire dataset, i.e. she accommodated all of the data. The question then becomes, should we prefer Mp or Ma in terms of expected predictive accuracy?

Hitchcock and Sober (2004) provide a nuanced answer, developing their argument around a commonly encountered issue in statistical modeling: the problem of overfitting. A well-known finding in statistics is that models that are overly complex often perform poorly in predicting new data. Real-world datasets typically include ‘noise’ that is not reflective of any underlying data generating process, arising from factors such as measurement errors, the influence of unmeasured confounding variables, and other peculiarities in the data collection process. More complex models tend to ‘overfit’ the data by becoming sensitive to the noise in the particular dataset, which leads to impaired predictive performance in other datasets. In statistical modeling, therefore, steps must be taken to address the overfitting problem.

Several methods are used for this purpose. Hitchcock and Sober discuss in particular the Akaike Information Criterion (AIC). Akaike (1973) famously showed that an unbiased estimate of the predictive accuracy of a model can be obtained by balancing its fit-to-data and complexity. The estimate of the predictive accuracy of model M ≈ log[Pr(Data|L(M)] − k, where L(M) is the likeliest member of M and k is the number of adjustable parameters in M. The AIC rewards models for their fit-to-data but also includes an increasing penalty for the complexity of the model; the more adjustable parameters are added to the model, the greater the penalty becomes. In this way, the AIC achieves a balance with goodness-of-fit and complexity and provides a potential solution to the overfitting problem.

With this set as background, Hitchcock and Sober consider different types of cases in the competition between Penny and Annie. They argue that differences in the confirmatory value of novel prediction and accommodation turn crucially on what we know about the respective models and model selection methods:

A) If we know the contents of Mp and Ma, or at least know enough to calculate their AIC-scores, Hitchcock and Sober argue that we should prefer the model with the better (i.e. lower) AIC-score. Here, they hold that an accommodationist advantage in favor of Ma comes into effect, as when the overfitting problem is dealt with, it is simply better to use more rather than less data in fitting one’s model. Hitchcock and Sober (2004, p. 17) emphasize that when both Penny and Annie use AIC in selecting their model, Penny’s model cannot have a higher estimated predictive accuracy than Annie’s when she has used only a part of the data, and thus accommodationism holds.

B) If we do not know either the contents of Mp and Ma or the methods that were used to select them, Hitchcock and Sober argue that the situation can now reverse and Mp has the advantage. In cases where we (a) have specific reasons to suspect that both Annie and Penny could have overfitted their model, or (b) do not have any information about whether they have addressed overfitting, we should prefer Mp over Ma. In these cases, Annie could be guilty of overfitting, but Penny’s novel predictive success provides evidence that she has successfully avoided overfitting her model. Thus, predictivism holds.

Hitchcock and Sober’s predictivist and accommodationist arguments connect to certain broad, widely recognized epistemic principles. The accommodationist argument aligns with ‘the Principle of Total Evidence,’ which maintains that we should use all the evidence that we have. When the overfitting problem is dealt with, Hitchcock and Sober argue that this principle applies, and the use of more data becomes advantageous. The predictivist argument relies on a certain form of inductive reasoning. Predictive success with a use-novel dataset provides evidence that the model has not overfitted the data, which provides some inductive grounds for expecting that the predictive success will carry on in the future in other datasets (providing that the world continues to cooperate) (see Hitchcock & Sober, 2004, pp. 19–20).

Hitchcock and Sober note that the extent to which their predictivist or accommodationist arguments apply in actual scientific contexts is unclear, as this depends on whether scientific cases assimilate more to cases of type A or B introduced above.Footnote 4 Their chief purpose at this point has been to provide a framework for exploring situations where a predictive advantage could sometimes arise, calling for others to explore further cases (see Hitchcock & Sober, 2004, pp. 20–21). Varying the relevant conditions, or challenging specific assumptions, could give rise to further predictivist (or accommodationist) arguments. For example, Douglas and Magnus (2013, pp. 582–584) draw a more extensive predictivist argument, arguing that novel prediction is associated with greater advantages than Hitchcock and Sober suggest. They argue that whether the assumptions underlying AIC are met in practice is often unclear, so achieving novel predictive success with another dataset after all provides more compelling evidence about the model’s predictive performance than accommodation with AIC.

In this section, I seek to expand the discussion on this problem by considering further relevant aspects to this issue. To what extent are there reasons to believe that either predictivist or accommodationist advantages apply in scientific contexts due to the overfitting issue? In brief, I argue for two points. First, something that has been overlooked in the discussion so far is that there are also several important advantages to accommodating more data instead of predicting it. In evaluating whether predictivist or accommodationist advantages apply, evaluators should thus also take into account the advantages of accommodation over prediction in this context. Second, there are further epistemic factors that impact the overfitting issue, the influence of which has not been much explored yet. Appealing to these factors, as well as to the advantages of accommodation, I discuss ways in which the predictivist arguments of both Hitchcock and Sober (2004) and Douglas and Magnus (2013, pp. 582–584) could be challenged in various circumstances in the sciences. At the end of the section, I draw some general conclusions about the value of novel prediction and accommodation in this context. Overall, accommodation may often have greater value, or at least in many situations there are good reasons to prefer more accommodation to prediction.

Let us first discuss some advantages that accommodation has over prediction in this context. To recall, the prediction versus accommodation problem confronts us to consider the relative evidential value of particular data D2 as either accommodated or predicted, concerning the expected predictive accuracy of a model that has been fitted to some data D1. We will again call a model that accommodates D1 and D2 model Ma and a model that accommodates D1 and is then used to predict D2 model Mp. For example, D1 could be a dataset gathered to estimate the relationship between particular variables (X and Y) in a target population, and D2 is another sample drawn from the population. Which model can we expect to achieve greater predictive accuracy in the target population, the one that accommodates or predicts D2?Footnote 5

Predictivists argue that in various contexts, Mp is preferable, because novel predictive success with D2 provides evidence that Mp has not overfitted the data. Yet, this overlooks that accommodating D2 also provides several benefits to model Ma. Firstly, it is worth observing that accommodating D2 also provides some further protections against overfitting. As observed by Yarkoni and Westfall (2017, p. 1108), another well-known finding in statistics is that one of the best methods to avoid overfitting one’s model is to just use more data. They note that larger datasets provide natural safeguards against overfitting, as models that are fitted to more data become less sensitive to random fluctuations in the data. When datasets are large enough, even overly complex models are unlikely to overfit the data (see Yarkoni & Westfall, 2017, p. 1108). Thus, instead of novelly predicting D2, the risk of overfitting can also be reduced by using D2 in fitting one’s model. However, in using D2, several further advantages come into effect that favor the accommodating model Ma over Mp:

(1) A certain factor that speaks in favor of Ma is that the use of less data is necessarily associated with greater variability in statistical estimates (e.g. Ioannidis, 2008; Yarkoni & Westfall, 2017, pp. 1109–1110). Smaller samples are more susceptible to random variability in sampling (for example, more extreme values have greater influence in smaller samples). This can lead estimates to deviate from the population parameters, even if modelers do not select an overly complex model. This issue (in association with the preferential publication of larger effects) has contributed to a common finding in the sciences, where as sample sizes have grown, effect sizes have tended to shrink (see Ioannidis, 2008; Yarkoni & Westfall, 2017, pp. 1109–1110). Model Mp that is fitted to only D1 encounters greater risks from random variability, while the accommodation of both D1 and D2 decreases these risks for Ma. This enables Ma to potentially become more tuned to the actual population parameters and thus achieve greater predictive accuracy in the target population than Mp.

(2) In cases where D2 constitutes an independent data sample, another point that speaks in favor of Ma is that the overall sample that is used to fit Ma may now become more representative of the target population. In just about any data collection process conducted by a particular research group (in a given time and place), the resulting sample can become biased if the participants that are available are not adequately representative of the target population. This is a widely recognized problem in the sciences, where the samples that are used in research are often convenience samples, i.e. they are not truly randomly collected, but are rather collected due to easier accessibility (see, for example, Peterson, 2001; Arnett, 2008; Zhao, 2021). In these cases, models are biased by the unrepresentative sample and thus achieve lower predictive accuracy in further samples drawn from the population. Using two (or more) independent datasets in fitting one’s model can alleviate this problem, providing Ma with another potential advantage over Mp.Footnote 6

(3) A further advantage in accommodating rather than predicting D2 is that this also provides safeguards against underfitting the data (cf. Hitchcock & Sober, 2004, p. 17; Yarkoni & Westfall, 2017, p. 1111).Footnote 7 Another problem in using less data to fit one’s model is that if the real data generating process is more complex, the model may not observe enough variation in the data to enable appropriate parameter adjustments. For example, in a case where the real data generating process in the world is non-linear, a linear model could nonetheless appear to fit the data well in a smaller sample before a larger sample is obtained. Using both D1 and D2 in curve fitting enables Ma to potentially pick out more intricate patterns in the data, which may further improve its predictive accuracy in the population compared to Mp.

In evaluating whether predictivist or accommodationist advantages apply in particular cases, we should thus observe that the question is not only about the potential advantages of prediction over accommodation, but rather evaluators must balance the advantages of prediction and accommodation each. Accommodating additional data provides several epistemic benefits. Accommodating more data provides safeguards against both overfitting and underfitting, it diminishes the influence of random variability in sampling, and it may enable tuning the model with a more representative sample from the target population. Each of these benefits allows the accommodating model to potentially become more tuned to the real data generating process in the world, improving its predictive accuracy in the target population. Predicting this data also provides evidence of the model’s predictive performance; however, when this comes at the increased risk of several biases that reduce model performance, accommodating the data instead may after all result in the selection of a model that has greater expected predictive accuracy.

An anonymous reviewer objected to this point, arguing that if the predictive model Mp does actually go on to achieve predictive success with D2 (rather than predictive failure), then it would be at least as good as Ma, and even better in cases where overfitting is a concern. However, in evaluating models for their expected predictive accuracy, we should observe that the question is about the degree of predictive accuracy that we expect the model to achieve in the target population, rather than a categorical comparison between predictive success and failure. A model that is fitted to some data and is then used to make predictions about further data—e.g. model Mp—will achieve a certain degree of accuracy with those data points (under some relevant metric, e.g. mean squared error), which provides some evidence of its degree of predictive accuracy in the target population (cf. Hitchcock & Sober, 2004, p. 18). My suggestion is that there are reasons to think that the accommodative model Ma could at least sometimes be expected to achieve a greater degree of predictive accuracy in the target population than Mp because of the advantages it gets through accommodating D2. This is particularly relevant in circumstances that are typical in the sciences, where models fitted to particular datasets achieve a certain degree of predictive accuracy with further data samples, but this degree of accuracy is often significantly lower in the new sample than in the original sample, due to for example the problem factors discussed above (e.g. random variability, unrepresentative samples, overfitting, underfitting) (see, for example, Peterson, 2001; Ioannidis, 2008; Arnett, 2008; Yarkoni & Westfall, 2017; Zhao, 2021). In these circumstances, I suggest that it is useful to consider the possibility that a model that accommodates more data could after all be superior in terms of expected predictive accuracy.Footnote 8

Which model, then, becomes preferable to evaluators, Ma or Mp? Hitchcock and Sober (2004) introduce certain important evidential factors which impact this evaluation: (a) the utilization of model selection methods such as the AIC and (b) evaluators’ knowledge about the contents of the model and the use of model selection methods. They argue that in cases where evaluators lack knowledge about the model and the model selection methods, Mp becomes preferable. Douglas and Magnus (2013) argue further that Mp is still often preferable to Ma even if evaluators know that AIC has been used in model selection, because whether the assumptions behind AIC are met in practice may be unclear.

I suggest that if we include the advantages of accommodation in the discussion, these arguments can be impacted in a meaningful way. Introducing further other epistemically important factors into the mix (cf. Douglas & Magnus, 2013, pp. 583–584), both Hitchcock and Sober’s as well as Douglas and Magnus’s predictivist advantages could potentially reverse to accommodationist advantages under various circumstances:

First, one important factor whose impact has not so far been fully explored is the sample size (which also relates to the actual effect size in the population). Generally, the larger the sample size, the less of an issue overfitting becomes, and the larger the actual effect size, the less data is needed for reliable statistical inferences (see, for example, Yarkoni & Westfall, 2017; see also Douglas & Magnus, 2013, p. 583). If we take into account the sample size and consider a Hitchcock and Sober (2004) type competition between an accommodative model Ma and a predictive model Mp, where we do not know the contents of either model or what model selection methods have been used, one way that Ma that could end up on top after all is if we gain knowledge that a large enough sample has been used so that overfitting is not a significant concern.Footnote 9 In such a case, knowing that overfitting is unlikely to be a problem, we might then prefer the model Ma that is based on more data, as Ma is less likely to suffer from the biases that are associated with the use of less data. So, one way that Hitchcock and Sober’s predictivist advantage could reverse to an accommodationist advantage is through knowing that a large enough dataset has been used in fitting the relevant model.

Second, even in cases where the sample sizes that are used in the predictivist comparison are not very large, recognizing the advantages of accommodation in this context enables us to see that even these types of cases are not as clear. It is not obvious that the predictivist advantage of Hitchcock and Sober (2004) would still always hold in these cases, because smaller samples will also make all the advantages of accommodation more relevant. For example, consider a scenario with two smallish datasets of 100 data points each (for example, two datasets measuring variables X and Y in a certain population), and the same type of models as before, where a certain model accommodates one dataset and is then used to make predictions about the other dataset (Mp) while another model accommodates both datasets (Ma). In the case of model Mp, we have evidence that the model achieves a certain degree of predictive accuracy with further data. However, we can also observe that in the case of Ma, the doubling of a relatively small sample size of 100 to 200 could enable Ma to become much better tuned to the actual population parameters, ultimately making it the more predictively accurate model in the target population. Even if we do not know details about the contents of either model or what statistical method has been used to address overfitting, the benefits of accommodation could still sometimes overcome the benefits of prediction, depending on evaluators’ concerns in the situation. In fact, given the problems that are known to be associated with actual data samples in scientific practice (see above) (e.g. Arnett, 2008; Ioannidis, 2008; Peterson, 2001; Yarkoni & Westfall, 2017; Zhao, 2021), this could become significant in scientific contexts. If it is commonly known that models tend to be fitted to suboptimal samples, we might after all prefer the accommodative model in a Hitchcock and Sober type case, because the accommodative model is less likely to suffer from the problems associated with the use of suboptimal samples.

In addition to the influence of sample size, a further factor that can impact this comparison is the role of background knowledge in evaluating the model (cf. Douglas & Magnus, 2013, p. 584). Patterns in statistical data can be more or less plausible in light of scientific background knowledge. This could also impact the calculation between Ma and Mp in the type of case introduced above. For example, if we have a case where background knowledge already implies strongly that a relationship likely exists between particular variables (e.g. caloric intake and weight), and that this relationship is relatively simple, overfitting becomes less of a concern, because we are aware that modelers will likely be targeting a simpler model in this case. We may then prefer the model Ma that has been fitted to more rather than less data, as this helps deal with the risks associated with the use of smaller, suboptimal samples.Footnote 10

Finally, we can also apply these observations to briefly address the disagreement between Hitchcock and Sober (2004) and Douglas and Magnus (2013) on the impact of model selection methods to the predictivist comparison. Hitchcock and Sober argue that the transparent use of AIC provides a general accommodationist advantage over prediction, while Douglas and Magnus question this argument due to potential issues with AIC in practice. I suggest that further recognizing the advantages of accommodation over prediction speaks in favor of Hitchcock and Sober’s accommodationism over Douglas and Magnus’s predictivism. Even if model selection methods are imperfect, as Douglas and Magnus observe, the combined effect of the advantages of accommodation and the advantages of model selection techniques may overcome any advantages that would have been gained through novel prediction. At the very least, if there are differences in expected predictive accuracy, these may be very small and vary intimately by context (due to the influence of other factors such as background knowledge, the exact sample size, etc.). In general, when appropriate model selection methods are used, more accommodation could likely often be advantageous to novel prediction, as Hitchcock and Sober (2004) hold.

If we pull these observations together, we can derive some general conclusions about predictivism and accommodationism in this context. Novel prediction has been argued to have a possible advantage in that it can provide evidence that the model has not overfitted the data. However, once we include the impact of other evidential factors in the calculation, as well as recognize the advantages of accommodation, it can be seen that if novel prediction is to have an advantage, this at least requires a very specific type of context. First of all, we need a context where the sample size is small enough (relative to the complexity of the targeted data generation process) to make overfitting a relevant concern. Second, there is a lack of transparency, so that evaluators do not have detailed information about the contents of the model and whether appropriate model selection methods have been used (see Hitchcock & Sober, 2004). Finally, if the conditions above apply, a predictivist advantage further requires that the benefits of novel prediction overcome the advantages that would have been gained through more accommodation. At this point, the comparison between novel prediction and accommodation becomes highly complex, depending on the exact sample sizes, effect sizes, prior knowledge, etc. in the situation. Any general advantages one way or the other are unlikely to exist in these cases, and as shown above, accommodation may still be advantageous under various circumstances.Footnote 11

Accommodation, in comparison to novel prediction, is associated with several benefits. Using more data to fit one’s model decreases the influence of random variability in sampling, it can help address both overfitting and underfitting, and it may enable tuning the model with a more representative sample from the target population. Arguably, often the most effective general solution to deal with the overfitting problem is to increase sample sizes (i.e. more accommodation), as statistical inference becomes altogether more reliable when more data is used (e.g. Yarkoni & Westfall, 2017).Footnote 12 This is something that Hitchcock and Sober (2004) also appear to endorse in that they hold that as far as appropriate model selection methods are used, it is simply better to use more rather than less data. Combining the advantages of using additional data with the use of appropriate model selection methods may often result in accommodation becoming advantageous. Finally, for the purposes of drawing more general conclusions about the predictivist argument from low-quality accommodation, we should also take notice of the importance of transparency about the contents of the model and the model selection methods to the predictivist evaluation. In so far as sufficient details about the model and the model selection methods are reported transparently to evaluators of the model, the advantages of accommodation over prediction become particularly pressing in this context (see Sect. 4 for further discussion).

3.2 Problem #2: hypothesis hunting

Mayo (1996, pp. 294–318) presents another weak predictivist argument that targets statistical inference in science, concerning the use of null hypothesis testing in statistical experiments.Footnote 13 In a typical statistical experiment in the sciences, a hypothesis is posited in advance, and data is then gathered to test that hypothesis against the null hypothesis (i.e. the assumption of no effect in the population). The hypothesis is evaluated against a previously determined threshold for the probability that the observed data arose by chance, under the assumption that the null hypothesis is true (often set at p = 0.05). If the hypothesis passes this test, it is deemed statistically significant, indicating that the observed effect is unlikely to be due to mere chance but instead constitutes a genuine finding about the target population.

In certain cases, researchers may deviate from this practice and instead theorize based on the results of the statistical experiments after they have already been performed. In these cases, a problem called ‘hypothesis hunting’ or ‘question trolling’ (see Murphy & Aguinis, 2019) can occur. In hypothesis hunting, researchers collect a large dataset with many variables, and then sift through the data looking for any statistically significant effects between the variables. If such associations are found, the researchers dismiss the non-significant results and present the significant results as if they had been targeting those from the outset. This creates a potential issue for statistical inference because the more associations that are tested, the greater the chances are of finding some relationships between variables that show statistically significant effects purely by chance. Thus, it becomes more likely that the researchers discover false positive results when they use this method to select their hypotheses (see Mayo, 1996, pp. 299–306).

Mayo argues that the hypothesis hunting problem justifies a preference for novel prediction over accommodation in the context of statistical experiments. When hypotheses are designated prior to conducting the statistical experiments, they are tested at the appropriate, predesignated level of statistical significance. If hypotheses are selected after the tests have already been performed, the likelihood that the researchers capitalize on chance results increases, as the researchers have now allowed more opportunities for such results to emerge. There is thus a predictivist advantage in theory confirmation in that predesignated hypotheses have passed a more severe statistical test than accommodated hypotheses. It should be observed that the hypothesis hunting argument does not entail that selecting hypotheses based on the known results of statistical experiments is necessarily an issue in all instances (see Mayo, 1996, pp. 314–316; see also Hollenbeck & Wright, 2017). For example, if researchers gather a large data sample and happen to stumble on one strong, independently plausible effect, using that finding to construct a new hypothesis may not constitute a problem (see, for an example, Hollenbeck & Wright, 2017, pp. 6–7). Hypothesis hunting becomes a problem when researchers systematically search for findings, looking through multiple possibilities.

In this section, I raise three problems with the hypothesis hunting argument for predictivism. First, there is another relevant aspect that bears on the question of whether the results of statistical experiments are seen as compelling in the sciences; namely, the theoretical rationale of the hypothesis, as evaluated in the light of scientific background knowledge. Considering this aspect of theory evaluation reveals certain limitations to the hypothesis hunting argument. Second, a predictivist advantage requires not only that there is an issue with accommodation, but also that novel prediction is in fact better. However, there are multiple reasons to question this assumption in scientific practice. Scientists have discussed several methodological issues, called ‘Questionable Research Practices’ (QRPs), many of which apply to novel prediction instead of accommodation. Third, novel predictions are also associated with a certain kind of publication bias that indicates they may not be very dependable in many fields. At the end of the section, I further consider what bearing the novel practice of preregistration has on these issues.Footnote 14

Background knowledge. Statistical experiments confront a hypothesis with a particular data sample, and confirmation for the hypothesis increases if statistically significant effects are observed. In addition to the direct empirical test, scientists also draw on their background knowledge to examine the theoretical rationale behind the hypothesis, including its various explanatory and theoretical virtues (see, for example, Oberauer & Lewandowsky, 2019; Szollosi & Donkin, 2021; Rubin, 2022; Rubin & Donkin 2022). Overall confirmation for the hypothesis is based on the integration of background knowledge and the empirical experiment (e.g. Mayo, 2014).Footnote 15

The hypothesis hunting argument has focused so far on the experimental side of theory evaluation: if multiple relationships between variables are explored in datasets, it becomes more likely that spurious chance effects are observed by the researchers (see Mayo, 1996, pp. 299–306). However, if we include background knowledge in the picture, this argument can become impacted in a meaningful way. A flip side to the practice of hypothesis hunting is that it is also possible for researchers to find more real effects by exploring relationships in datasets, as any unhypothesized effects that emerge in datasets are not necessarily spurious. When researchers engage in hypothesis hunting, a pool of potential hypotheses is generated, of which only a subset are false in any given scientific field. But, this actually renders issues associated with hypothesis hunting contingent on the state of background knowledge in the scientific field, as background knowledge provides another resource that researchers use to evaluate whether the results of statistical experiments are compelling. Problems for scientific inference begin to occur only if scientists cannot apply additional background criteria to separate out the real effects from the spurious ones.

To briefly illustrate how background knowledge can help rule out spurious effects, consider, for example, a psychological study examining the correlation between self-efficacy and various psychological variables. If such a study produces data indicating a correlation between generalized self-efficacy and indecision, this result will appear questionable in light of background knowledge in psychology, making it implausible for presentation in a psychology journal. Similarly, if the data demonstrate an association between self-efficacy and age within a particular age group (e.g. 30-year-olds), this result will also appear more likely spurious than real, given that there are little theoretical reasons to expect meaningful differences in self-efficacy in adulthood based on small age differences. In this way, even if unhypothesized but statistically significant patterns emerge in datasets, many of them could be readily dismissible based on background knowledge, leaving researchers to choose from a smaller pool of theoretically plausible effects.

Introducing the role of background knowledge in theory evaluation into the picture does not establish that hypothesis hunting cannot become problematic in scientific contexts. However, I suggest that it sets a certain constraint on the hypothesis hunting argument, showing that hypothesis hunting may not be equally problematic in all scientific contexts. In general, hypothesis hunting will become more of a problem in contexts where the background constraints on hypothesis construction are laxer. In fields where background knowledge has low ability to rule out spurious patterns, i.e. a variety of effects appear theoretically plausible even if they are spurious, researchers have more opportunities to hunt for and publish such erroneous results. Conversely, in fields where the standards for hypothesis construction are stricter, the chances that researchers discover and publish spurious but credible findings is substantially diminished. In these contexts, background knowledge acts to constrain the space of plausible hypotheses in a truth-conducive way so that many false hypotheses are excluded from the outset.

I suggest that the background factor that impacts the degree to which hypothesis hunting poses problems could be called the accommodative plasticity of scientific background knowledge. In conditions where scientific background knowledge is accommodatively plastic in that it has low ability to discriminate between real and spurious effects, hypothesis hunting becomes a potential concern. If background knowledge is not accommodatively plastic, but rather acts to constrain hypothesis construction in a truth-conducive way, effectively ruling out most spurious results, unhypothesized but significant effects could become much more credible.Footnote 16

To be sure, there are reasons to think that in certain fields, chiefly, those associated with the recent replication crisis, background knowledge may currently be accommodatively plastic in the relevant way. For example, Muthukrishna and Henrich (2019) argue that the psychological sciences lack generally accepted theoretical frameworks that constrain hypothesis construction. They argue that hypotheses are constructed based on personal intuitions and culturally biased assumptions, which leads to a proliferation of disjointed results. In these conditions, the hypothesis hunting problem may thus become more pressing. In contrast, if the standards for hypothesis construction are stricter, such as perhaps in fields like physics, chemistry, and biology, or they are made stricter by developing better background theories (see Muthukrishna & Henrich, 2019), the hypothesis hunting issue can become less problematic.

For the purposes of this paper, we do not need to take a position on where the distinctions between fields affected by accommodative plasticity of background knowledge and fields less affected by it lie. I suggest that accommodative plasticity of background knowledge is recognized as a certain background constraint on the hypothesis hunting argument, which makes this issue more pertinent in certain contexts. Next, we turn to examine contexts where this argument most plausibly comes into effect: i.e. the fields affected by the replication crisis. Does novel prediction have an advantage over accommodation in these contexts due to the hypothesis hunting problem?

QRPs. In scientific practice, hypothesis hunting is recognized as one type of Questionable Research Practice (QRP). QRPs are methods and techniques that are sometimes used by researchers that are argued to have detrimental effects to the reliability of scientific theorizing and experimentation. QRPs have generated an enormous amount of discussion in the social and medical sciences in recent years (see, for example, Simmons et al., 2011; Bakker et al., 2012; John et al., 2012; Open Science Collaboration, 2015; Head et al., 2015; Bosco et al., 2016; Shaw, 2017; Yarkoni & Westfall, 2017; Hollenbeck & Wright, 2017; Motyl et al., 2017; Rubin, 2017, Rubin, 2022; Vancouver, 2018; Nosek et al., 2018; Murphy & Aguinis, 2019; Oberauer & Lewandowsky, 2019; Szollosi & Donkin, 2021; Rubin & Donkin 2022). Relatively low replication rates in these fields have prompted scientists to search for reasons for why published results are so untrustworthy, and a number of QRPs have been implicated as possible causes.

In terminologies used in the sciences, hypothesis hunting counts as a particular form of ‘CHARKing’ (i.e. “constructing hypotheses after results are known”), which in turn falls under the general category of ‘HARKing’ (i.e. “hypothesizing after results are known”)—a category that is often defined more broadly in the sciences than just in reference to accommodation (see, for example, Bosco et al., 2016; Rubin, 2017).Footnote 17 Hypothesis hunting may increase the chances that researchers present false positive results or inflate real effect sizes, and it is a QRP that implicates accommodation unfavorably (e.g. Murphy & Aguinis, 2019; Oberauer & Lewandowsky, 2019). However, there are also many other QRPs that are associated with novel prediction instead. In what follows, I highlight three families of such QRPs discussed in the scientific literature. (It should be observed that, in the scientific literature, there is considerable variety in the terminology with regard to QRPs. The terminology that is used below is similar to the terminology applied by Oberauer & Lewandowsky, 2019, who distinguish between ‘HARKing’ and ‘p-hacking,’ where HARKing is practiced by accommodators and p-hacking by predictors. For some further discussion on the terminology, see the corresponding footnote.Footnote 18)

As opposed to hypothesis hunting, scientists have discussed another category of issues that impact theory evaluation, often called p-hacking (e.g. Head et al., 2015; Motyl et al., 2017; Oberauer & Lewandosky, 2019). In p-hacking, researchers start with a predesignated hypothesis, and then perform various manipulations or adjustments on the data or statistical tests to ensure that a statistically significant result that appears to support the hypothesis is achieved. P-hacking corresponds to a concern that has been raised previously also in the philosophy and history of science, where several scholars have noted the possibility of ‘observational fudging’ in the pursuit of theory confirmation (see Brush 1989, p. 1127; Lipton, 2004, pp. 176–177; Harker, 2008, p. 440). Similar to the practice of hypothesis hunting, p-hacking may increase the chances that researchers present false positive results or inflate real effect sizes (e.g. Murphy & Aguinis 2019; Oberauer & Lewandowsky, 2019). Types of p-hacking documented in scientific literature include experimenting with numerous statistical tests to find one that shows significant results, assessing data midway through data collection to decide whether to continue collecting more data, including or excluding outliers, including or excluding covariates, rounding down p-values, and stopping data collection if statistical significance is reached (e.g. Head et al., 2015; Motyl et al., 2017).

Another category of issues raised in the scientific literature that implicate novel prediction unfavorably involves various forms of evidence suppression. Ulrich and Miller (2020, p. 2) discuss a problem where researchers conduct multiple studies in the pursuit of a novelly predicted effect. In these cases, the researchers postulate that a certain effect exists in the population, conduct multiple studies going through multiple disconfirmations, until finally reporting the results of an isolated study that appears to confirm the hypothesized effect. In other words, the researchers keep trying until they find results that seem to support their hypothesis, suppressing other evidence against the hypothesis. Rubin (2017, pp. 316–317) highlights another problem where researchers start with numerous novel predictions on a particular subject, confirm only one or few of them, and then suppress all the others. This creates essentially the same problem as hypothesis hunting, where the researcher is instead hunting for successful novel predictions.

Finally, among the QRPs, there is also the problem of outright scientific fraud, e.g. fabrication of data. Dellsén (2023; see also Felgenhauer, 2021) argued recently that the possibility of data fraud is a problem with novel prediction. Novel prediction provides the motivation for engaging in fraud, whereas in the case of accommodation the data is already there and the task at hand is simply to find a hypothesis that fits the data. Dellsén argues that in cases of novel prediction, the data should therefore be seen as more uncertain by evaluators, lowering its confirmatory impact.

Introducing the issues that are associated with novel prediction enables a simple argument against the hypothesis hunting argument for predictivism. There are, in fact, potential risks to both novel prediction and accommodation. If we learn that a hypothesis was proposed prior to conducting statistical tests, we may have concerns about various forms of p-hacking, evidence suppression, or fraud. If we know that the hypothesis was formulated after the tests were performed, we may suspect that problematic hypothesis hunting has occurred. In either case, there may be reasons to reduce one’s confidence in the hypothesis compared to what appears justifiable based on the contents of theory and the reported evidence alone. Without further reasons or evidence to think that either prediction or accommodation is associated with worse (or better) outcomes, evaluators then have little reason to prefer one to the other.

In actual scientific contexts, it is of course quite possible that either novel prediction or accommodation is associated with more severe problems. However, establishing that this is the case will now require more evidence about QRPs in actual scientific contexts. Whether prediction or accommodation is preferable becomes a fundamentally contingent issue that depends on the myriads of factors that influence the reliability of statistical inference in a certain field, including typical sample sizes, typical effect sizes, the prevalence rates of hypothesis hunting, p-hacking, evidence suppression, and fraud, etc. Any epistemic differences will hinge on what specific types of practices researchers engage in when they predict or accommodate results. It is easy to think of situations where novel predictions could become less reliable than accommodations. For example, if we have a scientific field where researchers are reluctant to hypothesize based on known results, engaging in this practice only if they can rely on large sample sizes and effect sizes, but they routinely engage in various forms of p-hacking, evaluators have a reason to trust accommodation more than prediction. In contrast, if researchers work in a field with high accommodative plasticity of background knowledge and routinely engage in hypothesis hunting, while also actively taking measures to avoid p-hacking, evidence suppression, and fraud, novel prediction may become better (although see below for the community level hypothesis hunting problem).

Recently, some attempts have been made in the scientific literature to evaluate the extent of the problems caused by the QRPs. Interestingly, there is some evidence that neither hypothesis hunting nor p-hacking may be associated with very significant problems in the current context in many fields. Studies show that both p-hacking and HARKing (encompassing hypothesis hunting as one of its types) are sometimes practiced by researchers. For example, based on multiple surveys conducted chiefly among psychologists, Rubin (2017, p. 309) tallies a self-admission rate of 43%, where researchers admit to having HARKed “at least once.” In a survey of social and personality psychologists, Motyl et al., (2017, p. 39) find similar self-admission rates for various forms of p-hacking, where researchers admit to having ever selectively reported studies that worked (84%), decided to collect more data after looking (66%), excluded measures (78%), dropped data after looking at impact (58%), rounded down p-values (33%), and stopped data collection early (18%) (see also John et al., 2012).

Even though these numbers are not insignificant, the actual prevalence rates may, however, be significantly lower. Fiedler and Schwarz (2016) emphasize that estimating the overall prevalence of QRPs requires that we also consider how commonly they are practiced. In this respect, the surveys indicate that neither p-hacking nor HARKing are very common. In the survey by Motyl et al. (2017), researchers report that they “rarely” or “never” engage in either HARKing or p-hacking. (Only one potential QRP associated with novel prediction, ‘selectively reporting studies that work,’ was practiced somewhat more commonly.) In surveys by Fiedler and Schwarz (2016) among psychologists in Germany and by Latan et al. (2023) among business scholars in Indonesia, the overall prevalence of QRPs was found to be around 5%, with respondents in the latter study indicating that only a small minority have engaged in these practices “twice or more often.” Outright fraud appears even less common. Based on multiple survey results (see Fanelli, 2009; Stroebe et al., 2012; Gross, 2016), Ulrich and Miller (2020) estimate that the prevalence rate is probably smaller than 2%.

If these numbers are at least roughly reflective of actual prevalence rates, the impact of the QRPs may, after all, be relatively small. Recently, Murphy and Aguinis (2019) conducted a simulation study comparing the effects of hypothesis hunting (what they call ‘question trolling’) and a particular form of p-hacking, searching through the data with alternative measures and samples to find the strongest support for a hypothesized result (‘cherry-picking’).Footnote 19 They perform multiple simulations, varying the sample size (n = 100–280), the pool of results from which to choose (k = 2–10), the prevalence of cherry-picking and hypothesis hunting (20–80% of researchers), and the heterogeneity of the pool of effects from which hunters can draw, to estimate how much either practice would inflate a real effect size of 0.20. Effect sizes are more inflated the smaller the sample size, the greater the number of results to choose from, and the higher the prevalence of the QRPs, as one would expect. However, as long as the prevalence rates of the QRPs are not very high, the differences and the effects are not very significant. For example, if we assume a prevalence rate of 20% for both cherry-picking and hypothesis hunting, and the worst case scenario of sample size of 100 and pool size 10, with an actual effect size of 0.2 cherry-picking leads to an inflated result of 0.229 while with hypothesis hunting the result is 0.232, with a further 0.02 increase if we allow for the hunters to sample haphazardly the entire literature in the field. Significant differences begin to emerge only at particularly high prevalent rates (60% or more), with hypothesis hunting then becoming more of a problem. However, most forms of p-hacking associated with novel prediction are still not included in this calculation, and there is no evidence to suggest that hypothesis hunting is nearly this common.Footnote 20

Community level hypothesis hunting. If we take into consideration both the role of background knowledge in theory evaluation as well as the QRPs associated with novel prediction, I suggest that it remains unclear whether (or to what extent) there are confirmatory differences between novel prediction and accommodation in the sciences due to the hypothesis hunting problem. Per the evidence presented above, novel prediction and accommodation appear roughly on a par in their epistemic consequences, and neither may be associated with significantly worse outcomes than the other. However, there is yet one problem that can be raised with novel prediction, which could be called ‘the community level hypothesis hunting problem.’ Considering this problem further indicates that novel predictions may be overall rather unreliable in many scientific fields.

The hypothesis hunting argument has targeted the research practices of individual researchers, highlighting that searching for effects in a particular dataset increases the chances that researchers find false positive results. We have seen that hypothesis hunting becomes a problem when researchers (i) systematically hunt for significant effects in datasets, (ii) they suppress non-significant results, and (iii) the pool of hypotheses that appear plausible in light of scientific background knowledge contains a significant proportion of false hypotheses (i.e. there is a relatively low base rate of true effects in the field). It is, however, also possible to further formulate an analogous argument on the community level targeting novel prediction, if we simply make the same assumptions about the scientific field in general. A community level hypothesis hunting problem can occur if (a) researchers actively generate novel hypotheses, (b) non-significant results are suppressed, and (c) the pool of hypotheses that appear plausible in light of scientific background knowledge contains a significant proportion of false hypotheses.

Assumptions a–c have, in fact, been raised as significant concerns in recent scientific discussions on the replication crisis. In multiple fields, scientists have raised the concern that prominent journals prefer novel and surprising, statistically significant effects, while null results and replications are not given much space (e.g. Antonakis, 2017; Bakker et al., 2012; Woznyj et al., 2018). This creates an active incentive for researchers to pursue novel hypotheses (assumption a), while research communities commonly act to suppress failures to confirm such hypotheses (assumption b). Based on the low replication rates, scientists have also estimated that the base rates of true effects in these fields are rather low. For example, Wilson and Wixted (2018) estimate that the base rate of true effects is 0.21 in cognitive psychology and 0.06 in social psychology (see also Ulrich & Miller, 2020) (assumption c). In these conditions, a community level hypothesis hunting problem now plausibly comes into effect: a certain form of hypothesis hunting takes place within the entire field of science, where researchers hunt for novel, significant results. Failures to discover such results are suppressed, and cases where they do appear are promoted. This can result in significant bias in the literature, as many of the published studies are then based on mere chance results (cf. Ulrich & Miller, 2020).

Arguably, considering the community level hypothesis hunting problem, novel prediction could even be associated with greater problems than accommodation in the current context. If a large community of researchers hunt for novel, significant results, which however are relatively rarely real, novel predictive successes can become highly undependable in the field in question. There is evidence that something like this may have happened in the replication crisis. A recent Open Science Collaboration (2015) study that estimated the replicability of 100 psychological studies published in important journals found evidence that ‘surprising original effects’ were among those least likely to replicate. In other words, using novel predictive success as a criterion for what counts as a compelling result would lead to highly pervasive false beliefs in this context. As long as novel predictive success has such low reliability, it is plausible that accommodations could after all be more reliable. For example, if researchers are relatively reluctant to hypothesize based on known results, doing this only or mostly when there are good reasons for it, accommodated hypotheses could become more compelling than predicted hypotheses, and thus accommodation superior to prediction.Footnote 21

Does preregistration help? Discussion on QRPs and other potential issues that may impact the replicability of research is relatively new in the scientific literature. However, scientists have had some time to react and explore possible solutions. A certain proposed solution that is pertinent to our discussion is the novel practice of preregistration, which in recent years has been increasingly applied particularly in the psychological sciences (see Bakker et al., 2020). In preregistration, researchers register their hypotheses online prior to conducting statistical test. In a preregistration, it is possible to register the hypothesis, the planned methods, and analyses each before any statistical tests are conducted. This can provide a check for various researcher degrees of freedom, reducing the detrimental impact of QRPs (e.g. Nosek et al., 2018; Bakker et al., 2020; Heirene et al., 2021; van den Akker et al., 2023). In effect, preregistration could ultimately prevent much of the problems associated with novel prediction (e.g. all the various forms of p-hacking), allowing novelly predicted results to become more reliable after all.Footnote 22

While it is true that preregistration can, in principle, reduce the detrimental effects of QRPs, recent findings suggest that preregistration as it is currently practiced is still far from perfect. Firstly, preregistration is not applied uniformly across the sciences, and there is considerable variety in the detail and comprehensiveness of different guidelines for preregistration. In recent analyses of different guidelines and the preregistrations that are based on them by Bakker et al. (2020), Heirene et al. (2021), and van den Akker et al. (2023), preregistrations tended on the average to score rather low in terms of various dimensions of transparency concerning methods and analyses. For example, van den Akker et al. (2023) find that many preregistrations in psychology lack methodological detail involving the measured variables, the statistical model, and inference criteria, potentially leaving room for different types of p-hacking to occur. It is also very common for researchers to deviate from their preregistered plans (see Bakker et al., 2020; Heirene et al., 2021; van den Akker et al., 2023). In the study by van den Akker et al. (2023), one of the most strictly registered aspects of methodology concerned the data collection procedure; yet it scored among the lowest in actual adherence, i.e. the researcher nonetheless deviated from the preregistered plan. Finally, preregistrations also remain subject to considerable interpretation. In the analysis by Bakker et al. (2020), independent coders could agree on the number of hypotheses in a preregistration only 14% of the time, indicating significant ambiguity in the preregistered hypotheses.

As it stands, I suggest that it is yet unclear to what extent preregistration has impacted the calculation between the consequences of novel prediction and accommodation in scientific contexts. van den Akker et al. (2023), for example, note that their study “did not provide sufficient evidence for the claim that preregistration prevents p-hacking.” However, preregistration could, of course, improve in the future. If the preregistration guidelines are made stricter so that the possibility of p-hacking is truly eliminated, might this ultimately result in a predictivist advantage in theory confirmation?

At this point, it may be appropriate to consider the reverse counterfactual, i.e. what the results might be if accommodations were subjected to similarly strict guidelines as novel predictions. For example, Bakker et al. (2020) evaluate preregistrations based on 29 distinct criteria concerning hypotheses, methods, and analyses. If accommodations were required to adhere to similarly strict criteria concerning research practices, they would likely also become more trustworthy. A possible alternative strategy in improving the reliability of statistical inference in the sciences might be to develop and apply stricter methodological guidelines across the board. To ensure that such guidelines are followed, researchers could provide what Rubin (2020) calls ‘contemporary transparency’ about their research. Contemporary transparency requires that researchers report their actual hypotheses, methods, and analyses, and justify these over alternatives. Through requiring greater contemporary transparency, accommodations could be constrained similarly as novel predictions, which would likely again reduce any differences in their epistemic consequences (if any still remain).

In conclusion, taking into account the influence of scientific background knowledge, the QRPs associated with novel prediction and accommodation, the community level hypothesis hunting problem, and issues with preregistration, I suggest that there is currently insufficient evidence to conclude that novel prediction should be favored over accommodation due to the hypothesis hunting problem. If the prevalence rates of QRPs are roughly at the level that has been reported in the scientific literature, novel prediction and accommodation appear roughly on a par in their epistemic consequences. In so far as there are differences, these depend on a myriad of contingent factors concerning the prevalence rates and effects of the QRPs, typical sample sizes and effect sizes, etc. Furthermore, taking into account the community level hypothesis hunting problem, it is even plausible that novel predictions could be overall less reliable than accommodations in the current context. Preregistration may ultimately help make novel predictions become more reliable, but similarly restrictive rules for accommodation could also make accommodations more reliable.

3.3 Problem #3: fudging

As opposed to the two previous predictivist arguments that target statistical inference in science, Lipton (2004, pp. 164–183) raises a broader concern with accommodation that applies to scientific theorizing in general. Lipton argues that the fundamental problem with accommodation is that if theorists create their theories with the intention of accommodating for particular results, they may, intentionally or inadvertently, ‘fudge’ their theory in order to fit those results. In fudging, theorists introduce forced or unnatural changes to their theories. As a result, the theories become excessively complex, producing an inferior explanation of the evidence that warrants lower confidence.

Lipton argues that the underlying problem with accommodation ultimately concerns the theory and its properties as such, i.e. its lack of simplicity. However, adopting the weak predictivist strategy, he argues that the confirmatory distinction between novel prediction and accommodation cannot be eliminated simply by examining the theory and the evidence as such. Lipton holds that accommodation remains disadvantageous even if the theory and the evidence are made fully public and available for scientific evaluation, as the kind of inductive support that is available in science is “translucent, not transparent” (see Lipton, 2004, p. 178). Simply put, Lipton argues that scientists are not perfect judges of the degree of fudging that has gone into constructing a theory. Even if a scientist is directly in possession of all the relevant evidence and can seek to evaluate the virtues of the theory, problematic fudging may still have occurred in the case of accommodation (unbeknownst to the evaluating scientist), creating a predictivist advantage in theory confirmation.

Notably, Lipton’s theoretical argument has some support in the scientific literature. Recently, concerns that appear very similar to Lipton’s fudging argument have been raised in the scientific literature in association with the replication crisis (see Shaw, 2017; Rubin, 2017, pp. 313–314; Hollenbeck & Wright, 2017, pp. 14–15; see also Kerr, 1998, p. 210). Writing from the editorial perspective, Shaw (2017) argues that HARKing (i.e. accommodation) often results in article manuscripts that contain contorted theories, conceptual sloppiness, and mismatches between the theory and the operationalized measures. Theorists who HARK may invoke multiple theoretical frameworks in an incoherent manner, define their concepts loosely, and struggle to fit the measures used in the study to the newly chosen HARKed theoretical framework, which implicates other measures and mechanisms more directly. Hollenbeck and Wright (2017, p. 14) make similar observations, writing that HARKing results in manuscripts where the “[i]ntroduction sections are filled with tortured text that does not make any sense to the well-informed reader.” According to them, the reader’s reaction to such text is often “confused” and “incredulous.”

The scientists’ observations provide credibility to Lipton’s argument in that fudging can pose a problem with accommodation in scientific research. However, this does not yet settle the matter of Lipton’s weak predictivist argument. For Lipton’s predictivist argument to hold in the sciences, it is not enough to show that fudging sometimes happens in scientific practice. Rather, what matters is whether information about prediction and accommodation is relevant to scientific evaluators with respect to the fudging problem. In other words, do the consequences of fudging—e.g. convoluted theories and constructs—speak for themselves, or do scientists need to resort to the distinction between novel prediction and accommodation to avoid the negative epistemic consequences of fudging?

There are, I suggest, three problems with the fudging argument that indicate that it may not be able to show a significant predictivist advantage in science:

First, there is a lack of evidence about the negative epistemic consequences of fudging. Lipton did not provide examples of cases where fudging occurred, but scientists were effectively fooled by it, nor any evidence that surreptitious fudging is a pervasive problem in scientific practice. In contrast, the writings of the scientists may provide at least some evidence against Lipton’s claim. Shaw’s (2017, pp. 819–820) take on the dangers of HARKing is framed as a warning to authors. According to him, HARKing is a “rejection-creating” practice that leads to unpublishable manuscripts. He argues that HARKing leaves “telltale signs” that lead to negative reactions from reviewers (Shaw, 2017, p. 820). Similarly, Rubin (2017, p. 314), argues that “the signs [of low-quality accommodation] are quite obvious.” He denies that novel prediction is needed to identify problematically adjusted theories, writing: “[p]eer reviewers and end-users are able to identify these problems and reduce their confidence in the reported research regardless of whether or not they suspect that researchers have engaged in HARKing” (Rubin, 2017, p. 314). Rubin (2022, pp. 552–553) further notes that after-the-fact adjustments to theory could also result in improvements to the theory, when they are done for good theoretical reasons (supplied, for example, by peer reviewers or editors). In other words, scientists have expressed some confidence about being able to detect whether or not theories have been adjusted in problematic ways without resorting to novel prediction.

Second, the large scientific literature on QRPs and other possible contributors to low replication rates provides some further indirect evidence that fudging may not constitute a significant problem in scientific evaluations. In contrast to the issue of overfitting, the use of small sample sizes, p-hacking, hypothesis hunting, selective publication of statistically significant results, and the low base rates of true effects in various scientific fields, the publication of fudged or convoluted hypotheses has not received similar attention in the scientific literature with respect to ongoing debates about the causes of the replication crisis. This does not show directly that fudging is not a real epistemic issue in the sciences. However, it provides further indications that scientists have not regarded fudging as a major problem—despite several years of active discussions about a methodological crisis in multiple fields. Considering this menu of potential contributors to low replication rates, it is also unclear how much of the low replication rates are left to explain by factors that have not already been highlighted in the scientific literature. For example, in certain analyses, it is argued that much of the low replication rates are already explained by just a subset of these factors (e.g. Ulrich & Miller, 2020).

In certain recent discussions, inadequacies in theory have been raised as an important problem behind low replication rates (see Oberauer & Lewandowsky, 2019; Szollosi & Donkin, 2021; see also Eronen & Bringmann, 2021). However, these discussions raise another problem that concerns scientific theorizing: flexible theorizing, i.e. building theories that are so vaguely formulated that they can fit any number of observations and phenomena. Oberauer and Lewandowsky (2019) and Szollosi and Donkin (2021) argue that a successful prediction, when achieved by a flexible theory that would have fit any number of results, does not provide much compelling evidence for the theory. The fundamental issue is the (weak) inferential link between the theory and its empirical implications, rather than whether the theory was introduced before or after statistical experiments. Both Oberauer and Lewandowsky (2019) and Szollosi and Donkin (2021) call for researchers to move beyond the presumed distinction between accommodation-based exploratory research and prediction-based confirmatory research and focus instead on all around better theorizing through strengthening the link between theories and their empirical implications.

Finally, there is also one more principled problem with the fudging argument that is worth noting. The very nature of this argument implies that whenever fudging occurs, it is less likely to result in a hypothesis that has the potential to substantially mislead scientific evaluators, as fudged theories become more complex by being adjusted to particular results. This restricts the influence that these hypotheses can have for even if such hypotheses end up published, they become restricted to more isolated contexts. Rubin (2017, p. 313) gives the following example of the kind of hypothesis that we might consider to be a result of fudging: “prejudice only increases self-esteem among black women who are aged 50 years or more.” Fudging could very well result in the generation of such a hypothesis. However, by its very nature the impact of the hypothesis will be limited, and it is unlikely to generate substantial false beliefs within the scientific field.

In contrast, the potential for harm may be greater if researchers start with simpler novel hypotheses that are nonetheless often wrong, which is the issue that was identified with novel prediction in the previous section. A selection bias that prioritizes impressive but often unreliable new discoveries can result in a large pool of misleading findings, leading to more substantial and comprehensive false beliefs within the scientific field. For example, in the Open Science Collaboration (2015) study mentioned earlier, around half of the results in a set of 100 studies published in important psychology journals could not be replicated, with surprising original effects counting among those least likely to replicate. Trusting such results based on their novel predictive success would lead to highly pervasive false beliefs about psychological phenomena. In contrast, even if fudged hypotheses were to slip by scientific evaluators occasionally, their potential for causing epistemic harm will be limited to more isolated contexts.

In summary, the fudging argument for predictivism currently lacks evidence from scientific practice, while there are some indications that scientists themselves have not considered surreptitious fudging as a significant concern in theory evaluation. Furthermore, the fudging argument by its very nature limits the problematic consequences that accommodation produces. Given the previous findings about problems with novel prediction, novel prediction may be associated with greater epistemic issues than accommodation.

4 Conclusion: findings and implications

Philosophers of science have raised several issues with the accommodation of evidence in scientific practice, arguing that these problems speak to a predictivist advantage in theory confirmation. In this paper, I have raised multiple reasons to remain skeptical about the epistemic benefits of novel prediction in comparison to accommodation. In addressing the issue of overfitting, the overall best solution is often more accommodation, and accommodation also has multiple advantages over prediction in this context. Hypothesis hunting can become an issue in scientific fields where background knowledge is accommodatively plastic, but novel predictions may also be unreliable due to various types of questionable research practices (QRPs) associated with novel prediction. Finally, fudging may be easily detected by scientists, and even if it somehow occurs, the impact of the fudged hypotheses will be limited. Overall, novel prediction and accommodation appear roughly on a par in their epistemic consequences, or novel prediction could even be associated with greater epistemic issues than accommodation.

Some implications and consequences of these findings can be highlighted for both science and the philosophy of science:

Concerning scientific practice, our findings support, on the one hand, recent calls to grow the evidence base for theories and hypotheses, e.g. through the use of more data in original research (see Yarkoni & Westfall, 2017) or through the promotion of more independent replications studies (see Bakker et al., 2012). We have seen that several concerns can be raised about single, isolated studies, whether the results were novelly predicted or accommodated. However, if more data is used in these studies, this can help alleviate problems with either novel prediction or accommodation, as the use of more data allows for altogether more reliable statistical inference. The promotion of more independent replication studies can provide similar advantages. An independent replication, if successful, can provide evidence that problems with neither prediction or accommodation have occurred in the original study, and thus provide more compelling evidence for the hypothesis.

On the other hand, our results raise certain questions about the common distinction between exploratory and confirmatory research (i.e. accommodation-based and prediction-based research) that is applied in many scientific fields (see, for example, Nosek et al., 2018). Even though our results support the conclusion that there is a practical difference between these forms of research, in that accommodation and prediction are associated with distinct advantages and disadvantages each, it remains unclear to what extent there are overall epistemic differences between prediction-based confirmatory research and accommodation-based exploratory research (cf. Oberauer & Lewandowsky, 2019; Szollosi & Donkin, 2021; Rubin & Donkin 2022). Rather, both exploratory and confirmatory studies appear to provide potential avenues for advancing scientific research (perhaps with roughly comparable results), particularly if their respective problems are addressed adequately (cf. Rubin & Donkin 2022).

Finally, our results also support calls for increasing methodological transparency in the sciences (e.g. Aguinis et al., 2018; Rubin, 2020). If evaluators are provided with detailed information about the methodology used in a particular study, the epistemic implications of particular methodological choices can be evaluated directly, and any information about novel prediction or accommodation becomes less relevant (or perhaps altogether irrelevant). Based on the results of this investigation, it appears that in terms of transparency, only what Rubin (2020) calls ‘contemporary transparency’ may be required in theory evaluation, i.e. researchers should disclose their current hypotheses, methods, and analyses, and justify these over alternative possibilities. What Rubin (2020) calls ‘historical transparency,’ i.e. transparency about differences between what the researchers originally planned to do and what they ended up doing, may not ultimately count as much from the epistemic point of view, as long as contemporary transparency is provided. This also bears on the issue of preregistration in a certain important way: if it really is the case that the distinction between exploratory and confirmatory research is not associated with very significant epistemic differences, preregistration may not be the optimal way to improve the reliability of research (cf. Rubin, 2020). Rather, similar results could be achieved by developing overall stricter methodological guidelines and requiring more detailed contemporary transparency in research reports.

For the philosophy of science, the findings reveal a nuanced picture about scientific confirmation. The prediction versus accommodation issue has long been at the core of debates between logical and historical confirmation theories, where the presumed advantages of novel prediction over accommodation provide a reason to think that historical facts about hypothesis origin matter to theory confirmation over and above logical considerations (see Musgrave, 1974). In this paper, we have seen ample evidence that contingent factors related to the testing or construction of a scientific theory are of concern to scientists, which speaks in favor of historical theories of confirmation. However, in contrast to the assumption about the importance of novel prediction, our results do not implicate a significant overall novel prediction vs. accommodation confirmatory asymmetry in science. The results can instead be summarized as follows:

(A) Contingent factors related to scientific methodology matter in the sciences to theory confirmation in multiple ways, whether novel predictions or accommodations have been made in the process of research.

(B) Whether either novel predictions or accommodations are more compelling in scientific practice is itself a contingent issue. Depending on the contingent methodological choices that are made by researchers in practice, either predictivist or accommodationist advantages could come to effect in different contexts in science. For example, if researchers commonly engage in hypothesis hunting in conditions of accommodative plasticity, but take strict measures to avoid p-hacking, evidence suppression, and fraud, prediction may become more compelling; if they do the opposite, accommodation may be preferable. (As per evidence found in this study, the reality is most likely somewhere in between, with novel prediction and accommodation appearing roughly on a par.)

(C) Any confirmatory asymmetries between novel prediction and accommodation depend also importantly on (contemporary) transparency about hypotheses, methods, and analyses. The more transparently hypotheses, methods, and analyses are reported, the less information about hypothesis origin counts for theory confirmation, as the evidential impact of specific methodological practices can be evaluated directly. The confirmatory impact (if any) of the distinction between novel prediction and accommodation ultimately becomes less relevant, or disappears altogether, if transparency is increased about the research process (cf. Rubin, 2020).

(D) A more important contingent distinction in the sciences than the distinction between novel prediction and accommodation may, after all, concern the difference between original studies and independent tests. Both novel predictions and accommodations can become subject to various methodological problems; independent replications may provide more compelling evidence that problems with neither have occurred in the original studies.

(E) Nonetheless, pace all the above, the sheer amount of data that is used in evaluating a hypothesis counts, perhaps over and above any of the previous factors. If there is enough data, any problems related to contingent methodological choices are significantly reduced, as the use of more data allows for altogether more reliable statistical inference. In other words: the impact of contingent factors to theory confirmation ultimately diminishes the more evidence is produced for the theory. Thus, in the limit, scientific confirmation may depend only on the relationship between the theory, the evidence, and background knowledge, as logical confirmation theories argue. In situations where there is yet unclarity about any of these, contingent methodological choices may be important.

Mayo (2014, p. 79) remarked recently that the issue of novel facts “will give an account of evidence a run for its money.” Providing epistemic principles that allow for various contingent factors to count for theory confirmation, without highlighting either novel prediction or accommodation as particularly important, may further stimulate the challenges.