# Replication Bayes factors from evidence updating

## Abstract

We describe a general method that allows experimenters to quantify the evidence from the data of a direct replication attempt given data already acquired from an original study. These so-called replication Bayes factors are a reconceptualization of the ones introduced by Verhagen and Wagenmakers (*Journal of Experimental Psychology: General, 143*(4), 1457–1475 2014) for the common *t* test. This reconceptualization is computationally simpler and generalizes easily to most common experimental designs for which Bayes factors are available.

## Keywords

Evidence synthesis Hypothesis testing Meta-analysis ReplicationThe past 5 years have witnessed a dramatic increase in interest for replication studies, largely in response to psychology’s “crisis of confidence” (e.g., Pashler & Wagenmakers, 2012). While this crisis is not unique to the field of psychology by any means, psychologists have been at the forefront of efforts to assess and improve reproducibility in science by way of large-scale replication initiatives, such as the Reproducibility Project: Psychology (Open Science Collaboration, 2015), the *Social Psychology* special issue on replication (Nosek & Lakens, 2014), and the various ManyLabs efforts (Ebersole et al., 2016; Klein et al., 2014). Although the importance of direct replication has been contested by some (for an overview of the most common arguments see Zwaan, Etz, Lucas, & Donnellan, 2017), the increasing prominence of replication studies has prompted researchers to examine the question of how to assess, statistically, the degree to which a replication study succeeds or fails.

- 1.
Completely ignoring the data of the original study, what is the evidence that the effect is present or absent in the replication attempt? (e.g., Marsman et al., 2017).

- 2.
Taking the data of the original study fully into account, what is the evidence that the effect is present or absent in the replication attempt? (e.g., Verhagen & Wagenmakers, 2014).

- 3.
Pooling the data from the original study and the replication attempt, what is the evidence that the effect is present or absent? (e.g., Scheibehenne, Jamil, & Wagenmakers, 2016).

- 4.
Comparing the data from the original study and the replication attempt, what is the evidence that the effect sizes are similar or dissimilar? (e.g., Bayarri & Mayoral, 2002).

“The 1st hypothesis is that of the skeptic and holds that the effect is spurious; this is the null hypothesis that postulates a zero effect size, \(\mathcal {H}_{0}: \delta = 0\). The 2nd hypothesis is that of the proponent and holds that the effect is consistent with the one found in the original study, an effect that can be quantified by a posterior distribution. Hence, the 2nd hypothesis—the replication hypothesis—is given by \(\mathcal {H}_{r}: \delta \sim \) ‘posterior distribution from original study.’ The weighted-likelihood ratio [i.e., the replication Bayes factor] between \(\mathcal {H}_{0}\) and \(\mathcal {H}_{r}\) quantifies the evidence that the data provide for replication success and failure.” (Verhagen & Wagenmakers, 2014, p. 1457)

Verhagen and Wagenmakers (2014) proposed this replication Bayes factor in the context of the *t* test, and Wagenmakers et al., (2016b) extended it to the correlation test. The main idea is intuitive: first the original result is summarized by its posterior distribution, and, subsequently, this posterior is used as a prior for the replication attempt. Despite its intuitive appeal in terms of the coherent updating of information, the replication Bayes factor comes with at least three challenges: (1) the procedure is not exact, as the posterior distribution from the original study often needs to be approximated by a convenient function; (2) the procedure requires technicalities and is not easy to apply; (3) the procedure does not generalize well to more complicated designs such as ANOVA (but see George, Ročková, Rosenbaum, Satopää, & Silber, 2017; Harms, 2016; Wagenmakers, Verhagen, & Ly, 2016b).

Here we outline an alternative procedure that solves these challenges. Specifically, the rules of Bayesian updating reveal that the replication Bayes factor quantifies the change in evidence provided by the replication experiment, given that the evidence provided by the original study is already available. This means that any software package that is able to output ordinary Bayes factors can also be used to provide replication Bayes factors, by simply feeding it the combined data set.

Below we first describe the Bayes factor in general terms; subsequently we outline the new conceptualization of the replication Bayes factor and then apply it to a number of concrete examples. We end by discussing the method’s limitations and future challenges.

## The Bayes factor

The Bayes factor is “fundamental to the Bayesian comparison of alternative statistical models” (O’Hagan & Forster, 2004, p. 55) and it represents “the standard Bayesian solution to the hypothesis testing and model selection problems” (Lewis & Raftery, 1997, p. 648) and “the primary tool used in Bayesian inference for hypothesis testing and model selection” (Berger, 2006, p. 378).

Developed and promoted by Jeffreys (1961), the Bayes factor contrasts the predictive performance of two competing models (Etz & Wagenmakers, 1995; Kass & Raftery, 2017; Ly, Verhagen, & Wagenmakers, 2016a, b). Here we focus on the standard scenario that features a null hypothesis, \(\mathcal {H}_{0}\), which stipulates the absence of an effect, and an alternative hypothesis, \(\mathcal {H}_{1}\), which stipulates the presence of an effect. Both hypotheses are falsifiable in the sense that they make specific predictions about the to-be-observed data. This is accomplished by assigning the model parameters specific values, or—in case the values are unknown and require estimation from the data—entire distributions. For instance, in the case of the *t* test, \(\mathcal {H}_{0}\) assigns effect size *δ* in the population a single specific value, namely *δ* = 0 (i.e., the effect is absent); in contrast, \(\mathcal {H}_{1}\) assigns effect size *δ* a distribution that reflects the uncertainty about the true effect (e.g., \(\delta \sim \mathcal {N}(0,1)\); i.e., the effect is present but likely to be small).

*d*is given by the hypotheses’ relative predictive adequacy for those data (Wagenmakers et al., 2016a):

*𝜃*

_{1}is the parameter vector under \(\mathcal {H}_{1}\), and

*𝜃*

_{0}is the (typically shorter) parameter vector under \( \mathcal {H}_{0}\). Thus, when BF

_{10}(

*d*) = 3, the data

*d*are three times more likely under \(\mathcal {H}_{1}\) than under \(\mathcal {H}_{0}\), and when BF

_{10}(

*d*) = 0.125 (or equivalently, BF

_{01}(

*d*) = 1/BF

_{10}(

*d*) = 8), the data are eight times more likely under \(\mathcal {H}_{0}\) than under \(\mathcal {H}_{1}\).

The Bayes factor offers several advantages for the analysis of empirical data (e.g., Dienes, 2014; Rouder, 2014; Schönbrodt & Wagenmakers, 2018; Wagenmakers, Marsman, et al., 2018a). Specifically, the Bayes factor allows the researcher to quantify evidence to discriminate between absence of evidence (i.e., BF_{01}(*d*) ≈ 1) versus evidence of absence (i.e., BF_{01}(*d*) ≫ 1). The Bayes factor also allows one to monitor the evidence as the data come in (Gronau and Wagenmakers, 2017) and to design experiments in order to ensure compelling evidence. Finally, the Bayes factor can also be used to quantify replication success, a topic to which we turn next. For a more detailed introduction to the various fundamental Bayesian concepts, see Wagenmakers et al. (2018a), Wagenmakers et al., (2018b), and Etz and Vandekerckhove (2018).

## Bayesian updating in action

“(...) an anticipatory looking test (originally developed for human infants) to show that three species of great apes reliably look in anticipation of an agent acting on a location where he falsely believes an object to be, even though the apes themselves know that the object is no longer there. Our results suggest that great apes also operate, at least on an implicit level, with an understanding of false beliefs.” (Krupenye et al., 2016), p. 110.

“(...) we tested 40 apes [19 chimpanzees, 14 bonobos, and 7 orangutans (...)]. Thirty subjects looked to either the target or the distractor during the central-approach period. Of these 30, 20 looked first at the target (

P= 0.098, two-tailed binomial test)” (Krupenye et al., 2016, p. 113).

*𝜃*, the unknown proportion of apes in the population who first look at the target. The hypothesis that the apes are performing at chance level is specified as \( \mathcal {H}_{0}: \theta = 0.5 \). This hypothesis is contrasted with \(\mathcal {H}_{1}\), the hypothesis that

*𝜃*can take on values other than 0.5. For illustrative purposes, under \(\mathcal {H}_{1}\) we assign

*𝜃*a default prior distribution of Beta(1,1) that is uniform across the interval from 0 to 1. With the model in place, our uncertainty about the unknown parameter

*𝜃*is then updated by the data (i.e., 20 out of 30 looks at the target), and this yields the results shown in Fig. 1.

In Fig. 1, consider the two grey dots that mark the height of the prior and posterior distribution at *𝜃* = 0.5, the null hypothesis of chance performance. These heights can be used to obtain the Savage–Dickey representation of the Bayes factor, an intuitive depiction of its strength and direction: If the dot at *𝜃* = 0.5 gets higher from prior to posterior, the Bayes factor will provide evidence in favor of the null hypothesis (and vice-versa); moreover, the ratio of the heights of the dots exactly equals the Bayes factor (Dickey & Lientz, 1970; Wagenmakers et al., 2010). In this analysis, the two dots are almost at an equal height, and the Bayes factor obtained is BF_{10}(*d*) = 1.153, which indicates that the data are non-diagnostic in choosing between the two hypotheses under scrutiny.

We may have gained hardly any evidence for the one hypothesis over the other. However, assume we know that the null hypothesis is false, uninteresting, or generally unworthy of attention. Then we are left with \(\mathcal {H}_{1}\), and the corresponding posterior information about *𝜃* is shown as the full curve in Fig. 1. The area under this curve to the right of *𝜃* = 0.5 is much larger than the area to the left of *𝜃* = 0.5; consequently, if we discard the null hypothesis that the apes are performing at chance, thus, only take \(\mathcal {H}_{1}\) into consideration, the previously non-diagnostic data inform us that *𝜃* is likely to be higher than 0.5 (see also Etz & Vandekerckhove, 2018, Example 5); indeed, the 95% credible interval ranges from 0.486 to 0.808.

The idea of Verhagen and Wagenmakers was to use this posterior from the first experiment as an informed prior for a second experiment. This is in accordance with Bayesian parameter updating and the adage “today’s posterior is tomorrow’s prior” (Lindley, 1972, p. 2). The resulting “replication Bayes factor” quantifies the relative predictive adequacy of the null hypothesis versus an alternative hypothesis that is completely informed by the knowledge of the parameter obtained from the first study.

“In experiment two, we tested 30 subjects (29 from experiment one, plus one additional bonobo). Twenty-two apes made explicit looks to the target or the distractor during this period. Of these 22, 17 looked first at the target (

P= 0.016, two-tailed binomial test)” (Krupenye et al., 2016, p. 113).

In order to compute the replication Bayes factor, we take the posterior distribution from Experiment 1 (i.e., the solid line in Fig. 1), and use it as a prior distribution for the analysis of the second experiment. Recall that the original uniform prior was a Beta(1,1) distribution; after incorporating the 20 successes and ten failures from the first experiment, the posterior remains a beta distribution, namely, Beta(1 + 20,1 + 10). This distribution can be specified in the Summary Stats module of JASP.

*𝜃*near 0.5, and consequently the replication Bayes factor is relatively strong: the data are about 16 times more likely under the proponent’s \(\mathcal {H}_{r}\) than under the skeptic’s \(\mathcal {H}_{0}\).

This process of updating to a posterior and then using it as a prior for the analysis of the next experiment is relatively straightforward for this simple example. For more complex models, however, the process can be burdensome, approximate, and intricate. In the remainder of this paper, we will propose an easier, more exact way forward that focuses on updating the evidence rather than the parameter priors.

## The replication Bayes factor reconceptualized

The example above demonstrated how the replication Bayes factor can be obtained by a standard Bayesian parameter updating process, that is, by using the posterior distribution from the first experiment as a prior distribution for the replication test of the second experiment.

^{1}Multiplying these two Bayes factors yields 1.153 × 16.448 = 18.965, the same result as is obtained when all data are analyzed at once.

^{2}

*d*

_{orig}denotes the data from the original study, and

*d*

_{rep}the data from the replication attempt. Note that the replication Bayes factor is the change in the Bayes factor due to the observation of the replication data, and quantifies the additional evidence for the alternative hypothesis given what was already observed in the original study.

## Example 1: a *t* test to assess whether superstition improves performance

Consider perhaps the most routine replication scenario, one where a researcher conducts a replication of a study whose analysis featured a *t* test. For a common *t* test, JASP allows the specification of a Cauchy, *t*, or normal prior for the effect size *δ* and the user is free to specify the center and scale of this prior (for technical details see Gronau, Ly, & Wagenmakers, 2017a). However, in contrast to parameter *𝜃* from the binomial test, the posterior for *δ* in a *t* test has no known distributional form. The applied scientist is therefore unable to use the posterior as a prior to calculate a replication Bayes factor in JASP.

To overcome this hurdle, Verhagen and Wagenmakers (2014) proposed to approximate the posterior on effect size obtained from the *t* test with a normal distribution; this normal distribution is then used as a prior for the analysis of the replication experiment. Unfortunately, this approximation in the intermediate step between the original and the replication study makes this method computationally involved and hard to generalize to other designs.

A classical“Can superstitions actually improve performance? Damisch et al., (2010) reported a striking experiment in which manipulating superstitious feelings markedly increased golfing ability. Participants attempted ten putts, each from a distance of 100 cm. Some participants were primed for superstition prior to the task by being told ‘Here is the ball. So far it has turned out to be a lucky ball.’ Controls were simply told ‘This is the ball everyone has used so far.’ Remarkably, this manipulation produced a substantial increase in golf performance: Controls made 48

%of putts while superstition-primed participants made 65%of putts (d= 0.83, 95%CI [0.05,1.60]).” (Calin-Jageman & Caldwell, 2014, p. 239)

*t*test

^{3}of the original data resulted in a statistically significant result,

*t*(26) = 2.14,

*p*= .042,

*d*= 0.83. As shown in Fig. 4, a Bayesian independent-samples

*t*test using the JASP Summary Stats module returns BF

_{10}(

*d*

_{orig}) = 1.820, a level of evidence that is not compelling. Calin-Jageman and Caldwell (2014) performed a direct replication of this work. Their Experiment 1 featured 58 control participants and 66 “superstition-activated” participants. The latter group outperformed the controls by only 2%, a result that is not statistically significant (i.e.,

*t*(122) = 0.29,

*p*= .77,

*d*= 0.05).

To compute the EU replication Bayes factor, we first need to compute the complete Bayes factor for these two data sets. Since both the original and replication papers report the raw means and standard deviations for each of the two groups (which are sufficient statistics for the *t* test, see Ly, Marsman, Verhagen, Grasman, & Wagenmakers, 2017), we can straightforwardly compute the overall *t* value for the combined data (see Appendix A for a description of the algebra involved); this yields an overall *t* = 1.14, which corresponds to a complete Bayes factor of BF_{10}(*d*_{orig},*d*_{rep}) = 0.318. The replication Bayes factor can now be obtained by simply dividing the complete Bayes factor by the Bayes factor from the original data alone and leads to BF_{10}(*d*_{rep}|*d*_{orig}) = 0.175. In other words, the skeptic’s null hypothesis predicted the data from the replication attempt 1/0.175 = 5.72 times better than the proponent’s alternative hypothesis informed by the original data set.

## Example 2: a contingency table analysis to test whether more valuable stimuli are judged to be relatively rare

The previous example featured a *t* test and therefore the replication Bayes factor could also have been approximated using the parameter-updating procedure outlined in Verhagen and Wagenmakers (2014). We now turn to an example for which this parameter-updating procedure is problematic: the default Bayesian test for independence in a contingency table (Gunel & Dickey, 1974; Jamil et al., 2017).

The test for independence involves the construction of a model that is more complex than the models used for the *t* test. Consequently, in JASP, the researcher can only input a parameter that governs the relative concentration of the joint prior distribution, and—for the special case of a 2 × 2 table—receive a posterior distribution for the log-odds ratio, a derived summary measure that quantifies the degree of association. This generic setup does not allow researchers to obtain a joint parameter posterior from past studies and use it as a prior for current studies, frustrating the parameter-updating version of the replication Bayes factor.

“(…) tested a novel heuristic for making judgments of relative frequency. According to this so-called value heuristic, ‘people judge the frequency of a class of objects on the basis of the subjective value of the objects’ (p. 18). Based on the principle that scarcity increases an object’s value, the authors [Dai et al.] formulate the hypothesis that individuals will assess more valuable stimulus classes to be less frequent even when value is not diagnostic of frequency.”

*χ*

^{2}(1,56) = 4.51,

*p*= .037, and a default Bayesian reanalysis (Gunel & Dickey, 1974) using JASP yields BF

_{10}(

*d*

_{orig}) = 2.880.

Data from Dai et al., (2008), who concluded that endowing a category may lead participants to judge that category to be relatively rare

Endowed | Estimates | ||
---|---|---|---|

Fewer flowers | Fewer birds | Total | |

Flowers | 15 | 12 | 27 |

Birds | 8 | 21 | 29 |

Total | 23 | 33 | 56 |

*χ*

^{2}(1,51) = 1.57,

*p*= .21, which is not statistically significant. To reanalyze this data using our EU replication Bayes factor, we first combine the data into a single sample (see Table 3) and compute the complete Bayes factor, BF

_{10}(

*d*

_{orig},

*d*

_{rep}) = 0.298. To obtain the replication Bayes factor, we simply divide BF

_{10}(

*d*

_{orig},

*d*

_{rep}) by BF

_{10}(

*d*

_{orig}), which yields BF

_{10}(

*d*

_{rep}|

*d*

_{orig}) = 0.103. This means that the replication data are predicted 1/0.103 = 9.71 times better by the null hypothesis than by the alternative hypothesis informed by the original data set.

^{4}

Data from the replication experiment by Fuchs and colleagues

Endowed | Estimates | ||
---|---|---|---|

Fewer flowers | Fewer birds | Total | |

Flowers | 11 | 16 | 27 |

Birds | 14 | 10 | 24 |

Total | 25 | 26 | 51 |

Data from the original and replication experiment combined

Endowed | Estimates | ||
---|---|---|---|

Fewer flowers | Fewer birds | Total | |

Flowers | 26 | 28 | 54 |

Birds | 22 | 31 | 53 |

Total | 48 | 59 | 107 |

## Conclusions

The replication Bayes factor (Verhagen & Wagenmakers, 2014) provides an intuitive measure of replication success: rather than ignoring the original study, the replication Bayes factor uses the posterior distribution obtained from the original study as a prior distribution for the test of the data from the replication study.

Here we provided an additional perspective on the replication Bayes factor, namely as the change in evidence brought about by observing the results from the replication study. The advantage of this “evidence-updating” or EU perspective on the replication Bayes factor is that it does not require approximations, and that it can be easily applied to complex models. One reviewer noted that the EU replication Bayes factor follows directly from the general properties of the Bayes factor. Although this assessment is correct, we nevertheless believe that the EU replication Bayes factor represents a conceptual and practical advance. As is often the case in probability theory, solutions appear trivial only after they have been derived. In this particular case, Verhagen and Wagenmakers (2014) were unaware of the EU replication Bayes factor; in general, it is not immediately obvious that the parameter updating step—an integral part of the original Verhagen and Wagenmakers method—can be entirely omitted.

Both the original parameter-updating version and the current EU version of the replication Bayes factor are based on the idea of evidence synthesis and scientific learning (e.g., Marsman, Ly, & Wagenmakers, 2016; Scheibehenne et al., 2016; Silber et al., 2016). With more than two studies, the proposed method is similar to a fixed-effects meta-analysis that assumes the data to be exchangeable.^{5}

As with any statistical method, it can become vulnerable when its core assumptions are violated. For the EU replication Bayes factor, the most serious threat to its validity arises when the replication is not close, and aspects differ that the model assumes to be the same. Consider the *t* test. The parameter-updating version updates only the test-relevant parameter *δ*, but the nuisance parameters (e.g., the grand mean, which is common to \(\mathcal {H}_{0}\) and \(\mathcal {H}_{1}\)), were not updated. This small omission is rectified by the EU version that automatically and implicitly updates the joint prior for all model parameters. However, this updating of nuisance parameters also creates a lack of robustness: when the nuisance parameters do undergo a large change from original to replication study, the results can be misleading. For instance, assume that a replication attempt successfully reproduces the main effect of condition, but all participants are 150 ms slower. When the raw data from the two studies are combined, this artificially inflates the variance and may make it appear as if the replication failed.

A similar warning applies for a correlation test, where the parameter of interest—the correlation coefficient *ρ*—may be of similar magnitude in the original and the replication study, but global changes in the location parameters of the bivariate normal distribution can skew the outcome of the EU replication Bayes factor. For instance, suppose one studies the relation between income and body weight. The replication attempt finds the same correlation but on average participants make $10,000 more and weigh 15 pounds less. Visually, this yields two clouds of points; each may have the same shape and orientation, but pooling the raw data may create a misleading impression.

The solution to this lack of robustness is two-fold. First, users must be aware that this is a potential problem. Second, the data may be transformed to absorb any changes in nuisance parameters. For instance, correlational data may be mean-centered before being combined.

Another vulnerability of the replication Bayes factor (regardless of whether it is the parameter-updating version or the EU version) is that, in rare cases, it brings about a replication paradox. The paradox is that when a replication attempt strongly suggests that the results go in the direction opposite to the one found in the original study, the replication Bayes factor may yield compelling evidence in favor of the alternative hypothesis that the effect has successfully replicated. As with all uses of probability theory, such paradoxes reveal a lack of proper understanding. Appendix C illustrates the paradox and explains that it can be resolved by imposing an order restriction.

No single measure of replication success suffices to address all questions that surround the interpretation of a replication attempt. We advocate an inclusive approach to the statistical assessment of replication success, and we hope that the EU replication Bayes factor can be one of many tools that are at researchers’ disposal, to be applied not just across laboratories but also within laboratories.

## Footnotes

- 1.
- 2.
The difference between 18.965 and 18.961 is due to rounding and vanishes as the number of decimal places in the calculation are increased. The number of decimal places that are displayed in JASP can be increased in the preference window.

- 3.
This analysis is consistent with the one used in the original experiment and the replication attempt. A more appropriate statistical analysis arguably uses a hierarchical binomial model.

- 4.
It is worth noting that the replication Bayes factor in this example can be well approximated by a Bayes factor based on a normal prior and a normal likelihood (possibly after a suitable transformation of the parameters and the data; see Dienes & Mclatchie, 2018), as was brought to our attention by a reviewer. The normal prior is used as an approximation of the posterior (as in Verhagen & Wagenmakers, 2014), and the normal likelihood is used as an approximation to the exact likelihood. Specifically, the reviewer approximated the likelihood by a normal distribution with a mean of − 0.711 (i.e., the logarithm of the observed odds ratio in the replication study) and a standard deviation of 0.5699, and used as prior a normal distribution based on the logarithm of the odds ratio of 1.188, and a standard error of 0.568 as was observed from the original study. The reviewer then used the calculator proposed by Dienes (2008, 2014) and Dienes et al., (2018), which resulted in a Bayes factor of 0.10. See van Doorn et al., (2016) for a similar use of approximating normal likelihood to compute Bayes factors, and Ly et al., (2017) for some theoretical background.

- 5.

## Notes

### Acknowledgements

AL and EJW are supported by the starting grant “Bayes or Bust” awarded by the European Research Council (Grant #283876) and grant 016.Vici.170.083 from the Netherlands Organisation for Scientific Research (NWO). AE was supported by grant #1534472 from NSF’s Methods, Measurements, and Statistics panel, as well as the National Science Foundation Graduate Research Fellowship Program #DGE1321846.

## References

- Bayarri, M. J., & Mayoral, A. M. (2002). Bayesian analysis and design for comparison of effect-sizes.
*Journal of Statistical Planning and Inference*,*103*, 225–243.CrossRefGoogle Scholar - Berger, J. O. (2006). Bayes factors. In S., Kotz, N., Balakrishnan, C., Read, B., Vidakovic, & N.L., Johnson (Eds.)
*Encyclopedia of statistical sciences*. (2nd ed., Vol. 1, pp. 378–386). Hoboken: Wiley.Google Scholar - Calin-Jageman, R. J., & Caldwell, T. L. (2014). Replication of the superstition and performance study by Damisch, Stoberock, and Mussweiler (2010).
*Social Psychology*,*45*, 239–245. https://doi.org/10.1027/1864-9335/a000190 CrossRefGoogle Scholar - Dai, X., Wertenbroch, K., & Brendl, C. M. (2008). The value heuristic in judgments of relative frequency.
*Psychological Science*,*19*, 18–19.CrossRefPubMedGoogle Scholar - Damisch, L., Stoberock, B., & Mussweiler, T. (2010). Keep your fingers crossed! How superstition improves performance.
*Psychological Science*,*21*, 1014–1020.CrossRefPubMedGoogle Scholar - Dickey, J. M., & Lientz, B. P. (1970). The weighted likelihood ratio, sharp hypotheses about chances, the order of a Markov chain.
*The Annals of Mathematical Statistics*,*41*, 214–226.CrossRefGoogle Scholar - Dienes, Z. (2008). Understanding psychology as a science: An introduction to scientific and statistical inference. Macmillan International Higher Education.Google Scholar
- Dienes, Z. (2014). Using Bayes to get the most out of non-significant results.
*Frontiers in Psychology*,*5*, 781.CrossRefPubMedPubMedCentralGoogle Scholar - Dienes, Z., & Mclatchie, N. (2018). Four reasons to prefer Bayesian analyses over significance testing.
*Psychonomic Bulletin & Review*,*25*(1), 207–218. https://doi.org/10.3758/s13423-017-1266-z.CrossRefGoogle Scholar - Dienes, Z., Coulton, S., & Heather, N. (2018). Using Bayes factors to evaluate evidence for no effect: Examples from the SIPS project.
*Addiction*,*113*(2), 240–246.CrossRefPubMedGoogle Scholar - Ebersole, C., Atherton, O., Belanger, A., Skulborstad, H., Allen, J., Banks, J., ..., Nosek, B. (2016). Many labs 3: Evaluating participant pool quality across the academic semester via replication.
*Journal of Experimental Social Psychology*,*67*, 68–82.CrossRefGoogle Scholar - Etz, A., & Vandekerckhove, J. (2018). Introduction to Bayesian inference for psychology.
*Psychonomic Bulletin & Review*,*25*(1), 5–34. https://doi.org/10.3758/s13423-017-1262-3 CrossRefGoogle Scholar - Etz, A., & Wagenmakers, E.-J. (2017). J. B. S. Haldane’s contribution to the Bayes factor hypothesis test.
*Statistical Science*,*32*(2), 313–329.CrossRefGoogle Scholar - George, E. I., Ročková, V., Rosenbaum, P. R., Satopää, V. A., & Silber, J. H. (2017). Mortality rate estimation and standardization for public reporting: Medicare’s hospital compare.
*Journal of the American Statistical Association*,*112*(519), 933–947.CrossRefGoogle Scholar - Gronau, Q. F., & Wagenmakers, E.-J (2017). Bayesian evidence accumulation in experimental mathematics: A case study of four irrational numbers.
*Experimental Mathematics*. https://doi.org/10.1080/10586458.2016.1256006. - Gronau, Q. F., Ly, A., & Wagenmakers, E.-J. (2017a). Informed Bayesian
*t*-tests. arXiv:1704.02479. - Gronau, Q. F., van Erp, S., Heck, D. W., Cesario, J., Jonas, K. J., & Wagenmakers, E.-J. (2017b). A Bayesian model-averaged meta-analysis of the power pose effect with informed and default priors: The case of felt power.
*Comprehensive Results in Social Psychology*,*2*, 123–138.CrossRefGoogle Scholar - Gunel, E., & Dickey, J. (1974). Bayes factors for independence in contingency tables.
*Biometrika*,*61*, 545–557.CrossRefGoogle Scholar - Harms, C. (2016). A Bayes factor for replications of ANOVA results. Retrieved from arXiv:https://arxiv.org/abs/1611.09341.
- Jamil, T., Ly, A., Morey, R. D., Love, J., Marsman, M., & Wagenmakers, E.-J. (2017). Default “Gunel and Dickey” Bayes factors for contingency tables.
*Behavior Research Methods*,*49*(2), 638–652.CrossRefPubMedGoogle Scholar - JASP Team (2018). JASP (Version 0.9.0.1)[Computer software]. Retrieved from https://jasp-stats.org/.
- Jeffreys, H. (1938). Significance tests when several degrees of freedom arise simultaneously.
*Proceedings of the Royal Society of London. Series A Mathematical and Physical Sciences*,*165*, 161–198.CrossRefGoogle Scholar - Jeffreys, H. (1961)
*Theory of probability*, (3rd ed.). Oxford: Oxford University Press.Google Scholar - Kass, R. E., & Raftery, A. E. (1995). Bayes factors.
*Journal of the American Statistical Association*,*90*, 773–795.CrossRefGoogle Scholar - Klein, R., Ratliff, K., Vianello, M., Adams, Jr. R. B., Bahník, V., Bernstein, M., ..., Nosek, B. (2014). Investigating variation in replicability: A “many labs” replication project.
*Social Psychology*,*45*, 142–152. https://doi.org/10.1027/1864-9335/a000178 CrossRefGoogle Scholar - Krupenye, C., Kano, F., Hirata, S., Call, J., & Tomasello, M. (2016). Great apes anticipate that other individuals will act according to false beliefs.
*Science*,*354*, 110–114.CrossRefPubMedGoogle Scholar - Lewis, S. M., & Raftery, A. E. (1997). Estimating Bayes factors via posterior simulation with the Laplace—Metropolis estimator.
*Journal of the American Statistical Association*,*92*(438), 648–655.Google Scholar - Lindley, D. V. (1972)
*Bayesian statistics, a review*. Philadelphia: SIAM.CrossRefGoogle Scholar - Ly, A., Verhagen, A. J., & Wagenmakers, E.-J. (2016a). Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology.
*Journal of Mathematical Psychology*,*72*, 19–32. https://doi.org/10.1016/j.jmp.2015.06.004 CrossRefGoogle Scholar - Ly, A., Verhagen, A. J., & Wagenmakers, E.-J. (2016b). An evaluation of alternative methods for testing hypotheses, from the perspective of Harold Jeffreys.
*Journal of Mathematical Psychology*,*72*, 43–55. https://doi.org/10.1016/j.jmp.2016.01.003 CrossRefGoogle Scholar - Ly, A., Marsman, M., Verhagen, A. J., Grasman, R.P.P.P., & Wagenmakers, E.–J. (2017). A tutorial on Fisher information.
*Journal of Mathematical Psychology*,*80*, 40–55. https://doi.org/10.1016/j.jmp.2017.05.006.CrossRefGoogle Scholar - Ly, A., Raj, A., Etz, A., Marsman, M., Gronau, Q. F., & Wagenmakers, E.-J. (in press). Bayesian reanalyses from summary statistics and the strength of statistical evidence.
*Advances in Methods and Practices in Psychological Science*. https://doi.org/10.31219/osf.io/7dzmk. - Marsman, M., Ly, A., & Wagenmakers, E.-J. (2016). Four requirements for an acceptable research program.
*Basic and Applied Social Psychology*,*38*(6), 308–312. https://doi.org/10.1080/01973533.2016.1221349.CrossRefGoogle Scholar - Marsman, M., Schönbrodt, F. D., Morey, R. D., Yao, Y., Gelman, A., & Wagenmakers, E.-J. (2017). A Bayesian bird’s eye view of ‘Replications of important results in social psychology’.
*Royal Society Open Science*,*4*, 160426.CrossRefPubMedPubMedCentralGoogle Scholar - Matzke, D., Nieuwenhuis, S., van Rijn, H., Slagter, H. A., van der Molen, M. W., & Wagenmakers, E.-J. (2015). The effect of horizontal eye movements on free recall: A preregistered adversarial collaboration.
*Journal of Experimental Psychology: General*,*144*, e1–e15.CrossRefGoogle Scholar - Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the credibility of published results.
*Social Psychology*,*45*, 137–141.CrossRefGoogle Scholar - O’Hagan, A., & Forster, J. (2004)
*Kendall’s advanced theory of statistics vol 2B: Bayesian inference*, (2nd ed.). London: Arnold.Google Scholar - Open Science Collaboration (2015). Estimating the reproducibility of psychological science.
*Science*,*349*(6251), aac4716. https://doi.org/10.1126/science.aac4716.CrossRefGoogle Scholar - Pashler, H., & Wagenmakers, E.-J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence?
*Perspectives on Psychological Science*,*7*, 528–530.CrossRefPubMedGoogle Scholar - Rouder, J. N. (2014). Optional stopping: No problem for Bayesians.
*Psychonomic Bulletin & Review*,*21*, 301–308.CrossRefGoogle Scholar - Scheibehenne, B., Jamil, T., & Wagenmakers, E.-J. (2016). Bayesian evidence synthesis can reconcile seemingly inconsistent results: The case of hotel towel reuse.
*Psychological Science*,*27*(7), 1043–1046. https://doi.org/10.1177/0956797616644081 CrossRefPubMedGoogle Scholar - Scheibehenne, B., Gronau, Q. F., Jamil, T., & Wagenmakers, E.-J. (2017). Fixed or random? A resolution through model-averaging. Reply to Carlsson, Schimmack, Williams, and Burkner.
*Psychological Science*,*28*(11), 1698–1701. https://doi.org/10.1177/0956797617724426.CrossRefPubMedGoogle Scholar - Schönbrodt, F. D., & Wagenmakers, E.-J. (2018). Bayes factor design analysis: Planning for compelling evidence.
*Psychonomic Bulletin & Review*,*25*(1), 128–142.CrossRefGoogle Scholar - Silber, J. H., Satopää, V. A., Mukherjee, N., Rockova, V., Wang, W., Hill, A. S., ..., George, E. I. (2016). Improving Medicare’s hospital compare mortality model.
*Health Services Research*,*51*, 1229–1247.CrossRefPubMedPubMedCentralGoogle Scholar - van Doorn, J., Ly, A., Marsman, M., & Wagenmakers, E.-J (2016). Bayesian inference for Kendall’s rank correlation coefficient.
*The American Statistician*. https://doi.org/10.1080/00031305.2016.1264998. - Verhagen, A. J., & Wagenmakers, E.-J. (2014). Bayesian tests to quantify the result of a replication attempt.
*Journal of Experimental Psychology: General*,*143*(4), 1457–1475.CrossRefGoogle Scholar - Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage–Dickey method.
*Cognitive Psychology*,*60*, 158–189.CrossRefPubMedGoogle Scholar - Wagenmakers, E.-J., Wetzels, R., Borsboom, D., Kievit, R., & van der Maas, H. L. J. (2015). A skeptical eye on psi. In May, E., Marwaha, S. (Eds.),
*Extrasensory perception: support, skepticism, and science*(pp. 153–176). ABC-CLIO.Google Scholar - Wagenmakers, E.-J., Morey, R. D., & Lee, M. D. (2016a). Bayesian benefits for the pragmatic researcher.
*Current Directions in Psychological Science*,*25*, 169–176.CrossRefGoogle Scholar - Wagenmakers, E.-J., Verhagen, A. J., & Ly, A. (2016b). How to quantify the evidence for the absence of a correlation.
*Behavior Research Methods*,*2*, 413–426. https://doi.org/10.3758/s13428-015-0593-0.CrossRefGoogle Scholar - Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verhagen, A. J., Love, J., & Morey, R. D. (2018a). Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications.
*Psychonomic Bulletin & Review*,*25*(1), 35–57. https://doi.org/10.3758/s13423-017-1343-3 CrossRefGoogle Scholar - Wagenmakers, E.-J., Love, J., Marsman, M., Jamil, T., Ly, A., Verhagen, A.J., ..., Morey, R. D. (2018b). Bayesian inference for psychology. Part II: Example applications with JASP.
*Psychonomic Bulletin & Review*,*25*(1), 58–76. https://doi.org/10.3758/s13423-017-1323-7.CrossRefGoogle Scholar - Zwaan, R. A., Etz, A., Lucas, R. E., & Donnellan, M. B. (2017). Making replication mainstream.
*Behavioral and Brain Sciences*, 1–50.Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.