A tutorial on a practical Bayesian alternative to null-hypothesis significance testing
Abstract
Null-hypothesis significance testing remains the standard inferential tool in cognitive science despite its serious disadvantages. Primary among these is the fact that the resulting probability value does not tell the researcher what he or she usually wants to know: How probable is a hypothesis, given the obtained data? Inspired by developments presented by Wagenmakers (Psychonomic Bulletin & Review, 14, 779–804, 2007), I provide a tutorial on a Bayesian model selection approach that requires only a simple transformation of sum-of-squares values generated by the standard analysis of variance. This approach generates a graded level of evidence regarding which model (e.g., effect absent [null hypothesis] vs. effect present [alternative hypothesis]) is more strongly supported by the data. This method also obviates admonitions never to speak of accepting the null hypothesis. An Excel worksheet for computing the Bayesian analysis is provided as supplemental material.
Keywords
Bayesian analysis Null-hypothesis significance testingThe widespread use of null-hypothesis significance testing (NHST) in psychological research has withstood numerous rounds of debate (e.g., Chow, 1998; Cohen, 1994; Hagen, 1997; Krueger, 2001; Nickerson, 2000) and continues to be the field's most widely applied method for evaluating evidence. There are, however, clear shortcomings attached to standard applications of NHST. In this article, I briefly review the most important of these problems and then present a summary of a Bayesian alternative developed by Wagenmakers (2007). My goal is to provide readers with a highly accessible, practical tutorial on the Bayesian approach that Wagenmakers presented. Motivation for such a tutorial comes from the fact that in the time since publication of Wagenmakers's article, very little appears to have changed with respect to the reliance on NHST methods in the behavioral sciences. I suspect that one reason that Bayesian methods have not been more readily adopted is that researchers have failed to recognize the availability of a sufficiently simple means of computing and interpreting a Bayesian analysis. This tutorial is intended as a remedy for that obstacle.
Serious problems with NHST
The p value generated by NHST is misleading, particularly in the sense that it fails to provide the information that a researcher actually wants to have. That is, the NHST p value is a conditional probability that indicates the likelihood of an observed result (or any more extreme result), given that the null hypothesis is correct: p(D|H_{0}). Researchers draw inferences from this conditional probability regarding the status of the null hypothesis, reasoning that if the p value is very low, the null hypothesis can be rejected. But what are the grounds for this rejection? One possibility is that a low p value signals that the null hypothesis is unlikely to be true. This conclusion, however, does not follow directly from a low p value (see Cohen, 1994). The probability of the null hypothesis being true, given the obtained data, would be expressed as p(H_{0}|D), which is not equivalent to p(D|H_{0}). The only way to translate p(D|H_{0}), which is readily found by applying NHST methods, into p(H_{0}|D) is by applying the Bayes theorem. Trafimow and Rice (2009) did just that in a Monte Carlo simulation to investigate the extent to which p(D|H_{0}) and p(H_{0}|D) are correlated, as they must be if one is to justify decisions about the null hypothesis under NHST. Their results indicated a correlation of only .396 between these two measures. When they dichotomized the p(D|H_{0}) values as significant or not significant, on the basis of the typical cutoff of .05, the correlation between the dichotomized category and the p(H_{0}|D) values dropped to .289. Surprisingly, this correlation fell even more when more stringent cutoff values (.01 and .001) were applied. Thus, it is far from safe to assume that the magnitude of p(H_{0}|D) can be directly inferred from p(D|H_{0}). The Bayesian approach proposed by Wagenmakers (2007) eliminates this problem by directly computing p(H_{0}|D).
A second concern with NHST stems from the fact that researchers are permitted to make one of two decisions regarding the null hypothesis: reject or fail to reject. This approach to hypothesis testing does not provide a means of quantifying evidence in favor of the null hypothesis and even prohibits the concept of accepting the null hypothesis (Wilkinson & the Task Force on Statistical Inference, 1999). Thus, researchers are typically left with little to say when their statistical test fails to reject the null hypothesis. The Bayesian approach, because it can directly evaluate the relative strength of evidence for the null and alternative hypotheses, provides clear quantification of the degree to which the data support either hypothesis.
Fundamentals of a Bayesian alternative to NHST
The first point to note about the Bayesian approach to hypothesis testing is that, unlike NHST, this approach does not rest on a binary decision about a single (null) hypothesis. Rather, the Bayesian approach is essentially a model selection procedure that computes a comparison between competing hypotheses or models. For most applications in which NHST is currently the method of choice, the Bayesian alternative consists of a comparison between two models, which we can continue to characterize as the null and alternative hypotheses, or which can be two competing models that assume different configurations of effects. Non-Bayesian methods involving a similar model comparison approach have also been developed and have their own merits. For example, Dixon (2003) and Glover and Dixon (2004) proposed using likelihood ratios as a means of determining which of two models is more likely given a set of data. The likelihood ratio is equal to the likelihood of the observed data given one model, relative to the likelihood of the data given a competing model. As with the Bayesian method, likelihood ratios provide graded evidence concerning which model is more strongly supported by the data, rather than a thresholded binary decision. In the method advocated by Glover and Dixon (2004), the Akaike information criterion (Akaike, 1974) may be used to adjust the likelihood ratio to take into account the different number of parameters on which competing models are based. Glover and Dixon (2004) also pointed out that one could use the Bayesian information criterion (BIC; described more fully below), instead of the AIC, in generating a likelihood ratio that takes into account differences in the number of free parameters. In this sense, the Glover and Dixon (2004) formulation anticipates the approach developed by Wagenmakers (2007), which can be seen as a special case of the more general likelihood-ratio method proposed by Glover and Dixon (2004). Unlike the Bayesian method, evaluating models on the basis of a likelihood ratio does not involve computing a posterior probability such as p(H_{0}|D). Rather, the ratio itself is taken as a measure of the strength of evidence favoring one model over another. The likelihood ratio can be computed for various experimental designs as readily as the Bayesian analysis described here (Bortolussi & Dixon, 2003; Dixon, 2003; Dixon & O'Reilly, 1999; Glover & Dixon, 2004), and therefore, it constitutes another practical alternative to NHST.
The posterior odds (left side of Eq. 3), then, are determined by the product of what is called the Bayes factor (first term on the right side of the equation) and the prior odds. The posterior odds give the relative strength of evidence in favor of H_{0} relative to H_{1}.
Furthermore, if it is assumed that the prior odds equal 1 (i.e., the two hypotheses are deemed equally likely before the data are collected), the posterior odds are equal to the Bayes factor. It is clear, then, that the Bayes factor plays a crucial role in establishing the relative evidential support for the null and alternative hypotheses. One might argue that it is not reasonable to hold that H_{0} and H_{1} are equally plausible a priori when one of the models is the null hypothesis, which specifies a precise effect size of 0, whereas the competing model allows a range of possible effect sizes. Moreover, it seems unlikely, for example, that two populations would have identical means, so the exact value specified by H_{0} is unlikely ever to be literally correct. There are a number of possible responses to this concern. First, the Bayesian analysis holds either for the case where H_{0} specifies an effect size of exactly 0 or for a distribution of very small effect sizes, centered at 0, that would require a very large sample size to detect (Berger & Delampady, 1987; Iverson, Wagenmakers, & Lee, 2010), which usually is what one has in mind when arguing that H_{0} is never precisely correct (e.g., Cohen, 1994). Second, theoretical models tested in experiments often predict precisely no effect, setting up the classic H_{0} as a viable model with a precise effect size of 0 (Iverson et al., 2010). Third, one has the option to specify prior odds other than 1 and to apply simple algebra to adjust the posterior odds accordingly before converting those odds to posterior probabilities. Finally, one could forego computation of the posterior probabilities and simply take the Bayes factor as a measure of the relative adequacy of the two competing models, which would be analogous to the method of using likelihood ratios discussed earlier (e.g., Glover & Dixon, 2004).
The difficulty we face in computing the Bayes factor is that although the probability of obtaining the observed data, given the null hypothesis, can be computed rather easily (akin to what is done in NHST), the corresponding conditional probability based on the alternative hypothesis is another matter. Unlike the null hypothesis, the alternative hypothesis does not specify one particular a priori value for the effect in question. Rather, the alternative hypothesis is associated with a distribution of possible effect sizes, and the value of the Bayes factor depends on the nature of that distribution. Therefore, exact computation of the Bayes factor quickly becomes complex, involving integration over the space of possible effect sizes using procedures such as Markov chain Monte Carlo methods (e.g., Kass & Raftery, 1995). These are not methods that most experimental psychologists are readily equipped to apply.
Before describing a practical method for computing BIC values, or at least ΔBIC (the critical element in Eq. 5), and the resulting posterior probabilities, an important caveat is in order. Computation of the Bayes factor depends on the specification of a prior distribution for the effect size parameter that distinguishes the alternative hypothesis from the null hypothesis. That is, assuming (as per the alternative hypothesis) that there is some nonzero effect size, what are the probable values of this effect size? These values cover some distribution whose characteristics influence the eventual posterior odds. The method of estimating the Bayes factor implemented in Eq. 5 is consistent with a prior distribution of possible effect size parameter values known as the unit information prior (Kass & Wasserman, 1995). This distribution is the standard normal distribution centered at the value of the effect size observed in the data and extending over the distribution of observed data (Raftery, 1999). Because the unit information prior is normal in shape, its coverage emphasizes the plausible values of the effect size parameter without being excessively spread out (i.e., very little likelihood is associated with the more extreme values of effect size). As Raftery (1999) pointed out, this is a reasonable prior distribution in the sense that a researcher likely has some idea in advance of the general range in which the observed effect size is likely to fall and so will not put much prior probability outside that range. The application of Eq. 5 to estimate the Bayes factor, then, implicitly assumes the unit information prior as the distribution of the effect size parameter under the alternative hypothesis. It should be noted, however, that the unit information prior is more spread out than other potential prior distributions that could be informed by more specific a priori knowledge of the likely effect size. Prior distributions with greater spread tend to favor the null hypothesis more than do prior distributions with less spread. In this sense, the BIC estimate of the Bayesian posterior probabilities should be considered somewhat conservative with respect to providing evidence for the alternative hypothesis (Raftery, 1999).
Computation of posterior probabilities
Example applications of the Bayesian method
Example 1
Mean proportions correct and ANOVA summary table from Breuer et al. (2009, Experiment 2A)
Study Condition | Mean | Source | SS | df | MS | F | p |
---|---|---|---|---|---|---|---|
Original | .700 | Subjects | 0.668 | 39 | |||
Mirror image | .621 | Item | 0.357 | 2 | 0.178 | 12.90 | < .001 |
Unprimed | .567 | Item × Subjects | 1.078 | 78 | 0.014 | ||
Total | 2.103 | 119 |
These posterior probability values indicate that the data very clearly favor the alternative hypothesis over the null hypothesis. I have implemented in Excel a routine for computing SSE_{0}, ΔBIC, the Bayes factor, and the posterior probabilities for the null and alternative hypotheses from input consisting of n (number of independent observations), k_{1} – k_{0} (the difference between the two models with respect to number of free parameters), sum of squares for the effect of interest, and sum of squares for the error term associated with the effect of interest (SSE_{1}). The Excel worksheet is available as supplementary material for this article.
Descriptive terms for strength of evidence corresponding to ranges of p_{bic} values as suggested by Raftery (1995)
p_{BIC}(H_{i}|D) | Evidence |
---|---|
.50––.75 | weak |
.75–.95 | positive |
.95–.99 | strong |
> .99 | very strong |
Note that, in this test, only two scores per subject are considered, so the number of independent observations is equal to the number of subjects. The resulting Bayes factor is e^{–3.45/2} = 0.1782, and the posterior probabilities are p_{BIC}(H_{0}|D) = .151 and p_{BIC}(H_{1}|D) = .849. This outcome provides positive evidence, using Raftery's (1995) classification, that changing an item's orientation between study and test reduces priming.
The Bayes factor in this case is .377, which produces a posterior probability for the linear trend model of p_{BIC}(H_{1}|D) = .726, which qualifies only as weak evidence in favor of that model over the model that assumes priming only in the original condition.
Example 2
Mean response times (in milliseconds) and ANOVA summary table from Bub and Masson (2010, Experiment 1)
Delay (ms) | Source | SS | df | MS | F | p | ||
---|---|---|---|---|---|---|---|---|
Alignment | 0 | 195 | Subjects | 234,406 | 47 | |||
Aligned | 523 | 497 | Alignment | 1,906 | 1 | 1,906 | 7.40 | .009 |
Not aligned | 524 | 509 | Align. × Subj. | 12,103 | 47 | 258 | ||
Delay | 19,380 | 1 | 19,380 | 50.71 | < .001 | |||
Delay × Subj. | 17,964 | 47 | 382 | |||||
Align. × Delay | 1,536 | 1 | 1,536 | 16.33 | < .001 | |||
Align. × Del. × Subj. | 4,421 | 47 | 94 | |||||
Total | 291,716 | 191 |
Example 3
Mean corrected recognition scores (95% confidence interval), ANOVA results, and Bayesian analysis for Kantner and Lindsay's (2010) experiments
Experiment | Feedback | Control | n | SS_{effect} | SS_{error} | F | BF | p_{BIC}(H_{0}|D) |
---|---|---|---|---|---|---|---|---|
1 | .510 (±.05) | .524 (±.05) | 46 | 0.002 | 0.570 | < 1 | 6.26 | .862 |
2: 75% old | .594 (±.07) | .643 (±.07) | 36 | 0.022 | 0.645 | 1.15 | 3.28 | .766 |
2: 25% old | .600 (±.08) | .579 (±.07) | 35 | 0.004 | 0.750 | < 1 | 5.39 | .844 |
3 | .667 (±.07) | .688 (±.06) | 43 | 0.005 | 0.922 | < 1 | 5.84 | .854 |
4: CRR | .480 (±.07) | .410 (±.06) | 45 | 0.054 | 0.929 | 2.52 | 1.88 | .653 |
4: SR | .332 (±.09) | .351 (±.05) | 29 | 0.002 | 0.452 | < 1 | 5.05 | .835 |
All exps. | .537 (±.03) | .538 (±.03) | 234 | 0.00013 | 7.060 | < 1 | 15.27 | .938 |
In the context of NHST, one can compute power estimates to help interpret data when the null hypothesis is not rejected. To compare this approach to the Bayesian method for the aggregated data shown in Table 4, I computed a power analysis using G*Power 3 (Faul, Erdfelder, Lang, & Buchner, 2007), assuming three different effect sizes, equal to Cohen's (1988) benchmark values. Although the aggregated data had substantial power to detect medium (d = .5, power = .97) or large (d = .8; power > .99) effect sizes, power to detect a small effect size (d = .2) was very low at .33. This collapse of power as assumed effect size shrinks reflects the stricture within the NHST framework against accepting the null hypothesis. When using NHST, the best one can do is to claim that there is good evidence that the true effect size lies below some upper bound.
Implications of using the Bayesian method
For any researcher new to the Bayesian method proposed by Wagenmakers (2007), questions regarding the relationship between the Bayesian posterior probabilities and the more familiar NHST p values are bound to arise. I address this general issue in three ways. First, I present a plot showing how Bayesian posterior probabilities vary as a function of sample size and effect size (i.e., how well the data are described by the alternative vs. the null hypothesis). Second, a plot is provided that compares the posterior probabilities for the null hypothesis with the p values generated by NHST for the sample size and effect size combinations shown in the first plot. Finally, I present a plot (which is a variant of one used by Wagenmakers, 2007) to show how Bayesian posterior probabilities diverge from the NHST p value when effect size varies but sample size is adjusted to keep the NHST p value constant. All of these results are based on a repeated measures design with one factor having two levels, but the relationships revealed by these plots should generally hold for other designs as well.
As the goodness fit of the alternative model (effect size) decreases, the posterior probability for the alternative hypothesis decreases. Then, as the null model begins to be favored over the alternative model, the posterior probability for the alternative hypothesis approaches its lower asymptotic value. Note that the maximum value for the ratio of the two SSE terms is 1.0 (denoting an effect size of zero), so Fig. 1 indicates that the lower asymptotic value for p_{BIC}(H_{1}|D) varies as a function of sample size. In particular, with smaller samples, the lower asymptote for p_{BIC}(H_{1}|D) is greater than when sample size is larger. The implication of this fact is that sample size limits the strength of evidence in favor of the null hypothesis (evidence is stronger with larger sample sizes). Figure 1 also shows that with larger sample sizes, p_{BIC}(H_{1}|D) retains a relatively large value as the SSE ratio increases, until a critical value is reached (about .75 for n = 40), at which point p_{BIC}(H_{1}|D) drops sharply and the Bayesian analysis begins to favor the null hypothesis.
Why do the Bayesian and NHST approaches to evaluating the null hypothesis diverge to the striking degree shown in Fig. 3? The crucial element here is the fact that to maintain a constant p value under NHST, the effect size (represented by\( \eta_{\rm{p}}^2 \)) must shrink as sample size increases. The difficulty this situation poses for NHST is that that method is based on a decision-making process that considers only the plausibility of the null hypothesis. But the drop in effect size also has implications for how strongly the alternative model is supported by the data. As the effect size shrinks, so too does the value added by the extra parameter (nonzero effect size) carried by the alternative hypothesis. In the Bayesian analysis, the reduced contribution of that additional parameter in accounting for variability in the data shows up as a liability when the penalty for this parameter is taken into account (the last term in Eq. 10). The critical advantage of the Bayesian approach is that it consists of a comparative evaluation of two models or hypotheses, rather than driving toward a binary decision about a single (null) hypothesis, as in NHST.
Conclusion
The BIC approximation of Bayesian posterior probabilities introduced by Wagenmakers (2007) offers significant advantages over the NHST method. Foremost among these is the fact that the Bayesian approach provides exactly the information that researchers often seek—namely, the probability that a hypothesis or model should be preferred, given the obtained data. In addition, the Bayesian approach generates graded evidence regarding both the alternative and the null hypotheses, including the degree of support favoring the null (in contrast with NHST, which proscribes acceptance of the null hypothesis), and provides a simple means of aggregating evidence across replications. Researchers who wish to consider using this Bayesian method may be reluctant to break new ground. Why give any additional cause for reviewers or editors to react negatively to a manuscript during peer review? Any such reticence is understandable. Nevertheless, my goal is to encourage researchers to overcome this concern. Studies are beginning to appear that include Bayesian analyses of data, sometimes presented alongside the results of NHST p values to give readers some sense of how the Bayesian analysis compares with NHST results (e.g., Winkel, Wijnen, Ridderinkof, Groen, Derrfuss, Danielmeier & Forstmann, 2009). Moreover, the computation of estimated posterior probabilities and the Bayes factor is relatively straightforward, as has been shown here, and BIC values associated with specific models or hypotheses can be computed in the open-source statistical package R (R development core team, 2010).
In my roles as action editor, reviewer, and author, I have found that participants in the peer review process generally are surprisingly receptive to alternative methods of data analysis. Various studies have appeared over the years that report analyses of data using techniques in place of NHST, such as likelihood ratios (e.g., Glover & Dixon, 2001; Glover, Dixon, Castiello, & Rushworth, 2005) and confidence intervals (e.g., Bernstein, Loftus, & Meltzoff, 2005; Loftus & Harley, 2005; Loftus & Irwin, 1998). The BIC approximation to Bayesian posterior probabilities is an excellent and highly practical candidate for inclusion among these alternatives.
Author Note
Preparation of this article was supported by a discovery grant to the author from the Natural Sciences and Engineering Research Council of Canada. I am grateful to Jo-Anne Lefevre for suggesting that an article of this nature would be useful, to Peter Dixon and Geoffrey Loftus for helpful comments on an earlier version of the manuscript, and especially to E.-J. Wagenmakers for valuable suggestions and guidance. I also thank Justin Kantner and Stephen Lindsay for making their raw data available. Correspondence regarding this article should be sent to Michael Masson, Department of Psychology, University of Victoria, P.O. Box 3050 STN CSC, Victoria, British Columbia V8W 3P5, Canada (e-mail: mmasson@uvic.ca).