A Bayes factor meta-analysis of Bem’s ESP claim
Abstract
In recent years, statisticians and psychologists have provided the critique that p-values do not capture the evidence afforded by data and are, consequently, ill suited for analysis in scientific endeavors. The issue is particular salient in the assessment of the recent evidence provided for ESP by Bem (2011) in the mainstream Journal of Personality and Social Psychology. Wagenmakers, Wetzels, Borsboom, and van der Maas (Journal of Personality and Social Psychology, 100, 426–432, 2011) have provided an alternative Bayes factor assessment of Bem’s data, but their assessment was limited to examining each experiment in isolation. We show here that the variant of the Bayes factor employed by Wagenmakers et al. is inappropriate for making assessments across multiple experiments, and cannot be used to gain an accurate assessment of the total evidence in Bem’s data. We develop a meta-analytic Bayes factor that describes how researchers should update their prior beliefs about the odds of hypotheses in light of data across several experiments. We find that the evidence that people can feel the future with neutral and erotic stimuli to be slight, with Bayes factors of 3.23 and 1.57, respectively. There is some evidence, however, for the hypothesis that people can feel the future with emotionally valenced nonerotic stimuli, with a Bayes factor of about 40. Although this value is certainly noteworthy, we believe it is orders of magnitude lower than what is required to overcome appropriate skepticism of ESP.
Keywords
Statistics Statistical inference ESPBem (2011) has claimed that people can feel or sense salient events in the future that could not otherwise be anticipated. For example, in his Experiment 2, Bem presented participants with two rather ordinary pictures and asked them to indicate which one would be chosen subsequently by a random number generator. If a participant correctly anticipated the random choice, he or she was rewarded with a brief display of a positively valenced picture. Conversely, if a participant incorrectly anticipated the random choice, he or she was punished with a negatively valenced picture. Bem claimed that people could indeed feel these future reward and punishment events and, consequently, were able to anticipate the random choice at a rate deemed statistically above chance. Bem presented a sequence of similar experiments and results and, on this basis, concluded that people can feel the future. This phenomenon and others like it in which people can show seemingly impossible awareness of events are termed psi phenomena, or, more colloquially, extrasensory perception (ESP).
If ESP is substantiated, it would be among the most important findings in the history of psychology. The existence of ESP would force us to revise not only our theories of psychology, but also those of biology and physics. In our view, when seemingly implausible claims are made with conventional methods, it provides an ideal moment to reexamine these methods. The conventional approach used by Bem (2011) has two properties. First, as is typical in many empirical investigations, Bem presented a sequence of experiments, each targeting the same basic phenomena from a slightly different angle. Second, Bem employed null-hypothesis significance testing in which p-values are reported as evidence and evaluated against a fixed criterion to reach judgments. In previous work, we joined a growing consensus that conventional inference by significance testing overstates the evidence for an effect (see Berger & Sellke, 1987; Edwards, Lindman, & Savage, 1963; Wagenmakers, 2007, among several others), and proposed a Bayes factor replacement for the t-test (Rouder, Speckman, Sun, Morey & Iverson, 2009).This Bayes factor quantifies the evidence in data for competing hypotheses from a single experiment or, more precisely, for a single comparison. Unfortunately, while this Bayes factor is appropriate for assessing evidence for a single contrast, it is ill suited for meta-analytically combining evidence across several experiments. Herein, we develop a meta-analytic version of the Bayes factor t-test and use it to assess the evidence across Bem’s experiments. We find some support for ESP; the probability of the combined data are 40 times more likely under an ESP alternative than under a no-ESP null. This evaluation differs from that of Bem, who, in our opinion, overstated the evidence. It also differs from that of Wagenmakers, Wetzels, Borsboom and van der Maas (2011), who found no support for ESP. Our interpretation of this Bayes factor is that while it is noteworthy, it is insufficient in magnitude to sway the beliefs of an appropriately skeptical reader.
The evidence from p-values and Bayes factor
There is a well-known asymmetry in significance testing: Researchers can reject the null hypothesis but can never accept it. This asymmetry works against the goals of scientific inquiry, because null hypotheses often correspond to theoretically useful statements of invariance and constraint (Gallistel, 2009; Kass, 1992; Rouder et al., 2009). For Bem’s (2011) case, the null hypothesis is the theoretically attractive, reasonable, and highly interpretable constraint that ESP does not exist. In order to fairly assess the evidence for ESP, it is necessary to be able to state the evidence for or against the null provided by the data. Yet, with significance testing, we may only accept ESP and never reject it.
The logic behind significance testing is a form of argument by contradiction. If observed data (or data more extreme) are improbable under the null, then the null is contradicted, and presumably, there is some alternative under which the data are more probable. It is reasonable to ask, then, about the factor by which the observed data are more probable under some alternative than under the null. This factor serves as a measure of evidence for the alternative, relative to the null. Suppose that a data set with sample size of 50 yields a p-value in the interval between .04 and .05. Figure 1b shows the distributions of p-values for the null and the alternative (effect size = .2) around this interval, and the probabilities are the shaded areas under the curve. The probability of observing a p-value under the null and alternative is .01 and .04, respectively. Therefore, the alternative fares four times better than the null. Although such a ratio constitutes evidence for the alternative, it is not as substantial as might be inferred by such a small p-value.
Figure 1c shows a similar plot for the null and alternative (effect size = .2) for a large sample size of 500. For this effect size and sample size, very small p-values are the norm. In fact, a p-value between .04 and .05 is about 10 times more likely under the null than under the alternative. In fact, a p-value at any one point—say .05—constitutes increasing evidence for the null in the large sample size limit. This paradoxical behavior of significance testing in which researchers reject the null even though the evidence overwhelmingly favors it is known as Lindley’s paradox (Lindley, 1957) and is a primary critique of inference by p-values in the statistical literature.
We can examine the evidence from Bem’s (2011) data for various alternatives, relative to the null. In Experiment 1, for example, participants needed to anticipate which of two erotic pictures they would be shown. The average performance across 100 naive subjects was .531, and this level was significantly different from the at-chance baseline of .5,t(99) = 2.51, p = .007. Figure 1d shows the evidence for various alternatives. The probability ratios on the y-axis are the probability of the observed p-value under a specific alternative, relative to that under the null. Not surprisingly, these ratios vary greatly with the choice of alternative. Alternatives that are very near the null of .5—say, .525—are preferred over the null (filled circle in Fig. 1D). Alternatives further from .5—say .58 (filled square)—are definitely not preferred over the null. Note that even though the null is rejected at p = .007, there is only a small range of alternatives where the probability ratio exceeds 10, and for no alternative does it exceed 25, much less 100 (as might naïvely be inferred from the p-value). We see that the null may be rejected by p-values even when the evidence for every specific point alternative is more modest.
To compute Bayes factors, researchers must choose the prior distribution f. Fortunately, there is ample guidance in the literature about how to do so for the linear models, including the t-test (Gönen, Johnson, Lu, & Westfall, 2005; Liang, Paulo, Molina, Clyde, & Berger, 2008; Zellner, 1986; Zellner & Siow, 1980). We advocate a prior that serves as a generic default broadly applicable for scientific use. This prior was proposed by Jeffreys (1961), was developed for linear models by Zellner and Siow, among several others, and was termed the JZS prior by Bayarri and Garcia-Donato (2007). The JZS prior, along with the resulting JZS Bayes factor, are presented in the Appendix. The JZS Bayes factor has a number of advantages: It makes intuitive sense, it has beneficial theoretical properties,^{1} it is not dependent on the measurement scale of the dependent variable, and it can be conveniently computed.^{2} Further details are provided in Rouder et al. (2009).
Wagenmakers et al.’s (2011) analysis of ESP
Wagenmakers et al. (2011) assessment of Bem’s evidence
Experiment | Bayes Factor (Alt/Null) | Direction Predicted |
---|---|---|
1 | 1.64 | Yes |
2 | 1.05 | Yes |
3 | 1.82 | Yes |
4 | .58 | Yes |
5 | .88 | Yes |
6 | .32 | Yes |
6 | .30 | Yes |
7 | .13 | Yes |
8 | .47 | Yes |
9 | 5.9 | Yes |
The meta-analysis problem
JZS Bayes factor across four replicate experiments
N | \( \hat{\delta } \) | t | B | |
---|---|---|---|---|
Experiment 1 | 100 | .18 | 2.16 | 0.75 |
Experiment 2 | 100 | .12 | 1.25 | 0.17 |
Experiment 3 | 100 | .29 | 2.80 | 3.29 |
Experiment 4 | 100 | .14 | 1.44 | 0.22 |
Data pooled | 400 | .18 | 3.83 | 54.1 |
Product of Bayes factors | 0.092 | |||
Meta-analytic Bayes factor | 49 |
This seeming contradiction comes about because JZS Bayes factors respect the resolution of data (Rouder et al., 2009). When the sample size is small, small effects may be considered evidence for the null because the null is the more parsimonious description given the resolution provided by the data. As the sample size grows, however, the resolution provided for the data is finer, and small effects are more concordant with the alternative. An appropriate analogy may be a criminal court trial in which each of several witnesses provides only partial information as to the guilt of a defendant who has committed a crime. If the jury is forced to assess the odds after hearing the testimony of any single witness, these odds may all favor innocence, since no one witness may be compelling enough in isolation to provide evidence for guilt. However, if the jury considers the totality of all testimonies, the weight will assuredly shift toward guilt.
It is reasonable to wonder whether the constant effect size model underlying the meta-analytic Bayes factor is warranted. We chose this approach because it is tractable when researchers have access to the test statistics, rather than the raw data. Alternative models that posit variation in effect size across experiments are possible (Utts, Norris, Suess, & Johnson, 2010), although analysis may require access to the raw data. These variable effect-size alternatives are certainly more complex than the constant effect-size model, and if the true effects are about the same size, it may be at a competitive disadvantage. Whereas Bem (2011) reports near constant effect sizes across the experiments, we believe that the constant effect size model is a convenient and appropriate alternative to the null model.
To illustrate this meta-analytic Bayes factor, we applied it to the four replicate experiments in Table 2. The value is about 49:1 in favor an effect, which is quite close to the value of 54 from pooling the data. The reason these values differ slightly is that the meta-analytic Bayes factor posits a separate variance (σ^{2}) for each experiment, while the JZS Bayes factor on pooled data assumes a common, single variance.
The evidence in Bem’s (2011) data
Bayes factor for three feeling—The-future hypotheses
Stimuli | Included Experiments | Bayes Factor | |||
---|---|---|---|---|---|
Erotic stimuli | 3.23 | ||||
Bem'sexperiment | 1 | ||||
Sample size | 100 | ||||
t-value | 2.51 | ||||
Negative or positive stimuli | 38.7 | ||||
Bem'sexperiment | 1 | 2 | 3 | 4 | |
Sample size | 100 | 150 | 97 | 99 | |
t-value | ˗0.15 | 2.39 | 2.42 | 2.43 | |
Neutral stimuli | 1.57 | ||||
Bem'sexperiment | 1 | 8 | 9 | ||
Sample size | 100 | 100 | 50 | ||
t-value | ˗0.15 | 1.92 | 2.96 |
We have not included results from Bem’s (2011) Experiments 5, 6, and 7 in our meta-analysis because we are unconvinced that these are interpretable. These three experiments are retroactive mere-exposure effect experiments in which the influence of future events purportedly affects the current preference for items. The main difficulty in interpreting these experiments is forming an expectation about the direction of an effect, and this difficulty has consequential ramifications. In the vast majority of conventional mere-exposure effect studies, participants prefer previously viewed stimuli (Bornstein, 1989). Bem observed this pattern for negative stimuli, but the opposite pattern, novelty preference, for positive stimuli. Bem claimed that this crossover was anticipated by the findings of Dijksterhuis and Smith (2002), who documented that participants habituate to emotional stimuli. Accordingly previously encountered negative stimuli are judged less negative and previously encountered positive stimuli are judged less positive. We, however, remain unconvinced that Dijksterhuis and Smith’s emotional habituation is applicable here because of methodological differences. Dijksterhuis and Smith, for example, used six subliminal presentations to achieve habituation, and it is unclear if habituation will follow from a single presentation. What is sorely missing is the analogous conventional mere-exposure experiment with the same negative, positive, neutral, and erotic stimuli to firmly establish expectations. In fact, Bem took this approach with his retroactive priming experiments (Bem’s Experiments 3 and 4), and the inclusion of conventional priming studies to establish firm expectations greatly increases the interpretability of those results. Without these control experiments to establish the direction of mere exposure effects with emotional and evocative stimuli, the most judicious course is to exclude Experiments 5, 6, and 7 from analysis.
Table 3 reveals that there is relatively little support for the claim that people can feel the future with erotic or neutral events. The Bayes factor does offer some support for a retroactive effect of emotionally valenced, nonerotic stimuli: The evidence for an effect provided by Experiments 2, 3, and 4 outweighs the evidence against an effect provided by Experiment 1. In Experiment 2, participants were rewarded with brief presentations of positive pictures and punished with brief presentations of negative ones when they anticipated or failed to anticipate, respectively, the future state of a random-number generator. In Experiments 3 and 4, participants identified an emotionally valenced target stimulus more quickly when a subsequently presented prime matched the valence of the target.
General discussion
The publication of Bem’s (2011) report on ESP provides an ideal opportunity to discuss how evidence should be assessed and reported in experimental studies. We argue here that inference by p-values not only precludes stating evidence for theoretically useful null hypotheses, but also overstates the evidence against them. A suitable alternative is the Bayes factor—the relative probability of observing the data under two competing hypotheses. To use the Bayes factor, it is necessary to specify a prior against which evidence is calibrated. We recommend the JZS prior as a suitable generic default because the resulting Bayes factor is invariant to changes in measurement scale and has beneficial theoretical properties (see note 1). One of the drawbacks of our previous development (Rouder et al, 2009) was that it did not provide a means of combining data across multiple experiments, making meta-analysis difficult. Herein, we extend JZS default Bayesian t-test to multiple experiments and use this new development to analyze the data in Bem. Our Bayes factor analyses of Bem’s data, which Bem offered as evidence of ESP, show that the data support more modest claims. The data yield no substantial support for ESP effects of erotic or neutral stimuli. For emotionally valenced nonerotic stimuli, however, we found a Bayes factor of about 40, and this is the factor by which readers should increase their odds.
We caution readers against interpreting this Bayes factor as the posterior odds that ESP is true. On the contrary, posterior odds should reflect the context provided by prior odds, as discussed previously. In the present case, there are two relevant sources of context for prior odds: past studies of ESP, and the plausibility of mechanisms underlying ESP. Bem (2011) fallows in a line of parapsychological research that extends from the 1930s. In a recent meta-analyses, Storm, Tressoldi and Di Risio (2010) reported a sizable degree of statistical support for ESP for certain classes of experiments. For example, among the 63 studies that used a four-choice procedure, participants responded correctly on a total of 1,326 out of 4,442 trials, a rate of almost 30% (as compared with a 25% baseline). We worry, however, about the frequency of unreported studies. To us, the more relevant context in setting prior odds is the lack of a plausible mechanism for ESP. ESP seems contradicted by well-substantiated theories in physics and biology. Consequently, it is reasonable to have low prior odds on ESP. In our view, while the evidence provided by Bem is certainly worthy of notice, it should not be sufficient to sway an appropriately skeptical reader. We remain unconvinced of the viability of ESP.
The theoretical properties of the JZS Bayes factor are as follows. First, the Bayes factor is always finite for finite data. Second, the Bayes factor is consistent; as sample size is increased, B grows to infinity if the null is false and shrinks to zero if it is true. This consistency may be contrasted with p-values, which do not converge in the limit when the null is true (see Fig. 1). Finally, for any sample size, the Bayes factor grows to infinity as t grows to infinity.
Web applets to compute Bayes factors for paired and grouped t-tests may be found at pcl.missouri.edu/bayesfactor.