Just as no experimental psychologist would believe the findings of an experiment that was biased to report data from only subjects that showed a desired effect, so the presence of a publication bias means that the studies of psi and verbal overshadowing do not tell us anything scientifically useful about the phenomena, because critical data are not part of the results. The effects may be real, or they may be entirely due to bias. The set of studies are simply not scientific. Even worse, the publication bias test generally cannot identify which, if any, specific experimental results are valid, because it only tracks statistics across the entire set. Thus, although some researchers may report their experimental findings fully and properly, those experiments can be rendered scientifically useless by poor reporting practices from other researchers.
It might be feared that the publication bias test is so stringent that almost every set of studies would demonstrate a publication bias. In fact, the test used here is strongly inclined to not report a publication bias, because reported effect sizes tend to overestimate reality (Ioannidis, 2008). Moreover, when the publication bias test indicates suppression of null or negative findings, the true effect size for the phenomena being studied is probably smaller than what is estimated by the biased set of reports. Thus, many of the estimated power values that form the basis of the test are larger than they should be, which means that the expected number of rejections is overestimated. Once evidence for bias is found, it is likely that the bias is even more pronounced than the test indicates.
When evidence of a publication bias is presented, many people think of the file drawer problem, which refers to the idea that a researcher runs many different experiments but only publishes the ones that tend to show evidence for an effect. This kind of bias could be due to the deliberate intention of the author to mislead the field, or by an inability to get null findings approved by reviewers and editors. Such a problem surely exists, and the test described above can detect its influence.
A closely related bias, with a similar result, occurs when an experimenter measures many different variables but then selectively reports only the findings that reject the null hypothesis. Simmons, Nelson, and Simonsohn (2011) demonstrated how such an approach (in combination with some other tricks) can suggest evidence for truly outlandish effects. Moreover, seemingly innocuous decisions at many different levels of research can produce a publication bias. Given the high frequency of errors in reporting statistical findings (Bakker & Wicherts, 2011; Wicherts, Bakker, & Molenaar, 2011), researchers can introduce a publication bias by being very careful when the results are contrary to what was expected, but not double-checking results that agree with their beliefs or expectations. Likewise, data from a subject that happens to perform poorly (given the experimenter’s hopes) might be thrown out if there is some external cause, such as noise in a neighboring room or a mistake in the computer program, but data from a subject that happens to perform well under the same circumstances might be kept (and even interpreted as evidence of the effect’s strength despite distractions).
One form of publication bias is related to experiment stopping rules. When gathering experimental data, it is easy to compute statistical tests at various times and to explore whether the experiment is likely to work. Such checks (sometimes called data peeking) introduce a bias if the experimenter allows that information to influence whether to continue the experiment (Berger & Berry, 1988; Kruschke, 2010; Wagenmakers, 2007). In a similar way, when an experiment ends but the data do not quite reach statistical significance, it is common for researchers to add more subjects in order to get a definitive answer. This approach (although it seems like good science to gather more data) is inconsistent with the traditional frequentist approach to hypothesis testing. Such approaches inflate the Type I error rate (Strube, 2006).
The publication bias test cannot distinguish between the myriad ways for bias to appear, but since it provides evidence that the studies of psi and verbal overshadowing contain bias, one need not propose radical characteristics of the universe (Bem, 2011) or limits to the scientific method (Schooler, 2011) in order to explain the properties of those studies. The simpler explanation is that, as a set, the studies either are not reporting all of the relevant information or have rejected the null hypothesis more frequently than they should have because they were run improperly. Either way, these studies are at best anecdotal, and as such they need no scientific explanation at all.
Perhaps the most striking characteristic of both the Bem (2011) study and the set of studies reporting verbal overshadowing is that they meet the current standards of experimental psychology. The implication is that it is the standards and practices of the field that are not operating properly. Clearly, these standards and practices need to be improved to ensure that the frequency of reporting the null hypothesis is consistent with the power of the experiments. Such improvements are long overdue. Sterling (1959) noted that 97.3% of statistical tests rejected the null hypothesis for the major scientific findings reported in four psychology journals. A follow-up analysis by Sterling, Rosenbaum, and Weinkam (1995) found a similar result, with a null hypothesis rejection rate of 95.56%. These high percentages suggest that the power of the experiments (if the alternative hypothesis is true) must generally be well above .9, even though power values in the range of .25 to .85 are more common in psychology (Hedges, 1984). Formalizing this discrepancy between the observed and expected proportions of rejections of the null hypothesis is the core of the publication bias test developed by Ioannidis and Trikalinos (2007) and used above.
When publication bias has been reported in other fields (e.g., by Munafò & Flint, 2010), there is often a call to create a registry of planned experiments and to require researchers to describe the outcomes of the experiments, regardless of the findings (Schooler, 2011). Such a project would be an expensive undertaking, would require complex paperwork for every experiment, and would be difficult to enforce for a field like experimental psychology, where many of the investigations are exploratory rather than planned. A much simpler partial solution to the problem would be Bayesian data analysis.
A Bayesian approach has two features that mitigate the appearance of publication bias. First, in addition to finding evidence in support of an alternative hypothesis, a Bayesian analysis can find evidence in support of a null hypothesis. Thus, an experiment that finds convincing evidence for a null hypothesis provides publishable scientific information about a phenomenon in a way that a failure to reject the null hypothesis does not. Second, a Bayesian analysis is largely insensitive to the effects of data peeking and multiple tests (Berger & Berry, 1988; Kruschke, 2010; Wagenmakers, 2007), so some of the methodological approaches that inflate Type I error rates and introduce a publication bias would be rendered inert. Bayesian data analysis is only a partial solution because the file drawer problem might still result from the actions of researchers, reviewers, and editors who deliberately or unintentionally choose to suppress experimental findings that go against their hopes. Such remaining cases could be identified by the publication bias test described in this article, or by a Bayesian equivalent.
The findings of the publication bias test in experimental psychology demonstrate that the care that is so rigorously taken at the level of an experiment is sometimes not being exercised at the next level up, when research findings are being reported. Such publication bias is inappropriate for an objective science, and the field must improve its methods and reporting practices to avoid it.