We report the prevalence of reporting inconsistencies at different levels. We document general prevalence of NHST results and present percentages of articles that use NHST per journal and over the years. Because only the five APA journals provided HTMLs for all years from 1985 to 2013, the overall trends are reported for APA journals only, and do not include results from Psychological Science, PLoS, and Frontiers, which only cover recent years. Reporting inconsistencies are presented both at the level of article and at the level of the individual p-value, i.e., the percentage of articles with at least one inconsistency and the average percentage of p-values within an article that is inconsistent, respectively. We also describe differences between journals and trends over time.
Percentage of articles with null-hypothesis significance testing (NHST) results
Overall, statcheck detected NHST results in 54.4 % of the articles, but this percentage differed per journal. The percentage of articles with at least one detected NHST result ranged from 24.1 % in PLoS to 85.1 % in JPSP (see Table 1). This can reflect a difference in the number of null-hypothesis significance tests performed, but it could also reflect a difference in the rigor with which the APA reporting standards are followed or how often tables are used to report results. Figure 1 shows the percentage of downloaded articles that contained NHST results over the years, averaged over all APA journals (DP, JCCP, JEPG, JPSP, and JAP; dark gray panel), and split up per journal (light gray panels for the APA journals and white panels for the non-APA journals). All journals showed an increase in the percentage of articles with APA-reported NHST results over the years except for DP and FP, for which this rate remained constant and/or declined, respectively. Appendix B lists the number of articles with NSHT results over the years per journal.
Number of published NHST results over the years
We inspected the development of the average number of APA-reported NHST results per article, given that the article contained at least one detectable NHST result (see Fig. 2). Note that in 1985 the APA manual already required statistics to be reported in the manner that statcheck can read (American Psychological Association, 1983). Hence, any change in retrieved NHST results over time should reflect the actual change in the number of NHST results reported in articles rather than any change in the capability of statcheck to detect results.
Across all APA journals, the number of NHST results per article has increased over the period of 29 years (b = .25, R2 = .68), with the strongest increases in JEPG and JPSP. These journals went from an average of around 10–15 NHST results per article in 1985 to as much as around 30 results per article on average in 2013. The mean number of NHST results per article remained relatively stable in DP, JCCP, and JAP; over the years, the articles with NHST results in these journals contained an average of ten NHST results. It is hard to say anything definite about trends in PS, FP, and PLOS, since we have only a limited number of years for these journals (the earliest years we have information for are 2003, 2010, and 2004, respectively). Both the increase in the percentage of articles that report NHST results and the increased number of NHST results per article show that NHST is increasingly popular in psychology. It is therefore important that the results of these tests are reported correctly.
General prevalence of inconsistencies
Across all journals and years 49.6 % of the articles with NHST results contained at least one inconsistency (8,273 of the 16,695 articles) and 12.9 % (2,150) of the articles with NHST results contained at least one gross inconsistency. Furthermore, overall 9.7 % (24,961) of the p-values were inconsistent, and 1.4 % (3,581) of the p-values were grossly inconsistent. We also calculated the percentage of inconsistencies per article and averaged these percentages over all articles. We call this the “(gross) inconsistency rate.” Across journals, the inconsistency rate was 10.6 % and the gross inconsistency rate was 1.6 %.
Prevalence of inconsistencies per journal
We calculated the prevalence of inconsistencies per journal at two levels. First, we calculated the percentage of articles with NHST results per journal that contained at least one (gross) inconsistency. Second, we calculated the inconsistency rate per journal. The top panel of Fig. 3 shows the average percentage of articles with at least one (gross) inconsistency, per journal. The journals are ordered from the journal with the highest percentage of articles with an inconsistency to the journal with the least articles with an inconsistency. JPSP showed the highest prevalence of articles with at least one inconsistency (57.6 %), followed by JEPG (54.8 %). The journals in which the percentage of articles with an inconsistency was lowest are PS and JAP (39.7 % and 33.6 % respectively). JPSP also had the highest percentage of articles with at least one gross inconsistency (15.8 %), this time followed by DP (15.2 %). PS had the lowest percentage of articles with gross inconsistencies (6.5 %).
The inconsistency rate shows a different pattern than the percentage of articles with all inconsistencies. PLoS showed the highest percentage of inconsistent p-values per article overall, followed by FP (14.0 % and 12.8 %, respectively). Furthermore, whereas JPSP was the journal with the highest percentage of articles with inconsistencies, it had one of the lowest probabilities that a p-value in an article was inconsistent (9.0 %). This discrepancy is caused by a difference between journals in the number of p-values per article: the articles in JPSP contain many p-values (see Table 1, right column). Hence, notwithstanding a low probability of a single p-value in an article being inconsistent, the probability that an article contained at least one inconsistent p-value was relatively high. The gross inconsistency rate was quite similar over all journals except JAP, in which the gross inconsistency rate was relatively high (2.5 %).
Prevalence of inconsistencies over the years
If gross inconsistencies are indicative of QRPs and QRPs have increased over the years, we would expect an increase of gross inconsistencies over the years (see also Leggett et al., 2013). To study this, we inspected the gross inconsistency rate in journals over time. The results are shown in Fig. 4.
The number of (gross) inconstencies has decreased or remained stable over the years across the APA journals. In DP, JCCP, JPEG, and JPSP the percentage of all inconsistencies in an article has decreased over the years. For JAP there is a positive (but very small) regression coefficient for year, indicating an increasing error rate, but the R2 is close to zero. The same pattern held for the prevalence of gross inconsistencies over the years. DP, JCCP, and JPSP have shown a decrease in gross inconsistencies, in JEPG and JAP the R2 is very small, and the prevalence seems to have remained practically stable. The trends for PS, FP, and PLoS are hard to interpret given the limited number of years of covarage. Overall, it seems that, contrary to the evidence suggesting that the use of QRPs could be on the rise (Fanelli, 2012; Leggett et al., 2013), neither the inconsistencies nor the gross inconsistencies have shown an increase over time. If anything, the current results reflect a decrease of reporting error prevalences over the years.
We also looked at the development of inconsistencies at the article level. More specifically, we looked at the percentage of articles with at least one inconsistency over the years, averaged over all APA journals (DP, JCCP, JEPG, JPSP, and JAP; dark gray panel in Fig. 5) and split up per journal (light gray panels for the APA journals and white panels for the non-APA journals in Fig. 5). Results show that there has been an increase in JEPG and JPSP for the percentage of articles with NHST results that have at least one inconsistency, which is again associated with the increase in the number of NHST results per article in these journals (see Fig. 2). In DP and JCCP, there was a decrease in articles with an inconsistency. For JAP there is no clear trend; the R2 is close to zero. A more general trend is evident in the prevalence of articles with gross inconsistencies: in all journals, except PS and PLOS, the percentage of articles with NHST that contain at least one gross inconsistency has been decreasing. Note that the trends for PS, FP, and PLOS are unstable due to the limited number of years we have data for. Overall, it seems that, even though the prevalence of articles with inconsistencies has increased in some journals, the prevalence of articles with gross inconsistencies has shown a decline over the studied period.
Prevalence of gross inconsistencies in results reported as significant and nonsignificant
We inspected the gross inconsistencies in more detail by comparing the percentage of gross inconsistencies in p-values reported as significant and p-values reported as nonsignificant. Of all p-values reported as significant 1.56 % was grossly inconsistent, whereas only .97 % of all p-values reported as nonsignificant was grossly inconsistent, indicating it is more likely for a p-value reported as significant to be a gross inconsistency than for a p-value reported as nonsignificant. We also inspected the prevalence of gross inconsistencies in significant and non-significant p-values per journal (see Fig. 6). In all journals, the prevalence of gross inconsistencies is higher in significant p-values than in nonsignificant p-values (except for FP, in which the prevalence is equal in the two types of p-values). This difference in prevalence is highest in JCCP (1.03 percentage point), JAP (.97 percentage point), and JPSP (.83 percentage point) respectively, followed by JEPG (.51 percentage point) and DP (.26 percentage point), and smallest in PLOS (.19 percentage point) and FP (.00 percentage point).
It is hard to interpret the percentages of inconsistencies in significant and nonsignificant p-values substantively, since they depend on several factors, such as the specific p-value: it seems more likely that a p-value of .06 is reported as smaller than .05 than a p-value of .78. That is, because journals may differ in the distribution of specific p-values we should also be careful in comparing gross inconsistencies in p-values reported as significant across journals. Furthermore, without the raw data it is impossible to determine whether it is the p-value that is erroneous, or the test statistic or degrees of freedom. As an example of the latter case, a simple typographical error such as “F(2,56) = 1.203, p < .001” instead of “F(2,56) = 12.03, p < .001” produces a gross inconsistency, without the p-value being incorrect. Although we cannot interpret the absolute percentages and their differences, the finding that gross inconsistencies are more likely in p-values presented as significant than in p-values presented as nonsignificant could indicate a systematic bias and is reason for concern.
Figure 7 shows the prevalence of gross inconsistencies in significant (solid line) and nonsignificant (dotted line) p-values over time, averaged over all journals. The size of the circles represents the total number of significant (open circle) and nonsignificant (solid circle) p-values in that particular year. Note that we only have information for PS, FP, and PLOS since 2003, 2010, and 2004, respectively. The prevalence of gross inconsistencies in significant p-values seems to decline slightly over the years (b = −.04, R2 = .65). The prevalence of the gross inconsistencies in nonsignificant p-values does not show any change (b = .00, R2 = .00). In short, the potential systematic bias leading to more gross inconsistencies in significant results seems to be present in all journals except for FP, but there is no evidence that this bias is increasing over the years.
To investigate the consequence of these gross inconsistencies, we compared the percentage of significant results in the reported p-values with the percentage of significant results in the computed p-values. Averaged over all journals and years, 76.6 % of all reported p-values were significant. However, only 74.4 % of all computed p-values were significant, which means that the percentage of significant findings in the investigated literature is overestimated by 2.2 percentage points due to gross inconsistencies.
Prevalence of inconsistencies as found by other studies
Our study can be considered a large replication of several previous studies (Bakker & Wicherts, 2011; Bakker & Wicherts, 2014; Berle & Starcevic, 2007; Caperos & Pardo, 2013; Garcia-Berthou & Alcaraz, 2004; Veldkamp et al., 2014; Wicherts et al., 2011). Table 2 shows the prevalence of inconsistent p-values as determined by our study and previous studies.
Table 2 Prevalence of inconsistencies in the current study and in earlier studies
Table 2 shows that the estimated percentage of inconsistent results can vary considerably between studies, ranging from 4.3 % of the results (Wicherts et al., 2011) to 14.3 % of the results (Berle & Starcevic, 2007). The median rate of inconsistent results is 11.1 % (1.4 percentage points higher than the 9.7 % in the current study). The percentage of gross inconsistencies ranged from .4 % (Garcia-Berthou & Alcaraz, 2004) to 2.3 % (Caperos & Pardo, 2013), with a median of 1.1 % (.3 percentage points lower than the 1.4 % found in the current study). The percentage of articles with at least one inconsistency ranged from as low as 10.1 % (Berle & Starcevic, 2007) to as high as 63.0 % (Veldkamp et al., 2014), with a median of 46.7 % (2.9 percentage points lower than the estimated 49.6 % in the current study). Finally, the lowest percentage of articles with at least one gross inconsistency is 2.6 % (Berle & Starcevic, 2007) and the highest is 20.5 % (Veldkamp et al., 2014), with a median of 14.3 % (1.4 percentage points higher than the 12.9 % found in the current study).
Some of the differences in prevalences could be caused by differences in inclusion criteria. For instance, Bakker and Wicherts (2011) included only t, F, and χ
2 values; Wicherts et al. (2011) included only t, F, and χ
2 values of which the reported p-value was smaller than .05; Berle and Starcevic (2007) included only exactly reported p-values; Bakker and Wicherts (2014) only included completely reported t and F values. Furthermore, two studies evaluated p-values in the medical field (Garcia-Berthou & Alcaraz, 2004) and in psychiatry (Berle & Starcevic, 2007) instead of in psychology. Lastly, there can be differences in which p-values are counted as inconsistent. For instance, the current study counts p = .000 as incorrect, whereas this was not the case in for example Wicherts et al. (2011; see also Appendix A).
Based on Table 2 we conclude that our study corroborates earlier findings. The prevalence of reporting inconsistencies is high: almost all studies find that roughly one in ten results is erroneously reported. Even though the percentage of results that is grossly inconsistent is lower, the studies show that a substantial percentage of published articles contain at least one gross inconsistency, which is reason for concern.