The prevalence of statistical reporting errors in psychology (1985–2013)
 19k Downloads
 4 Citations
Abstract
This study documents reporting errors in a sample of over 250,000 pvalues reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved nullhypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one pvalue that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent pvalue that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent pvalues has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in pvalues reported as significant than in pvalues reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let coauthors check results in a socalled “copilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.
Keywords
Reporting errors pvalues Significance False positives NHST Questionable research practices Publication biasMost conclusions in psychology are based on the results of null hypothesis significance testing (NHST; Cumming et al., 2007; Hubbard & Ryan, 2000; Sterling, 1959; Sterling, Rosenbaum, & Weinkam, 1995). Therefore, it is important that NHST is performed correctly and that NHST results are reported accurately. However, there is evidence that many reported pvalues do not match their accompanying test statistic and degrees of freedom (Bakker & Wicherts, 2011; Bakker & Wicherts, 2014; Berle & Starcevic, 2007; Caperos & Pardo, 2013; GarciaBerthou & Alcaraz, 2004; Veldkamp, Nuijten, DominguezAlvarez, Van Assen, & Wicherts, 2014; Wicherts, Bakker, & Molenaar, 2011). These studies highlighted that roughly half of all published empirical psychology articles using NHST contained at least one inconsistent pvalue and that around one in seven articles contained a gross inconsistency, in which the reported pvalue was significant and the computed pvalue was not, or vice versa.
This alarmingly high error rate can have large consequences. Reporting inconsistencies could affect whether an effect is perceived to be significant or not, which can influence substantive conclusions. If a result is inconsistent it is often impossible (in the absence of raw data) to determine whether the test statistic, the degrees of freedom, or the pvalue were incorrectly reported. If the test statistic is incorrect and it is used to calculate the effect size for a metaanalysis, this effect size will be incorrect as well, which could affect the outcome of the metaanalysis (Bakker & Wicherts, 2011; in fact, the misreporting of all kinds of statistics is a problem for metaanalyses; Gotzsche, Hrobjartsson, Maric, & Tendal, 2007; Levine & Hullett, 2002). Incorrect pvalues could affect the outcome of tests that analyze the distribution of pvalues, such as the pcurve (Simonsohn, Nelson, & Simmons, 2014) and puniform (Van Assen, Van Aert, & Wicherts, 2014). Moreover, Wicherts et al. (2011) reported that a higher prevalence of reporting errors were associated with a failure to share data upon request.
Even though reporting inconsistencies can be honest mistakes, they have also been categorized as one of several fairly common questionable research practices (QRPs) in psychology (John, Loewenstein, & Prelec, 2012). Interestingly, psychologists’ responses to John et al.’s survey fitted a Guttman scale reasonably well. This suggests that a psychologist’s admission to a QRP that is less often admitted to by others usually implies his or her admission to QRPs with a higher admission rate in the entire sample. Given that rounding down pvalues close to .05 was one of the QRPs with relatively low admission rates, the frequency of misreported pvalues could provide information on the frequency of the use of more common QRPs. The results of John et al. would therefore imply that a high prevalence of reporting errors (or more specifically, incorrect rounding down of pvalues to be below .05) can be seen as an indicator of the use of other QRPs, such as the failure to report all dependent variables, collecting of more data after seeing whether results are significant, failing to report all conditions, and stopping data collection after achieving the desired result. Contrary to many other QRPs in John et al.’s list, misreported pvalues that bear on significance can be readily detected on the basis of the articles’ text.
Previous research found a decrease in negative results (Fanelli, 2012) and an increase in reporting inconsistencies (Leggett, Thomas, Loetscher, & Nicholls, 2013), suggesting that QRPs are on the rise. On the other hand, it has been found that the number of published corrections to the literature did not change over time, suggesting no change in QRPs over time (Fanelli, 2013, 2014). Studying the prevalence of misreported pvalues over time could shed light on possible changes in the prevalence of QRPs.
Besides possible changes in QRPs over time, some evidence suggests that the prevalence of QRPs may differ between subfields of psychology. Leggett et al. (2013) recently studied reporting errors in two main psychology journals in 1965 and 2005. They found that the increase in reporting inconsistencies over the years was higher in the Journal of Personality and Social Psychology (JPSP), the flagship journal of social psychology, than in the Journal of Experimental Psychology: General (JEPG). This is in line with the finding of John et al. (2012) that social psychologists admit to more QRPs, find them more applicable to their field, and find them more defensible as compared to other subgroups in psychology (but see also Fiedler & Schwarz, 2015, on this issue). However, the number of journals and test results in Leggett et al.’s study was rather limited and so it is worthwhile to consider more data before drawing conclusions with respect to differences in QRPs between subfields in psychology.
The current evidence for reporting inconsistencies is based on relatively small sample sizes of articles and pvalues. The goal of our current study was to evaluate reporting errors in a large sample of more than a quarter million pvalues retrieved from eight flagship journals covering the major subfields in psychology. Manually checking errors is timeconsuming work, therefore we present and validate an automated procedure in the R package statcheck (Epskamp & Nuijten, 2015). The validation of statcheck is described in Appendix A.
We used statcheck to investigate the overall prevalence of reporting inconsistencies and compare our findings to findings in previous studies. Furthermore, we investigated whether there has been an increase in inconsistencies over the period 1985 to 2013, and, on a related note, whether there has been any increase in the number of NHST results in general and per paper. We also documented any differences in the prevalence and increase of reporting errors between journals. Specifically, we studied whether articles in social psychology contain more inconsistencies than articles in other subfields of psychology.
Method
“statcheck”

Step 1: First, statcheck converts a PDF or HTML file to a plain text file. The conversion from PDF to plain text can sometimes be problematic, because some journal publishers use images of signs such as “<”, “>”, or “=”, instead of the actual character. These images are not converted to the text file. HTML files do not have such problems and typically render accurate plain text files.

Step 2: From the plain text file, statcheck extracts t, F, r, χ^{2}, and Z statistics, with the accompanying degrees of freedom (df) and pvalue. Since statcheck is an automated procedure, it can only search for prespecified strings of text. Therefore, we chose to let statcheck search for results that are reported completely and exactly in APA style (American Psychological Association, 2010). A general example would be “test statistic (df1, df2) =/</> …, p =/</> …”. Two more specific examples are: “t(37) = −4.93, p <.001”, “χ ^{2}(1, N = 226) = 6.90, p <.01.” statcheck takes different spacing into account, and also reads results that are reported as nonsignificant (ns). On the other hand, it does not read results that deviate from the APA template. For instance, statcheck overlooks cases in which a result includes an effect size estimate in between the test statistic and the pvalue (e.g., “F(2, 70) = 4.48, MSE = 6.61, p <.02”) or when two results are combined into one sentence (e.g., “F(1, 15) = 19.9 and 5.16, p <.001 and p <.05, respectively”). These restrictions usually also imply that statcheck will not read results in tables, since these are often incompletely reported (see Appendix A for a more detailed overview of what statcheck can and cannot read).

Step 3: statcheck uses the extracted test statistics and degrees of freedom to recalculate the pvalue. By default all tests are assumed to be twotailed. We compared pvalues recalculated by statcheck in R version 3.1.2 and Microsoft Office Excel 2013 and found that the results of both programs were consistent up to the tenth decimal point. This indicates that underlying algorithms used to approximate the distributions are not specific to the R environment.

Step 4: Finally, statcheck compares the reported and recalculated pvalue. Whenever the reported pvalue is inconsistent with the recalculated pvalue, the result is marked as an inconsistency. If the reported pvalue is inconsistent with the recalculated pvalue and the inconsistency changes the statistical conclusion (assuming α = .05), the result is marked as a gross inconsistency. To take into account onesided tests, statcheck scans the whole text of the article for the words “onetailed,” “onesided,” or “directional.” If a result is initially marked as inconsistent, but the article mentions one of these words and the result would have been consistent if it were onesided, then the result is marked as consistent. Note that statcheck does not take into account pvalues that are adjusted for multiple testing (e.g., a Bonferroni correction). Pvalues adjusted for multiple comparisons that are higher than the recalculated pvalue can therefore erroneously be marked as inconsistent. However, when we automatically searched our sample of 30,717 articles, we found that only 96 articles reported the string “Bonferroni” (0.3 %) and nine articles reported the string “HuynhFeldt” or “Huynh Feldt” (0.03 %). We conclude from this that corrections for multiple testing are rarely used and will not significantly distort conclusions in our study.
Similar to Bakker and Wicherts (2011), statcheck takes numeric rounding into account. Consider the following example: t(28) = 2.0, p<.05. The recalculated pvalue that corresponds to a tvalue of 2.0 with 28 degrees of freedom is .055, which appears to be inconsistent with the reported pvalue of <.05. However, a reported tvalue of 2.0 could correspond to any rounded value between 1.95 and 2.05, with a corresponding range of pvalues between .0498 and .0613, which means that the reported p <.05 is not considered inconsistent.
Furthermore, statcheck considers pvalues reported as p = .05 as significant. We inspected 10 % of the 2,473 instances in our sample in which a result was reported as “p = .05” and inspected whether these pvalues were interpreted as significant. In the cases where multiple pvalues from the same article were selected, we only included the pvalue that was drawn first to avoid dependencies in the data. Our final sample consisted of 236 instances where “p = .05” was reported and of these pvalues 94.3 % was interpreted as being significant. We therefore decided to count pvalues reported as “p = .05” as indicating that the authors presented the result as significant.
The main advantage of statcheck is that it enables searching for reporting errors in very large samples, which would be infeasible by hand. Furthermore, manual checking is subject to human error, which statcheck eliminates. The disadvantage of statcheck is that it is not as comprehensive as a manual procedure, because it will miss results that deviate from standard reporting and results in tables, and it does not take into account adjustments on pvalues. Consequently, statcheck will miss some reported results and will incorrectly earmark some correct pvalues as a reporting error. Even though it is not feasible to create an automated procedure that is as accurate as a manual search in veryfying correctness of the results, it is important to exclude the possibility that statcheck yields a biased depiction of the true inconsistency rate. To avoid bias in the prevalence of reporting errors, we performed a validity study of statcheck, in which we compared statcheck’s results with the results of Wicherts, Bakker, and Molenaar (2011), who performed a manual search for and verification of reporting errors in a sample of 49 articles.
The validity study showed that statcheck read 67.5 % of the results that were manually extracted. Most of the results that statcheck missed were either reported with an effect size between the test statistics and the pvalue (e.g., F(2, 70) = 4.48, MSE = 6.61, p <.02; 201 instances in total) or reported in a table (150 instances in total). Furthermore, Wicherts et al. found that 49 of 1,148 pvalues were inconsistent (4.3 %) and ten of 1,148 pvalues were grossly inconsistent (.9 %), whereas statcheck (with automatic onetailed test detection) found that 56 of 775 pvalues were inconsistent (7.2 %) and eight of 775 pvalues were grossly inconsistent (1.0 %). The higher inconsistency rate found by statcheck was mainly due to our decision to count p = .000 as incorrect (a pvalue cannot exactly be zero), whereas this was counted correct by Wicherts et al. If we do not include these 11 inconsistencies due to p = .000, statcheck finds an inconsistency percentage of 5.8 % (45 of 775 results), 1.5 percentage points higher than in Wicherts et al. This difference was due to the fact that statcheck did not take into account 11 corrections for multiple testing and Wicherts et al. did. The interrater reliability in this scenario between the manual coding in Wicherts et al. and the automatic coding in statcheck was .76 for the inconsistencies and .89 for the gross inconsistencies. Since statcheck slightly overestimated the prevalence of inconsistencies in this sample of papers, we conclude that statcheck can render slightly different inconsistency rates than a search by hand. Therefore, the results of statcheck should be interpreted with care. For details of the validity study and an explanation of all discrepancies between statcheck and Wicherts et al., see Appendix A.
Sample
Specifications of the years from which HTML articles were available, the number of downloaded articles per journal, the number of articles with APAreported nullhypothesis significance testing (NHST) results, the number of APAreported NHST results, and the median number of APAreported NHST results per article
Journal  Subfield  Years included  No. of articles  No. of articles with NHST results  No. of NHST results  Median no. of NHST results per article with NHST results  

PLOS  General  20002013  10,299  2,487  (24.1 %)  31,539  9 
JPSP  Social  19852013  5,108  4,346  (85.1 %)  101,621  19 
JCCP  Clinical  19852013  3,519  2,413  (68.6 %)  27,429  8 
DP  Developmental  19852013  3,379  2,607  (77.2 %)  37,658  11 
JAP  Applied  19852013  2,782  1,638  (58.9 %)  15,134  6 
PS  General  20032013  2,307  1,681  (72.9 %)  15,654  8 
FP  General  20102013  2,139  702  (32.8 %)  10,149  10 
JEPG  Experimental  19852013  1,184  821  (69.3 %)  18,921  17 
Total  30,717  16,695  (54.4%)  258,105  11 
Statistical analyses
Our population of interest is all APAreported NHST results in the full text of the articles from the eight selected flagship journals in psychology from 1985 until 2013. Our sample includes this entire population. We therefore made no use of inferential statistics, since inferential statistics are only necessary to draw conclusions about populations when having much smaller samples. We restricted ourselves to descriptive statistics; every documented difference or trend entails a difference between or trend in the entire population or subpopulations based on journals. For linear trends we report regression weights and percentages of variance explained to aid interpretation.
Results
We report the prevalence of reporting inconsistencies at different levels. We document general prevalence of NHST results and present percentages of articles that use NHST per journal and over the years. Because only the five APA journals provided HTMLs for all years from 1985 to 2013, the overall trends are reported for APA journals only, and do not include results from Psychological Science, PLoS, and Frontiers, which only cover recent years. Reporting inconsistencies are presented both at the level of article and at the level of the individual pvalue, i.e., the percentage of articles with at least one inconsistency and the average percentage of pvalues within an article that is inconsistent, respectively. We also describe differences between journals and trends over time.
Percentage of articles with nullhypothesis significance testing (NHST) results
Number of published NHST results over the years
Across all APA journals, the number of NHST results per article has increased over the period of 29 years (b = .25, R^{2} = .68), with the strongest increases in JEPG and JPSP. These journals went from an average of around 10–15 NHST results per article in 1985 to as much as around 30 results per article on average in 2013. The mean number of NHST results per article remained relatively stable in DP, JCCP, and JAP; over the years, the articles with NHST results in these journals contained an average of ten NHST results. It is hard to say anything definite about trends in PS, FP, and PLOS, since we have only a limited number of years for these journals (the earliest years we have information for are 2003, 2010, and 2004, respectively). Both the increase in the percentage of articles that report NHST results and the increased number of NHST results per article show that NHST is increasingly popular in psychology. It is therefore important that the results of these tests are reported correctly.
General prevalence of inconsistencies
Across all journals and years 49.6 % of the articles with NHST results contained at least one inconsistency (8,273 of the 16,695 articles) and 12.9 % (2,150) of the articles with NHST results contained at least one gross inconsistency. Furthermore, overall 9.7 % (24,961) of the pvalues were inconsistent, and 1.4 % (3,581) of the pvalues were grossly inconsistent. We also calculated the percentage of inconsistencies per article and averaged these percentages over all articles. We call this the “(gross) inconsistency rate.” Across journals, the inconsistency rate was 10.6 % and the gross inconsistency rate was 1.6 %.
Prevalence of inconsistencies per journal
The inconsistency rate shows a different pattern than the percentage of articles with all inconsistencies. PLoS showed the highest percentage of inconsistent pvalues per article overall, followed by FP (14.0 % and 12.8 %, respectively). Furthermore, whereas JPSP was the journal with the highest percentage of articles with inconsistencies, it had one of the lowest probabilities that a pvalue in an article was inconsistent (9.0 %). This discrepancy is caused by a difference between journals in the number of pvalues per article: the articles in JPSP contain many pvalues (see Table 1, right column). Hence, notwithstanding a low probability of a single pvalue in an article being inconsistent, the probability that an article contained at least one inconsistent pvalue was relatively high. The gross inconsistency rate was quite similar over all journals except JAP, in which the gross inconsistency rate was relatively high (2.5 %).
Prevalence of inconsistencies over the years
The number of (gross) inconstencies has decreased or remained stable over the years across the APA journals. In DP, JCCP, JPEG, and JPSP the percentage of all inconsistencies in an article has decreased over the years. For JAP there is a positive (but very small) regression coefficient for year, indicating an increasing error rate, but the R^{2} is close to zero. The same pattern held for the prevalence of gross inconsistencies over the years. DP, JCCP, and JPSP have shown a decrease in gross inconsistencies, in JEPG and JAP the R^{2} is very small, and the prevalence seems to have remained practically stable. The trends for PS, FP, and PLoS are hard to interpret given the limited number of years of covarage. Overall, it seems that, contrary to the evidence suggesting that the use of QRPs could be on the rise (Fanelli, 2012; Leggett et al., 2013), neither the inconsistencies nor the gross inconsistencies have shown an increase over time. If anything, the current results reflect a decrease of reporting error prevalences over the years.
Prevalence of gross inconsistencies in results reported as significant and nonsignificant
It is hard to interpret the percentages of inconsistencies in significant and nonsignificant pvalues substantively, since they depend on several factors, such as the specific pvalue: it seems more likely that a pvalue of .06 is reported as smaller than .05 than a pvalue of .78. That is, because journals may differ in the distribution of specific pvalues we should also be careful in comparing gross inconsistencies in pvalues reported as significant across journals. Furthermore, without the raw data it is impossible to determine whether it is the pvalue that is erroneous, or the test statistic or degrees of freedom. As an example of the latter case, a simple typographical error such as “F(2,56) = 1.203, p < .001” instead of “F(2,56) = 12.03, p < .001” produces a gross inconsistency, without the pvalue being incorrect. Although we cannot interpret the absolute percentages and their differences, the finding that gross inconsistencies are more likely in pvalues presented as significant than in pvalues presented as nonsignificant could indicate a systematic bias and is reason for concern.
To investigate the consequence of these gross inconsistencies, we compared the percentage of significant results in the reported pvalues with the percentage of significant results in the computed pvalues. Averaged over all journals and years, 76.6 % of all reported pvalues were significant. However, only 74.4 % of all computed pvalues were significant, which means that the percentage of significant findings in the investigated literature is overestimated by 2.2 percentage points due to gross inconsistencies.
Prevalence of inconsistencies as found by other studies
Prevalence of inconsistencies in the current study and in earlier studies
Study  Field  No. of articles  No. of results  No. of inconsistencies  Gross inconsistencies (%)  Articles with at least one inconsistency (%)  Articles with at least one gross inconsistency (%) 

Current study  Psychology  30,717  258,105  9.7  1.4  49.6^{2}  12.9^{2} 
GarciaBerthou and Alcaraz (2004)  Medical  44  244^{4}  11.5  0.4  31.5   
Berle and Starcevic (2007)  Psychiatry  345  5,464  14.3    10.1  2.6 
Wicherts et al. (2011)  Psychology  49  1,148^{1}  4.3  0.9  53.1  14.3 
Bakker and Wicherts (2011)  Psychology  333  4,248^{3}  11.9  1.3  45.4  12.4 
Caperos and Pardo (2013)  Psychology  186  1,212^{3}  12.2  2.3  48.0^{2}  17.6^{2} 
Bakker and Wicherts (2014)  Psychology  153^{5}  2,667  6.7  1.1  45.1  15.0 
Veldkamp et al. (2014)  Psychology  697  8,105  10.6  0.8  63.0  20.5 
Table 2 shows that the estimated percentage of inconsistent results can vary considerably between studies, ranging from 4.3 % of the results (Wicherts et al., 2011) to 14.3 % of the results (Berle & Starcevic, 2007). The median rate of inconsistent results is 11.1 % (1.4 percentage points higher than the 9.7 % in the current study). The percentage of gross inconsistencies ranged from .4 % (GarciaBerthou & Alcaraz, 2004) to 2.3 % (Caperos & Pardo, 2013), with a median of 1.1 % (.3 percentage points lower than the 1.4 % found in the current study). The percentage of articles with at least one inconsistency ranged from as low as 10.1 % (Berle & Starcevic, 2007) to as high as 63.0 % (Veldkamp et al., 2014), with a median of 46.7 % (2.9 percentage points lower than the estimated 49.6 % in the current study). Finally, the lowest percentage of articles with at least one gross inconsistency is 2.6 % (Berle & Starcevic, 2007) and the highest is 20.5 % (Veldkamp et al., 2014), with a median of 14.3 % (1.4 percentage points higher than the 12.9 % found in the current study).
Some of the differences in prevalences could be caused by differences in inclusion criteria. For instance, Bakker and Wicherts (2011) included only t, F, and χ ^{ 2 } values; Wicherts et al. (2011) included only t, F, and χ ^{ 2 } values of which the reported pvalue was smaller than .05; Berle and Starcevic (2007) included only exactly reported pvalues; Bakker and Wicherts (2014) only included completely reported t and F values. Furthermore, two studies evaluated pvalues in the medical field (GarciaBerthou & Alcaraz, 2004) and in psychiatry (Berle & Starcevic, 2007) instead of in psychology. Lastly, there can be differences in which pvalues are counted as inconsistent. For instance, the current study counts p = .000 as incorrect, whereas this was not the case in for example Wicherts et al. (2011; see also Appendix A).
Based on Table 2 we conclude that our study corroborates earlier findings. The prevalence of reporting inconsistencies is high: almost all studies find that roughly one in ten results is erroneously reported. Even though the percentage of results that is grossly inconsistent is lower, the studies show that a substantial percentage of published articles contain at least one gross inconsistency, which is reason for concern.
Discussion
In this paper we investigated the prevalence of reporting errors in eight major journals in psychology using the automated R package statcheck (Epskamp & Nuijten, 2015). Over half of the articles in the six flagship journals reported NHST results that statcheck was able to retrieve. Notwithstanding the many debates on the downsides of NHST (see e.g., Fidler & Cumming, 2005; Wagenmakers, 2007), the use of NHST in psychology appears to have increased from 1985 to 2013 (see Figs. 1 and 2), although this increase can also reflect an increase in adherence to APA reporting standards. Our findings show that in general the prevalence of reporting inconsistencies in six flagship psychology journals is substantial. Roughly half of all articles with NHST results contained at least one inconsistency and about 13 % contained a gross inconsistency that may have affected the statistical conclusion. At the level of individual pvalues we found that on average 10.6 % of the pvalues in an article were inconsistent, whereas 1.6 % of the pvalues were grossly inconsistent.
Contrary to what one would expect based on the suggestion that QRPs have been on the rise (Leggett et al., 2013), we found no general increase in the prevalence of inconsistent pvalues in the studied journals from 1985 to 2013. When focusing on inconsistencies at the article level, we only found an increase in the percentage of articles with NHST results that showed at least one inconsistency for JEPG and JPSP. Note this was associated with clear increases in the number of reported NHST results per article in these journals. Furthermore, we did not find an increase in gross inconsistencies in any of the journals. If anything, we saw that the prevalence of articles with gross inconsistencies has been decreasing since 1985, albeit only slightly. We also found no increase in the prevalence of gross inconsistencies in pvalues that were reported as significant as compared to gross inconsistencies in pvalues reported as nonsignificant. This is at odds with the notion that QRPs in general and reporting errors in particular have been increasing in the last decades. On the other hand, the stability or decrease in reporting errors is in line with research showing no trend in the proportion of published errata, which implies that there is also no trend in the proportion of articles with (reporting) errors (Fanelli, 2013).
Furthermore, we found no evidence that inconsistencies are more prevalent in JPSP than in other journals. The (gross) inconsistency rate was not the highest in JPSP. The prevalence of (gross) inconsistencies has been declining in JPSP, as it did in other journals. We did find that JPSP showed a higher prevalence of articles with at least one inconsistency than other journals, but this was associated with the higher number of NSHT results per article in JPSP. Hence our findings are not in line with the previous findings that JPSP shows a higher (increase in) inconsistency rate (Leggett et al., 2013). Since statcheck cannot distinguish between pvalues pertaining to core hypotheses and pvalues pertaining to, for example, manipulation checks, it is hard to interpret the differences in inconsistencies between fields and the implications of these differences. To warrant such a conclusion the inconsistencies would have to be manually analyzed within the context of the papers containing the inconsistencies.
We also found that gross inconsistencies are more prevalent in pvalues reported as significant than in pvalues reported as nonsignificant. This could suggest a systematic bias favoring significant results, potentially leading to an excess of false positives in the literature. The higher prevalence of gross inconsistencies in significant pvalues versus nonsignificant pvalues was highest in JCCP, JAP, and JPSP, and lowest in PLOS and FP. Note again that we do not know the hypotheses underlying these pvalues. It is possible that in some cases a nonsignificant pvalue would be in line with a hypothesis and thus in line with the researcher’s predictions. Our data do not speak to the causes of this overrepresentation of significant results. Perhaps these pvalues are intentionally rounded down (a practice that 20 % of the surveyed psychological researchers admitted to; John et al., 2012) to convince the reviewers and other readers of an effect. Or perhaps researchers fail to double check significantly reported pvalues, because they are in line with their expectations, hence leaving such reporting errors more likely to remain undetected. It is also possible that the cause of the overrepresentation of falsely significant results lies with publication bias: perhaps researchers report significant pvalues as nonsignificant just as often as vice versa, but in the process of publication, only the (accidentally) significant pvalues get published.
There are two main limitations in our study. Firstly, by using the automated procedure statcheck to detect reporting inconsistencies, our sample did not include NHST results that were not reported exactly according to APA format or results reported in tables. However, based on the validity study and on earlier results (Bakker & Wicherts, 2011), we conclude that there does not seem to be a difference in the prevalence of reporting inconsistencies between results in APA format and results that are not exactly in APA format (see Appendix A). The validity study did suggest, however, that statcheck might slightly overestimate the number of inconsistencies. One reason could be that statcheck cannot correctly evaluate pvalues that were adjusted for multiple testing. However, we found that these adjustments are rarely used. Notably, the term “Bonferroni” was mentioned in a meager 0.3 % of the 30,717 papers. This finding is interesting in itself; with a median number of 11 NHST results per paper, most papers report multiple pvalues. Without any correction for multiple testing, this suggests that overall Type I error rates in the eight psychology journals are already higher than the nominal level of .05. Nevertheless, the effect of adjustments of pvalues on the error estimates from statcheck is expected to be small. We therefore conclude that, as long as the results are interpreted with care, statcheck provides a good method to analyze vast amounts of literature to locate reporting inconsistencies. Future developments of statcheck could focus on taking into account corrections for multiple testing and results reported in tables or with effect sizes reported between the test statistic and pvalue.
The second limitation of our study is that we chose to limit our sample to only a selection of flagship journals from several sub disciplines of psychology. It is possible that the prevalence of inconsistencies in these journals is not representative for the psychological literature. For instance, it has been found that journals with lower impact factors have a higher prevalence of reporting inconsistencies than high impact journals (Bakker & Wicherts, 2011). In this study we avoid conclusions about psychology in general, but treat the APAreported NHST results in the full text of the articles from journals we selected as the population of interest (which made statistical inference superfluous). All conclusions in this paper therefore hold for the APAreported NHST results in the eight selected journals. Nevertheless, the relatively high impact factors of these journals attest to the relevance of the current study.
There are several possible solutions to the problem of reporting inconsistencies. Firstly, researchers can check their own papers before submitting, either by hand or with the R package statcheck. Editors and reviewers could also make use of statcheck to quickly flag possible reporting inconsistencies in a submission, after which the flagged results can be checked by hand. This should reduce erroneous conclusions caused by gross inconsistencies. Checking articles with statcheck can also prevent such inconsistencies from distorting metaanalyses or analyses of pvalue distributions (Simonsohn et al., 2014; Van Assen et al., 2014). This solution would be in line with the notion of Analytic Review (Sakaluk, Williams, & Biernat, 2014), in which a reviewer receives the data file and syntax of a manuscript to check if the reported analyses were actually conducted and reported correctly. One of the main concerns about Analytic Review is that it would take reviewers a lot of additional work. The use of statcheck in Analytic Review could reduce this workload substantially.
Secondly, the prevalence of inconsistencies might decrease if coauthors check each other’s work, a socalled “copilot model” (Wicherts, 2011). In recent research (Veldkamp et al., 2014) this idea has been investigated by relating the probability that a pvalue was inconsistent to six different copiloting activities (e.g., multiple authors conducting the statistical analyses). Veldkamp et al. did not find direct evidence for a relation between copiloting and reduced prevalence of reporting errors. However, the investigated copilot activities did not explicitly include the actual checking of each other’s pvalues, hence we do not rule out the possibility that reporting errors would decrease if coauthors double checked pvalues.
Thirdly, it has been found that reporting errors are related to reluctance to share data (Wicherts et al., 2011). Although any causal relation cannot be established, a solution might be to require open data by default, allowing exceptions only when explicit reasons are available for not sharing. Subsequently, researchers know their data could be checked and may feel inclined to double check the Results section before publishing the paper. Besides a possible reduction in reporting errors, sharing data has many other advantages. Sharing data for instance facilitates aggregating data for better effect size estimates, enable reanalyzing published articles, and increase credibility of scientific findings (see also Nosek, Spies, & Motyl, 2012; Sakaluk et al., 2014; Wicherts, 2013; Wicherts & Bakker, 2012). The APA already requires data to be available for verification purposes (American Psychological Association, 2010, p. 240), many journals explicitly encourage data sharing in their policies, and the journal Psychological Science has started to award badges to papers of which the data are publicly available. Despite these policies and encouragements, raw data are still rarely available (AlsheikhAli, Qureshi, AlMallah, & Ioannidis, 2011). One objection that has been raised is that due to privacy concerns data cannot be made publicly available (see e.g., Finkel, Eastwick, & Reis, 2015). Even though this can be a legitimate concern for some studies with particularly sensitive data, these are exceptions; the data of most psychology studies could be published without risks (Nosek et al., 2012).
To find a successful solution to the substantial prevalence of reporting errors, more research is needed on how reporting errors arise. It is important to know whether reporting inconsistencies are mere sloppiness or whether they are intentional. We found that the large majority of inconsistencies were not gross inconsistencies around p = .05, but inconsistencies that did not directly influence any statistical conclusion. Rounding down a pvalue of, say, .38 down to .37 does not seem to be in the direct interest of the researcher, suggesting that the majority of inconsistencies are accidental. On the other hand, we did find that the large majority of grossly inconsistent pvalues were nonsignificant pvalues that were presented as significant, instead of vice versa. This seems to indicate a systematic bias that causes an overrepresentation of significant results in the literature. Whatever the cause of this overrepresentation might be, there seems to be too much focus on getting “perfect,” significant results (see also GinerSorolla, 2012). Considering that the ubiquitous significance level of .05 is arbitrary, and that there is a vast amount of critique on NHST in general (see e.g., Cohen, 1994; Fidler & Cumming, 2005; Krueger, 2001; Rozeboom, 1960; Wagenmakers, 2007), it should be clear that it is more important that pvalues are accurately reported than that they are below .05.
There are many more interesting aspects of the collected 258,105 pvalues that could be investigated, but this is beyond the scope of this paper. In another paper, the nonsignificant test results from this dataset are investigated for false negatives (Hartgerink, van Assen, & Wicherts, 2015). Here a method is used to detect false negatives and the results indicate two out of three papers with nonsignificant test results might contain falsenegative results. This is only one out of the many possibilities and we publicly share the anonymized data on our Open Science Framework page (https://osf.io/gdr4q/) to encourage further research.
Our study illustrates that science is done by humans, and humans easily make mistakes. However, the prevalence of inconsistent pvalues in eight major journals in psychology has generally been stable over the years, or even declining. Hopefully, statcheck can contribute to further reducing the prevalence of reporting inconsistencies in psychology.
Footnotes
 1.
We note there is a minor difference in the number of search results from the webpage and the package due to default specifications in the rplos package. See also https://github.com/ropensci/rplos/issues/75
 2.
The only onetailed test that is still counted by statcheck as inconsistent is a result that is reported as onetailed and has a rounded test statistic: t(14) = 2.0, p <.03. The correct rounding of test statistics is not incorporated in the automatic onetailed test detection, but this will be incorporated in the next version. For now this will not bias the results that much, since these are rare cases.
 3.
Note that the APA advises any pvalue smaller than .001 to be reported as p < .001. These cases could be considered as exactly reported. Our analysis does not take this into account. Furthermore, statements like “all tests >.05” are not included in our analysis.
Notes
Author note
The preparation of this article was supported by The Innovational Research Incentives Scheme Vidi (no. 45211004) from the Netherlands Organization for Scientific Research.
Supplementary material
References
 AlsheikhAli, A. A., Qureshi, W., AlMallah, M. H., & Ioannidis, J. P. A. (2011). Public availability of published research data in highimpact journals. PLoS One, 6(9), e24357. doi: 10.1371/journal.pone.0024357 CrossRefPubMedPubMedCentralGoogle Scholar
 American Psychological Association. (1983). Publication Manual of the American Psychological Association (3rd ed.). Washington, DC: American Psychological Association.Google Scholar
 American Psychological Association. (2010). Publication Manual of the American Psychological Association (6th ed.). Washington, DC: American Psychological Association.Google Scholar
 Bakker, M., & Wicherts, J. M. (2011). The (mis)reporting of statistical results in psychology journals. Behavior Research Methods, 43, 666–678. doi: 10.3758/s1342801100895 CrossRefPubMedPubMedCentralGoogle Scholar
 Bakker, M., & Wicherts, J. M. (2014). Outlier removal and the relation with reporting errors and quality of research. Manuscript submitted for publication.Google Scholar
 Berle, D., & Starcevic, V. (2007). Inconsistencies between reported test statistics and pvalues in two psychiatry journals. International Journal of Methods in Psychiatric Research, 16(4), 202–207. doi: 10.1002/mpr.225 CrossRefPubMedGoogle Scholar
 Caperos, J. M., & Pardo, A. (2013). Consistency errors in pvalues reported in Spanish psychology journals. Psicothema, 25(3), 408–414.PubMedGoogle Scholar
 Chamberlain, S., Boettiger, C., & Ram, K. (2014). rplos: Interface to PLoS Journals search API. R package version 0.4.0. http://CRAN.Rproject.org/package=rplos
 Cohen, J. (1994). The earth is round (P lessthan.05). American Psychologist, 49(12), 997–1003.CrossRefGoogle Scholar
 Cumming, G., Fidler, F., Leonard, M., Kalinowski, P., Christiansen, A., Kleinig, A., . . . Wilson, S. (2007). Statistical reform in psychology: Is anything changing? Psychological science, 18(3), 230–232.Google Scholar
 Epskamp, S., & Nuijten, M. B. (2015). statcheck: Extract statistics from articles and recompute p values. R package version 1.0.1. http://CRAN.Rproject.org/package=statcheck
 Fanelli, D. (2012). Negative results are disappearing from most disciplines and countries. Scientometrics, 90(3), 891–904. doi: 10.1007/s1119201104947 CrossRefGoogle Scholar
 Fanelli, D. (2013). Why Growing Retractions Are (Mostly) a Good Sign. Plos Medicine, 10(12). doi: 10.1371/journal.pmed.1001563
 Fanelli, D. (2014). Rise in retractions is a signal of integrity. Nature, 509(7498), 33–33.CrossRefPubMedGoogle Scholar
 Fidler, F., & Cumming, G. (2005). Teaching confidence intervals: Problems and potential solutions. Proceedings of the 55th International Statistics Institute Session.Google Scholar
 Fiedler, K., & Schwarz, N. (2015). Questionable Research Practices Revisited.Google Scholar
 Finkel, E. J., Eastwick, P. W., & Reis, H. T. (2015). Best research practices in psychology: Illustrating epistemological and pragmatic considerations with the case of relationship science. Journal of Personality and Social Psychology, 108(2), 275–297.CrossRefPubMedGoogle Scholar
 GarciaBerthou, E., & Alcaraz, C. (2004). Incongruence between test statistics and P values in medical papers. Bmc Medical Research Methodology, 4, 13. doi: 10.1186/14712288413 CrossRefPubMedPubMedCentralGoogle Scholar
 GinerSorolla, R. (2012). Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science. Perspectives on Psychological Science, 7, 562–571. doi: 10.1177/1745691612457576 CrossRefPubMedGoogle Scholar
 Gotzsche, P. C., Hrobjartsson, A., Maric, K., & Tendal, B. (2007). Data extraction errors in metaanalyses that use standardized mean differences. JamaJournal of the American Medical Association, 298(4), 430–437.CrossRefGoogle Scholar
 Hartgerink, C. H. J., van Assen, M. A. L. M., & Wicherts, J. M. (2015). Too Good to be False: NonSignificant Results Revisited. Retrieved from osf.io/qpfnw.Google Scholar
 Hubbard, R., & Ryan, P. A. (2000). The historical growth of statistical significance testing in psychologyand its future prospects. Educational and Psychological Measurement, 60, 661–681.Google Scholar
 John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truthtelling. Psychological science, 23, 524–532. doi: 10.1177/0956797611430953 CrossRefPubMedGoogle Scholar
 Krueger, J. (2001). Null hypothesis significance testing  On the survival of a flawed method. American Psychologist, 56(1), 16–26.CrossRefPubMedGoogle Scholar
 Leggett, N. C., Thomas, N. A., Loetscher, T., & Nicholls, M. E. (2013). The life of p:“Just significant” results are on the rise. The Quarterly Journal of Experimental Psychology, 66(12), 2303–2309.CrossRefPubMedGoogle Scholar
 Levine, T. R., & Hullett, C. R. (2002). Eta squared, partial eta squared, and misreporting of effect size in communication research. Human Communication Research, 28(4), 612–625. doi: 10.1111/j.14682958.2002.tb00828.x CrossRefGoogle Scholar
 Nosek, B. A., Spies, J., & Motyl, M. (2012). Scientific Utopia: II  Restructuring Incentives and Practices to Promote Truth Over Publishability. Perspectives on Psychological Science, 7, 615–631. doi: 10.1177/1745691612459058 CrossRefPubMedGoogle Scholar
 R Core Team. (2014). R: A Language and Environment for Statistical Computing. http://www.Rproject.org/
 Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416–428.CrossRefPubMedGoogle Scholar
 Sakaluk, J., Williams, A., & Biernat, M. (2014). Analytic Review as a Solution to the Misreporting of Statistical Results in Psychological Science. Perspectives on Psychological Science, 9(6), 652–660.CrossRefPubMedGoogle Scholar
 Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). Pcurve: A key to the filedrawer. Journal of Experimental Psychology: General, 143(2), 534–547.CrossRefGoogle Scholar
 Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance  Or vice versa. Journal of the American Statistical Association, 54, 30–34.Google Scholar
 Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited  The effect of the outcome of statistical tests on the decision to publish and viceversa. American Statistician, 49(1), 108–112.Google Scholar
 Van Assen, M. A. L. M., Van Aert, R. C. M., & Wicherts, J. M. (2014). Metaanalysis using effect size distributions of only statistically significant studies. Psychological Methods.Google Scholar
 Veldkamp, C. L. S., Nuijten, M. B., DominguezAlvarez, L., Van Assen, M. A. L. M., & Wicherts, J. M. (2014). Statistical reporting errors and collaboration on statistical analyses in psychological science PLoS One.Google Scholar
 Wagenmakers, E. J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804. doi: 10.3758/BF03194105 CrossRefGoogle Scholar
 Wicherts, J. M. (2011). Psychology must learn a lesson from fraud case. Nature, 480, 7. doi: 10.1038/480007a CrossRefPubMedGoogle Scholar
 Wicherts, J. M. (2013). Science revolves around the data. Journal of Open Psychology Data, 1(1), e1.CrossRefGoogle Scholar
 Wicherts, J. M., & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish your data too? Intelligence. doi: 10.1016/j.intell.2012.01.004 Google Scholar
 Wicherts, J. M., Bakker, M., & Molenaar, D. (2011). Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PLoS One, 6(11), e26828.CrossRefPubMedPubMedCentralGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.