Background

Statistics is a difficult topic to teach and learn and there is ample evidence that its application is often faulty in medicine [16] as well as in many other scientific disciplines. Errors include aspects of design, analysis, and reporting and interpretation. Although there has recently been considerable effort to improve and standardise the reporting of medical research (e.g., the CONSORT statement for randomised controlled trials [7]), there is almost no literature demonstrating the incorrect computation or reporting of results beyond general deficiencies of computer packages [8, 9] or some well-scrutinized data such as Benford's original data [10]. Beyond deficiencies of software, such numerical errors may later originate in the transcription of results from computer outputs to reports and manuscripts, wrong rounding of results, or uncorrected typesetting errors. We investigated this question by checking the statistical results reported in all the papers of volumes 409–412 of Nature (2001) and some papers in vol. 322–323 of BMJ (2001). We show that the occurrence of errors is very high and we review ways to improve current practice.

Methods

Given an observed test statistic and its degrees of freedom (df), one may compute the observed P value or significance level (or vice versa) with most statistical packages. We are thus able to check the congruence of results consisting of the test statistic, df and a precise P value. We cannot check results consisting only of a P value or with no precise P value (e.g. P < 0.05 instead of P = 0.023) and therefore these were not considered in our review. Note that the latter are bad practices and reporting both the observed test statistic and the "exact" P value has been recommended [3]. We did not check the congruence of confidence intervals and other statistics because it would be generally impossible without access to the raw data.

We checked all the statistical results (consisting of the test statistic, df and a precise P value) reported in all the papers of volumes 409–412 of Nature (2001) and 12 randomly selected papers from vol. 322–323 of BMJ (2001). We checked the results with three different packages: SPSS for Windows 10.1, STATISTICA '98 for Windows, and the freeware NCSS Probability Calculator for Windows. The results of the three statistical packages were identical at least up to the 4th decimal. All the results checked and the errors detected are detailed in Table 1 for BMJ (see Additional file 1) and Table 2 for Nature (see Additional file 2).

We only determined that a result was in error when it was not possibly due to rounding in the original paper. For instance, the result of "χ2 = 1.7, df = 1, P = 0.30" in vol. 322, p. 769–770 of BMJ cannot be due to correct rounding of the test statistic and P value, given the following precise results: χ2 = 1.65, df = 1, P = 0.199; χ2 = 1.70, df = 1, P = 0.192; χ2 = 1.75, df = 1, P = 0.186. If the statistic was really χ2 = 1.7, then the P value should have been much lower than 0.3. In fact, a χ2 of 1.07 with 1 df yields a P value of 0.3, suggesting a reporting error. In contrast, the result "χ2 = 1.2, df = 2, P = 0.54" in vol. 322, p. 336–342 is congruent with the following precise results after rounding: χ2 = 1.15, df = 2, P = 0.563; χ2 = 1.20, df = 2, P = 0.549; χ2 = 1.25, df = 2, P = 0.535.

We also tested whether the frequencies of the last digit of the P values found and an additional random sample of 610 statistics (Table 3, see Additional file 3) in the same volumes 409–412 of Nature deviated significantly from the uniform distribution with the Kolmogorov-Smirnov test (with SPSS for Windows 10.1). For leading digits, Benford's law (i.e., that the distribution of first digits follows a logarithmic pattern, with probability decreasing from 0 to 9) is usually observed. Benford's law states that for the first digit the probability of 1 is 30.1% while the probability for 9 is 4.6% [11]. However, the distribution flattens out progressively for subsequent digits and the difference is only 12.0% for 0 and 8.5% for 9 for the second digit (and 10.2% and 9.8% respectively for the third digit). As the statistics analysed were usually reported to 3–4 significant figures, a uniform distribution (i.e. equally probable digits) should be rather expected. Similar analyses of equiprobability of last digits have been performed in a variety of medical contexts to detect digit preference and check the accuracy of databases [1216].

Results and discussion

We found that a surprising 11.6% (21 of 181) of the computations in Nature were incongruent (Table 2, see Additional file 2). A less exhaustive check in BMJ resulted in a very similar percentage (11.1%, 7 of 63) (Table 1, see Additional file 1). At least one such error appeared in 38% (12 of 32) and 25% (3 of 12) of the papers of Nature and BMJ respectively, indicating that they are widespread and not concentrated in a few papers. For instance, in vol. 411, p. 88 of Nature "F 2,14 = 10.89, P = 0.014" was reported while the congruent P value is 0.0014, suggesting a transcription error. Another transcription error is "F 7,79 = 7.09, P = 0.0094" in vol. 412, p. 74, in which the P value corresponds to an F with 1 and 79 degrees of freedom.

Many errors are probably due to incorrect rounding, e.g. "r = 0.30, N = 21, P = 0.20" (congruent P = 0.186) in vol. 411, p. 297 of Nature or "χ2 = 0.01, df = 1, P = 1.00" (congruent P = 0.92) in vol. 322, p. 336–342 of BMJ. Some authors state P = 0.001, when they should state P < 0.001 or P << 0.001.

These incongruences are probably due to inaccurate rounding or transcription. Software deficiencies are usually orders of magnitude less important [8, 9], and would be restricted to specific papers using a certain statistical package, contrary to our findings of over 25% of the papers with errors. Most typesetting errors are probably detected by authors' corrections and errors in previous steps of manuscript preparation are probably more frequent and difficult to detect.

Interestingly, independent evidence of rounding misuse stems from digit preference. We collected 610 test statistics from the same Nature volumes and counted the frequencies of the last digit reported (see Fig. 1 and Additional file 3). The counts significantly deviate from the expected uniform distribution (Kolmogorov-Smirnov test, Z = 2.7, P < 0.0005) and show that authors tend to round more frequently, inconsistently and sometimes wrongly, when the last digit is high (as expected for psychological reasons) and when it is 4, 6 or 9. The counts of the last digit of P values also significantly deviate from the uniform distribution (Kolmogorov-Smirnov Z = 1.4, P = 0.043), and 0, 4, and 9 are less common than expected (see Fig. 2 and Additional file 2). Similar avoidance of the odd digits adjacent to multiples of 5 (such as 4 or 9) has been also noticed in other studies of digit preference [12, 13] and suggests that rounding practice is not performed by authors in a consistent manner (e.g., to 3–4 significant figures).

Figure 1
figure 1

Histogram of the last digit of 610 test statistics (see Additional file 3) in volumes 409–412 of Nature. The reference line corresponds to the mean count (61).

Figure 2
figure 2

Histogram of the last digit of 181 P values (see Additional file 2) in volumes 409–412 of Nature. The reference line corresponds to the mean count (18.1).

The estimate of 11–12% of incongruent statistical results is a conservative one since some cases were not considered errors because they might have been caused by rounding. It is not possible to be certain of the real importance of these errors because without access to the raw data, we do not know the correct result. Apparently, the conclusion would change from significant to nonsignificant in only about 4% (1/27) of the errors (1 error reporting "1,9" df for a t statistic was not considered) using the arbitrary 5% level. However, the median of the relative bias (absolute difference between the reported and congruent P values, divided by the congruent P value) was 38% and in 12% of the cases the relative bias was larger than 10%, showing that the significance level might change one or more orders of magnitude.

Although these kinds of errors may leave unchanged the conclusions of a study and other errors might be more harmful, they are indicative of poor practice. Our concern is that these kinds of errors are probably present in all numerical results (e.g., means, percentages, confidence intervals) and all steps of scientific research, with potentially important practical consequences. Moreover, poor presentation provides clues that there may be serious errors elsewhere [17]. Our findings confirm that the quality of research and scientific papers needs improvement and should be more carefully checked and evaluated in these days of high publication pressure [1820].

Conclusions

Several detailed guidelines on the practice and reporting of statistics in medical papers are available. [3, 7, 21, 22]. There is considerable consensus on the most desirable practices, and some of their suggestions are:

1) In medical research, confidence intervals are often more appropriate than hypothesis testing. If hypothesis testing is used, it is desirable to report not only the P values but also the observed values of test statistics and the degrees of freedom.

2) Exact P values (to no more than two significant figures) should be given rather than reporting P > 0.05 or P < 0.01. It is unnecessary to specify levels of P lower than 0.0001.

3) Spurious precision adds no value to a paper and even detracts from its readability and credibility. Results need to be rounded [2325].

To this we need to add that:

1) Numerical results should be correctly rounded. The problem of introducing bias by rounding digits ending in five [26] is a trivial one compared to the misuses reported in our paper.

2) The preparation and editing of manuscripts should be more carefully checked. Increasing the use in medical journals of statistical reviewers [1, 17] and of unlimited publication of correspondence on the web [2] may help to improve the quality of papers.

3) In principle, authors of research papers (including systematic reviews) should make the raw data freely available on the Internet and journals should implement and stimulate this practice. The benefits of this recent practice mainly involve: further analyses not directly addressed by the primary researchers are possible [27, 28], including effective systematic review and meta-analysis [29] or the estimation of adequate sample sizes (power analysis) [30]; other researchers can check whether the results are correct and the conclusions justified [29, 30]; fraud and sloppiness may be more easily detected and is thus discouraged [27].

4) The software version or code used should also be stated, since this gives many hints of the methods used.

Among others, Altman and coauthors give details of many other ways to improve the practice and reporting of statistics in medicine and their suggestions are widely applicable to other research fields [1, 3, 5, 17].