Should significance testing be abandoned in machine learning?

  • Daniel BerrarEmail author
  • Werner Dubitzky
Regular Paper


Significance testing has become a mainstay in machine learning, with the p value being firmly embedded in the current research practice. Significance tests are widely believed to lend scientific rigor to the interpretation of empirical findings; however, their problems have received only scant attention in the machine learning literature so far. Here, we investigate one particular problem, the JeffreysLindley paradox. This paradox describes a statistical conundrum: the p value can be close to zero, convincing us that there is overwhelming evidence against the null hypothesis. At the same time, however, the posterior probability of the null hypothesis being true can be close to 1, convincing us of the exact opposite. In experiments with synthetic data sets and a subsequent thought experiment, we demonstrate that this paradox can have severe repercussions for the comparison of multiple classifiers over multiple benchmark data sets. Our main result suggests that significance tests should not be used in such comparative studies. We caution that the reliance on significance tests might lead to a situation that is similar to the reproducibility crisis in other fields of science. We offer for debate four avenues that might alleviate the looming crisis.


Jeffreys–Lindley paradox p Value Significance test Bayesian test Classification 


  1. 1.
    Baker, M.: Is there a reproducibility crisis? Nature 533, 452–454 (2016)CrossRefGoogle Scholar
  2. 2.
    Bartlett, M.: A comment on D.V. Lindley’s statistical paradox. Biometrika 44, 533–534 (1957)CrossRefzbMATHGoogle Scholar
  3. 3.
    Bayarri, M., Berger, J.: \(P\) values for composite null models. J. Am. Stat. Assoc. 95(452), 1127–1142 (2000)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Begley, C., Ioannidis, J.: Reproducibility in science: improving the standard for basic and preclinical research. Circ. Res. 116(1), 116–126 (2015)CrossRefGoogle Scholar
  5. 5.
    Benavoli, A., Corani, G., Demšar, J., Zaffalon, M.: Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. J. Mach. Learn. Res. 18(77), 1–36 (2017)MathSciNetzbMATHGoogle Scholar
  6. 6.
    Benavoli, A., Corani, G., Mangili, F.: Should we really use post-hoc tests based on mean-ranks? J. Mach. Learn. Res. 17(5), 1–10 (2016)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Berger, J., Berry, D.: Statistical analysis and the illusion of objectivity. Am. Sci. 76, 159–165 (1988)Google Scholar
  8. 8.
    Berger, J., Delampady, M.: Testing precise hypotheses. Stat. Sci. 2(3), 317–352 (1987)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Berrar, D.: Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers. Mach. Learn. 106(6), 911–949 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Berrar, D., Dubitzky, W.: Jeffreys–Lindley Paradox in Machine Learning (2017). Accessed 23 July 2018
  11. 11.
    Berrar, D., Dubitzky, W.: On the Jeffreys–Lindley paradox and the looming reproducibility crisis in machine learning. In: Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics, pp. 334–340 (2017)Google Scholar
  12. 12.
    Berrar, D., Lopes, P., Dubitzky, W.: Caveats and pitfalls in crowdsourcing research: the case of soccer referee bias. Int. J. Data Sci. Anal. 4(2), 143–151 (2017)CrossRefGoogle Scholar
  13. 13.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  14. 14.
    Cohen, J.: The earth is round (\(p <\).05). Am. Psychol. 49(12), 997–1003 (1994)CrossRefGoogle Scholar
  15. 15.
    Cousins, R.D.: The Jeffreys–Lindley paradox and discovery criteria in high energy physics. Synthese 194(2), 395–432 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Cox, D., Hinkley, D.: Theoretical Statistics. Chapman and Hall/CR, London (1974)CrossRefzbMATHGoogle Scholar
  17. 17.
    Cummings, G.: Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge, New York (2012)Google Scholar
  18. 18.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Fisher, R.: Statistical methods and scientific induction. J. R. Stat. Soc. Ser. B 17(1), 69–78 (1955)MathSciNetzbMATHGoogle Scholar
  20. 20.
    Foster, E., Deardorff, A.: Open Science Framework (OSF). J. Med. Libr. Assoc. JMLA 105(2), 203–206 (2017). Accessed 23 July 2018
  21. 21.
    Gelman, A., Loken, E.: The garden of forking paths: why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time (2013). Accessed 23 July 2018
  22. 22.
    Gigerenzer, G.: Mindless statistics. J. Socio-Econ. 33, 587–606 (2004)CrossRefGoogle Scholar
  23. 23.
    Goodman, S.: Toward evidence-based medical statistics. 1: the \(P\) value fallacy. Ann. Intern. Med. 130(12), 995–1004 (1999)CrossRefGoogle Scholar
  24. 24.
    Goodman, S.: A dirty dozen: twelve \(P\)-value misconceptions. Semin. Hematol. 45(3), 135–140 (2008)CrossRefGoogle Scholar
  25. 25.
    Goodman, S., Royall, R.: Evidence and scientific research. Am. J. Public Health 78(12), 1568–1574 (1988)CrossRefGoogle Scholar
  26. 26.
    Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.B., Poole, C., Goodman, S.N., Altman, D.G.: Statistical tests, \(p\) values, confidence intervals, and power: a guide to misinterpretations. Eur. J. Epidemiol. 31(4), 337–350 (2016)CrossRefGoogle Scholar
  27. 27.
    Hays, W.: Statistics for the Social Sciences. Holt, Rinehart & Winston, New York (1973)Google Scholar
  28. 28.
    Hubbard, R.: Alphabet soup—blurring the distinctions between \(p\)’s and \(\alpha \)’s in psychological research. Theory Psychol. 14(3), 295–327 (2004)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Hubbard, R., Armstrong, J.: Why we don’t really know what “statistical significance” means: a major educational failure. J. Mark. Edu. 28(2), 114–120 (2006)CrossRefGoogle Scholar
  30. 30.
    Hubbard, R., Lindsay, R.: Why \(p\) values are not a useful measure of evidence in statistical significance testing. Theory Psychol. 18(1), 69–88 (2008)CrossRefGoogle Scholar
  31. 31.
    Ioannidis, J.: Why most published research findings are false. PLoS Med. 2(8), e124 (2005)CrossRefGoogle Scholar
  32. 32.
    Jeffreys, H.: Theory of Probability, 3rd edn. Clarendon Press, Oxford (1961). (Reprinted 2003)zbMATHGoogle Scholar
  33. 33.
    Leek, J., McShane, B., Gelman, A., Colquhoun, D., Nuijten, M., Goodman, S.: Five ways to fix statistics. Nature 551, 557–559 (2017)CrossRefGoogle Scholar
  34. 34.
    Levin, J.: What if there were no more bickering about statistical significance tests? Res. Sch. 5(2), 43–53 (1998)MathSciNetGoogle Scholar
  35. 35.
    Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002). Accessed 23 July 2018
  36. 36.
    Lindley, D.: A statistical paradox. Biometrika 44, 187–192 (1957)CrossRefzbMATHGoogle Scholar
  37. 37.
    Lu, M., Ishwaran, H.: A prediction-based alternative to \(P\) values in regression models. J. Thoracic Cardiovasc. Surg. 155(3), 1130–1136.e4 (2018)CrossRefGoogle Scholar
  38. 38.
    Matthews, R., Wasserstein, R., Spiegelhalter, D.: The ASA’s \(p\)-value statement, one year on. Significance 14(2), 38–41 (2017)CrossRefGoogle Scholar
  39. 39.
    McShane, B.B., Gal, D., Gelman, A., Robert, C., Tackett, J.L.: Abandon Statistical Significance (2017). ArXiv e-prints 1709.07588
  40. 40.
    Nuzzo, R.: Statistical errors. Nature 506, 150–152 (2014)CrossRefGoogle Scholar
  41. 41.
    Poole, C.: Beyond the confidence interval. Am. J. Public Health 2(77), 195–199 (1987)CrossRefGoogle Scholar
  42. 42.
    R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2017). Accessed 23 July 2018
  43. 43.
    Rosenthal, R.: The file drawer problem and tolerance for null results. Psychol. Bull. 86(3), 638–641 (1979)CrossRefGoogle Scholar
  44. 44.
    Rothman, K.: Writing for epidemiology. Epidemiology 9(3), 333–337 (1998)MathSciNetCrossRefGoogle Scholar
  45. 45.
    Rothman, K., Greenland, S., Lash, T.: Modern Epidemiology, 3rd edn. Wolters Kluwer, Alphen aan den Rijn (2008)Google Scholar
  46. 46.
    Savalei, V., Dunn, E.: Is the call to abandon \(p\)-values the red herring of the replicability crisis? Front. Psychol. Artic. 6, 1–4, Article 245 (2015)Google Scholar
  47. 47.
    Schervish, M.: \(P\) values: what they are and what they are not. Am. Stat. 50(3), 203–206 (1996)MathSciNetGoogle Scholar
  48. 48.
    Schmidt, F.: Statistical significance testing and cumulative knowledge in psychology: implications for training of researchers. Psychol. Methods 1(2), 115–129 (1996)CrossRefGoogle Scholar
  49. 49.
    Schmidt, F., Hunter, J.: Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In: Harlow, L., Mulaik, S., Steiger, J. (eds.) What If There were No Significance Tests?, pp. 37–64. Psychology Press, Hove (1997)Google Scholar
  50. 50.
    Sellke, T., Bayarri, M., Berger, J.: Calibration of \(p\) values for testing precise null hypotheses. Am. Stat. 55(1), 62–71 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  51. 51.
    Senn, S.: Two cheers for \(p\)-values? J. Epidemiol. Biostat. 6, 193–204 (2001)Google Scholar
  52. 52.
    Simmons, J., Nelson, L., Simonsohn, U.: False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22(11), 1359–1366 (2011)CrossRefGoogle Scholar
  53. 53.
    Trafimow, D., Marks, M.: Editorial. Basic Appl. Soc. Psychol. 37, 1–2 (2015)CrossRefGoogle Scholar
  54. 54.
    Wasserstein, R., Lazar, N.: The ASA’s statement on \(p\)-values: context, process, and purpose (editorial). Am. Stat. 70(2), 129–133 (2016)MathSciNetCrossRefGoogle Scholar
  55. 55.
    Webb, G.I., Boughton, J.R., Zheng, F., Ting, K.M., Salem, H.: Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification. Mach. Learn. 86(2), 233–272 (2012)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Data Science Laboratory, Department of Information and Communications EngineeringTokyo Institute of TechnologyTokyoJapan
  2. 2.Research Unit Scientific Computing, German Research Center for Environmental HealthHelmholtz Zentrum MünchenMunichGermany

Personalised recommendations