Machine Learning

, Volume 106, Issue 6, pp 911–949 | Cite as

Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers



Null hypothesis significance testing is routinely used for comparing the performance of machine learning algorithms. Here, we provide a detailed account of the major underrated problems that this common practice entails. For example, omnibus tests, such as the widely used Friedman test, are not appropriate for the comparison of multiple classifiers over diverse data sets. In contrast to the view that significance tests are essential to a sound and objective interpretation of classification results, our study suggests that no such tests are needed. Instead, greater emphasis should be placed on the magnitude of the performance difference and the investigator’s informed judgment. As an effective tool for this purpose, we propose confidence curves, which depict nested confidence intervals at all levels for the performance difference. These curves enable us to assess the compatibility of an infinite number of null hypotheses with the experimental results. We benchmarked several classifiers on multiple data sets and analyzed the results with both significance tests and confidence curves. Our conclusion is that confidence curves effectively summarize the key information needed for a meaningful interpretation of classification results while avoiding the intrinsic pitfalls of significance tests.


Confidence curve Significance test p value Multiple comparisons Performance evaluation 


  1. Abelson, R. (1997). A retrospective on the significance test ban of 1999 (if there were no significance tests, they would need to be invented). In L. Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 117–141). Mahwah, NJ: Psychology Press.Google Scholar
  2. Baguley, T. (2012). Serious stats: A guide to advanced statistics for the behavioral sciences. New York: Palgrave Macmillan.Google Scholar
  3. Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66(6), 423–437.CrossRefGoogle Scholar
  4. Bayarri, M., & Berger, J. (2000). P values for composite null models. Journal of the American Statistical Association, 95(452), 1127–1142.MathSciNetMATHGoogle Scholar
  5. Benavoli, A., Corani, G., Mangili, F., & Zaffalon, M. (2015). A Bayesian nonparametric procedure for comparing algorithms. In Proceedings of the 32nd international conference on machine learning,, JMLR Proceedings (Vol. 37, pp. 1264–1272).Google Scholar
  6. Berger, J., & Berry, D. (1988). Statistical analysis and the illusion of objectivity. American Scientist, 76(2), 159–165.Google Scholar
  7. Berger, J., & Delampaday, M. (1987). Testing precise hypotheses. Statistical Science, 2(3), 317–352.MathSciNetCrossRefGoogle Scholar
  8. Berger, J., & Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of \(p\) values and evidence. Journal of the American Statistical Association, 82, 112–122.MathSciNetMATHGoogle Scholar
  9. Berrar, D., & Lozano, J. (2013). Significance tests or confidence intervals: Which are preferable for the comparison of classifiers? Journal of Experimental and Theoretical Artificial Intelligence, 25(2), 189–206.CrossRefGoogle Scholar
  10. Berry, D. (2006). Bayesian clinical trials. Nature Reviews Drug Discovery, 5, 27–36.CrossRefGoogle Scholar
  11. Birnbaum, A. (1961). A unified theory of estimation. I. Annals of Mathematical Statistics, 32, 112–135.CrossRefMATHGoogle Scholar
  12. Bouckaert, R., & Frank, E. (2004). Evaluating the replicability of significance tests for comparing learning algorithms. In Proceedings of the 8th Asia-Pacific conference on advances in knowledge discovery and data mining, Springer Lecture Notes in Computer Science (Vol. 3056, pp. 3–12).Google Scholar
  13. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefMATHGoogle Scholar
  14. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. New York: Chapman and Hall.MATHGoogle Scholar
  15. Carver, R. (1978). The case against statistical significance testing. Harvard Educational Review, 48(3), 378–399.MathSciNetCrossRefGoogle Scholar
  16. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304–1312.CrossRefGoogle Scholar
  17. Cohen, J. (1994). The earth is round (p \(<\).05). American Psychologist, 49(12), 997–1003.CrossRefGoogle Scholar
  18. Corani, G., Benavoli, A., Mangili, F., & Zaffalon, M. (2015). Bayesian hypothesis testing in machine learning. In Proceedings of 2015 ECML-PKDD, Part III, Springer Lecture Notes in Artificial Intelligence (pp. 199–202).Google Scholar
  19. Cox, D. (1958). Some problems connected with statistical inference. Annals of Mathematical Statistics, 29(2), 357–372.MathSciNetCrossRefMATHGoogle Scholar
  20. Cox, D. (1977). The role of significance tests. Scandinavian Journal of Statistics, 4(2), 49–70.MathSciNetMATHGoogle Scholar
  21. Cox, D., & Hinkley, D. (1974). Theoretical statistics. New York: Chapman and Hall/CR.CrossRefMATHGoogle Scholar
  22. Cummings, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York, London: Routledge, Taylor & Francis Group.Google Scholar
  23. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.MathSciNetMATHGoogle Scholar
  24. Demšar, J. (2008). On the appropriateness of statistical tests in machine learning. In Proceedings of the 3rd workshop on evaluation methods for machine learning, in conjunction with the 25th international conference on machine learning (pp. 1–4).Google Scholar
  25. Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10, 1895–1923.CrossRefGoogle Scholar
  26. Drummond, C. (2006). Machine learning as an experimental science, revisited. In Proceedings of the 21st national conference on artificial intelligence: Workshop on evaluation methods for machine learning, Technical Report WS-06-06 (pp. 1–5). AAAI Press.Google Scholar
  27. Drummond, C. (2009). Replicability is not reproducibility: Nor is it good science. In Proceedings of evaluation methods for machine learning workshop at the 26th international conference on machine learning, Montreal (pp. 1–6).Google Scholar
  28. Drummond, C., & Japkowicz, N. (2010). Warning: Statistical benchmarking is addictive. Kicking the habit in machine learning. Journal of Experimental and Theoretical Artificial Intelligence, 2, 67–80.CrossRefMATHGoogle Scholar
  29. Fisher, R. (1943). Note on Dr. Berkson’s criticism of tests of significance. Journal of the American Statistical Association, 38, 103–104.CrossRefGoogle Scholar
  30. Fisher, R. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society, Series B, 17(1), 69–78.MathSciNetMATHGoogle Scholar
  31. Folks, J. (1981). Ideas of statistics. New York: Wiley.Google Scholar
  32. Fraley, R., & Marks, M. (2007). The null hypothesis significance testing debate and its implications for personality research. In R. Robins, R. Fraley, & R. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 149–169). New York: Guilford.Google Scholar
  33. Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200), 675–701.CrossRefMATHGoogle Scholar
  34. Friedman, M. (1940). A comparison of alternative tests of significance for the problem of \(m\) rankings. Annals of Mathematical Statistics, 11(1), 86–92.MathSciNetCrossRefMATHGoogle Scholar
  35. García, S., & Herrera, F. (2008). An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. Journal of Machine Learning Research, 9, 2677–2694.MATHGoogle Scholar
  36. Gigerenzer, G. (1998). We need statistical thinking, not statistical rituals. Behavioral and Brain Sciences, 21, 199–200.CrossRefGoogle Scholar
  37. Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual–What you always wanted to know about significance testing but were afraid to ask. In D. Kaplan (Ed.), The sage handbook of quantitative methodology for the social sciences (pp. 391–408). Thousand Oaks, CA: Sage.Google Scholar
  38. Goodman, S. (1993). \(p\) values, hypothesis tests, and likelihood: Implications for epidemiology of a neglected historical debate. American Journal of Epidemiology, 137(5), 485–496.CrossRefGoogle Scholar
  39. Goodman, S. (1999). Toward evidence-based medical statistics. 1: The \(p\) value fallacy. Annals of Internal Medicine, 130(12), 995–1004.CrossRefGoogle Scholar
  40. Goodman, S. (2008). A dirty dozen: Twelve \(p\)-value misconceptions. Seminars in Hematology, 45(3), 135–140.CrossRefGoogle Scholar
  41. Goodman, S., & Royall, R. (1988). Evidence and scientific research. American Journal of Public Health, 78(12), 1568–1574.CrossRefGoogle Scholar
  42. Greenwald, A., Gonzalez, R., Harris, R., & Guthrie, D. (1996). Effect sizes and \(p\) values: What should be reported and what should be replicated? Psychophysiology, 33(2), 175–183.CrossRefGoogle Scholar
  43. Guyon, I., Lemaire, V., Boullé, M., Dror, G., & Vogel, D. (2009). Analysis of the KDD Cup 2009: Fast scoring on a large Orange customer database. In JMLR: Workshop and conference proceedings (Vol. 7, pp. 1–22).Google Scholar
  44. Harlow, L., Mulaik, S., & Steiger, J. (1997). What if there were no significance tests? Multivariate applications book series. Mahwah, NJ: Lawrence Erlbaum Associates Publishers.Google Scholar
  45. Hays, W. (1963). Statistics. New York: Holt, Rinehart and Winston.MATHGoogle Scholar
  46. Hsu, J. (1996). Multiple comparisons: Theory and methods. Boca Raton, FL: CRC Press.CrossRefMATHGoogle Scholar
  47. Hubbard, R. (2004). Alphabet soup—blurring the distinctions between \(p\)’s and \(\alpha \)’s in psychological research. Theory and Psychology, 14(3), 295–327.CrossRefGoogle Scholar
  48. Hubbard, R., & Armstrong, J. (2006). Why we don’t really know what “statistical significance” means: A major educational failure. Journal of Marketing Education, 28(2), 114–120.CrossRefGoogle Scholar
  49. Hubbard, R., & Bayarri, M. (2003). P values are not error probabilities. Technical Report University of Valencia; Accessed 22 Sept. 2016
  50. Hubbard, R., & Lindsay, R. (2008). Why \(p\) values are not a useful measure of evidence in statistical significance testing. Theory Psychology, 18(1), 69–88.CrossRefGoogle Scholar
  51. Iman, R., & Davenport, J. (1980). Approximations of the critical region of the Friedman statistic. Communications in Statistics—Theory and Methods, 9(6), 571–595.CrossRefMATHGoogle Scholar
  52. Killeen, P. (2004). An alternative to null hypothesis significance tests. Psychological Science, 16(5), 345–353.CrossRefGoogle Scholar
  53. Krueger, J. (2001). Null hypothesis significance testing—On the survival of a flawed method. American Psychologist, 56(1), 16–26.CrossRefGoogle Scholar
  54. Levin, J. (1998). What if there were no more bickering about statistical significance tests? Research in the Schools, 5(2), 43–53.Google Scholar
  55. Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18–22.
  56. Lichman, M, (2013). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences.
  57. Lindley, D. (1957). A statistical paradox. Biometrika, 44, 187–192.CrossRefMATHGoogle Scholar
  58. Morgan, P. (2003). Null hypothesis significance testing: Philosophical and practical considerations of a statistical controversy. Exceptionality, 11(4), 209–221.MathSciNetCrossRefGoogle Scholar
  59. Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52, 239–281.CrossRefMATHGoogle Scholar
  60. Nemenyi, P. (1963). Distribution-free multiple comparisons. Ph.D. thesis, Princeton University, Princeton.Google Scholar
  61. Neyman, J., & Pearson, E. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London Series A, 231, 289–337.CrossRefMATHGoogle Scholar
  62. Nuzzo, R. (2014). Statistical errors. Nature, 506, 150–152.CrossRefGoogle Scholar
  63. Perneger, T. (1998). What’s wrong with Bonferroni adjustments. British Medical Journal, 316, 1236–1238.CrossRefGoogle Scholar
  64. Poole, C. (1987). Beyond the confidence interval. American Journal of Public Health, 2(77), 195–199.CrossRefGoogle Scholar
  65. Poole, C. (1991). Multiple comparisons? No problem!. Epidemiology, 4(2), 241–243.Google Scholar
  66. Poole, C. (2001). Low p-values or narrow confidence intervals: Which are more durable? Epidemiology, 12(3), 291–294.CrossRefGoogle Scholar
  67. R Development Core Team. (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna., ISBN 3-900051-07-0
  68. Rothman, K. (1990). No adjustments are needed for multiple comparisons. Epidemiology, 1(1), 43–46.CrossRefGoogle Scholar
  69. Rothman, K. (1998). Writing for Epidemiology. Epidemiology, 9(3), 333–337.CrossRefGoogle Scholar
  70. Rothman, K., Greenland, S., & Lash, T. (2008). Modern epidemiology (3rd ed.). Philadelphia: Wolters Kluwer.Google Scholar
  71. Rozeboom, W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416–428.CrossRefGoogle Scholar
  72. Rozeboom, W. (1997). Good science is abductive, not hypothetico-deductive. In L. Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 132–149). Mahwah, NJ: Psychology Press.Google Scholar
  73. Savalei, V., & Dunn, E. (2015). Is the call to abandon \(p\)-values the red herring of the replicability crisis? Frontiers in Psychology, 245(6), 1–4.Google Scholar
  74. Savitz, D., & Olshan, A. (1998). Describing data requires no adjustment for multiple comparisons: A reply from Savitz and Olshan. American Journal of Epidemiology, 147(9), 813–814.CrossRefGoogle Scholar
  75. Schervish, M. (1996). \(P\) values: What they are and what they are not. The American Statistician, 50(3), 203–206.MathSciNetGoogle Scholar
  76. Schmidt, F. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115–129.CrossRefGoogle Scholar
  77. Schmidt, F., & Hunter, J. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L. Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 37–64). Mahwah, NJ: Psychology Press.Google Scholar
  78. Sellke, T., Bayarri, M., & Berger, J. (2001). Calibration of \(p\) values for testing precise null hypotheses. The American Statistician, 55(1), 62–71.MathSciNetCrossRefMATHGoogle Scholar
  79. Sheskin, D. (2007). Handbook of parametric and nonparametric statistical procedures (4th ed.). London/New York: Chapman and Hall.MATHGoogle Scholar
  80. Stang, A., Poole, C., & Kuss, O. (2010). The ongoing tyranny of statistical significance testing in biomedical research. European Journal of Epidemiology, 25, 225–230.CrossRefGoogle Scholar
  81. Sullivan, K., & Foster, D. (1990). Use of the confidence interval function. Epidemiology, 1(1), 39–42.CrossRefGoogle Scholar
  82. Therneau, T., Atkinson, B., & Ripley, B. (2014). rpart: Recursive partitioning and regression trees., R package version 4.1-5.
  83. Thompson, B. (1999). If statistical significance tests are broken/misused, what practices should supplement or replace them? Theory & Psychology, 9(2), 165–181.CrossRefGoogle Scholar
  84. Tukey, J. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100–116.MathSciNetCrossRefGoogle Scholar
  85. Yates, F. (1951). The influence of statistical methods for research workers on the development of the science of statistics. Journal of the American Statistical Association, 46(253), 19–34.Google Scholar
  86. Zimmerman, D., & Zumbo, B. (1993). Relative power of the Wilcoxon test, the Friedman test, and repeated-measures ANOVA on ranks. The Journal of Experimental Education, 62(1), 75–86.CrossRefGoogle Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  1. 1.College of EngineeringShibaura Institute of TechnologySaitamaJapan

Personalised recommendations