Beware the Null Hypothesis: Critical Value Tables for Evaluating Classifiers

  • George Forman
  • Ira Cohen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3720)


Scientists regularly decide the statistical significance of their findings by determining whether they can, with sufficient confidence, rule out the possibility that their findings could be attributed to random variation—the ‘null hypothesis.’ For this, they rely on tables with critical values pre-computed for the normal distribution, the t-distribution, etc. This paper provides such tables (and methods for generating them) for the performance metrics of binary classification: accuracy, F-measure, area under the ROC curve (AUC), and true positives in the top ten. Given a test set of a certain size, the tables provide the critical value for accepting or rejecting the null hypothesis that the score of the best classifier would be consistent with taking the best of a set of random classifiers. The tables are appropriate to consult when a researcher, practitioner or contest manager selects the best of many classifiers measured against a common test set. The risk of the null hypothesis is especially high when there is a shortage of positives or negatives in the testing set (irrespective of the training set size), as is the case for many medical and industrial classification tasks with highly skewed class distributions.


Null Hypothesis Cumulative Distribution Function Performance Metrics Random Classifier Inverse Cumulative Distribution Function 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Abroise, C., McLachlan, G.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99(10), 6562–6566 (2002)CrossRefGoogle Scholar
  2. 2.
    Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Dept. of Information and Computer Science, Irvine, CA (1998)Google Scholar
  3. 3.
    Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Tech. Report HPL-2003-4, Hewlett-Packard (2003)Google Scholar
  4. 4.
    Forman, G.: A Method for Discovering the Insignificance of One’s Best Classifier and the Unlearnability of a Classification Task. In: Data Mining Lessons Learned Workshop, 19th International Conference on Machine Learning (ICML), Sydney, Australia (2002)Google Scholar
  5. 5.
    Forman, G., Cohen, I.: Beware the Null Hypothesis: Critical Value Tables for Evaluating Classifiers. Tech. Report HPL-2005-70, Hewlett-Packard (2005),
  6. 6.
    Jensen, D., Cohen, P.: Multiple Comparisons in Induction Algorithms. Machine Learning 38(3), 309–338 (2000)zbMATHCrossRefGoogle Scholar
  7. 7.
    Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. SIAM News 23(5), 1–18 (1990)Google Scholar
  8. 8.
    Perneger, T.V.: What is wrong with Bonferroni adjustments. British Medical Journal 136, 1236–1238 (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • George Forman
    • 1
  • Ira Cohen
    • 1
  1. 1.Hewlett-Packard LabsPalo AltoUSA

Personalised recommendations