Knowledge and Information Systems

, Volume 35, Issue 1, pp 131–152 | Cite as

On measuring the performance of binary classifiers

  • Charles ParkerEmail author
Regular Paper


If one is given two binary classifiers and a set of test data, it should be straightforward to determine which of the two classifiers is the superior. Recent work, however, has called into question many of the methods heretofore accepted as standard for this task. In this paper, we analyze seven ways of determining whether one classifier is better than another, given the same test data. Five of these are long established, and two are relative newcomers. We review and extend work showing that one of these methods is clearly inappropriate and then conduct an empirical analysis with a large number of datasets to evaluate the real-world implications of our theoretical analysis. Both our empirical and theoretical results converge strongly toward one of the newer methods.


Performance measures Binary classification Supervised learning Evaluation 


  1. 1.
    Aeberhard S, Coomans D, de Vel O (1992) Comparison of classifiers in high dimensional settings. Technical Report 92–02, Department of Computer Science and Department of Mathematics and Statistics, James Cook University, North QueenslandGoogle Scholar
  2. 2.
    Airola A, Pahikkala T, Waegeman W, De Baets B, Salakoski T (2011) An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Comput Stat Data Anal 55(4):1828–1844. doi: 10.1016/j.csda.2010.11.018 CrossRefGoogle Scholar
  3. 3.
    Ait Elhadj A, Boughanem M, Mezghiche M, Souam F (2011) Using structural similarity for clustering XML documents. Knowl Inf Syst. doi: 10.1007/s10115-011-0421-5
  4. 4.
    Alimoglu F, Alpaydi E (1996) Methods of combining multiple classifiers based on different representations for pen-based handwriting recognition. In: Proceedings of the fifth Turkish artificial intelligence and artificial neural networks symposium (TAINN ’96), Istanbul, TurkeyGoogle Scholar
  5. 5.
    Aslam JA, Yilmaz E (2005) A geometric interpretation and analysis of r-precision. In: Proceedings of the 14th ACM international conference on information and knowledge management, pp 664–671Google Scholar
  6. 6.
    Ben-David A (2008) About the relationship between ROC curves and Cohen’s kappa. Eng Appl Artif Intell 21:874–882CrossRefGoogle Scholar
  7. 7.
    Beran T, Hecker K, Coderre S, Wright B, Woloschuk W McLaughlin K (2011) Ego identity status of medical students in clerkship. Can Med Educ J 2(1):e4–e10. Google Scholar
  8. 8.
    Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
  9. 9.
    Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth International Group, Belmont, CAzbMATHGoogle Scholar
  10. 10.
    Bulling A, Ward JA, Gellersen H, Tröster G (2011) Eye movement analysis for activity recognition using electrooculography. IEEE Trans Pattern Anal Mach Intell 33(4):741–753. Google Scholar
  11. 11.
    Cai W, Lee J-G, Zalis ME, Yoshida H (2011) Mosaic decomposition: An electronic cleansing method for inhomogeneously tagged regions in noncathartic CT colonography. IEEE Trans. Med. Imaging 30(3): 559–574Google Scholar
  12. 12.
    Chazdon RL, Chao A, Cplwel RK, Shang-Yi L, Norden N, Letche SG, Clark DB, Finegan B, Arroyo JP (2011) A novel statistical method for classifying habitat generalists and specialists. J Ecol 92(6):1332–1343. Google Scholar
  13. 13.
    Chen B-C, Guo J, Tseng BL, Yang J (2011) User reputation in a comment rating environment. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 159–167Google Scholar
  14. 14.
    Chen F, Dai J, Wang B, Sahu S, Naphade M, Lu C-T (2011) Activity analysis based on low sample rate smart meters. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, ACM, New York, NY, USA, pp 240–248.
  15. 15.
    Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46CrossRefGoogle Scholar
  16. 16.
    Costa G, Manco G, Ortale R, Ritacco E (2011) From global to local and viceversa: uses of associative rule learning for classification in imprecise environments. Knowl Inf Syst 1–33. doi: 10.1007/s10115-011-0458-5
  17. 17.
    Cotton CV, Ellis DPW, Loui AC (2011) Soundtrack classification by transient events. In: ICASSP, pp 473–476Google Scholar
  18. 18.
    Cramer H (1946) Mathematical methods of statistics. Princeton University Press, Princeton, NJzbMATHGoogle Scholar
  19. 19.
    Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetzbMATHGoogle Scholar
  20. 20.
    Diesner J, Frantz TL, Carley KM (2005) Communication networks from the enron email corpus “it’s always about the people. enron is no different”. Comput Math Organ Theory 11:201–228zbMATHCrossRefGoogle Scholar
  21. 21.
    Dodd LE, Pepe MS (2003) Partial AUC estimation and regression. Biometrics 59:614–623MathSciNetzbMATHCrossRefGoogle Scholar
  22. 22.
    Drummond C, Holte RC (2006) Cost curves: an improved method for visualizing classifier performance. Mach Learn 65(1):95–130CrossRefGoogle Scholar
  23. 23.
    Elisseeff A, Weston J (2005) A kernel method for multi-labelled classification. In: Annual ACM conference on research and development in information retrieval, pp 274–281Google Scholar
  24. 24.
    Flach P, Hernandez-Orallo J, Ferri C (2011) A coherent interpretation of AUC as a measure of aggregated classification performance. In: Proceedings of the 28th international conference on machine learning (ICML-11). New York, NY, USA, pp 657–664Google Scholar
  25. 25.
    Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305zbMATHGoogle Scholar
  26. 26.
    Frank A, Asuncion A (2010) UCI machine learning repository.
  27. 27.
    Ghahramani Z, Kim H-C (2003) Bayesian classifier combination. Biomed Environ Sens 38(1):279–294Google Scholar
  28. 28.
    Hand DJ (2009) Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach Learn 77:103–123CrossRefGoogle Scholar
  29. 29.
    Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45:171–186zbMATHCrossRefGoogle Scholar
  30. 30.
    Ji S, Yuan L, Li Y-X, Zhou Z-H, Kumar S, Ye J (2009) Drosophila gene expression pattern annotation using sparse features and term-term interactions. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 407–416Google Scholar
  31. 31.
    Kaymak U, Ben-David A, Potharst R (2010) AUK: a simple alternative to the AUC, Research Paper ERS-2010-024-LIS, Erasmus Research Institute of Management (ERIM).
  32. 32.
    Klement W, Flach PA, Japkowicz N, Matwin S (2011) Smooth receiver operating characteristics (smROC) curves. In: ECML/PKDD (2), pp 193–208Google Scholar
  33. 33.
    Lavesson N, Axelsson S (2011) Similarity assessment for removal of noisy end user license agreements. Knowl Inf Syst 32:1–23Google Scholar
  34. 34.
    Li S, Lin C-Y, Song Y-I, Li Z (2011) Comparable entity mining from comparative questions. IEEE Trans Knowl Data Eng 99:650–658Google Scholar
  35. 35.
    Ling CX, Huang J, Zhang H (2003) AUC: a statistically consistent and more discriminating measure than accuracy. In: IJCAI, pp 519–526Google Scholar
  36. 36.
    McClish D (1989) Analyzing a portion of the ROC curve. Med Decis Mak 9(3):190–195CrossRefGoogle Scholar
  37. 37.
    McDonald JH (2009) Handbook of biological statistics, 2nd edn. Sparky House Publishing, Baltimore, MDGoogle Scholar
  38. 38.
    Merler M, Huang B, Xie L, Hua G, Natsev A (2012) Semantic model vectors for complex video event recognition. IEEE Trans Multimed 14(1):88–101CrossRefGoogle Scholar
  39. 39.
    Moragues J, Vergara L, Gosálbez J (2011) Generalized matched subspace filter for nonindependent noise based on ICA. IEEE Trans Signal Process 59(7):3430–3434MathSciNetCrossRefGoogle Scholar
  40. 40.
    Mowery D, Wiebe J, Visweswaran S, Harkema H, Chapman WW (2011) Building an automated SOAP classifier for emergency department reports. J Biomed Inform 45:71–81Google Scholar
  41. 41.
    Naik PK, Nitin N, Janmeja A, Puri S, Chawla K, Bhasin M, Jain K (2011) B-MIPT: a case tool for biomedical image processing and their classification using nearest neighbor and genetic algorithm. In: International conference on intelligent systems, modelling and simulation, pp 107–112. doi: 10.1109/ISMS.2011.26
  42. 42.
    Park LAF (2011) Bootstrap confidence intervals for mean average precision. In: Proceedings of the fourth ASEARC conference, pp 51–54.
  43. 43.
    Parker C (2010) An empirical study of feature extraction methods for audio classification. In: ICPR ’10: the twentieth international conference on pattern recognition. Istanbul, TurkeyGoogle Scholar
  44. 44.
    Parker C (2011) An analysis of performance measures for binary classification. In: The international conference on data mining. Vancouver, CanadaGoogle Scholar
  45. 45.
    Provost F, Fawcett T, Kohavi R (1997) The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the fifteenth international conference on machine learning, pp 445–453Google Scholar
  46. 46.
    Provost FJ, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42(3):203–231zbMATHCrossRefGoogle Scholar
  47. 47.
    Ramachandran P, Lu W-S, Antoniou A (2012) Filter-based methodology for the location of hot spots in proteins and exons in DNA. IEEE Trans Biomed Eng 59(6):1598–1609CrossRefGoogle Scholar
  48. 48.
    Renals S, Rohwer R (1989) Phoneme classification experiments using radial basis functions. In: International joint conference on neural networks, WashingtonGoogle Scholar
  49. 49.
    Rijsbergen CJV (1979) Information retrieval. Butterworth-Heinemann, Newton, MAGoogle Scholar
  50. 50.
    Robertson S (2012) On smoothing average precision. In: ECIR, pp 158–169Google Scholar
  51. 51.
    Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, New York, NYGoogle Scholar
  52. 52.
    Schapire RE, Singer Y (2000) BoosTexter: a boosting-based system for text categorization. Mach Learn 39:135–168zbMATHCrossRefGoogle Scholar
  53. 53.
    Trohidis K, Tsoumakas G, Kalliris G, Vlahavas I (2008) Multilabel classification of music into emotions. In: Proceedings of the 2008 international conference on music information retrieval (ISMIR 2008). Philadelphia, PA, USA, pp 325–330Google Scholar
  54. 54.
    Turnbull D, Barrington L, Torres D, Lanckriet G (2008) Semantic annotation and retrieval of music and sound effects. IEEE Trans Audio Speech Lang Process 16:467–476CrossRefGoogle Scholar
  55. 55.
    Uddin M, Maskrey V, Holland R (2011) A study to validate a self-reported version of the ONS drug dependence questionnaire. J Subst Use 16(4):273–281.
  56. 56.
    Valentini G, Dietterich TG (2003) Low bias bagged support vector machines. In: International conference on machine learning. Morgan Kaufmann, Burlington, MA, pp 752–759Google Scholar
  57. 57.
    van der Maaten LJP, Postma EO, van den Herik HJ (2009) Dimensionality reduction: a comparative review. Technical Report TiCC-TR 2009–005, Tilburg UniversityGoogle Scholar
  58. 58.
    Warrens MJ (2012) Cohen’s linearly weighted kappa is a weighted average. Adv Data Anal Classif 6(1):67–79MathSciNetzbMATHCrossRefGoogle Scholar
  59. 59.
    wei Hsu C, chung Chang C, jen Lin C (2010) A practical guide to support vector classification. Bioinformatics 1(1):1–16MathSciNetGoogle Scholar
  60. 60.
    Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, second edition Morgan Kaufmann series in data management systems, 2nd edn. Morgan Kaufmann, Burlington, MAGoogle Scholar
  61. 61.
    Xie L, Zheng L, Liu Z, Zhang Y (2012) Laplacian eigenmaps for automatic story segmentation of broadcast news. IEEE Trans Audio Speech Lang Process 20(1):276–289CrossRefGoogle Scholar
  62. 62.
    Ye M, Shou D, Lee W-C, Yin P, Janowicz K (2011) On the semantic annotation of places in location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11, ACM, New York, NY, USA, pp 520–528.
  63. 63.
    Yilmaz E, Aslam JA (2008) Estimating average precision when judgments are incomplete. Knowl Inf Syst 16:173–211CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2012

Authors and Affiliations

  1. 1.BigML, Inc.CorvallisUSA

Personalised recommendations