Skip to main content
Log in

BenchMetrics: a systematic benchmarking method for binary classification performance metrics

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

This paper proposes a systematic benchmarking method called BenchMetrics to analyze and compare the robustness of binary classification performance metrics based on the confusion matrix for a crisp classifier. BenchMetrics, introducing new concepts such as meta-metrics (metrics about metrics) and metric space, has been tested on fifteen well-known metrics including balanced accuracy, normalized mutual information, Cohen’s Kappa, and Matthews correlation coefficient (MCC), along with two recently proposed metrics, optimized precision and index of balanced accuracy in the literature. The method formally presents a pseudo-universal metric space where all the permutations of confusion matrix elements yielding the same sample size are calculated. It evaluates the metrics and metric spaces in a two-staged benchmark based on our proposed eighteen new criteria and finally ranks the metrics by aggregating the criteria results. The mathematical evaluation stage analyzes metrics’ equations, specific confusion matrix variations, and corresponding metric spaces. The second stage, including seven novel meta-metrics, evaluates the robustness aspects of metric spaces. We interpreted each benchmarking result and comparatively assessed the effectiveness of BenchMetrics with the limited comparison studies in the literature. The results of BenchMetrics have demonstrated that widely used metrics have significant robustness issues, and MCC is the most robust and recommended metric for binary classification performance evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of data and material

The authors confirm that all data and materials support their published claims and comply with field standards (see Appendix A).

Code availability

The authors confirm that all software application or custom code support their published claims and comply with field standards (see Appendix A).

Notes

  1. Note that ‘performance metrics’ that are in [0, 1] or [− 1, 1] directly represents the success of a classifier (e.g., Accuracy or True Positive Rate). Those metrics are the instruments published in the literature to report, evaluate, and compare classifiers. Whereas, ‘performance measures’ that are usually not published represent other aspects such as dataset or classifier’s output characteristics (e.g., PREV is the ratio of positive examples in a dataset and BIAS is the ratio of positive outcomes of a classifier). Some instruments indicating the performance in an unbounded interval [0, ∞) or (− ∞, ∞) are also ‘measures’ that are not applicable to publish and compare classification performances in the literature (e.g., Odds Ratio or Discriminant Power) because of limitations in interpretability.

  2. Sample sizes (permutations/metric space sizes): Sn = 25 (3,276); Sn = 50 (23,426); Sn = 75 (76,076); Sn = 100 (176,851); Sn = 125 (341,376); Sn = 150 (585,276); Sn = 175 (924,176); Sn = 200 (1,373,701); Sn = 250 (2,667,126).

  3. For ten negative samples (e.g., i = 1, …, 10): ci = 0 and example pi = 0.49 then | cipi |= 0.49. For remaining ten positive samples (e.g., i = 11, …, 20): ci = 1 and example pi = 0.51 then | cipi |= 0.49. Hence, MAE = 0.49.

References

  1. Luque A, Carrasco A, Martín A, de las Heras A (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit 91:216–231. https://doi.org/10.1016/j.patcog.2019.02.023

    Article  Google Scholar 

  2. Staartjes VE, Schröder ML (2018) Letter to the Editor. Class imbalance in machine learning for neurosurgical outcome prediction: are our models valid? J Neurosurg Spine 29:611–612. https://doi.org/10.3171/2018.5.SPINE18543

    Article  Google Scholar 

  3. Brown JB (2018) Classifiers and their metrics quantified. Mol Inform 37:1–11. https://doi.org/10.1002/minf.201700127

    Article  Google Scholar 

  4. Sokolova M (2006) Assessing invariance properties of evaluation measures. Proc Work Test Deployable Learn Decis Syst 19th Neural Inf Process Syst Conf (NIPS 2006) 1–6

  5. Ranawana R, Palade V (2006) Optimized precision—a new measure for classifier performance evaluation. In: 2006 IEEE International Conference on Evolutionary Computation. IEEE, Vancouver, BC, Canada, pp 2254–2261

  6. Garcia V, Mollineda RA, Sanchez JS (2010) Theoretical analysis of a performance measure for imbalanced data. IEEE Int Conf Pattern Recognit 2006:617–620. https://doi.org/10.1109/ICPR.2010.156

    Article  Google Scholar 

  7. Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PLoS ONE 12:1–17. https://doi.org/10.1371/journal.pone.0177678

    Article  Google Scholar 

  8. Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recognit Lett 30:27–38. https://doi.org/10.1016/j.patrec.2008.08.010

    Article  Google Scholar 

  9. Seliya N, Khoshgoftaar TM, Van Hulse J (2009) Aggregating performance metrics for classifier evaluation. In: IEEE International Conference on Information Reuse and Integration, IRI. pp 35–40

  10. Liu Y, Zhou Y, Wen S, Tang C (2016) A strategy on selecting performance metrics for classifier evaluation. Int J Mob Comput Multimed Commun 6:20–35. https://doi.org/10.4018/ijmcmc.2014100102

    Article  Google Scholar 

  11. Brzezinski D, Stefanowski J, Susmaga R, Szczȩch I (2018) Visual-based analysis of classification measures and their properties for class imbalanced problems. Inf Sci (Ny) 462:242–261. https://doi.org/10.1016/j.ins.2018.06.020

    Article  MathSciNet  Google Scholar 

  12. Hu B-G, Dong W-M (2014) A study on cost behaviors of binary classification measures in class-imbalanced problems. Comput Res Repos abs/1403.7

  13. Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45:427–437. https://doi.org/10.1016/j.ipm.2009.03.002

    Article  Google Scholar 

  14. Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17:299–310. https://doi.org/10.1109/TKDE.2005.50

    Article  Google Scholar 

  15. Forbes A (1995) Classification-algorithm evaluation: five performance measures based on confusion matrices. J Clin Monit Comput 11:189–206. https://doi.org/10.1007/BF01617722

    Article  Google Scholar 

  16. Pereira RB, Plastino A, Zadrozny B, Merschmann LHC (2018) Correlation analysis of performance measures for multi-label classification. Inf Process Manag 54:359–369. https://doi.org/10.1016/j.ipm.2018.01.002

    Article  Google Scholar 

  17. Straube S, Krell MM (2014) How to evaluate an agent’s behavior to infrequent events? Reliable performance estimation insensitive to class distribution. Front Comput Neurosci 8:1–6. https://doi.org/10.3389/fncom.2014.00043

    Article  Google Scholar 

  18. Hossin M, Sulaiman MN (2015) A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process 5:1–11. https://doi.org/10.5121/ijdkp.2015.5201

    Article  Google Scholar 

  19. Tharwat A (2020) Classification assessment methods. Appl Comput Informatics Informatics ahead-of-p:1–13. https://doi.org/10.1016/j.aci.2018.08.003

    Article  Google Scholar 

  20. Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. https://doi.org/10.1186/s12864-019-6413-7

    Article  Google Scholar 

  21. Brzezinski D, Stefanowski J, Susmaga R, Szczech I (2020) On the dynamics of classification measures for imbalanced and streaming data. IEEE Trans Neural Networks Learn Syst 31:1–11. https://doi.org/10.1109/TNNLS.2019.2899061

    Article  Google Scholar 

  22. Baldi P, Brunak S, Chauvin Y et al (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16:412–424. https://doi.org/10.1093/bioinformatics/16.5.412

    Article  Google Scholar 

  23. Hu B-G, He R, Yuan X-T (2012) Information-theoretic measures for objective evaluation of classifications. Acta Autom Sin 38:1169–1182. https://doi.org/10.1016/S1874-1029(11)60289-9

    Article  MathSciNet  Google Scholar 

  24. Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874. https://doi.org/10.1016/j.patrec.2005.10.010

    Article  Google Scholar 

  25. Valverde-Albacete FJ, Peláez-Moreno C (2014) 100% classification accuracy considered harmful: the normalized information transfer factor explains the accuracy paradox. PLoS ONE 9:1–10. https://doi.org/10.1371/journal.pone.0084217

    Article  Google Scholar 

  26. Shepperd M (2013) Assessing the predictive performance of machine learners in software defect prediction function. In: The 24th CREST Open Workshop (COW), on Machine Learning and Search Based Software Engineering (ML&SBSE). Centre for Research on Evolution, Search and Testing (CREST), London, pp 1–16

  27. Schröder G, Thiele M, Lehner W (2011) Setting goals and choosing metrics for recommender system evaluations. In: UCERSTI 2 Workshop at the 5th ACM Conference on Recommender Systems. Chicago, Illinois, pp. 1–8

  28. Delgado R, Tibau XA (2019) Why Cohen’s kappa should be avoided as performance measure in classification. PLoS ONE 14:1–26. https://doi.org/10.1371/journal.pone.0222916

    Article  Google Scholar 

  29. Ma J, Zhou S (2020) Metric learning-guided k nearest neighbor multilabel classifier. Neural Comput Appl. https://doi.org/10.1007/s00521-020-05134-9

    Article  Google Scholar 

  30. Fatourechi M, Ward RK, Mason SG, et al (2008) Comparison of evaluation metrics in classification applications with imbalanced datasets. In: 7th International Conference on Machine Learning and Applications (ICMLA). pp. 777–782

  31. Seliya N, Khoshgoftaar TM, Van Hulse J (2009) A study on the relationships of classifier performance metrics. In: 21st IEEE International Conference on Tools with Artificial Intelligence, ICTAI. pp. 59–66

  32. Joshi MV (2002) On evaluating performance of classifiers for rare classes. In: Proceedings IEEE International Conference on Data Mining. IEEE, pp. 641–644

  33. Caruana R, Niculescu-Mizil A (2004) Data mining in metric space: an empirical analysis of supervised learning performance criteria. Proc 10th ACM SIGKDD Int Conf Knowl Discov Data Min 69–78. https://doi.org/1-58113-888-1/04/0008

  34. Huang J, Ling CX (2007) Constructing new and better evaluation measures for machine learning. IJCAI Int Jt Conf Artif Intell 859–864

  35. Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press, Cambridge

    Book  Google Scholar 

  36. Contreras-Reyes JE (2020) An asymptotic test for bimodality using the Kullback-Leibler divergence. Symmetry (Basel) 12:1–13. https://doi.org/10.3390/SYM12061013

    Article  Google Scholar 

  37. Shi L, Campbell G, Jones WD et al (2010) The Microarray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol 28:827–838. https://doi.org/10.1038/nbt.1665

    Article  Google Scholar 

  38. Rohani A, Mamarabadi M (2019) Free alignment classification of dikarya fungi using some machine learning methods. Neural Comput Appl 31:6995–7016. https://doi.org/10.1007/s00521-018-3539-5

    Article  Google Scholar 

  39. Azar AT, El-Said SA (2014) Performance analysis of support vector machines classifiers in breast cancer mammography recognition. Neural Comput Appl 24:1163–1177. https://doi.org/10.1007/s00521-012-1324-4

    Article  Google Scholar 

  40. Canbek G, Sagiroglu S, Taskaya Temizel T, Baykal N (2017) Binary classification performance measures/metrics: A comprehensive visualized roadmap to gain new insights. In: 2017 International Conference on Computer Science and Engineering (UBMK). IEEE, Antalya, Turkey, pp. 821–826

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Contributions

GC: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing—original draft, Writing—review and editing, Visualization. TTT: Validation, Writing—review and editing, Supervision. SS: Conceptualization, Validation, Writing—review & editing, Supervision.

Corresponding author

Correspondence to Gürol Canbek.

Ethics declarations

Conflicts of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Developed online research tools and data

Appendix B Binary classification performance instrument list

Table 16 lists the instruments and their abbreviations and equations. Note that more information can be found in [40]. See Table 12 for the recently proposed metrics’ equations.

Table 16 Performance measures and metrics (names, alternative names, abbreviations, and equations)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Canbek, G., Taskaya Temizel, T. & Sagiroglu, S. BenchMetrics: a systematic benchmarking method for binary classification performance metrics. Neural Comput & Applic 33, 14623–14650 (2021). https://doi.org/10.1007/s00521-021-06103-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06103-6

Keywords

Navigation