Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation

  • Marina Sokolova
  • Nathalie Japkowicz
  • Stan Szpakowicz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4304)


Different evaluation measures assess different characteristics of machine learning algorithms. The empirical evaluation of algorithms and classifiers is a matter of on-going debate among researchers. Most measures in use today focus on a classifier’s ability to identify classes correctly. We note other useful properties, such as failure avoidance or class discrimination, and we suggest measures to evaluate such properties. These measures – Youden’s index, likelihood, Discriminant power – are used in medical diagnosis. We show that they are interrelated, and we apply them to a case study from the field of electronic negotiations. We also list other learning problems which may benefit from the application of these measures.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Demsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006)MathSciNetGoogle Scholar
  2. 2.
    Chawla, N., Japkowicz, N., Kolcz, A. (eds.): Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations, vol. 6(1) (2004)Google Scholar
  3. 3.
    Isselbacher, K., Braunwald, E.: Harrison’s Principles of Internal Medicine. McGraw-Hill, New York (1994)Google Scholar
  4. 4.
    Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, Hillsdale (1988)zbMATHGoogle Scholar
  5. 5.
    Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine learning techniques. In: Proc. Empirical Methods of Natural Language Processing EMNLP 2002, pp. 79–86 (2002)Google Scholar
  6. 6.
    Sokolova, M., Nastase, V., Shah, M., Szpakowicz, S.: Feature selection for electronic negotiation texts. In: Proc. Recent Advances in Natural Language Processing RANLP 2005, pp. 518–524 (2005)Google Scholar
  7. 7.
    Kersten, G., et al.: Electronic negotiations, media and transactions for socio-economic interactions (2006) (2002-2006),
  8. 8.
    Witten, I., Frank, E.: Data Mining. Morgan Kaufmann, San Francisco (2005)zbMATHGoogle Scholar
  9. 9.
    Cherkassky, V., Muller, F.: Learning from Data. Wiley, Chichester (1998)zbMATHGoogle Scholar
  10. 10.
    Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, Chichester (2000)Google Scholar
  11. 11.
    Youden, W.: Index for rating diagnostic tests. Cancer 3, 32–35 (1950)CrossRefGoogle Scholar
  12. 12.
    Biggerstaff, B.: Comparing diagnostic tests: a simple graphic using likelihood ratios. Statistics in Medicine 19(5), 649–663 (2000)CrossRefGoogle Scholar
  13. 13.
    Blakeley, D., Oddone, E.: Noninvasive carotid artery testing. Ann. Intern. Med. 122, 360–367 (1995)Google Scholar
  14. 14.
    Mishne, G.: Experiments with mood classification in blog posts. In: Proc. 1st Workshop on Stylistic Analysis of Text for Information Access (Style 2005) (2005),
  15. 15.
    Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proc. 10th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining KDD 2004, pp. 168–177 (2004)Google Scholar
  16. 16.
    Boparai, J., Kay, J.: Supporting user task based conversations via email. In: Proc. 7th Australasian Document Computing Symposium (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Marina Sokolova
    • 1
  • Nathalie Japkowicz
    • 2
  • Stan Szpakowicz
    • 3
  1. 1.DIROUniversité de MontréalMontrealCanada
  2. 2.SITEUniversity of OttawaOttawaCanada
  3. 3.SITEUniversity of Ottawa, Ottawa, Canada, ICS, Polish Academy of SciencesWarsawPoland

Personalised recommendations