Optimal Thresholding of Classifiers to Maximize F1 Measure

  • Zachary C. Lipton
  • Charles Elkan
  • Balakrishnan Naryanaswamy
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8725)


This paper provides new insight into maximizing F1 measures in the context of binary classification and also in the context of multilabel classification. The harmonic mean of precision and recall, the F1 measure is widely used to evaluate the success of a binary classifier when one class is rare. Micro average, macro average, and per instance average F1 measures are used in multilabel classification. For any classifier that produces a real-valued output, we derive the relationship between the best achievable F1 value and the decision-making threshold that achieves this optimum. As a special case, if the classifier outputs are well-calibrated conditional probabilities, then the optimal threshold is half the optimal F1 value. As another special case, if the classifier is completely uninformative, then the optimal behavior is to classify all examples as positive. When the actual prevalence of positive examples is low, this behavior can be undesirable. As a case study, we discuss the results, which can be surprising, of maximizing F1 when predicting 26,853 labels for Medline documents.


supervised learning text classification evaluation methodology F score F1 measure multilabel learning binary classification 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Akay, M.F.: Support vector machines combined with feature selection for breast cancer diagnosis. Expert Systems with Applications 36(2), 3240–3247 (2009)CrossRefGoogle Scholar
  2. 2.
    Capen, E.C., Clapp, R.V., Campbell, W.M.: Competitive bidding in high-risk situations. Journal of Petroleum Technology 23(6), 641–653 (1971)CrossRefGoogle Scholar
  3. 3.
    Cover, T.M., Thomas, J.A.: Elements of information theory. John Wiley & Sons (2012)Google Scholar
  4. 4.
    del Coz, J.J., Diez, J., Bahamonde, A.: Learning nondeterministic classifiers. Journal of Machine Learning Research 10, 2273–2293 (2009)zbMATHGoogle Scholar
  5. 5.
    Dembczynski, K., Kotłowski, W., Jachnik, A., Waegeman, W., Hüllermeier, E.: Optimizing the F-measure in multi-label classification: Plug-in rule approach versus structured loss minimization. In: ICML (2013)Google Scholar
  6. 6.
    Dembczyński, K., Waegeman, W., Cheng, W., Hüllermeier, E.: An exact algorithm for F-measure maximization. In: Neural Information Processing Systems (2011)Google Scholar
  7. 7.
    Elkan, C.: The foundations of cost-sensitive learning. In: International Joint Conference on Artificial Intelligence, pp. 973–978 (2001)Google Scholar
  8. 8.
    Jansche, M.: A maximum expected utility framework for binary sequence labeling. In: Annual Meeting of the Association for Computational Linguistics, p. 736 (2007)Google Scholar
  9. 9.
    Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 246–254. ACM (1995)Google Scholar
  10. 10.
    Madsen, R., Kauchak, D., Elkan, C.: Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 545–552 (August 2005)Google Scholar
  11. 11.
    Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval, vol. 1. Cambridge University Press (2008)Google Scholar
  12. 12.
    Menon, A., Jiang, X., Vembu, S., Elkan, C., Ohno-Machado, L.: Predicting accurate probabilities with a ranking loss. In: Proceedings of the International Conference on Machine Learning (ICML) (June 2012)Google Scholar
  13. 13.
    Mozer, M.C., Dodier, R.H., Colagrosso, M.D., Guerra-Salcedo, C., Wolniewicz, R.H.: Prodding the ROC curve: Constrained optimization of classifier performance. In: NIPS, pp. 1409–1415 (2001)Google Scholar
  14. 14.
    Musicant, D.R., Kumar, V., Ozgur, A., et al.: Optimizing F-measure with support vector machines. In: FLAIRS Conference, pp. 356–360 (2003)Google Scholar
  15. 15.
    Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Information Processing and Management 45, 427–437 (2009)CrossRefGoogle Scholar
  16. 16.
    Suzuki, J., McDermott, E., Isozaki, H.: Training conditional random fields with multivariate evaluation measures. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 217–224. Association for Computational Linguistics (2006)Google Scholar
  17. 17.
    Tan, S.: Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Systems with Applications 28, 667–671 (2005)CrossRefGoogle Scholar
  18. 18.
    Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. International Journal of Data Warehousing and Mining 3(3), 1–13 (2007)CrossRefGoogle Scholar
  19. 19.
    Ye, N., Chai, K.M., Lee, W.S., Chieu, H.L.: Optimizing F-measures: A tale of two approaches. In: Proceedings of the International Conference on Machine Learning (2012)Google Scholar
  20. 20.
    Zhao, M.J., Edakunni, N., Pocock, A., Brown, G.: Beyond Fano’s inequality: Bounds on the optimal F-score, BER, and cost-sensitive risk and their implications. Journal of Machine Learning Research 14(1), 1033–1090 (2013)zbMATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Zachary C. Lipton
    • 1
  • Charles Elkan
    • 1
  • Balakrishnan Naryanaswamy
    • 1
  1. 1.University of CaliforniaSan Diego, La JollaUSA

Personalised recommendations