Skip to main content

Optimal Rates for Nonparametric F-Score Binary Classification via Post-Processing

Abstract

This work studies the problem of binary classification with the F-score as the performance measure. We propose a post-processing algorithm for this problem which fits a threshold for any score base classifier to yield high F-score. The post-processing step involves only unlabeled data and can be performed in logarithmic time. We derive a general finite sample post-processing bound for the proposed procedure and show that the procedure is minimax rate optimal, when the underlying distribution satisfies classical nonparametric assumptions. This result improves upon previously known rates for the F-score classification and bridges the gap between standard classification risk and the F-score. Finally, we discuss the generalization of this approach to the set-valued classification.

This is a preview of subscription content, access via your institution.

Notes

  1. The (weighted) harmonic mean of any real number and zero is defined as zero.

  2. It is assumed that \(\mathbf{X}_{n+1},\ldots,\mathbf{X}_{n+N}\) are independent from \((\mathbf{X}_{1},Y_{1}),\ldots,(\mathbf{X}_{n},Y_{n}),(\mathbf{X},Y)\).

  3. For this discussion let us assume that \(n\) is even.

  4. We only assumed that \(\mathbb{P}(Y=1)>0\) which is hardly an assumption in this context.

  5. Again it is assumed that the (weighted) harmonic mean of any real number and zero is zero.

  6. Indeed note that \(R(\theta)=H_{1}(\theta)+H_{2}(\theta)\) with \(H_{1}(\theta)=b^{2}\theta\) being strictly increasing and \(H_{2}(\theta)=-\sum_{k=1}^{K}\mathbb{E}(\eta_{k}({x})-\theta)_{+}\) being non-decreasing.

REFERENCES

  1. J.-Y. Audibert, Progressive Mixture Rules are Deviation Suboptimal, in NIPS, (2007), pp. 41–48.

  2. J.-Y. Audibert and A. B. Tsybakov, ‘‘Fast learning rates for plug-in classifiers,’’ Ann. Statist. 35 (2), 608–633 (2007).

    MathSciNet  Article  Google Scholar 

  3. H. Bao and M. Sugiyama, Calibrated surrogate maximization of linear-fractional utility in binary classification (2019), arXiv preprint arXiv:1905.12511.

  4. M. Binkhonain and L. Zhao, A review of machine learning algorithms for identification and classification of non-functional requirements. Expert Systems with Applications: X, 1:100001 (2019).

  5. E. Chzhen, C. Denis, and M. Hebiri, Minimax semi-supervised confidence sets for multi-class classification. preprint (2019). https://arxiv.org/abs/1904.12527.

  6. S. Conte and C. Boor, Elementary Numerical Analysis: An Algorithmic Approach (McGraw-Hill Higher Education, 3rd edition, 1980).

    MATH  Google Scholar 

  7. J. Del Coz, J. Díez, and A. Bahamonde, ‘‘Learning nondeterministic classifiers,’’ Journal of Machine Learning Research 10 (10) (2009).

  8. K. Dembczynski, A. Jachnik, W. Kotlowski, W. Waegeman, and E. Hüllermeier, ‘‘Optimizing the f-measure in multi-label classification: Plugin rule approach versus structured loss minimization,’’ in International conference on machine learning, 1130–1138 (2013).

  9. K. Dembczynski, W. Waegeman, W. Cheng, and E. Hüllermeier, ‘‘An exact algorithm for f-measure maximization,’’ Advances in neural information processing systems 24, 1404–1412 (2011).

    Google Scholar 

  10. C. Denis and M. Hebiri, ‘‘Confidence sets with expected sizes for multiclass classification,’’ JMLR 18 (1), 3571–3598 (2017).

    MathSciNet  MATH  Google Scholar 

  11. L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, vol. 31 of Applications of Mathematics (Springer-Verlag, New York, 1996).

  12. P. Flach, ‘‘Performance evaluation in machine learning: The good, the bad, the ugly, and the way forward,’’ In Proceedings of the AAAI Conference on Artificial Intelligence 33, 9808–9814 (2019).

    Article  Google Scholar 

  13. S. Gadat, T. Klein, and C. Marteau, ‘‘Classification in general finite dimensional spaces with the k-nearest neighbor rule,’’ Ann. Statist. 44 (3), 982–1009 (2016).

    MathSciNet  Article  Google Scholar 

  14. A. Gunawardana and G. Shani, ‘‘A survey of accuracy evaluation metrics of recommendation tasks,’’ Journal of Machine Learning Research 10 (12) (2009).

  15. M. Jansche, ‘‘Maximum expected f-measure training of logistic regression models,’’ in Proceedings of the conference on human language technology and empirical methods in natural language processing, Association for Computational Linguistics, 692–699 (2005).

  16. N. Japkowicz and M. Shah, Evaluating Learning Algorithms: A Classification Perspective (Cambridge University Press, 2011).

    Book  Google Scholar 

  17. O. Koyejo, N. Natarajan, P. Ravikumar, and I. Dhillon, ‘‘Consistent binary classification with generalized performance metrics,’’ in NIPS, 2744–2752 (2014).

  18. S. Kpotufe and G. Martinet, ‘‘Marginal singularity, and the benefits of labels in covariate-shift,’’ in Conference on Learning Theory, 1882–1886 (2018).

  19. M. Lapin, M. Hein, and B. Schiele, ‘‘Top-k multiclass svm,’’ in Advances in Neural Information Processing Systems, 325–333 (2015).

  20. O. Mac Aodha, E. Cole, and P. Perona, ‘‘Presence-only geographical priors for fine-grained image classification,’’ in Proceedings of the IEEE International Conference on Computer Vision, 9596–9606 (2019).

  21. E. Mammen and A. B. Tsybakov, ‘‘Smooth discrimination analysis,’’ Ann. Statist. 27 (6), 1808–1829 (1999).

    MathSciNet  MATH  Google Scholar 

  22. D. R. Martin, C. C. Fowlkes, and J. Malik, ‘‘Learning to detect natural image boundaries using local brightness, color, and texture cues,’’ IEEE transactions on pattern analysis and machine intelligence 26 (5), 530–549 (2004).

    Article  Google Scholar 

  23. T. Mortier, M. Wydmuch, K. Dembczyński, E. Hüllermeier, and W. Waegeman, Efficient set-valued prediction in multi-class classification (2019), arXiv preprint arXiv:1906.08129.

  24. D. R. Musicant, V. Kumar, A. Ozgur, et al. ‘‘Optimizing f-measure with support vector machines,’’ in FLAIRS conference, 356–360 (2003).

  25. H. Narasimhan, R. Vaish, and S. Agarwal, ‘‘On the statistical consistency of plug-in classifiers for non-decomposable performance measures,’’ in NIPS, 1493–1501 (2014).

  26. S. P. Parambath, N. Usunier, and Y. Grandvalet, ‘‘Optimizing f-measures by cost-sensitive classification,’’ in Advances in Neural Information Processing Systems, 2123–2131 (2014).

  27. W. Polonik, ‘‘Measuring mass concentrations and estimating density contour clusters-an excess mass approach,’’ Ann. Statist. 23 (3), 855–881 (1995).

    MathSciNet  Article  Google Scholar 

  28. H. G. Ramaswamy, A. Tewari, S. Agarwal, et al. ‘‘Consistent algorithms for multiclass classification with an abstain option,’’ Electronic Journal of Statistics 12 (1), 530–554 (2018).

    MathSciNet  Article  Google Scholar 

  29. P. Rigollet, R. Vert, et al., ‘‘Optimal rates for plug-in estimators of density level sets,’’ Bernoulli 15 (4), 1154–1178 (2009).

    MathSciNet  Article  Google Scholar 

  30. M. Sadinle, J. Lei, and L. Wasserman, ‘‘Least ambiguous set-valued classifiers with bounded error levels,’’ Journal of the American Statistical Association 114 (525), 223–234 (2019).

    MathSciNet  Article  Google Scholar 

  31. C. Scott, ‘‘Calibrated asymmetric surrogate losses,’’ Electronic Journal of Statistics 6, 958–992 (2012).

    MathSciNet  Article  Google Scholar 

  32. C. J. Stone, ‘‘Consistent nonparametric regression,’’ The annals of statistics, 595–620 (1977).

  33. E. F. Tjong Kim Sang and F. De Meulder, ‘‘Introduction to the conll-2003 shared task: Language-independent named entity recognition,’’ in Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, Vol. 4, pp. 142–147.

  34. A. B. Tsybakov, ‘‘Optimal aggregation of classifiers in statistical learning,’’ Ann. Statist. 32 (1), 135–166 (2004).

    MathSciNet  Article  Google Scholar 

  35. A. B. Tsybakov, Introduction to Nonparametric Estimation. Springer Series in Statistics (Springer, New York, 2009).

    Book  Google Scholar 

  36. V. Vovk, I. Nouretdinov, V. Fedorova, I. Petej, and A. Gammerman, ‘‘Criteria of efficiency for set-valued classification,’’ Annals of Mathematics and Artificial Intelligence 81, 21–47 (2017).

    MathSciNet  Article  Google Scholar 

  37. W. Waegeman, K. Dembczyński, A. Jachnik, W. Cheng, and E. Hüllermeier, ‘‘On the bayes-optimality of f-measure maximizers,’’ Journal of Machine Learning Research 15, 3333–3388 (2014).

    MathSciNet  MATH  Google Scholar 

  38. B. Yan, S. Koyejo, K. Zhong, and P. Ravikumar, Binary classification with karmic, threshold-quasi-concave metrics, in ICML, vol. 80 (2018).

  39. Y. Yang, ‘‘Minimax nonparametric classification: Rates of convergence,’’ IEEE Transactions on Information Theory 45 (7), 2271–2284 (1999).

    MathSciNet  Article  Google Scholar 

  40. M.-J. Zhao, N. Edakunni, A. Pocock, and G. Brown, ‘‘Beyond fano’s inequality: bounds on the optimal f-score, ber, and cost-sensitive risk and their implications,’’ JMLR 14(Apr), 1033–1090 (2013).

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Evgenii Chzhen.

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chzhen, E. Optimal Rates for Nonparametric F-Score Binary Classification via Post-Processing. Math. Meth. Stat. 29, 87–105 (2020). https://doi.org/10.3103/S1066530720020027

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S1066530720020027

Keywords:

  • F-score
  • universal consistency
  • minimax rate
  • margin assumption
  • nonparametric classification