Skip to main content

Handling Class Imbalance in k-Nearest Neighbor Classification by Balancing Prior Probabilities

  • Conference paper
  • First Online:
Similarity Search and Applications (SISAP 2021)

Abstract

It is well known that recall rather than precision is the performance measure to optimize in imbalanced classification problems, yet most existing methods that adjust for class imbalance do not particularly address the optimization of recall. Here we propose an elegant and straightforward variation of the k-nearest neighbor classifier to balance imbalanced classification problems internally in a probabilistic interpretation and show how this relates to the optimization of the recall. We evaluate this novel method against popular k-nearest neighbor-based class imbalance handling algorithms and compare them to general oversampling and undersampling techniques. We demonstrate that the performance of the proposed method is on par with SMOTE yet our method is much simpler and outperforms several competitors over a large selection of real-world and synthetic datasets and parameter choices while having the same complexity as the regular k-nearest neighbor classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple Valued Log. Soft Comput. 17(2–3), 255–287 (2011)

    Google Scholar 

  2. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)

    Google Scholar 

  3. Bellinger, C., Drummond, C., Japkowicz, N.: Beyond the boundaries of SMOTE. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9851, pp. 248–263. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46128-1_16

    Chapter  Google Scholar 

  4. Bellinger, C., Sharma, S., Japkowicz, N., Zaïane, O.R.: Framework for extreme imbalance classification: SWIM - sampling with the majority class. Knowl. Inf. Syst. 62(3), 841–866 (2020)

    Article  Google Scholar 

  5. Chapelle, O., Schölkopf, B., Zien, A.: Introduction to semi-supervised learning. In: Semi-Supervised Learning, pp. 1–12. The MIT Press (2006)

    Google Scholar 

  6. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  7. Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)

    Article  Google Scholar 

  8. Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: Cohen, W.W., Moore, A.W. (eds.) Proceedings of ICML, pp. 233–240 (2006). https://doi.org/10.1145/1143844.1143874

  9. Domingos, P.M.: MetaCost: a general method for making classifiers cost-sensitive. In: KDD, pp. 155–164. ACM (1999)

    Google Scholar 

  10. Dubey, H., Pudi, V.: Class based weighted K-nearest neighbor over imbalance dataset. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 305–316. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_26

    Chapter  Google Scholar 

  11. Elkan, C.: The foundations of cost-sensitive learning. In: IJCAI, pp. 973–978. Morgan Kaufmann (2001)

    Google Scholar 

  12. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91

    Chapter  Google Scholar 

  13. Hand, D.J.: Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach. Learn. 77(1), 103–123 (2009). https://doi.org/10.1007/s10994-009-5119-5

    Article  MATH  Google Scholar 

  14. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IJCNN, pp. 1322–1328. IEEE (2008)

    Google Scholar 

  15. Japkowicz, N.: The class imbalance problem: significance and strategies. In: Proceedings of the International Conference on Artificial Intelligence, vol. 56 (2000)

    Google Scholar 

  16. Japkowicz, N.: Assessment metrics for imbalanced learning. In: He, H., Ma, Y. (eds.) Imbalanced Learning: Foundations, algorithms, and applications, chap. 8, pp. 187–206. Wiley, Hoboken (2013)

    Google Scholar 

  17. Kriminger, E., Príncipe, J.C., Lakshminarayan, C.: Nearest neighbor distributions for imbalanced classification. In: IJCNN, pp. 1–5. IEEE (2012)

    Google Scholar 

  18. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: ICML, pp. 179–186. Morgan Kaufmann (1997)

    Google Scholar 

  19. Lemaitre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18, 17:1–17:5 (2017)

    Google Scholar 

  20. Liu, W., Chawla, S.: Class confidence weighted kNN algorithms for imbalanced data sets. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011. LNCS (LNAI), vol. 6635, pp. 345–356. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20847-8_29

    Chapter  Google Scholar 

  21. Ntoutsi, E., et al.: Bias in data-driven artificial intelligence systems - an introductory survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 10(3), e1356 (2020)

    Article  Google Scholar 

  22. Pedregosa, F.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  23. Qin, Z., Wang, A.T., Zhang, C., Zhang, S.: Cost-sensitive classification with k-nearest neighbors. In: Wang, M. (ed.) KSEM 2013. LNCS (LNAI), vol. 8041, pp. 112–131. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39787-5_10

    Chapter  Google Scholar 

  24. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Burlington (1993)

    Google Scholar 

  25. Rodieck, R.W.: The density recovery profile: a method for the analysis of points in the plane applicable to retinal studies. Vis. Neurosci. 6(2), 95–111 (1991)

    Article  Google Scholar 

  26. Siddappa, N.G., Kampalappa, T.: Imbalance data classification using local Mahalanobis distance learning based on nearest neighbor. SN Comput. Sci. 1(2), 76 (2020)

    Article  Google Scholar 

  27. Song, Y., Huang, J., Zhou, D., Zha, H., Giles, C.L.: IKNN: informative K-nearest neighbor pattern classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 248–264. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74976-9_25

    Chapter  Google Scholar 

  28. Zhang, S.: Cost-sensitive KNN classification. Neurocomputing 391, 234–242 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jonatan Møller Nuutinen Gøttcke or Arthur Zimek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gøttcke, J.M.N., Zimek, A. (2021). Handling Class Imbalance in k-Nearest Neighbor Classification by Balancing Prior Probabilities. In: Reyes, N., et al. Similarity Search and Applications. SISAP 2021. Lecture Notes in Computer Science(), vol 13058. Springer, Cham. https://doi.org/10.1007/978-3-030-89657-7_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-89657-7_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-89656-0

  • Online ISBN: 978-3-030-89657-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics