Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets

  • Wei Liu
  • Sanjay Chawla
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6635)

Abstract

In this paper, a novel k-nearest neighbors (kNN) weighting strategy is proposed for handling the problem of class imbalance. When dealing with highly imbalanced data, a salient drawback of existing kNN algorithms is that the class with more frequent samples tends to dominate the neighborhood of a test instance in spite of distance measurements, which leads to suboptimal classification performance on the minority class. To solve this problem, we propose CCW (class confidence weights) that uses the probability of attribute values given class labels to weight prototypes in kNN. The main advantage of CCW is that it is able to correct the inherent bias to majority class in existing kNN algorithms on any distance measurement. Theoretical analysis and comprehensive experiments confirm our claims.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making 5(4), 597–604 (2006)CrossRefGoogle Scholar
  2. 2.
    Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE. Journal of Artificial Intelligence Research 16(1), 321–357 (2002)MATHGoogle Scholar
  3. 3.
    Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 475–482. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  4. 4.
    Cieslak, D., Chawla, N.: Learning Decision Trees for Unbalanced Data. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  5. 5.
    Liu, W., Chawla, S., Cieslak, D., Chawla, N.: A Robust Decision Tree Algorithms for Imbalanced Data Sets. In: Proceedings of the Tenth SIAM International Conference on Data Mining, pp. 766–777 (2010)Google Scholar
  6. 6.
    Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P., et al.: Top 10 algorithms in data mining. Knowledge and Information Systems 14(1), 1–37 (2008)CrossRefGoogle Scholar
  7. 7.
    Weinberger, K., Saul, L.: Distance metric learning for large margin nearest neighbour classification. The Journal of Machine Learning Research 10, 207–244 (2009)MATHGoogle Scholar
  8. 8.
    Min, R., Stanley, D.A., Yuan, Z., Bonner, A., Zhang, Z.: A deep non-linear feature mapping for large-margin knn classification. In: Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, pp. 357–366 (2009)Google Scholar
  9. 9.
    Yang, T., Cao, L., Zhang, C.: A Novel Prototype Reduction Method for the K-Nearest Neighbor Algrithms with K ≥ 1. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 89–100. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  10. 10.
    Paredes, R., Vidal, E.: Learning prototypes and distances. Pattern Recognition 39(2), 180–188 (2006)CrossRefMATHGoogle Scholar
  11. 11.
    Paredes, R., Vidal, E.: Learning weighted metrics to minimize nearest-neighbor classification error. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1100–1110 (2006)Google Scholar
  12. 12.
    Wang, J., Neskovic, P., Cooper, L.: Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recognition Letters 28(2), 207–213 (2007)CrossRefGoogle Scholar
  13. 13.
    Jahromi, M.Z., Parvinnia, E., John, R.: A method of learning weighted similarity function to improve the performance of nearest neighbor. Information Sciences 179(17), 2964–2973 (2009)CrossRefMATHGoogle Scholar
  14. 14.
    Cooper, G., Herskovits, E.: A Bayesian method for the induction of probablistic networks from data. Machine Learning 9(4), 309–347 (1992)MATHGoogle Scholar
  15. 15.
    Han, E., Karypis, G.: Centroid-based document classification. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 116–123. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  16. 16.
    Asuncion, A., Newman, D.: UCI Machine Learning Repository (2007)Google Scholar
  17. 17.
    Witten, I., Frank, E.: Data mining: practical machine learning tools and techniques with Java implementations. ACM SIGMOD Record 31(1), 76–77 (2002)CrossRefGoogle Scholar
  18. 18.
    Hendricks, W., Robey, K.: The sampling distribution of the coefficient of variation. The Annals of Mathematical Statistics 7(3), 129–132 (1936)CrossRefMATHGoogle Scholar
  19. 19.
    Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 233–240 (2006)Google Scholar
  20. 20.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7, 1–30 (2006)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Wei Liu
    • 1
  • Sanjay Chawla
    • 1
  1. 1.School of Information TechnologiesUniversity of SydneyAustralia

Personalised recommendations