Pattern Analysis and Applications

, Volume 11, Issue 3–4, pp 269–280 | Cite as

On the k-NN performance in a challenging scenario of imbalance and overlapping

Theoretical Advances

Abstract

A two-class data set is said to be imbalanced when one (minority) class is heavily under-represented with respect to the other (majority) class. In the presence of a significant overlapping, the task of learning from imbalanced data can be a very difficult problem. Additionally, if the overall imbalance ratio is different from local imbalance ratios in overlap regions, the task can become in a major challenge. This paper explains the behaviour of the k-nearest neighbour (k-NN) rule when learning from such a complex scenario. This local model is compared to other machine learning algorithms, attending to how their behaviour depends on a number of data complexity features (global imbalance, size of overlap region, and its local imbalance). As a result, several conclusions useful for classifier design are inferred.

Keywords

Imbalanced data Nearest neighbour rule Class overlap Local and global learning Overall imbalance ratio Local imbalance ratio 

References

  1. 1.
    Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th international conference on knowledge discovery and data mining, pp 155–164Google Scholar
  2. 2.
    Gordon DF, Perlis D (1989) Explicitly biased generalization. Comput Intell 5:67–81CrossRefGoogle Scholar
  3. 3.
    Pazzani M, Merz C, Murphy P, Ali K, Hume T, Brunk C (1994) Reducing misclassification costs. In: Proceedings 11th international conference on machine learning, pp 217–225Google Scholar
  4. 4.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATHGoogle Scholar
  5. 5.
    Kubat M, Matwin S (1997) Adressing the curse of imbalanced training sets: one-sided selection. Proceedings of the 14th international conference on machine learning, pp 179–186Google Scholar
  6. 6.
    Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36:849–851CrossRefGoogle Scholar
  7. 7.
    Fawcett T, Provost F (1996) Adaptive fraud detection. Data Mining Knowl Discov 1:291–316CrossRefGoogle Scholar
  8. 8.
    Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal J 6(5):429–450MATHGoogle Scholar
  9. 9.
    Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor 6:40–49CrossRefGoogle Scholar
  10. 10.
    Weiss GM (2003) The effect of small disjuncts and class distribution on decision tree learning. PhD thesis, Rutgers UniversityGoogle Scholar
  11. 11.
    Prati RC, Batista GE, Monard MC (2004) Class imbalance versus class overlapping: an analysis of a learning system behavior. In: Proceedings of the 3rd Mexican international conference on artificial intelligence, pp 312–321Google Scholar
  12. 12.
    Orriols A, Bernardó E (2005) The class imbalance problem in learning classifier systems: a preliminary study. In: Proceedings of conference on genetic and evolutionary computation, pp 74–78Google Scholar
  13. 13.
    Visa S, Ralescu A (2003) Learning from imbalanced and overlapped data using fuzzy sets. In: Proceedings of ICML-2003 workshop: learning with imbalanced data sets II, pp 97–104Google Scholar
  14. 14.
    Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27MATHCrossRefGoogle Scholar
  15. 15.
    Dasarathy BV (1991) Nearest neighbor norms: NN pattern classification techniques. IEEE Computer Society Press, Los AlamosGoogle Scholar
  16. 16.
    Devijver PA, Kittler J (1992) Pattern recognition: a statistical approach. Prentice Hall, Englewood CliffsGoogle Scholar
  17. 17.
    Hand DJ, Vinciotti V (2003) Choosing k for two-class nearest neighbour classifiers with unbalanced classes. Pattern Recognit Lett 24:1555–1562MATHCrossRefGoogle Scholar
  18. 18.
    Zhang J, Mani I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets II, pp 42–48Google Scholar
  19. 19.
    Duda RO, Hart PE, Stork DG (2001) Pattern classification and scene analysis. Wiley, New YorkGoogle Scholar
  20. 20.
    Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, USAGoogle Scholar
  21. 21.
    Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San MateoGoogle Scholar
  22. 22.
    Buhmann M, Albowitz M (2003) Radial basis functions: theory and implementations. Cambridge University Press, USAMATHGoogle Scholar
  23. 23.
    Cover TM (1968) Estimation by the nearest neighbor rule. IEEE Trans Inf Theory 14:50–55MATHCrossRefGoogle Scholar
  24. 24.
    Okamoto S, Yugami N (2003) Effects of domain characteristics on instance-based learning algorithms. Theor Comput Sci 298:207–233MATHCrossRefMathSciNetGoogle Scholar
  25. 25.
    Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New YorkMATHGoogle Scholar
  26. 26.
    Kubat M, Chen WK (1998) Weighted projection in nearest-neighbor classifiers. In: Proceedings of 1st southern symposium on computing, pp 27–34Google Scholar
  27. 27.
    Friedman JH (1997) On bias, variance, 0/1-loss, and the curse of dimensionality. Data Mining Knowl Discov 1:55–77CrossRefGoogle Scholar
  28. 28.
    Sánchez JS, Barandela R, Marqués AI, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recognit Lett 24:1015–1022CrossRefGoogle Scholar
  29. 29.
    Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is "nearest neighbor" meaningful?. In: Proceedings of 7th international conference on database theory, pp 217–235Google Scholar
  30. 30.
    Provost F, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proceedings of 3rd international conference on knowledge discovery and data mining, pp 43–48Google Scholar
  31. 31.
    Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Engng 17:299–310CrossRefGoogle Scholar
  32. 32.
    Daskalaki S, Kopanas I, Avouris N (2006) Evaluation of classifiers for an uneven class distribution problem. Appl Artif Intell 20:381–417CrossRefGoogle Scholar
  33. 33.
    Landgrebe TCW, Paclick P, Duin RPW (2006) Precision-recall operating characteristic (P-ROC) curves in imprecise environments. In: Proceedings of 18th international conference on pattern recognition, pp 123–127Google Scholar
  34. 34.
    Fawcett T (2006) ROC graphs with instance-varying costs. Pattern Recognit Lett 27:882–891CrossRefGoogle Scholar
  35. 35.
    Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, USAMATHGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2007

Authors and Affiliations

  1. 1.Laboratorio de Reconocimiento de PatronesInstituto Tecnológico de Toluca. Av. Tecnológico s/nMetepecMéxico
  2. 2.Departament de Llenguatges i Sistemes InformàticsUniversitat Jaume I. Av. Vicent Sos Baynat s/nCastellóSpain

Personalised recommendations