On the k-NN performance in a challenging scenario of imbalance and overlapping

García, V.; Mollineda, R. A.; Sánchez, J. S.

doi:10.1007/s10044-007-0087-5

On the k-NN performance in a challenging scenario of imbalance and overlapping

Theoretical Advances
Published: 28 September 2007

Volume 11, pages 269–280, (2008)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

V. García¹,
R. A. Mollineda² &
J. S. Sánchez²

1647 Accesses
151 Citations
Explore all metrics

Abstract

A two-class data set is said to be imbalanced when one (minority) class is heavily under-represented with respect to the other (majority) class. In the presence of a significant overlapping, the task of learning from imbalanced data can be a very difficult problem. Additionally, if the overall imbalance ratio is different from local imbalance ratios in overlap regions, the task can become in a major challenge. This paper explains the behaviour of the k-nearest neighbour (k-NN) rule when learning from such a complex scenario. This local model is compared to other machine learning algorithms, attending to how their behaviour depends on a number of data complexity features (global imbalance, size of overlap region, and its local imbalance). As a result, several conclusions useful for classifier design are inferred.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysing the Footprint of Classifiers in Overlapped and Imbalanced Contexts

OKC classifier: an efficient approach for classification of imbalanced dataset using hybrid methodology

Article 22 August 2022

A Positive-biased Nearest Neighbour Algorithm for Imbalanced Classification

References

Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th international conference on knowledge discovery and data mining, pp 155–164
Gordon DF, Perlis D (1989) Explicitly biased generalization. Comput Intell 5:67–81
Article Google Scholar
Pazzani M, Merz C, Murphy P, Ali K, Hume T, Brunk C (1994) Reducing misclassification costs. In: Proceedings 11th international conference on machine learning, pp 217–225
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Kubat M, Matwin S (1997) Adressing the curse of imbalanced training sets: one-sided selection. Proceedings of the 14th international conference on machine learning, pp 179–186
Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36:849–851
Article Google Scholar
Fawcett T, Provost F (1996) Adaptive fraud detection. Data Mining Knowl Discov 1:291–316
Article Google Scholar
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal J 6(5):429–450
MATH Google Scholar
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor 6:40–49
Article Google Scholar
Weiss GM (2003) The effect of small disjuncts and class distribution on decision tree learning. PhD thesis, Rutgers University
Prati RC, Batista GE, Monard MC (2004) Class imbalance versus class overlapping: an analysis of a learning system behavior. In: Proceedings of the 3rd Mexican international conference on artificial intelligence, pp 312–321
Orriols A, Bernardó E (2005) The class imbalance problem in learning classifier systems: a preliminary study. In: Proceedings of conference on genetic and evolutionary computation, pp 74–78
Visa S, Ralescu A (2003) Learning from imbalanced and overlapped data using fuzzy sets. In: Proceedings of ICML-2003 workshop: learning with imbalanced data sets II, pp 97–104
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27
Article MATH Google Scholar
Dasarathy BV (1991) Nearest neighbor norms: NN pattern classification techniques. IEEE Computer Society Press, Los Alamos
Google Scholar
Devijver PA, Kittler J (1992) Pattern recognition: a statistical approach. Prentice Hall, Englewood Cliffs
Google Scholar
Hand DJ, Vinciotti V (2003) Choosing k for two-class nearest neighbour classifiers with unbalanced classes. Pattern Recognit Lett 24:1555–1562
Article MATH Google Scholar
Zhang J, Mani I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets II, pp 42–48
Duda RO, Hart PE, Stork DG (2001) Pattern classification and scene analysis. Wiley, New York
Google Scholar
Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, USA
Google Scholar
Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo
Google Scholar
Buhmann M, Albowitz M (2003) Radial basis functions: theory and implementations. Cambridge University Press, USA
MATH Google Scholar
Cover TM (1968) Estimation by the nearest neighbor rule. IEEE Trans Inf Theory 14:50–55
Article MATH Google Scholar
Okamoto S, Yugami N (2003) Effects of domain characteristics on instance-based learning algorithms. Theor Comput Sci 298:207–233
Article MATH MathSciNet Google Scholar
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York
MATH Google Scholar
Kubat M, Chen WK (1998) Weighted projection in nearest-neighbor classifiers. In: Proceedings of 1st southern symposium on computing, pp 27–34
Friedman JH (1997) On bias, variance, 0/1-loss, and the curse of dimensionality. Data Mining Knowl Discov 1:55–77
Article Google Scholar
Sánchez JS, Barandela R, Marqués AI, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recognit Lett 24:1015–1022
Article Google Scholar
Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is "nearest neighbor" meaningful?. In: Proceedings of 7th international conference on database theory, pp 217–235
Provost F, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proceedings of 3rd international conference on knowledge discovery and data mining, pp 43–48
Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Engng 17:299–310
Article Google Scholar
Daskalaki S, Kopanas I, Avouris N (2006) Evaluation of classifiers for an uneven class distribution problem. Appl Artif Intell 20:381–417
Article Google Scholar
Landgrebe TCW, Paclick P, Duin RPW (2006) Precision-recall operating characteristic (P-ROC) curves in imprecise environments. In: Proceedings of 18th international conference on pattern recognition, pp 123–127
Fawcett T (2006) ROC graphs with instance-varying costs. Pattern Recognit Lett 27:882–891
Article Google Scholar
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, USA
MATH Google Scholar

Download references

Acknowledgments

This work has been partially supported by grants DPI2006-15542 from the Spanish CICYT, CSD2007-00018 from the Spanish Ministry of Science and Education and SEP-2003-C02-44225 from the Mexican CONACyT.

Author information

Authors and Affiliations

Laboratorio de Reconocimiento de Patrones, Instituto Tecnológico de Toluca. Av. Tecnológico s/n, Metepec, 52140, México
V. García
Departament de Llenguatges i Sistemes Informàtics, Universitat Jaume I. Av. Vicent Sos Baynat s/n, 12071, Castelló, Spain
R. A. Mollineda & J. S. Sánchez

Authors

V. García
View author publications
You can also search for this author in PubMed Google Scholar
R. A. Mollineda
View author publications
You can also search for this author in PubMed Google Scholar
J. S. Sánchez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to V. García.

Rights and permissions

Reprints and permissions

About this article

Cite this article

García, V., Mollineda, R.A. & Sánchez, J.S. On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Applic 11, 269–280 (2008). https://doi.org/10.1007/s10044-007-0087-5

Download citation

Received: 06 January 2007
Accepted: 10 July 2007
Published: 28 September 2007
Issue Date: September 2008
DOI: https://doi.org/10.1007/s10044-007-0087-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the k-NN performance in a challenging scenario of imbalance and overlapping

Abstract

Access this article

Similar content being viewed by others

Analysing the Footprint of Classifiers in Overlapped and Imbalanced Contexts

OKC classifier: an efficient approach for classification of imbalanced dataset using hybrid methodology

A Positive-biased Nearest Neighbour Algorithm for Imbalanced Classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the k-NN performance in a challenging scenario of imbalance and overlapping

Abstract

Access this article

Similar content being viewed by others

Analysing the Footprint of Classifiers in Overlapped and Imbalanced Contexts

OKC classifier: an efficient approach for classification of imbalanced dataset using hybrid methodology

A Positive-biased Nearest Neighbour Algorithm for Imbalanced Classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation