An analysis of how training data complexity affects the nearest neighbor classifiers

Sánchez, J. S.; Mollineda, R. A.; Sotoca, J. M.

doi:10.1007/s10044-007-0061-2

An analysis of how training data complexity affects the nearest neighbor classifiers

Theoretical Advances
Published: 16 February 2007

Volume 10, pages 189–201, (2007)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

J. S. Sánchez¹,
R. A. Mollineda¹ &
J. M. Sotoca¹

666 Accesses
73 Citations
Explore all metrics

Abstract

The k-nearest neighbors (k-NN) classifier is one of the most popular supervised classification methods. It is very simple, intuitive and accurate in a great variety of real-world domains. Nonetheless, despite its simplicity and effectiveness, practical use of this rule has been historically limited due to its high storage requirements and the computational costs involved. On the other hand, the performance of this classifier appears strongly sensitive to training data complexity. In this context, by means of several problem difficulty measures, we try to characterize the behavior of the k-NN rule when working under certain situations. More specifically, the present analysis focuses on the use of some data complexity measures to describe class overlapping, feature space dimensionality and class density, and discover their relation with the practical accuracy of this classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Siamese Neural Networks: An Overview

Fundamentals of Artificial Neural Networks and Deep Learning

References

Bernadó E, Ho T-K (2004) On classifier domain of competence. In: Proceedings of 17th international conference on pattern recognition, Cambridge, pp 136–139
Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of 7th international conference on database theory, Jerusalem, pp 217–235
Cover TM (1968) Estimation by the nearest neighbor rule. IEEE Trans Inf Theory 14:50–55
Article MATH Google Scholar
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27
Article MATH Google Scholar
Dasarathy BV (1991) Nearest neighbor norms: NN pattern classification techniques. IEEE Computer Society Press, Los Alamos
Google Scholar
Devijver PA, Kittler J (1992) Pattern recognition: a statistical approach. Prentice Hall, Englewood Cliffs
Google Scholar
Friedman JH (1997) On bias, variance, 0/1-loss, and the curse of dimensionality. Data Mining Knowl Discov 1:55–77
Article Google Scholar
Gama J, Brazdil P (1995) Characterization of classification algorithms. In: Progress in artificial intelligence. Springer, Heidelberg, pp 83–102
Hand DJ, Vinciotti V (2003) Choosing k for two-class nearest neighbour classifiers with unbalanced classes. Pattern Recognit Lett 24:1555–1562
Article MATH Google Scholar
Ho T-K, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24:289–300
Article Google Scholar
Hoekstra A, Duin RPW (1996) On the nonlinearity of pattern classifiers. In: Proceedings of 13th international conference on pattern recognition, Vienna, pp 271–275
Jain AK, Xu X, Ho T-K, Xiao F (2002) Uniformity testing using minimal spanning tree. In: Proceedings of 16th international conference on pattern recognition, Quebec City, pp 281–284
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6:429–449
MATH Google Scholar
Kubat M, Chen W-K (1998) Weighted projection in nearest-neighbor classifiers. In: Proceedings of 1st Southern symposium on computing, Hattiesburg, pp 27–34
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York
MATH Google Scholar
Mollineda RA, Sánchez JS, Sotoca JM (2005) Data characterization for effective prototype selection. In: Proceedings of 2nd Iberian conference on pattern recognition and image analysis, Estoril, pp 27–34
Okamoto S, Yugami N (2003) Effects of domain characteristics on instance-based learning algorithms. Theor Comput Sci 298:207–233
Article MATH MathSciNet Google Scholar
Sánchez JS, Barandela R, Marqués AI, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recognit Lett 24:1015–1022
Article Google Scholar
Singh S (2003) PRISM—a novel framework for pattern recognition. Pattern Anal Appl 6:134–149
Article Google Scholar
Sohn S-Y (1999) Meta analysis of classification algorithms for pattern recognition. IEEE Trans Pattern Anal Mach Intell 21:1137–1144
Article Google Scholar
Zhang J, Mani I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets, Washington DC
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data sets. IEEE Trans Syst Man Cybern 2:408–421
Article MATH Google Scholar

Download references

Acknowledgments

This work has been partially supported by grants DPI2006-15542 from the Spanish CICYT, GV04A-705 from Generalitat Valenciana and P1-1B2004-08 from Fundació Caixa Castelló-Bancaixa.

Author information

Authors and Affiliations

Departamento de Llenguatges i Sistemes Informátics, Universitat Jaume I, Av. Vicent Sos Baynat s/n, 12071, Castelló, Spain
J. S. Sánchez, R. A. Mollineda & J. M. Sotoca

Authors

J. S. Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
R. A. Mollineda
View author publications
You can also search for this author in PubMed Google Scholar
J. M. Sotoca
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. S. Sánchez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sánchez, J.S., Mollineda, R.A. & Sotoca, J.M. An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Applic 10, 189–201 (2007). https://doi.org/10.1007/s10044-007-0061-2

Download citation

Received: 14 November 2005
Accepted: 22 December 2006
Published: 16 February 2007
Issue Date: August 2007
DOI: https://doi.org/10.1007/s10044-007-0061-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An analysis of how training data complexity affects the nearest neighbor classifiers

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Siamese Neural Networks: An Overview

Fundamentals of Artificial Neural Networks and Deep Learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An analysis of how training data complexity affects the nearest neighbor classifiers

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Siamese Neural Networks: An Overview

Fundamentals of Artificial Neural Networks and Deep Learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation