Abstract
The \(k\)-NN classifier is a widely used classification algorithm. However, exhaustively searching the whole dataset for the nearest neighbors is prohibitive for large datasets because of the high computational cost involved. The paper proposes an efficient model for fast and accurate nearest neighbor classification. The model consists of a non-parametric cluster-based preprocessing algorithm that constructs a two-level speed-up data structure and algorithms that access this structure to perform the classification. Furthermore, the paper demonstrates how the proposed model can improve the performance on reduced sets built by various data reduction techniques. The proposed classification model was evaluated using eight real-life datasets and compared to known speed-up methods. The experimental results show that it is a fast and accurate classifier, and, in addition, it involves low pre-processing computational cost.
Similar content being viewed by others
Notes
DRTs have two points of view: (1) item reduction, and, (2) dimensionality reduction. We consider them from the first point of view.
Detailed experimental results are available at http://users.uom.gr/~stoug/AIRJ_experiments.zip.
References
Aha DW (1992) Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. Int J Man Mach Stud 36(2):267–287
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Alcalá-Fdez J, Sánchez L, García S, del Jesús MJ, Ventura S, i Guiu JMG, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6(2):153–172
Chen CH, Jóźwik A (1996) A sample set condensation algorithm for the class sensitive artificial neural network. Pattern Recogn Lett 17:819–823
Chou CH, Kuo BH, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. In: Proceedings of the 18th international conference on pattern recognition—volume 02. IEEE Computer Society, Washington, DC, USA, ICPR ’06, pp 556–559
Dasarathy BV (1991) Nearest neighbor (NN) norms : NN pattern classification techniques. IEEE Computer Society Press, Silver Spring
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Devi VS, Murty MN (2002) An incremental prototype set building technique. Pattern Recogn 35(2):505–513
García S, Molina D, Lozano M, Herrera F (2009) A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the cec’2005 special session on real parameter optimization. J Heuristics 15(6):617–644
Garcia S, Derrac J, Cano J, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Anal Mach Intell 34(3):417–435
Gates GW (1972) The reduced nearest neighbor rule. IEEE Trans Inf Theory 18(3):431–433
Grochowski M, Jankowski N (2004) Comparison of instance selection algorithms ii. Results and comments. In: artificial intelligence and soft computing—ICAISC 2004, LNCS, vol 3070. Springer, Berlin, pp 580–585
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. The Morgan Kaufmann series in data management systems. Morgan Kaufmann, San Francisco, CA
Hart PE (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14(3):515–516
Hwang S, Cho S (2007) Clustering-based reference set reduction for k-nearest neighbor. In: 4th international symposium on neural networks: part II-advances in neural networks. Springer, ISNN ’07, pp 880–888
James M (1985) Classification algorithms. Wiley, New York
Jankowski N, Grochowski M (2004) Comparison of instances seletion algorithms i. Algorithms survey. In: Artificial intelligence and soft computing—ICAISC 2004, LNCS, vol 3070. Springer, Berlin, pp 598–603
Karamitopoulos L, Evangelidis G (2009) Cluster-based similarity search in time series. In: Proceedings of the fourth Balkan conference in informatics. IEEE Computer Society, Washington, DC, USA, BCI ’09, pp 113–118
Lozano M (2007) Data reduction techniques in classification processes (Phd Thesis). Universitat Jaume I
Mardia K, Kent J, Bibby J (1979) Multivariate analysis. Academic Press, New York/London
McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, CA, pp 281–298
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF, Kittler J (2010) A review of instance selection methods. Artif Intell Rev 34(2):133–143
Olvera-Lopez JA, Carrasco-Ochoa JA, Trinidad JFM (2010) A new fast prototype selection method based on clustering. Pattern Anal Appl 13(2):131–141
Ougiaroglou S, Evangelidis G (2012a) Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the fifth Balkan conference in informatics. ACM, New York, NY, USA, BCI ’12, pp 168–173
Ougiaroglou S, Evangelidis G (2012b) A fast hybrid \(k\)-NN classifier based on homogeneous clusters. In: Artificial intelligence applications and innovations. Springer, Berlin, IFIP advances in information and communication technology 381:327–336
Ougiaroglou S, Evangelidis G, Dervos DA (2012) An adaptive hybrid and cluster-based model for speeding up the \(k\)-NN classifier. In: Proceedings of the 7th international conference on hybrid artificial intelligent systems—volume part II. Springer, Berlin, Heidelberg, HAIS’12, pp 163–175
Ritter G, Woodruff H, Lowry S, Isenhour T (1975) An algorithm for a selective nearest neighbor decision rule. IEEE Trans Inf Theory 21(6):665–669
Rokach L (2007) Data mining with decision trees: theory and applications. Series in machine perception and artificial intelligence. World Scientific Publishing Company, Incorporated, Singapore
Samet H (2006) Foundations of multidimensional and metric data structures. The Morgan Kaufmann series in computer graphics. Morgan Kaufmann, San Francisco, CA
Sánchez JS (2004) High training set size reduction by space partitioning and prototype abstraction. Pattern Recogn 37(7):1561–1564
Sheskin D (2011) Handbook of parametric and nonparametric statistical procedures. A Chapman and Hall book, Chapman and Hall/CRC, London
Toussaint G (2002) Proximity graphs for nearest neighbor decision rules: recent progress. In: 34th symposium on the interface, pp 17–20
Triguero I, Derrac J, Francisco Herrera SG (2012) A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Trans Syst Man Cybern Part C 42(1):86–100
Wang X (2011) A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality. In: The 2011 international joint conference on neural networks (IJCNN), pp 1293–1299
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421
Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3):257–286
Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana CA (2006) Fast time series classification using numerosity reduction. In: Proceedings of the 23rd international conference on machine learning. ACM, New York, NY, USA, ICML ’06, pp 1033–1040
Zhang B, Srihari SN (2004) Fast \(k\)-nearest neighbor classification using cluster-based trees. IEEE Trans Pattern Anal Mach Intell 26(4):525–528
Author information
Authors and Affiliations
Corresponding author
Additional information
Stefanos Ougiaroglou is supported by the Greek State Scholarships Foundation (IKY).
Rights and permissions
About this article
Cite this article
Ougiaroglou, S., Evangelidis, G. Efficient \(k\)-NN classification based on homogeneous clusters. Artif Intell Rev 42, 491–513 (2014). https://doi.org/10.1007/s10462-013-9411-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-013-9411-1