Pattern Analysis and Applications

, Volume 19, Issue 1, pp 93–109 | Cite as

RHC: a non-parametric cluster-based data reduction for efficient \(k\)-NN classification

  • Stefanos OugiaroglouEmail author
  • Georgios Evangelidis
Theoretical Advances


Although the \(k\)-NN classifier is a popular classification method, it suffers from the high computational cost and storage requirements it involves. This paper proposes two effective cluster-based data reduction algorithms for efficient \(k\)-NN classification. Both have low preprocessing cost and can achieve high data reduction rates while maintaining \(k\)-NN classification accuracy at high levels. The first proposed algorithm is called reduction through homogeneous clusters (RHC) and is based on a fast preprocessing clustering procedure that creates homogeneous clusters. The centroids of these clusters constitute the reduced training set. The second proposed algorithm is a dynamic version of RHC that retains all its properties and, in addition, it can manage datasets that cannot fit in main memory and is appropriate for dynamic environments where new training data are gradually available. Experimental results, based on fourteen datasets, illustrate that both algorithms are faster and achieve higher reduction rates than four known methods, while maintaining high classification accuracy.


k-NN classification Clustering Data reduction  Prototypes 



We are grateful to the anonymous reviewers for their valuable comments on the original form of the paper.


  1. 1.
    Aha DW (1992) Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. Int J Man-Mach Stud 36(2):267–287. doi: 10.1016/0020-7373(92)90018-G
  2. 2.
    Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66. doi: 10.1023/A:1022689900470
  3. 3.
    Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Mult-Val Logic Soft Comput 17(2–3):255–287Google Scholar
  4. 4.
    Angiulli F (2005) Fast condensed nearest neighbor rule. In: Proceedings of the 22nd international conference on machine learning., ICML ’05ACM, New York, NY, USA, pp 25–32Google Scholar
  5. 5.
    Angiulli F (2007) Fast nearest neighbor condensation for large data sets classification. IEEE Trans Knowl Data Eng 19(11):1450–1464. doi: 10.1109/TKDE.2007.190645
  6. 6.
    Beringer J, Hüllermeier E (2007) Efficient instance-based learning on data streams. Intell Data Anal 11(6):627–650.
  7. 7.
    Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6(2):153–172. doi: 10.1023/A:1014043630878
  8. 8.
    Chang CL (1974) Finding prototypes for nearest neighbor classifiers. IEEE Trans Comput 23(11):1179–1184. doi: 10.1109/T-C.1974.223827
  9. 9.
    Chen CH, Jóźwik A (1996) A sample set condensation algorithm for the class sensitive artificial neural network. Pattern Recogn Lett 17(8):819–823. doi: 10.1016/0167-8655(96)00041-4
  10. 10.
    Chou CH, Kuo BH, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. In: Proceedings of the 18th international conference on pattern recognition, vol 02, ICPR ’06. IEEE Computer Society, Washington, DC, pp 556–559. doi: 10.1109/ICPR.2006.1119
  11. 11.
    Cover T, Hart P (2006) Nearest neighbor pattern classification. IEEE Trans Inf Theor 13(1):21–27. doi: 10.1109/TIT.1967.1053964
  12. 12.
    Dasarathy BV (1991) Nearest neighbor. NN pattern classification techniques. IEEE Computer Society Press, NN) normsGoogle Scholar
  13. 13.
    Dasarathy BV, Snchez JS, Townsend S (2000) Nearest neighbour editing and condensing toolssynergy exploitation. Pattern Anal Appl 3(1):19–30. doi: 10.1007/s100440050003
  14. 14.
    Datta P, Kibler DF (1997) Learning symbolic prototypes. In: Proceedings of the fourteenth international conference on machine learning., ICML ’97Morgan Kaufmann Publishers Inc., San Francisco, pp 75–82Google Scholar
  15. 15.
    Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30.
  16. 16.
    Devi VS, Murty MN (2002) An incremental prototype set building technique. Pattern Recogn 35(2):505–513zbMATHCrossRefGoogle Scholar
  17. 17.
    Devijver PA, Kittler J (1980) On the edited nearest neighbor rule. In: Proceedings of the fifth international conference on pattern recognition. The Institute of Electrical and Electronics Engineers, New JerseyGoogle Scholar
  18. 18.
    Fayed HA, Hashem SR, Atiya AF (2007) Self-generating prototypes for pattern classification. Pattern Recogn 40(5):1498–1509. doi: 10.1016/j.patcog.2006.10.018
  19. 19.
    García S, Cano JR, Herrera F (2008) A memetic algorithm for evolutionary prototype selection: a scaling up approach. Pattern Recogn 41(8):2693–2709.  10.1016/j.patcog.2008.02.006
  20. 20.
    Garcia S, Derrac J, Cano J, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Anal Mach Intell 34(3):417–435. doi: 10.1109/TPAMI.2011.142
  21. 21.
    García-Borroto M, Villuendas-Rey Y, Carrasco-Ochoa JA, Martínez-Trinidad JF (2009) Using maximum similarity graphs to edit nearest neighbor classifiers. In: Proceedings of the 14th Iberoamerican conference on pattern recognition: progress in pattern recognition, image analysis, computer vision, and applications, CIARP ’09. Springer, Berlin, pp 489–496. doi: 10.1007/978-3-642-10268-4_57
  22. 22.
    Gates GW (1972) The reduced nearest neighbor rule. IEEE Trans Inf Theory 18(3):431–433CrossRefGoogle Scholar
  23. 23.
    Grochowski M, Jankowski N (2004) Comparison of instance selection algorithms ii. results and comments. In: Artificial intelligence and soft computing—ICAISC 2004, vol 3070. LNCS/Springer, Berlin/Heidelberg, pp 580–585Google Scholar
  24. 24.
    Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Elsevier Science, The Morgan Kaufmann Series in Data Management SystemsGoogle Scholar
  25. 25.
    Hart PE (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14(3):515–516CrossRefGoogle Scholar
  26. 26.
    James M (1985) Classification algorithms. Wiley-Interscience, New YorkzbMATHGoogle Scholar
  27. 27.
    Jankowski N, Grochowski M (2004) Comparison of instances seletion algorithms i. algorithms survey. Artif Intell Soft Comput ICAISC 2004, vol 3070. LNCS/Springer, Berlin/Heidelberg, pp 598–603Google Scholar
  28. 28.
    Lozano M (2007) Data reduction techniques in classification processes. Phd Thesis, Universitat Jaume IGoogle Scholar
  29. 29.
    McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proc. of 5th Berkeley symp. on math. statistics and probability. University of California Press, Berkeley, pp 281–298Google Scholar
  30. 30.
    Mollineda R, Ferri F, Vidal E (2002) An efficient prototype merging strategy for the condensed 1-nn rule through class-conditional hierarchical clustering. Pattern Recogn 35(12):2771–2782zbMATHCrossRefGoogle Scholar
  31. 31.
    Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF, Kittler J (2010) A review of instance selection methods. Artif Intell Rev 34(2):133–143. doi: 10.1007/s10462-010-9165-y
  32. 32.
    Olvera-Lopez JA, Carrasco-Ochoa JA, Trinidad JFM (2010) A new fast prototype selection method based on clustering. Pattern Anal Appl 13(2):131–141MathSciNetCrossRefGoogle Scholar
  33. 33.
    Olvera-López JA, Martínez-Trinidad JF, Carrasco-Ochoa JA (2007) Mixed data object selection based on clustering and border objects. In: Proceedings of the congress on pattern recognition 12th Iberoamerican conference on progress in pattern recognition, image analysis and applications, CIARP’07. Springer, Berlin, pp 674–683.
  34. 34.
    Olvera-López JA, Carrasco-Ochoa JA, Trinidad JFM (2008) Object selection based on clustering and border objects. In: Kurzynski M, Puchala E, Wozniak M, Zolnierek A (eds) Computer Recognition Systems 2, advances in soft computing, vol 45. Springer, Berlin, pp 27–34Google Scholar
  35. 35.
    Ougiaroglou S, Evangelidis G (2012) Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the fifth Balkan conference in informatics, BCI ’12. ACM, New York, pp 168–173. doi: 10.1145/2371316.2371349
  36. 36.
    Ougiaroglou S, Evangelidis G (2012) Fast and accurate k-nearest neighbor classification using prototype selection by clustering. In: 16th Panhellenic conference on informatics (PCI), 2012, pp 168–173Google Scholar
  37. 37.
    Ritter G, Woodruff H, Lowry S, Isenhour T (1975) An algorithm for a selective nearest neighbor decision rule. IEEE Trans Inf Theory 21(6):665–669zbMATHCrossRefGoogle Scholar
  38. 38.
    Samet H (2006) Foundations of multidimensional and metric data structures. In: The Morgan Kaufmann series in computer graphics, Elsevier/Morgan KaufmannGoogle Scholar
  39. 39.
    Sánchez JS (2004) High training set size reduction by space partitioning and prototype abstraction. Pattern Recogn 37(7):1561–1564CrossRefGoogle Scholar
  40. 40.
    Sheskin D (2011) Handbook of parametric and monparametric statistical procedures. A Chapman & Hall book. Chapman & Hall/CRC, Boca RatonGoogle Scholar
  41. 41.
    Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:448–452zbMATHMathSciNetCrossRefGoogle Scholar
  42. 42.
    Tomek I (1976) Two modifications of cnn. Syst Man Cybern IEEE Trans SMC 6(11):769–772. doi: 10.1109/TSMC.1976.4309452
  43. 43.
    Toussaint G (2002) Proximity graphs for nearest neighbor decision rules: recent progress. In: 34th symposium on the INTERFACE, pp 17–20Google Scholar
  44. 44.
    Triguero I, Derrac J, Garcia S, Herrera F (2012) A taxonomy and experimental study on prototype generation for nearest neighbor classification. Trans Syst Man Cyber Part C 42(1):86–100. doi: 10.1109/TSMCC.2010.2103939
  45. 45.
    Tsymbal A (2004) The problem of concept drift: definitions and related work. Tech. Rep. TCD-CS-2004-15, The University of Dublin, Trinity College, Department of Computer Science, Dublin, IrelandGoogle Scholar
  46. 46.
    Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421Google Scholar
  47. 47.
    Wilson DR, Martinez TR (2000) Reduction techniques for instance-basedlearning algorithms. Mach Learn 38(3):257–286. doi: 10.1023/A:1007626913721
  48. 48.
    Wu J (2012) Advances in K-means clustering: a data mining thinking. Springer Publishing Company, IncorporatedCrossRefGoogle Scholar
  49. 49.
    Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana CA (2006) Fast time series classification using numerosity reduction. In: Proceedings of the 23rd international conference on Machine learning, ICML ’06, pp. 1033–1040. ACM, New York. doi: 10.1145/1143844.1143974

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  1. 1.Department of Applied Informatics, School of Information SciencesUniversity of MacedoniaThessalonikiGreece

Personalised recommendations