Efficient editing and data abstraction by finding homogeneous clusters

  • Stefanos OugiaroglouEmail author
  • Georgios Evangelidis


The efficiency of the k-Nearest Neighbour classifier depends on the size of the training set as well as the level of noise in it. Large datasets with high level of noise lead to less accurate classifiers with high computational cost and storage requirements. The goal of editing is to improve accuracy by improving the quality of the training datasets. To obtain such datasets, editing removes noise and mislabeled data as well as smooths the decision boundaries between the discrete classes. On the other hand, prototype abstraction aims to reduce the computational cost and the storage requirements of classifiers by condensing the training data. This paper proposes an editing algorithm called Editing through Homogeneous Clusters (EHC). Then, it extends the idea by introducing a prototype abstraction algorithm that integrate the EHC mechanism and is capable of creating a small noise-free representative set of the initial training data. This algorithm is called Editing and Reduction through Homogeneous Clusters (ERHC). Both are based on a fast and parameter free iterative execution of k-means clustering that forms homogeneous clusters. Both consider as noise and remove clusters consisting of a single item. In addition, ERHC summarizes the items of the remaining clusters by storing the mean item for each one in the representative set. EHC and ERHC are tested on several datasets. The results show that both run very fast and achieve high accuracy. In addition, ERHC achieves high reduction rates.


k-NN classification Clustering Data reduction Data abstraction Editing Noise Prototypes 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aha, D.W.: Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. Int. J. Man-Mach. Stud. 36(2), 267–287 (1992). doi: 10.1016/0020-7373(92)90018-G CrossRefGoogle Scholar
  2. 2.
    Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). doi: 10.1023/A:1022689900470 Google Scholar
  3. 3.
    Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. MVLSC 17(2-3), 255–287 (2011)Google Scholar
  4. 4.
    Barandela, R., Gasca, E.: Decontamination of training samples for supervised pattern recognition methods. In: Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition, pp 621–630. Springer-Verlag, London (2000).
  5. 5.
    Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (2006). doi: 10.1109/TIT.1967.1053964 CrossRefzbMATHGoogle Scholar
  6. 6.
    Dasarathy, B.V.: Nearest neighbor (NN) norms : NN pattern classification techniques. IEEE Computer Society Press (1991)Google Scholar
  7. 7.
    Dasarathy, B.V., Snchez, J.S., Townsend, S.: Nearest neighbour editing and condensing toolssynergy exploitation. Pattern. Anal. Applic. 3(1), 19–30 (2000). doi: 10.1007/s100440050003 CrossRefGoogle Scholar
  8. 8.
    Derrac, J., Cornelis, C., García, S., Herrera, F.: Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection. Inf. Sci. 186(1), 73–92 (2012). doi: 10.1016/j.ins.2011.09.027 CrossRefGoogle Scholar
  9. 9.
    Devijver, P.A., Kittler, J.: On the edited nearest neighbor rule. In: Proceedings of the Fifth International Conference on Pattern Recognition. The Institute of Electrical and Electronics Engineers (1980)Google Scholar
  10. 10.
    García, S., Cano, J.R., Herrera, F.: A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recogn. 41(8), 2693–2709 (2008). doi: 10.1016/j.patcog.2008.02.006 CrossRefzbMATHGoogle Scholar
  11. 11.
    Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012). doi: 10.1109/TPAMI.2011.142 CrossRefGoogle Scholar
  12. 12.
    Garcia, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer Publishing Company, Incorporated (2014)Google Scholar
  13. 13.
    García-Borroto, M., Villuendas-Rey, Y., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Using maximum similarity graphs to edit nearest neighbor classifiers. In: Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, CIARP ’09, pp. 489–496. Springer-Verlag, Berlin, Heidelberg (2009), doi: 10.1007/978-3-642-10268-4_57
  14. 14.
    García-Pedrajas, N., De Haro-García, A.: Boosting instance selection algorithms. Know.-Based Syst. 67, 342–360 (2014). doi: 10.1016/j.knosys.2014.04.021 CrossRefGoogle Scholar
  15. 15.
    Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems. Elsevier Science (2011)Google Scholar
  16. 16.
    Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1968)CrossRefGoogle Scholar
  17. 17.
    Hattori, K., Takahashi, M.: A new edited k-nearest neighbor rule in the pattern classification problem. Pattern Recogn. 33(3), 521–528 (2000). doi: 10.1016/S0031-3203(99)00068-0. CrossRefGoogle Scholar
  18. 18.
    Jiang, Y., hua Zhou, Z.: Editing training data for knn classifiers with neural network ensemble. In: Lecture Notes in Computer Science, Vol.3173, pp. 356–361. Springer (2004)Google Scholar
  19. 19.
    Lozano, M.: Data Reduction Techniques in Classification processes (Phd Thesis). Universitat Jaume I (2007)Google Scholar
  20. 20.
    McQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proc. of 5th Berkeley Symp. on Math. Statistics and Probability, pp. 281–298. Berkeley, CA : University of California Press (1967)Google Scholar
  21. 21.
    Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Kittler, J.: A review of instance selection methods. Artif. Intell. Rev. 34(2), 133–143 (2010). doi: 10.1007/s10462-010-9165-y CrossRefGoogle Scholar
  22. 22.
    Olvera-Lopez, J.A., Carrasco-Ochoa, J.A., Trinidad, J.F.M.: A new fast prototype selection method based on clustering. Pattern. Anal. Applic. 13(2), 131–141 (2010)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Olvera-López, J.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: Mixed data object selection based on clustering and border objects. In: Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications, CIARP’07, 674–683. Springer-Verlag, Berlin, Heidelberg (2007).
  24. 24.
    Olvera-Lpez, J.A., Carrasco-Ochoa, J.A., Trinidad, J.F.M.: Object selection based on clustering and border objects. In: Kurzynski, M., Puchala, E., Wozniak, M., Zolnierek, A. (eds.) Computer Recognition Systems 2, Advances in Soft Computing, vol. 45, pp. 27–34. Springer (2008)Google Scholar
  25. 25.
    Ougiaroglou, S., Evangelidis, G.: Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the Fifth Balkan Conference in Informatics, BCI ’12, pp. 168–173. ACM, New York, NY, USA (2012), doi: 10.1145/2371316.2371349
  26. 26.
    Ougiaroglou, S., Evangelidis, G.: Fast and accurate k-nearest neighbor classification using prototype selection by clustering. In: 16th Panhellenic Conference on Informatics (PCI), 2012, pp. 168–173 (2012)Google Scholar
  27. 27.
    Ougiaroglou, S., Evangelidis, G.: EHC: Non-parametric editing by finding homogeneous clusters. In: Beierle, C., Meghini, C. (eds.) Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, vol. 8367, pp. 290–304. Springer International Publishing (2014), doi: 10.1007/978-3-319-04939-7_14
  28. 28.
    Ougiaroglou, S., Evangelidis, G.: RHC: a non-parametric cluster-based data reduction for efficient k-nn classification. Pattern Analysis and Applications pp. 1–17 (2014). doi: 10.1007/s10044-014-0393-7
  29. 29.
    Sánchez, J.S.: High training set size reduction by space partitioning and prototype abstraction. Pattern Recogn. 37(7), 1561–1564 (2004)CrossRefGoogle Scholar
  30. 30.
    Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques to obtain quality training sets. Pattern Recogn. Lett. 24(7), 1015–1022 (2003). doi: 10.1016/S0167-8655(02)00225-8 CrossRefGoogle Scholar
  31. 31.
    Segata, N., Blanzieri, E., Delany, S.J., Cunningham, P.: Noise reduction for instance-based learning with a local maximal margin approach. J. Intell. Inf. Syst. 35(2), 301–331 (2010). doi: 10.1007/s10844-009-0101-z CrossRefGoogle Scholar
  32. 32.
    Sheskin, D.: Handbook of Parametric and Nonparametric Statistical Procedures. A Chapman & Hall book. Chapman & Hall/CRC (2011)Google Scholar
  33. 33.
    Snchez, J., Pla, F., Ferri, F.: On the use of neighbourhood-based non-parametric classifiers. Pattern Recogn. Lett. 18(1113), 1179–1186 (1997). doi: 10.1016/S0167-8655(97)00112-8. http://www.sciencedirect. com/science/article/pii/S0167865597001128 CrossRefGoogle Scholar
  34. 34.
    Snchez, J., Pla, F., Ferri, F.: Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recogn. Lett. 18(6), 507–513 (1997). doi: 10.1016/S0167-8655(97)00035-4. http:// CrossRefGoogle Scholar
  35. 35.
    Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6, 448–452 (1976)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Triguero, I., Derrac, J., Garcia, S., Herrera, F.: A taxonomy and experimental study on prototype generation for nearest neighbor classification. Trans. Sys. Man Cyber Part C 42(1), 86–100 (2012). doi: 10.1109/TSMCC.2010.2103939 CrossRefGoogle Scholar
  37. 37.
    Tsai, C.F., Eberle, W., Chu, C.Y.: Genetic algorithms in feature and instance selection. Know.-Based Syst. 39, 240–247 (2013). doi: 10.1016/j.knosys.2012.11.005 CrossRefGoogle Scholar
  38. 38.
    Vázquez, F., Sánchez, J.S., Pla, F.: A stochastic approach to wilson’s editing algorithm. In: Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II, IbPRIA’05, pp. 35–42. Springer-Verlag, Berlin, Heidelberg (2005), doi: 10.1007/11492542_5
  39. 39.
    Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)MathSciNetCrossRefzbMATHGoogle Scholar
  40. 40.
    Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-basedlearning algorithms. Mach. Learn. 38(3), 257–286 (2000). doi: 10.1023/A:1007626913721 CrossRefzbMATHGoogle Scholar
  41. 41.
    Wu, J.: Advances in K-means Clustering: A Data Mining Thinking. Springer Publishing Company, Incorporated (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Department of Applied Informatics, School of Information SciencesUniversity of MacedoniaThessalonikiGreece

Personalised recommendations