Skip to main content
Log in

Efficient editing and data abstraction by finding homogeneous clusters

  • Published:
Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Abstract

The efficiency of the k-Nearest Neighbour classifier depends on the size of the training set as well as the level of noise in it. Large datasets with high level of noise lead to less accurate classifiers with high computational cost and storage requirements. The goal of editing is to improve accuracy by improving the quality of the training datasets. To obtain such datasets, editing removes noise and mislabeled data as well as smooths the decision boundaries between the discrete classes. On the other hand, prototype abstraction aims to reduce the computational cost and the storage requirements of classifiers by condensing the training data. This paper proposes an editing algorithm called Editing through Homogeneous Clusters (EHC). Then, it extends the idea by introducing a prototype abstraction algorithm that integrate the EHC mechanism and is capable of creating a small noise-free representative set of the initial training data. This algorithm is called Editing and Reduction through Homogeneous Clusters (ERHC). Both are based on a fast and parameter free iterative execution of k-means clustering that forms homogeneous clusters. Both consider as noise and remove clusters consisting of a single item. In addition, ERHC summarizes the items of the remaining clusters by storing the mean item for each one in the representative set. EHC and ERHC are tested on several datasets. The results show that both run very fast and achieve high accuracy. In addition, ERHC achieves high reduction rates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aha, D.W.: Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. Int. J. Man-Mach. Stud. 36(2), 267–287 (1992). doi:10.1016/0020-7373(92)90018-G

    Article  Google Scholar 

  2. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). doi:10.1023/A:1022689900470

    Google Scholar 

  3. Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. MVLSC 17(2-3), 255–287 (2011)

    Google Scholar 

  4. Barandela, R., Gasca, E.: Decontamination of training samples for supervised pattern recognition methods. In: Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition, pp 621–630. Springer-Verlag, London (2000). http://dl.acm.org/citation.cfm?id=645889.673580

  5. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (2006). doi:10.1109/TIT.1967.1053964

    Article  MATH  Google Scholar 

  6. Dasarathy, B.V.: Nearest neighbor (NN) norms : NN pattern classification techniques. IEEE Computer Society Press (1991)

  7. Dasarathy, B.V., Snchez, J.S., Townsend, S.: Nearest neighbour editing and condensing toolssynergy exploitation. Pattern. Anal. Applic. 3(1), 19–30 (2000). doi:10.1007/s100440050003

    Article  Google Scholar 

  8. Derrac, J., Cornelis, C., García, S., Herrera, F.: Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection. Inf. Sci. 186(1), 73–92 (2012). doi:10.1016/j.ins.2011.09.027

    Article  Google Scholar 

  9. Devijver, P.A., Kittler, J.: On the edited nearest neighbor rule. In: Proceedings of the Fifth International Conference on Pattern Recognition. The Institute of Electrical and Electronics Engineers (1980)

  10. García, S., Cano, J.R., Herrera, F.: A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recogn. 41(8), 2693–2709 (2008). doi:10.1016/j.patcog.2008.02.006

    Article  MATH  Google Scholar 

  11. Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012). doi:10.1109/TPAMI.2011.142

    Article  Google Scholar 

  12. Garcia, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer Publishing Company, Incorporated (2014)

  13. García-Borroto, M., Villuendas-Rey, Y., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Using maximum similarity graphs to edit nearest neighbor classifiers. In: Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, CIARP ’09, pp. 489–496. Springer-Verlag, Berlin, Heidelberg (2009), doi:10.1007/978-3-642-10268-4_57

  14. García-Pedrajas, N., De Haro-García, A.: Boosting instance selection algorithms. Know.-Based Syst. 67, 342–360 (2014). doi:10.1016/j.knosys.2014.04.021

    Article  Google Scholar 

  15. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems. Elsevier Science (2011)

  16. Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1968)

    Article  Google Scholar 

  17. Hattori, K., Takahashi, M.: A new edited k-nearest neighbor rule in the pattern classification problem. Pattern Recogn. 33(3), 521–528 (2000). doi:10.1016/S0031-3203(99)00068-0. http://www.sciencedirect.com/science/article/pii/S0031320399000680

    Article  Google Scholar 

  18. Jiang, Y., hua Zhou, Z.: Editing training data for knn classifiers with neural network ensemble. In: Lecture Notes in Computer Science, Vol.3173, pp. 356–361. Springer (2004)

  19. Lozano, M.: Data Reduction Techniques in Classification processes (Phd Thesis). Universitat Jaume I (2007)

  20. McQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proc. of 5th Berkeley Symp. on Math. Statistics and Probability, pp. 281–298. Berkeley, CA : University of California Press (1967)

  21. Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Kittler, J.: A review of instance selection methods. Artif. Intell. Rev. 34(2), 133–143 (2010). doi:10.1007/s10462-010-9165-y

    Article  Google Scholar 

  22. Olvera-Lopez, J.A., Carrasco-Ochoa, J.A., Trinidad, J.F.M.: A new fast prototype selection method based on clustering. Pattern. Anal. Applic. 13(2), 131–141 (2010)

    Article  MathSciNet  Google Scholar 

  23. Olvera-López, J.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: Mixed data object selection based on clustering and border objects. In: Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications, CIARP’07, 674–683. Springer-Verlag, Berlin, Heidelberg (2007). http://dl.acm.org/citation.cfm?id=1782914.1782996

  24. Olvera-Lpez, J.A., Carrasco-Ochoa, J.A., Trinidad, J.F.M.: Object selection based on clustering and border objects. In: Kurzynski, M., Puchala, E., Wozniak, M., Zolnierek, A. (eds.) Computer Recognition Systems 2, Advances in Soft Computing, vol. 45, pp. 27–34. Springer (2008)

  25. Ougiaroglou, S., Evangelidis, G.: Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the Fifth Balkan Conference in Informatics, BCI ’12, pp. 168–173. ACM, New York, NY, USA (2012), doi:10.1145/2371316.2371349

  26. Ougiaroglou, S., Evangelidis, G.: Fast and accurate k-nearest neighbor classification using prototype selection by clustering. In: 16th Panhellenic Conference on Informatics (PCI), 2012, pp. 168–173 (2012)

  27. Ougiaroglou, S., Evangelidis, G.: EHC: Non-parametric editing by finding homogeneous clusters. In: Beierle, C., Meghini, C. (eds.) Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, vol. 8367, pp. 290–304. Springer International Publishing (2014), doi:10.1007/978-3-319-04939-7_14

  28. Ougiaroglou, S., Evangelidis, G.: RHC: a non-parametric cluster-based data reduction for efficient k-nn classification. Pattern Analysis and Applications pp. 1–17 (2014). doi:10.1007/s10044-014-0393-7

  29. Sánchez, J.S.: High training set size reduction by space partitioning and prototype abstraction. Pattern Recogn. 37(7), 1561–1564 (2004)

    Article  Google Scholar 

  30. Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques to obtain quality training sets. Pattern Recogn. Lett. 24(7), 1015–1022 (2003). doi:10.1016/S0167-8655(02)00225-8

    Article  Google Scholar 

  31. Segata, N., Blanzieri, E., Delany, S.J., Cunningham, P.: Noise reduction for instance-based learning with a local maximal margin approach. J. Intell. Inf. Syst. 35(2), 301–331 (2010). doi:10.1007/s10844-009-0101-z

    Article  Google Scholar 

  32. Sheskin, D.: Handbook of Parametric and Nonparametric Statistical Procedures. A Chapman & Hall book. Chapman & Hall/CRC (2011)

  33. Snchez, J., Pla, F., Ferri, F.: On the use of neighbourhood-based non-parametric classifiers. Pattern Recogn. Lett. 18(1113), 1179–1186 (1997). doi:10.1016/S0167-8655(97)00112-8. http://www.sciencedirect. com/science/article/pii/S0167865597001128

    Article  Google Scholar 

  34. Snchez, J., Pla, F., Ferri, F.: Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recogn. Lett. 18(6), 507–513 (1997). doi:10.1016/S0167-8655(97)00035-4. http:// www.sciencedirect.com/science/article/pii/S0167865597000354

    Article  Google Scholar 

  35. Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6, 448–452 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  36. Triguero, I., Derrac, J., Garcia, S., Herrera, F.: A taxonomy and experimental study on prototype generation for nearest neighbor classification. Trans. Sys. Man Cyber Part C 42(1), 86–100 (2012). doi:10.1109/TSMCC.2010.2103939

    Article  Google Scholar 

  37. Tsai, C.F., Eberle, W., Chu, C.Y.: Genetic algorithms in feature and instance selection. Know.-Based Syst. 39, 240–247 (2013). doi:10.1016/j.knosys.2012.11.005

    Article  Google Scholar 

  38. Vázquez, F., Sánchez, J.S., Pla, F.: A stochastic approach to wilson’s editing algorithm. In: Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II, IbPRIA’05, pp. 35–42. Springer-Verlag, Berlin, Heidelberg (2005), doi:10.1007/11492542_5

  39. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  40. Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-basedlearning algorithms. Mach. Learn. 38(3), 257–286 (2000). doi:10.1023/A:1007626913721

    Article  MATH  Google Scholar 

  41. Wu, J.: Advances in K-means Clustering: A Data Mining Thinking. Springer Publishing Company, Incorporated (2012)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefanos Ougiaroglou.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ougiaroglou, S., Evangelidis, G. Efficient editing and data abstraction by finding homogeneous clusters. Ann Math Artif Intell 76, 327–349 (2016). https://doi.org/10.1007/s10472-015-9472-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10472-015-9472-8

Keywords

Navigation