Efficient editing and data abstraction by finding homogeneous clusters

Ougiaroglou, Stefanos; Evangelidis, Georgios

doi:10.1007/s10472-015-9472-8

Efficient editing and data abstraction by finding homogeneous clusters

Published: 15 August 2015

Volume 76, pages 327–349, (2016)
Cite this article

Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Stefanos Ougiaroglou¹ &
Georgios Evangelidis¹

136 Accesses
6 Citations
Explore all metrics

Abstract

The efficiency of the k-Nearest Neighbour classifier depends on the size of the training set as well as the level of noise in it. Large datasets with high level of noise lead to less accurate classifiers with high computational cost and storage requirements. The goal of editing is to improve accuracy by improving the quality of the training datasets. To obtain such datasets, editing removes noise and mislabeled data as well as smooths the decision boundaries between the discrete classes. On the other hand, prototype abstraction aims to reduce the computational cost and the storage requirements of classifiers by condensing the training data. This paper proposes an editing algorithm called Editing through Homogeneous Clusters (EHC). Then, it extends the idea by introducing a prototype abstraction algorithm that integrate the EHC mechanism and is capable of creating a small noise-free representative set of the initial training data. This algorithm is called Editing and Reduction through Homogeneous Clusters (ERHC). Both are based on a fast and parameter free iterative execution of k-means clustering that forms homogeneous clusters. Both consider as noise and remove clusters consisting of a single item. In addition, ERHC summarizes the items of the remaining clusters by storing the mean item for each one in the representative set. EHC and ERHC are tested on several datasets. The results show that both run very fast and achieve high accuracy. In addition, ERHC achieves high reduction rates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aha, D.W.: Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. Int. J. Man-Mach. Stud. 36(2), 267–287 (1992). doi:10.1016/0020-7373(92)90018-G
Article Google Scholar
Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). doi:10.1023/A:1022689900470
Google Scholar
Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. MVLSC 17(2-3), 255–287 (2011)
Google Scholar
Barandela, R., Gasca, E.: Decontamination of training samples for supervised pattern recognition methods. In: Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition, pp 621–630. Springer-Verlag, London (2000). http://dl.acm.org/citation.cfm?id=645889.673580
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (2006). doi:10.1109/TIT.1967.1053964
Article MATH Google Scholar
Dasarathy, B.V.: Nearest neighbor (NN) norms : NN pattern classification techniques. IEEE Computer Society Press (1991)
Dasarathy, B.V., Snchez, J.S., Townsend, S.: Nearest neighbour editing and condensing toolssynergy exploitation. Pattern. Anal. Applic. 3(1), 19–30 (2000). doi:10.1007/s100440050003
Article Google Scholar
Derrac, J., Cornelis, C., García, S., Herrera, F.: Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection. Inf. Sci. 186(1), 73–92 (2012). doi:10.1016/j.ins.2011.09.027
Article Google Scholar
Devijver, P.A., Kittler, J.: On the edited nearest neighbor rule. In: Proceedings of the Fifth International Conference on Pattern Recognition. The Institute of Electrical and Electronics Engineers (1980)
García, S., Cano, J.R., Herrera, F.: A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recogn. 41(8), 2693–2709 (2008). doi:10.1016/j.patcog.2008.02.006
Article MATH Google Scholar
Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012). doi:10.1109/TPAMI.2011.142
Article Google Scholar
Garcia, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer Publishing Company, Incorporated (2014)
García-Borroto, M., Villuendas-Rey, Y., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Using maximum similarity graphs to edit nearest neighbor classifiers. In: Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, CIARP ’09, pp. 489–496. Springer-Verlag, Berlin, Heidelberg (2009), doi:10.1007/978-3-642-10268-4_57
García-Pedrajas, N., De Haro-García, A.: Boosting instance selection algorithms. Know.-Based Syst. 67, 342–360 (2014). doi:10.1016/j.knosys.2014.04.021
Article Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems. Elsevier Science (2011)
Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1968)
Article Google Scholar
Hattori, K., Takahashi, M.: A new edited k-nearest neighbor rule in the pattern classification problem. Pattern Recogn. 33(3), 521–528 (2000). doi:10.1016/S0031-3203(99)00068-0. http://www.sciencedirect.com/science/article/pii/S0031320399000680
Article Google Scholar
Jiang, Y., hua Zhou, Z.: Editing training data for knn classifiers with neural network ensemble. In: Lecture Notes in Computer Science, Vol.3173, pp. 356–361. Springer (2004)
Lozano, M.: Data Reduction Techniques in Classification processes (Phd Thesis). Universitat Jaume I (2007)
McQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proc. of 5th Berkeley Symp. on Math. Statistics and Probability, pp. 281–298. Berkeley, CA : University of California Press (1967)
Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Kittler, J.: A review of instance selection methods. Artif. Intell. Rev. 34(2), 133–143 (2010). doi:10.1007/s10462-010-9165-y
Article Google Scholar
Olvera-Lopez, J.A., Carrasco-Ochoa, J.A., Trinidad, J.F.M.: A new fast prototype selection method based on clustering. Pattern. Anal. Applic. 13(2), 131–141 (2010)
Article MathSciNet Google Scholar
Olvera-López, J.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: Mixed data object selection based on clustering and border objects. In: Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications, CIARP’07, 674–683. Springer-Verlag, Berlin, Heidelberg (2007). http://dl.acm.org/citation.cfm?id=1782914.1782996
Olvera-Lpez, J.A., Carrasco-Ochoa, J.A., Trinidad, J.F.M.: Object selection based on clustering and border objects. In: Kurzynski, M., Puchala, E., Wozniak, M., Zolnierek, A. (eds.) Computer Recognition Systems 2, Advances in Soft Computing, vol. 45, pp. 27–34. Springer (2008)
Ougiaroglou, S., Evangelidis, G.: Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the Fifth Balkan Conference in Informatics, BCI ’12, pp. 168–173. ACM, New York, NY, USA (2012), doi:10.1145/2371316.2371349
Ougiaroglou, S., Evangelidis, G.: Fast and accurate k-nearest neighbor classification using prototype selection by clustering. In: 16th Panhellenic Conference on Informatics (PCI), 2012, pp. 168–173 (2012)
Ougiaroglou, S., Evangelidis, G.: EHC: Non-parametric editing by finding homogeneous clusters. In: Beierle, C., Meghini, C. (eds.) Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, vol. 8367, pp. 290–304. Springer International Publishing (2014), doi:10.1007/978-3-319-04939-7_14
Ougiaroglou, S., Evangelidis, G.: RHC: a non-parametric cluster-based data reduction for efficient k-nn classification. Pattern Analysis and Applications pp. 1–17 (2014). doi:10.1007/s10044-014-0393-7
Sánchez, J.S.: High training set size reduction by space partitioning and prototype abstraction. Pattern Recogn. 37(7), 1561–1564 (2004)
Article Google Scholar
Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques to obtain quality training sets. Pattern Recogn. Lett. 24(7), 1015–1022 (2003). doi:10.1016/S0167-8655(02)00225-8
Article Google Scholar
Segata, N., Blanzieri, E., Delany, S.J., Cunningham, P.: Noise reduction for instance-based learning with a local maximal margin approach. J. Intell. Inf. Syst. 35(2), 301–331 (2010). doi:10.1007/s10844-009-0101-z
Article Google Scholar
Sheskin, D.: Handbook of Parametric and Nonparametric Statistical Procedures. A Chapman & Hall book. Chapman & Hall/CRC (2011)
Snchez, J., Pla, F., Ferri, F.: On the use of neighbourhood-based non-parametric classifiers. Pattern Recogn. Lett. 18(1113), 1179–1186 (1997). doi:10.1016/S0167-8655(97)00112-8. http://www.sciencedirect. com/science/article/pii/S0167865597001128
Article Google Scholar
Snchez, J., Pla, F., Ferri, F.: Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recogn. Lett. 18(6), 507–513 (1997). doi:10.1016/S0167-8655(97)00035-4. http:// www.sciencedirect.com/science/article/pii/S0167865597000354
Article Google Scholar
Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6, 448–452 (1976)
Article MathSciNet MATH Google Scholar
Triguero, I., Derrac, J., Garcia, S., Herrera, F.: A taxonomy and experimental study on prototype generation for nearest neighbor classification. Trans. Sys. Man Cyber Part C 42(1), 86–100 (2012). doi:10.1109/TSMCC.2010.2103939
Article Google Scholar
Tsai, C.F., Eberle, W., Chu, C.Y.: Genetic algorithms in feature and instance selection. Know.-Based Syst. 39, 240–247 (2013). doi:10.1016/j.knosys.2012.11.005
Article Google Scholar
Vázquez, F., Sánchez, J.S., Pla, F.: A stochastic approach to wilson’s editing algorithm. In: Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II, IbPRIA’05, pp. 35–42. Springer-Verlag, Berlin, Heidelberg (2005), doi:10.1007/11492542_5
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Article MathSciNet MATH Google Scholar
Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-basedlearning algorithms. Mach. Learn. 38(3), 257–286 (2000). doi:10.1023/A:1007626913721
Article MATH Google Scholar
Wu, J.: Advances in K-means Clustering: A Data Mining Thinking. Springer Publishing Company, Incorporated (2012)

Download references

Author information

Authors and Affiliations

Department of Applied Informatics, School of Information Sciences, University of Macedonia, 156 Egnatia str., 54006, Thessaloniki, Greece
Stefanos Ougiaroglou & Georgios Evangelidis

Authors

Stefanos Ougiaroglou
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Evangelidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefanos Ougiaroglou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ougiaroglou, S., Evangelidis, G. Efficient editing and data abstraction by finding homogeneous clusters. Ann Math Artif Intell 76, 327–349 (2016). https://doi.org/10.1007/s10472-015-9472-8

Download citation

Published: 15 August 2015
Issue Date: April 2016
DOI: https://doi.org/10.1007/s10472-015-9472-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient editing and data abstraction by finding homogeneous clusters

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient editing and data abstraction by finding homogeneous clusters

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation