Abstract
In this paper we present a clustering technique called DenClust that produces high quality initial seeds through a deterministic process without requiring an user input on the number of clusters k and the radius of the clusters r. The high quality seeds are given input to K-Means as the set of initial seeds to produce the final clusters. DenClust uses a density based approach for initial seed selection. It calculates the density of each record, where the density of a record is the number of records that have the minimum distances with the record. This approach is expected to produce high quality initial seeds for K-Means resulting in high quality clusters from a dataset. The performance of DenClust is compared with five (5) existing techniques namely CRUDAW, AGCUK, Simple K-means (SK), Basic Farthest Point Heuristic (BFPH) and New Farthest Point Heuristic (NFPH) in terms of three (3) external cluster evaluation criteria namely F-Measure, Entropy, Purity and two (2) internal cluster evaluation criteria namely Xie-Beni Index (XB) and Sum of Square Error (SSE). We use three (3) natural datasets that we obtain from the UCI machine learning repository. DenClust performs better than all five existing techniques in terms of all five evaluation criteria for all three datasets used in this study.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bai, L., Liang, J., Dang, C.: An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data. Knowledge-Based Systems 24(6), 785–795 (2011)
Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Pearson Addison Wesley (2005)
Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, pp. 21–34 (1997)
Khan, F.: An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application. Applied Soft Computing 12(11), 3698–3700 (2012)
Chuan Tan, S., Ming Ting, K., Wei Teng, S.: A general stochastic clustering method for automatic cluster discovery. Pattern Recognition 44(10-11), 2786–2799 (2011)
Jain, A.K.: Data clustering: 50 years beyond K-Means. Pattern Recognition Letters 31(8), 651–666 (2010)
Bagirov, A.M.: Modified global -means algorithm for minimum sum-of-squares clustering problems. Pattern Recognition 41(10), 3192–3199 (2008)
Maitra, R., Peterson, A., Ghosh, A.: A systematic evaluation of different methods for initializing the K-means clustering algorithm. IEEE Transactions on Knowledge and Data Engineering (2010)
Rahman, M.A., Islam, M.Z.: CRUDAW: A Novel Fuzzy Technique for Clustering Records Following User Defined Attribute Weights. In: 10th Australasian Data Mining Conference (AusDM 2012), Sydney, Australia. CRPIT Series, vol. 134, pp. 27–42. ACS (2012)
Liu, Y., Wu, X., Shen, Y.: Automatic clustering using genetic algorithms. Applied Mathematics and Computation 218(4), 1267–1279 (2011)
He, Z.: Farthest-Point Heuristic based Initialization Methods for K-Modes Clustering. CoRR, abs/cs/0610043 (2006)
Mukhopadhyay, A., Maulik, U.: Towards improving fuzzy clustering using support vector machine: Application to gene expression data. Pattern Recognition 42(11), 2744–2763 (2009)
Bache, K., Lichman, M.: UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences (2013), http://archive.ics.uci.edu/ml/
Rahman, M.A., Islam, M.Z.: Seed-Detective: A Novel Clustering Technique Using High Quality Seed for K-Means on Categorical and Numerical Attributes. In: 9th Australasian Data Mining Conference(AusDM 2011), Ballarat, Australia. CRPIT Series, vol. 121, pp. 211–220. ACS (2011)
Giggins, H., Brankovic, L.: VICUS - A Noise Addition Technique for Categorical Data. In: 10th Australasian Data Mining Conference (AusDM 2012), December 4 - 7. CRPIT, vol. 134, pp. 139–148 (2012)
Ji, J., Pang, W., Zhou, C., Han, X., Wang, Z.: A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data. Knowledge-Based Systems 30(0), 129–135 (2012)
Wang, Y.: Approximating nearest neighbor among triangles in convex position. Information Processing Letters 108(6), 379–385 (2008)
Nene, S.A., Nayar, S.K.: A simple algorithm for nearest neighbor search in high dimensions. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(9), 989–1003 (1997)
Vaidya, P.M.: An O(n log n) Algorithm for the All-Nearest-Neighbors Problem. Discrete Computational Geometry 4(1), 101–115 (1989)
Kocamaz, U.E.: Increasing the efficiency of quicksort using a neural network based algorithm selection model. Information Sciences 229(0), 94–105 (2013)
Yang, Y., Yu, P., Gan, Y.: Experimental Study on the Five Sort Algorithms. In: Second International Conference on Mechanic Automation and Control Engineering (MACE), pp. 1314–1317 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Rahman, M.A., Islam, M.Z., Bossomaier, T. (2014). DenClust: A Density Based Seed Selection Approach for K-Means. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2014. Lecture Notes in Computer Science(), vol 8468. Springer, Cham. https://doi.org/10.1007/978-3-319-07176-3_68
Download citation
DOI: https://doi.org/10.1007/978-3-319-07176-3_68
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07175-6
Online ISBN: 978-3-319-07176-3
eBook Packages: Computer ScienceComputer Science (R0)