Abstract
Clustering is one of the fundamental data mining tasks. Many different clustering paradigms have been developed over the years, which include partitional, hierarchical, mixture model based, density-based, spectral, subspace, and so on. The focus of this paper is on full-dimensional, arbitrary shaped clusters. Existing methods for this problem suffer either in terms of the memory or time complexity (quadratic or even cubic). This shortcoming has restricted these algorithms to datasets of moderate sizes. In this paper we propose SPARCL, a simple and scalable algorithm for finding clusters with arbitrary shapes and sizes, and it has linear space and time complexity. SPARCL consists of two stages—the first stage runs a carefully initialized version of the Kmeans algorithm to generate many small seed clusters. The second stage iteratively merges the generated clusters to obtain the final shape-based clusters. Experiments were conducted on a variety of datasets to highlight the effectiveness, efficiency, and scalability of our approach. On the large datasets SPARCL is an order of magnitude faster than the best existing approaches.
Similar content being viewed by others
References
Astrahan MM (1970) Speech analysis by clustering, or the hyperphoneme method. Stanford A.I. Project Memo, Stanford University
Aupetit M (2003) Robust topology representing networks. In: Proceedings of the European symposium on artificial neural networks, Bruges, Belgium, pp 45–50
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans, Louisiana, 2007, pp 1027–1035
Ball GH, Hall DJ (1967) PROMENADE—an online pattern recognition system. Stanford Research Institute, Stanford University
Bradley PS, Fayyad UM (1998) Refining initial points for k-means clustering. In: Proceedings of the fifteenth international conference on machine learning Madison, Wisconsin, pp 91–99
Breunig MM, Kriegel H, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of 2000 ACM-SIGMOD international conference on management of data, Dallas, Texas, 2000, pp 93–104
Chazelle B, Palios L (1994) Decomposition algorithms in geometry. In: Bajaj C (eds) Algebraic geometry and its applications. Springer, Berlin, pp 419–447
Cormen TH, Leiserson CE, Rivest RL, Stein C (2005) Introduction to algorithms, 2nd edn. MIT Press, Cambridge
Dasgupta S, Schulman LJ (2000) A two-round variant of EM for Gaussian mixtures. In: Proceedings of the 16th conference on uncertainty in artificial intelligence, Stanford, CA, June 2000, pp 152–159
Dhillon IS, Guan Y, Julis B (2004) Kernel k-means, spectral clustering and normalized cuts. In: Proceedings of the tenth international conference on knowledge discovery and data mining, Seattle, WA, August 2004, pp 551–556
Domeniconi C, Gunopulos D (2004) An efficient density-based approach for data mining tasks. Knowl Inform Syst 6(6): 750–770
Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-Interscience, New York
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, 1996, pp 226–231
García JA, Fdez-Valdivia J, Cortijo FJ, Molina R (1995) A dynamic approach for clustering data. Signal Process 44(2): 181–196
Gao BJ, Ester M, Cai J-Y, Schulte O, Xiong H (2007) The minimum consistent subset cover problem and its applications in data mining. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining San Jose, California, USA, 2007, pp 310–319
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD international conference on management of data, Seattle, WA, pp 73–84
Hasan MA, Chaoji V, Salem S, Zaki MJ (2008) Robust partitional clustering by outlier and density insensitive seeding. Technical Report 08-04, RPI Computer Science
Hinneburg A, Gabriel H-H (2007) DENCLUE 2.0: fast clustering based on kernel density estimation. In: International symposium on intelligent data analysis, pp 70–80
Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inform Syst 5(4): 387–415
Hinneburg A, Keim DA (1998) An efficient approach to clustering in multimedia databases with noise. In: Proceedings of 4th international conference on knowledge discovery and data mining. AAAI Press, New York
Hu X, Pan Y (2007) Knowledge discovery in bioinformatics: techniques, methods, and applications (Wiley Series in Bioinformatics). Wiley, New York
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle River
Klaus J (2003) Clustering intrusion detection alarms to support root cause analysis. ACM Trans Inform Syst Secur 6(4): 443–471
Karypis G, Han E-H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8): 68–75
Katsavounidis I, Kuo CCJ, Zhen Z (1994) A new initialization technique for generalized Lloyd iteration. IEEE Signal Process Lett 1(10): 144–146
Kaufman, L Rousseeuw, PJ (1990) Finding groups in data: An introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, New York
Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing. Knowl Inform Syst 12(1): 25–53
Lee J, Verleysen M (2007) Nonlinear dimensionality reduction. Springer, Berlin
Martinetz T, Schulten K (1994) Topology representing networks. Neural Netw 7(3): 507–522
Meila M, Shi J (2001) A random walks view of spectral segmentation. AI and Statistics (AISTATS)
Miller HJ, Han J (2001) Geographic data mining and knowledge discovery. Taylor & Francis, Bristol
Ng AY, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, vol 14
Okabe A, Boots B, Sugihara K (1992) Spatial tessellations: concepts and applications of voronoi diagrams. Wiley, New York
Punj G, Stewart DW (1983) Cluster analysis in marketing research: review and suggestions for application. J Mark Res 20(2): 134–148
Sack JR, Urrutia J (2000) Handbook of computational geometry. North-Holland, Amsterdam
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings of 24th international conference on very large data bases, pp 428–439
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905
Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2323
Vennam JR, Vadapalli S (2005) SynDECA: a tool to generate synthetic datasets for evaluation of clustering algorithms. In: 11th international conference on management of data
Wu X, Kumar V, Ross Q, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14(1): 1–37
Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. In: Proceedings of 18th annual conference on neural information processing systems
Zeng H-J, He Q-C, Chen Z, Ma W-Y, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, pp 210–217
Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Min Knowl Discov 1(2): 141–182
Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2): 141–168
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported in part by NSF Grants EMT-0829835, and CNS-0103708, and NIH Grant 1R01EB0080161-01A1.
Rights and permissions
About this article
Cite this article
Chaoji, V., Al Hasan, M., Salem, S. et al. SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters. Knowl Inf Syst 21, 201–229 (2009). https://doi.org/10.1007/s10115-009-0216-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0216-0