SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters

Chaoji, Vineet; Al Hasan, Mohammad; Salem, Saeed; Zaki, Mohammed J.

doi:10.1007/s10115-009-0216-0

SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters

Regular Paper
Published: 03 June 2009

Volume 21, pages 201–229, (2009)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Vineet Chaoji¹,
Mohammad Al Hasan¹,
Saeed Salem¹ &
…
Mohammed J. Zaki¹

217 Accesses
19 Citations
Explore all metrics

Abstract

Clustering is one of the fundamental data mining tasks. Many different clustering paradigms have been developed over the years, which include partitional, hierarchical, mixture model based, density-based, spectral, subspace, and so on. The focus of this paper is on full-dimensional, arbitrary shaped clusters. Existing methods for this problem suffer either in terms of the memory or time complexity (quadratic or even cubic). This shortcoming has restricted these algorithms to datasets of moderate sizes. In this paper we propose SPARCL, a simple and scalable algorithm for finding clusters with arbitrary shapes and sizes, and it has linear space and time complexity. SPARCL consists of two stages—the first stage runs a carefully initialized version of the Kmeans algorithm to generate many small seed clusters. The second stage iteratively merges the generated clusters to obtain the final shape-based clusters. Experiments were conducted on a variety of datasets to highlight the effectiveness, efficiency, and scalability of our approach. On the large datasets SPARCL is an order of magnitude faster than the best existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SPARK: A New Clustering Algorithm for Obtaining Sparse and Interpretable Centroids

Initial Centroid Selection Method for an Enhanced K-means Clustering Algorithm

“Anti-Bayesian” Flat and Hierarchical Clustering Using Symmetric Quantiloids

References

Astrahan MM (1970) Speech analysis by clustering, or the hyperphoneme method. Stanford A.I. Project Memo, Stanford University
Aupetit M (2003) Robust topology representing networks. In: Proceedings of the European symposium on artificial neural networks, Bruges, Belgium, pp 45–50
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans, Louisiana, 2007, pp 1027–1035
Ball GH, Hall DJ (1967) PROMENADE—an online pattern recognition system. Stanford Research Institute, Stanford University
Bradley PS, Fayyad UM (1998) Refining initial points for k-means clustering. In: Proceedings of the fifteenth international conference on machine learning Madison, Wisconsin, pp 91–99
Breunig MM, Kriegel H, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of 2000 ACM-SIGMOD international conference on management of data, Dallas, Texas, 2000, pp 93–104
Chazelle B, Palios L (1994) Decomposition algorithms in geometry. In: Bajaj C (eds) Algebraic geometry and its applications. Springer, Berlin, pp 419–447
Google Scholar
Cormen TH, Leiserson CE, Rivest RL, Stein C (2005) Introduction to algorithms, 2nd edn. MIT Press, Cambridge
Google Scholar
Dasgupta S, Schulman LJ (2000) A two-round variant of EM for Gaussian mixtures. In: Proceedings of the 16th conference on uncertainty in artificial intelligence, Stanford, CA, June 2000, pp 152–159
Dhillon IS, Guan Y, Julis B (2004) Kernel k-means, spectral clustering and normalized cuts. In: Proceedings of the tenth international conference on knowledge discovery and data mining, Seattle, WA, August 2004, pp 551–556
Domeniconi C, Gunopulos D (2004) An efficient density-based approach for data mining tasks. Knowl Inform Syst 6(6): 750–770
Article Google Scholar
Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-Interscience, New York
Google Scholar
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, 1996, pp 226–231
García JA, Fdez-Valdivia J, Cortijo FJ, Molina R (1995) A dynamic approach for clustering data. Signal Process 44(2): 181–196
Article Google Scholar
Gao BJ, Ester M, Cai J-Y, Schulte O, Xiong H (2007) The minimum consistent subset cover problem and its applications in data mining. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining San Jose, California, USA, 2007, pp 310–319
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD international conference on management of data, Seattle, WA, pp 73–84
Hasan MA, Chaoji V, Salem S, Zaki MJ (2008) Robust partitional clustering by outlier and density insensitive seeding. Technical Report 08-04, RPI Computer Science
Hinneburg A, Gabriel H-H (2007) DENCLUE 2.0: fast clustering based on kernel density estimation. In: International symposium on intelligent data analysis, pp 70–80
Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inform Syst 5(4): 387–415
Article Google Scholar
Hinneburg A, Keim DA (1998) An efficient approach to clustering in multimedia databases with noise. In: Proceedings of 4th international conference on knowledge discovery and data mining. AAAI Press, New York
Hu X, Pan Y (2007) Knowledge discovery in bioinformatics: techniques, methods, and applications (Wiley Series in Bioinformatics). Wiley, New York
Book Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle River
MATH Google Scholar
Klaus J (2003) Clustering intrusion detection alarms to support root cause analysis. ACM Trans Inform Syst Secur 6(4): 443–471
Article Google Scholar
Karypis G, Han E-H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8): 68–75
Article Google Scholar
Katsavounidis I, Kuo CCJ, Zhen Z (1994) A new initialization technique for generalized Lloyd iteration. IEEE Signal Process Lett 1(10): 144–146
Article Google Scholar
Kaufman, L Rousseeuw, PJ (1990) Finding groups in data: An introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, New York
Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing. Knowl Inform Syst 12(1): 25–53
Article Google Scholar
Lee J, Verleysen M (2007) Nonlinear dimensionality reduction. Springer, Berlin
Book MATH Google Scholar
Martinetz T, Schulten K (1994) Topology representing networks. Neural Netw 7(3): 507–522
Article Google Scholar
Meila M, Shi J (2001) A random walks view of spectral segmentation. AI and Statistics (AISTATS)
Miller HJ, Han J (2001) Geographic data mining and knowledge discovery. Taylor & Francis, Bristol
Google Scholar
Ng AY, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, vol 14
Okabe A, Boots B, Sugihara K (1992) Spatial tessellations: concepts and applications of voronoi diagrams. Wiley, New York
MATH Google Scholar
Punj G, Stewart DW (1983) Cluster analysis in marketing research: review and suggestions for application. J Mark Res 20(2): 134–148
Article Google Scholar
Sack JR, Urrutia J (2000) Handbook of computational geometry. North-Holland, Amsterdam
MATH Google Scholar
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Google Scholar
Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings of 24th international conference on very large data bases, pp 428–439
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905
Article Google Scholar
Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2323
Article Google Scholar
Vennam JR, Vadapalli S (2005) SynDECA: a tool to generate synthetic datasets for evaluation of clustering algorithms. In: 11th international conference on management of data
Wu X, Kumar V, Ross Q, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14(1): 1–37
Article Google Scholar
Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. In: Proceedings of 18th annual conference on neural information processing systems
Zeng H-J, He Q-C, Chen Z, Ma W-Y, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, pp 210–217
Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Min Knowl Discov 1(2): 141–182
Article Google Scholar
Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2): 141–168
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, 12180, USA
Vineet Chaoji, Mohammad Al Hasan, Saeed Salem & Mohammed J. Zaki

Authors

Vineet Chaoji
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Al Hasan
View author publications
You can also search for this author in PubMed Google Scholar
Saeed Salem
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed J. Zaki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vineet Chaoji.

Additional information

This work was supported in part by NSF Grants EMT-0829835, and CNS-0103708, and NIH Grant 1R01EB0080161-01A1.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chaoji, V., Al Hasan, M., Salem, S. et al. SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters. Knowl Inf Syst 21, 201–229 (2009). https://doi.org/10.1007/s10115-009-0216-0

Download citation

Received: 30 December 2008
Revised: 28 February 2009
Accepted: 03 April 2009
Published: 03 June 2009
Issue Date: November 2009
DOI: https://doi.org/10.1007/s10115-009-0216-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters

Abstract

Access this article

Similar content being viewed by others

SPARK: A New Clustering Algorithm for Obtaining Sparse and Interpretable Centroids

Initial Centroid Selection Method for an Enhanced K-means Clustering Algorithm

“Anti-Bayesian” Flat and Hierarchical Clustering Using Symmetric Quantiloids

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters

Abstract

Access this article

Similar content being viewed by others

SPARK: A New Clustering Algorithm for Obtaining Sparse and Interpretable Centroids

Initial Centroid Selection Method for an Enhanced K-means Clustering Algorithm

“Anti-Bayesian” Flat and Hierarchical Clustering Using Symmetric Quantiloids

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation