Advertisement

SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters

  • Vineet Chaoji
  • Mohammad Al Hasan
  • Saeed Salem
  • Mohammed J. Zaki
Regular Paper

Abstract

Clustering is one of the fundamental data mining tasks. Many different clustering paradigms have been developed over the years, which include partitional, hierarchical, mixture model based, density-based, spectral, subspace, and so on. The focus of this paper is on full-dimensional, arbitrary shaped clusters. Existing methods for this problem suffer either in terms of the memory or time complexity (quadratic or even cubic). This shortcoming has restricted these algorithms to datasets of moderate sizes. In this paper we propose SPARCL, a simple and scalable algorithm for finding clusters with arbitrary shapes and sizes, and it has linear space and time complexity. SPARCL consists of two stages—the first stage runs a carefully initialized version of the Kmeans algorithm to generate many small seed clusters. The second stage iteratively merges the generated clusters to obtain the final shape-based clusters. Experiments were conducted on a variety of datasets to highlight the effectiveness, efficiency, and scalability of our approach. On the large datasets SPARCL is an order of magnitude faster than the best existing approaches.

Keywords

Clustering Spatial Kmeans Hierarchical Linear time 

References

  1. 1.
    Astrahan MM (1970) Speech analysis by clustering, or the hyperphoneme method. Stanford A.I. Project Memo, Stanford UniversityGoogle Scholar
  2. 2.
    Aupetit M (2003) Robust topology representing networks. In: Proceedings of the European symposium on artificial neural networks, Bruges, Belgium, pp 45–50Google Scholar
  3. 3.
    Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans, Louisiana, 2007, pp 1027–1035Google Scholar
  4. 4.
    Ball GH, Hall DJ (1967) PROMENADE—an online pattern recognition system. Stanford Research Institute, Stanford UniversityGoogle Scholar
  5. 5.
    Bradley PS, Fayyad UM (1998) Refining initial points for k-means clustering. In: Proceedings of the fifteenth international conference on machine learning Madison, Wisconsin, pp 91–99Google Scholar
  6. 6.
    Breunig MM, Kriegel H, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of 2000 ACM-SIGMOD international conference on management of data, Dallas, Texas, 2000, pp 93–104Google Scholar
  7. 7.
    Chazelle B, Palios L (1994) Decomposition algorithms in geometry. In: Bajaj C (eds) Algebraic geometry and its applications. Springer, Berlin, pp 419–447Google Scholar
  8. 8.
    Cormen TH, Leiserson CE, Rivest RL, Stein C (2005) Introduction to algorithms, 2nd edn. MIT Press, CambridgeGoogle Scholar
  9. 9.
    Dasgupta S, Schulman LJ (2000) A two-round variant of EM for Gaussian mixtures. In: Proceedings of the 16th conference on uncertainty in artificial intelligence, Stanford, CA, June 2000, pp 152–159Google Scholar
  10. 10.
    Dhillon IS, Guan Y, Julis B (2004) Kernel k-means, spectral clustering and normalized cuts. In: Proceedings of the tenth international conference on knowledge discovery and data mining, Seattle, WA, August 2004, pp 551–556Google Scholar
  11. 11.
    Domeniconi C, Gunopulos D (2004) An efficient density-based approach for data mining tasks. Knowl Inform Syst 6(6): 750–770CrossRefGoogle Scholar
  12. 12.
    Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-Interscience, New YorkGoogle Scholar
  13. 13.
    Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, 1996, pp 226–231Google Scholar
  14. 14.
    García JA, Fdez-Valdivia J, Cortijo FJ, Molina R (1995) A dynamic approach for clustering data. Signal Process 44(2): 181–196CrossRefGoogle Scholar
  15. 15.
    Gao BJ, Ester M, Cai J-Y, Schulte O, Xiong H (2007) The minimum consistent subset cover problem and its applications in data mining. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining San Jose, California, USA, 2007, pp 310–319Google Scholar
  16. 16.
    Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD international conference on management of data, Seattle, WA, pp 73–84Google Scholar
  17. 17.
    Hasan MA, Chaoji V, Salem S, Zaki MJ (2008) Robust partitional clustering by outlier and density insensitive seeding. Technical Report 08-04, RPI Computer ScienceGoogle Scholar
  18. 18.
    Hinneburg A, Gabriel H-H (2007) DENCLUE 2.0: fast clustering based on kernel density estimation. In: International symposium on intelligent data analysis, pp 70–80Google Scholar
  19. 19.
    Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inform Syst 5(4): 387–415CrossRefGoogle Scholar
  20. 20.
    Hinneburg A, Keim DA (1998) An efficient approach to clustering in multimedia databases with noise. In: Proceedings of 4th international conference on knowledge discovery and data mining. AAAI Press, New YorkGoogle Scholar
  21. 21.
    Hu X, Pan Y (2007) Knowledge discovery in bioinformatics: techniques, methods, and applications (Wiley Series in Bioinformatics). Wiley, New YorkCrossRefGoogle Scholar
  22. 22.
    Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle RiverzbMATHGoogle Scholar
  23. 23.
    Klaus J (2003) Clustering intrusion detection alarms to support root cause analysis. ACM Trans Inform Syst Secur 6(4): 443–471CrossRefGoogle Scholar
  24. 24.
    Karypis G, Han E-H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8): 68–75CrossRefGoogle Scholar
  25. 25.
    Katsavounidis I, Kuo CCJ, Zhen Z (1994) A new initialization technique for generalized Lloyd iteration. IEEE Signal Process Lett 1(10): 144–146CrossRefGoogle Scholar
  26. 26.
    Kaufman, L Rousseeuw, PJ (1990) Finding groups in data: An introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, New YorkGoogle Scholar
  27. 27.
    Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing. Knowl Inform Syst 12(1): 25–53CrossRefGoogle Scholar
  28. 28.
    Lee J, Verleysen M (2007) Nonlinear dimensionality reduction. Springer, BerlinzbMATHCrossRefGoogle Scholar
  29. 29.
    Martinetz T, Schulten K (1994) Topology representing networks. Neural Netw 7(3): 507–522CrossRefGoogle Scholar
  30. 30.
    Meila M, Shi J (2001) A random walks view of spectral segmentation. AI and Statistics (AISTATS)Google Scholar
  31. 31.
    Miller HJ, Han J (2001) Geographic data mining and knowledge discovery. Taylor & Francis, BristolGoogle Scholar
  32. 32.
    Ng AY, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, vol 14Google Scholar
  33. 33.
    Okabe A, Boots B, Sugihara K (1992) Spatial tessellations: concepts and applications of voronoi diagrams. Wiley, New YorkzbMATHGoogle Scholar
  34. 34.
    Punj G, Stewart DW (1983) Cluster analysis in marketing research: review and suggestions for application. J Mark Res 20(2): 134–148CrossRefGoogle Scholar
  35. 35.
    Sack JR, Urrutia J (2000) Handbook of computational geometry. North-Holland, AmsterdamzbMATHGoogle Scholar
  36. 36.
    Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, CambridgeGoogle Scholar
  37. 37.
    Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings of 24th international conference on very large data bases, pp 428–439Google Scholar
  38. 38.
    Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905CrossRefGoogle Scholar
  39. 39.
    Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2323CrossRefGoogle Scholar
  40. 40.
    Vennam JR, Vadapalli S (2005) SynDECA: a tool to generate synthetic datasets for evaluation of clustering algorithms. In: 11th international conference on management of dataGoogle Scholar
  41. 41.
    Wu X, Kumar V, Ross Q, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14(1): 1–37CrossRefGoogle Scholar
  42. 42.
    Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. In: Proceedings of 18th annual conference on neural information processing systemsGoogle Scholar
  43. 43.
    Zeng H-J, He Q-C, Chen Z, Ma W-Y, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, pp 210–217Google Scholar
  44. 44.
    Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Min Knowl Discov 1(2): 141–182CrossRefGoogle Scholar
  45. 45.
    Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2): 141–168CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  • Vineet Chaoji
    • 1
  • Mohammad Al Hasan
    • 1
  • Saeed Salem
    • 1
  • Mohammed J. Zaki
    • 1
  1. 1.Computer Science DepartmentRensselaer Polytechnic InstituteTroyUSA

Personalised recommendations