Knowledge and Information Systems

, Volume 28, Issue 3, pp 665–685 | Cite as

SimClus: an effective algorithm for clustering with a lower bound on similarity

  • Mohammad Al Hasan
  • Saeed Salem
  • Mohammed J. Zaki
Regular Paper

Abstract

Clustering algorithms generally accept a parameter k from the user, which determines the number of clusters sought. However, in many application domains, like document categorization, social network clustering, and frequent pattern summarization, the proper value of k is difficult to guess. An alternative clustering formulation that does not require k is to impose a lower bound on the similarity between an object and its corresponding cluster representative. Such a formulation chooses exactly one representative for every cluster and minimizes the representative count. It has many additional benefits. For instance, it supports overlapping clusters in a natural way. Moreover, for every cluster, it selects a representative object, which can be effectively used in summarization or semi-supervised classification task. In this work, we propose an algorithm, SimClus, for clustering with lower bound on similarity. It achieves a O(log n) approximation bound on the number of clusters, whereas for the best previous algorithm the bound can be as poor as O(n). Experiments on real and synthetic data sets show that our algorithm produces more than 40% fewer representative objects, yet offers the same or better clustering quality. We also propose a dynamic variant of the algorithm, which can be effectively used in an on-line setting.

Keywords

Dominating set Overlapping clustering Set cover Star clustering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: VLDB proceedingsGoogle Scholar
  2. 2.
    Aslam J, Pelekhov JE, Rus D (2004) The star clustering algorithm for static and dynamic information organization. Graph Algorithms Appl 8(1): 95–129MathSciNetMATHGoogle Scholar
  3. 3.
    Azoury KS, Warmuth MK (2001) Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach Learn 43(3): 211–246MATHCrossRefGoogle Scholar
  4. 4.
    Banerjee A, Basu S (2007) Topic models over text streams: A study of batch and online unsepervised learning. In: Proceedings of SIAM data mining conference 2007Google Scholar
  5. 5.
    Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data streams. In: SIGMOD-SIGACT-SIGART symposium on principles of database systems, ACM 2002Google Scholar
  6. 6.
    Badoiu M, Har-Peled S, Indyk P (2002) Approximate clustering via core-sets. Proc. of thirty-fourth STOC, ACM, pp 250–257Google Scholar
  7. 7.
    Banerjee A, Krumpelman C, Basu S, Mooney RJ, Ghosh J (2005) Model-based overlapping clustering. In: Eleventh ACM SIGKDD international conference on KDDGoogle Scholar
  8. 8.
    Boyardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. Proc. of 16th Internation World Wide Web Conference, May 2007Google Scholar
  9. 9.
    Bringmann B, Zimmermann A (2009) One in a million: picking the right patterns. Knowl Inf Syst 18(1): 61–81CrossRefGoogle Scholar
  10. 10.
    Chaoji V, Hasan MA, Salem S, Zaki MJ (2009) SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters. Knowl Inf Syst 21(2): 201–229CrossRefGoogle Scholar
  11. 11.
    Cox E (2005) Fuzzy modeling tools for data mining and knowledge discovery. Morgan KaufmannGoogle Scholar
  12. 12.
    Dongen SV (2000) Graph clustering by flow simulation. PhD in Computer Science, Univeristy of UtrechtGoogle Scholar
  13. 13.
    Chen X, Zhang D, Lee WS (2005) Text classification with kernels on the multinomial manifold. SIGIR ProceedingsGoogle Scholar
  14. 14.
    G-Garcia R, Badia-Contelles J, Pons-Porrata A (2003) Extended star clustering, in CIARP (LNCS 2905). Springer, New YorkGoogle Scholar
  15. 15.
    Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of np-completeness. W.H. Freeman, San FranciscoMATHGoogle Scholar
  16. 16.
    Huang D, Sherman B, Tan Q et al (2007) The DAVID gene functional classification tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol 8(9)Google Scholar
  17. 17.
    Kobayashi M, Aono M (2006) Exploring overlapping clusters using dynamic re-scaling and sampling. Knowl Inf Syst 10(3): 295–313CrossRefGoogle Scholar
  18. 18.
    Karypis G, Han E, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8): 68–75CrossRefGoogle Scholar
  19. 19.
    King G, Tzeng W (1997) On-line algorithm for the dominating set problem. Inf Process Lett 61: 11–14MathSciNetCrossRefGoogle Scholar
  20. 20.
    Liu H, Lin Y, Han J (2009) Methods for mining frequent items in data streams: an overview. Knowl Inf Syst. doi:10.1007/s10115-009-0267-2
  21. 21.
    Lloyd SP (1982) Least square quantization in pcm. IEEE Trans Inf Theory 28(2): 129–136MathSciNetMATHCrossRefGoogle Scholar
  22. 22.
    Lund C, Yannakakis M (1994) On the hardness of approximating minimization problems. J ACM 41(5): 960–981MathSciNetMATHCrossRefGoogle Scholar
  23. 23.
    Narasimhamurthy A, Greene D, Hurley N, Cunningham P (2009) Partitioning large networks without breaking communities. Knowl Inf Syst. doi:10.1007/s10115-009-0251-x
  24. 24.
    Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: VLDB proceedings August 1994Google Scholar
  25. 25.
    Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435Google Scholar
  26. 26.
    Parikh N, Sundaresan N (2008) Inferring semantic query relation from collective user behavior. In: Proceedings of conference on information and knowledge management, ACM, October 2008Google Scholar
  27. 27.
    Vazirani VV (1998) Approximation algorithms. Springer, New YorkGoogle Scholar
  28. 28.
    Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: VLDB proceedings, August 2005Google Scholar
  29. 29.
    Zamir O, Etzioni O (1994) Grouper: a dynamic clustering interface to web search results. In: Proceedings of world wide web, pp 1261–1274Google Scholar
  30. 30.
    Zuckerman D (1993) Np-complete problems have a version that’s hard to approximate. In: Proceedings of eighth annual structure in complexity theory, IEEE Computer Society, pp 305–312Google Scholar

Copyright information

© Springer-Verlag London Limited 2010

Authors and Affiliations

  • Mohammad Al Hasan
    • 1
  • Saeed Salem
    • 2
  • Mohammed J. Zaki
    • 3
  1. 1.Indiana University–Purdue UniversityIndianapolisUSA
  2. 2.Department of Computer ScienceNorth Dakota State UniversityFargoUSA
  3. 3.Department of Computer ScienceRensselaer Polytechnic InstituteTroyUSA

Personalised recommendations