Advertisement

Clustering with Lower Bound on Similarity

  • Mohammad Al Hasan
  • Saeed Salem
  • Benjarath Pupacdi
  • Mohammed J. Zaki
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5476)

Abstract

We propose a new method, called SimClus, for clustering with lower bound on similarity. Instead of accepting k the number of clusters to find, the alternative similarity-based approach imposes a lower bound on the similarity between an object and its corresponding cluster representative (with one representative per cluster). SimClus achieves a O(logn) approximation bound on the number of clusters, whereas for the best previous algorithm the bound can be as poor as O(n). Experiments on real and synthetic datasets show that our algorithm produces more than 40% fewer representative objects, yet offers the same or better clustering quality. We also propose a dynamic variant of the algorithm, which can be effectively used in an on-line setting.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB Proceedings (August 2003)Google Scholar
  2. 2.
    Aslam, J., Pelekhov, J.E., Rus, D.: The star clustering algorithm for static and dynamic information organization. Graph Algorithms and Application 8(1), 95–129 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Azoury, K.S., Warmuth, M.K.: Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning 43(3), 211–246 (2001)CrossRefzbMATHGoogle Scholar
  4. 4.
    Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data streams. In: ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems. ACM, New YorkGoogle Scholar
  5. 5.
    Banerjee, A., Basu, S.: Topic models over text streams: A study of batch and online unsepervised learning. In: SIAM Data Mining (2007)Google Scholar
  6. 6.
    Gil-García, R.J., Badía-Contelles, J.M., Pons-Porrata, A.: Extended Star Clustering. In: Sanfeliu, A., Ruiz-Shulcloper, J. (eds.) CIARP 2003. LNCS, vol. 2905, pp. 480–487. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  7. 7.
    Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (1979)zbMATHGoogle Scholar
  8. 8.
    King, G., Tzeng, W.: On-line algorithm for the dominating set problem. Information Processing Letters 61, 11–14 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Lund, C., Yannakakis, M.: On the hardness of approximating minimization problems. Journal of the ACM 41(5), 960–981 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Vazirani, V.V.: Approximation Algorithms. Springer, HeidelbergGoogle Scholar
  11. 11.
    Zuckerman, D.: Np-complete problems have a version that’s hard to approximate. In: Proc. of Eighth Annual Structure in Complexity Theorey, pp. 305–312. IEEE Computer Society, Los Alamitos (1993)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Mohammad Al Hasan
    • 1
  • Saeed Salem
    • 1
  • Benjarath Pupacdi
    • 2
  • Mohammed J. Zaki
    • 1
  1. 1.Department of Computer ScienceRensselaer Polytechnic InstituteTroyUSA
  2. 2.Chulabhorn Research Institute, LaksiBangkokThailand

Personalised recommendations