Clustering with Lower Bound on Similarity
We propose a new method, called SimClus, for clustering with lower bound on similarity. Instead of accepting k the number of clusters to find, the alternative similarity-based approach imposes a lower bound on the similarity between an object and its corresponding cluster representative (with one representative per cluster). SimClus achieves a O(logn) approximation bound on the number of clusters, whereas for the best previous algorithm the bound can be as poor as O(n). Experiments on real and synthetic datasets show that our algorithm produces more than 40% fewer representative objects, yet offers the same or better clustering quality. We also propose a dynamic variant of the algorithm, which can be effectively used in an on-line setting.
Unable to display preview. Download preview PDF.
- 1.Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB Proceedings (August 2003)Google Scholar
- 4.Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data streams. In: ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems. ACM, New YorkGoogle Scholar
- 5.Banerjee, A., Basu, S.: Topic models over text streams: A study of batch and online unsepervised learning. In: SIAM Data Mining (2007)Google Scholar
- 10.Vazirani, V.V.: Approximation Algorithms. Springer, HeidelbergGoogle Scholar