Efficient Clustering of Web-Derived Data Sets
Many data sets derived from the web are large, high-dimensional, sparse and have a Zipfian distribution of both classes and features. On such data sets, current scalable clustering methods such as streaming clustering suffer from fragmentation, where large classes are incorrectly divided into many smaller clusters, and computational efficiency drops significantly. We present a new clustering algorithm based on connected components that addresses these issues and so works well on web-type data.
Unable to display preview. Download preview PDF.
- 4.Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT Press and McGraw-Hill Book Company (1990)Google Scholar
- 9.Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW 2007: Proceedings of the 16th international conference on World Wide Web, pp. 271–280. ACM, New York (2007)Google Scholar
- 10.Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. of 30th STOC, pp. 604–613 (1998)Google Scholar
- 11.Broder, A.Z.: On the resemblance and containment of documents. In: SEQS: Sequences 1991 (1998)Google Scholar
- 12.Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI 2004: Sixth Symposium on Operating System Design and Implementation (2004)Google Scholar