Efficient Clustering of Web-Derived Data Sets

  • Luís Sarmento
  • Alexander Kehlenbeck
  • Eugénio Oliveira
  • Lyle Ungar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5632)


Many data sets derived from the web are large, high-dimensional, sparse and have a Zipfian distribution of both classes and features. On such data sets, current scalable clustering methods such as streaming clustering suffer from fragmentation, where large classes are incorrectly divided into many smaller clusters, and computational efficiency drops significantly. We present a new clustering algorithm based on connected components that addresses these issues and so works well on web-type data.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aggarwal, C., Hinneburg, A., Keim, D.: On the Surprising Behavior of Distance Metrics in High Dimensional Spaces. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  2. 2.
    Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering 15(3), 515–528 (2003)CrossRefGoogle Scholar
  3. 3.
    Samuel-Cahn, E., Zamir, S.: Algebraic characterization of infinite markov chains where movement to the right is limited to one step. Journal of Applied Probability 14, 740–747 (1977)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT Press and McGraw-Hill Book Company (1990)Google Scholar
  5. 5.
    Hopcroft, J., Tarjan, R.: Algorithm 447: efficient algorithms for graph manipulation. Commun. ACM 16(6), 372–378 (1973)CrossRefGoogle Scholar
  6. 6.
    Charikar, M., O’Callaghan, L., Panigrahy, R.: Better streaming algorithms for clustering problems. In: STOC 2003: Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pp. 30–39. ACM, New York (2003)CrossRefGoogle Scholar
  7. 7.
    McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: KDD 2000: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169–178. ACM, New York (2000)CrossRefGoogle Scholar
  8. 8.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 27(2), 94–105 (1998)CrossRefGoogle Scholar
  9. 9.
    Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW 2007: Proceedings of the 16th international conference on World Wide Web, pp. 271–280. ACM, New York (2007)Google Scholar
  10. 10.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. of 30th STOC, pp. 604–613 (1998)Google Scholar
  11. 11.
    Broder, A.Z.: On the resemblance and containment of documents. In: SEQS: Sequences 1991 (1998)Google Scholar
  12. 12.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI 2004: Sixth Symposium on Operating System Design and Implementation (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Luís Sarmento
    • 1
  • Alexander Kehlenbeck
    • 2
  • Eugénio Oliveira
    • 1
  • Lyle Ungar
    • 3
  1. 1.Faculdade de Engenharia da Universidade do Porto - DEI - LIACCPortoPortugal
  2. 2.Google IncNew York, NYUSA
  3. 3.University of Pennsylvania - CSPhiladelphiaUSA

Personalised recommendations