Chapter

Machine Learning and Data Mining in Pattern Recognition

Volume 5632 of the series Lecture Notes in Computer Science pp 398-412

Efficient Clustering of Web-Derived Data Sets

  • Luís SarmentoAffiliated withFaculdade de Engenharia da Universidade do Porto - DEI - LIACC
  • , Alexander KehlenbeckAffiliated withGoogle Inc
  • , Eugénio OliveiraAffiliated withFaculdade de Engenharia da Universidade do Porto - DEI - LIACC
  • , Lyle UngarAffiliated withUniversity of Pennsylvania - CS

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Many data sets derived from the web are large, high-dimensional, sparse and have a Zipfian distribution of both classes and features. On such data sets, current scalable clustering methods such as streaming clustering suffer from fragmentation, where large classes are incorrectly divided into many smaller clusters, and computational efficiency drops significantly. We present a new clustering algorithm based on connected components that addresses these issues and so works well on web-type data.