Machine Learning and Data Mining in Pattern Recognition
Volume 5632 of the series Lecture Notes in Computer Science pp 398-412
Efficient Clustering of Web-Derived Data Sets
- Luís SarmentoAffiliated withFaculdade de Engenharia da Universidade do Porto - DEI - LIACC
- , Alexander KehlenbeckAffiliated withGoogle Inc
- , Eugénio OliveiraAffiliated withFaculdade de Engenharia da Universidade do Porto - DEI - LIACC
- , Lyle UngarAffiliated withUniversity of Pennsylvania - CS
Abstract
Many data sets derived from the web are large, high-dimensional, sparse and have a Zipfian distribution of both classes and features. On such data sets, current scalable clustering methods such as streaming clustering suffer from fragmentation, where large classes are incorrectly divided into many smaller clusters, and computational efficiency drops significantly. We present a new clustering algorithm based on connected components that addresses these issues and so works well on web-type data.
- Title
- Efficient Clustering of Web-Derived Data Sets
- Book Title
- Machine Learning and Data Mining in Pattern Recognition
- Book Subtitle
- 6th International Conference, MLDM 2009, Leipzig, Germany, July 23-25, 2009. Proceedings
- Pages
- pp 398-412
- Copyright
- 2009
- DOI
- 10.1007/978-3-642-03070-3_30
- Print ISBN
- 978-3-642-03069-7
- Online ISBN
- 978-3-642-03070-3
- Series Title
- Lecture Notes in Computer Science
- Series Volume
- 5632
- Series ISSN
- 0302-9743
- Publisher
- Springer Berlin Heidelberg
- Copyright Holder
- Springer-Verlag Berlin Heidelberg
- Additional Links
- Topics
- Industry Sectors
- eBook Packages
- Editors
-
- Petra Perner (19)
- Editor Affiliations
-
- 19. Institut für Bildverarbeitung und angewandte Informatik
- Authors
-
- Luís Sarmento (20)
- Alexander Kehlenbeck (21)
- Eugénio Oliveira (20)
- Lyle Ungar (22)
- Author Affiliations
-
- 20. Faculdade de Engenharia da Universidade do Porto - DEI - LIACC, Rua Dr. Roberto Frias, s/n, 4200-465, Porto, Portugal
- 21. Google Inc, New York, NY, USA
- 22. University of Pennsylvania - CS, 504 Levine, 200 S. 33rdSt, Philadelphia, PA, USA
Continue reading...
To view the rest of this content please follow the download PDF link above.