Transformations and Selection Methods in Document Clustering

  • Kristóf Csorba
  • István Vajk

Document clustering is an important and widely researched part of information retrieval. It aims to assign natural language document to various categories based on some criteria. In this case, this criteria is the topic of the document, which means, that the goal is to identify the topic of documents and group the similar ones together. As there are many clustering methods and noise filtering techniques to support this procedure, this paper focuses on the composition of such transformations and on the comparison of the configurations built from a subset of these transformations techniques as tiles of the whole procedure. Altogether five tile methods (term filtering, frequency quantizing, singular value decomposition (SVD), term clustering (for double clustering) and document clustering of course) are used. These are compared based on the maximal achieved F-measure and time consumption to find the best composition.


Singular Value Decomposition Document Cluster Tile Method Bypass Mode Occurrence Number 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Singhal A (2001) Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin 24(4):35–43Google Scholar
  2. 2.
    Li L, Chou W (2002) Improving latent semantic indexing based classifier with information gain. Technical Report May 16Google Scholar
  3. 3.
    Furnas G, Deerwester S, Dumais S T, Landauer T K, Harshman R, Streeter L A, Lochbaum K E (1988) Information retrieval using a singular value decomposition model of latent semantic structure. In: Chiaramella Y (ed), Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Grenoble, France), pp 465–480, ACMGoogle Scholar
  4. 4.
    Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Clustering, Athens, Greece, pp 208–215Google Scholar
  5. 5.
    Lang K (1995) Newsweeder: Learning to filter netnews. In: ICML, Tahoe City, California, USA pp 331–339Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  1. 1.Department of Automation and Applied InformaticsBudapest University of Technology and EconomicsBudapestHungary

Personalised recommendations