Transformations and Selection Methods in Document Clustering
Document clustering is an important and widely researched part of information retrieval. It aims to assign natural language document to various categories based on some criteria. In this case, this criteria is the topic of the document, which means, that the goal is to identify the topic of documents and group the similar ones together. As there are many clustering methods and noise filtering techniques to support this procedure, this paper focuses on the composition of such transformations and on the comparison of the configurations built from a subset of these transformations techniques as tiles of the whole procedure. Altogether five tile methods (term filtering, frequency quantizing, singular value decomposition (SVD), term clustering (for double clustering) and document clustering of course) are used. These are compared based on the maximal achieved F-measure and time consumption to find the best composition.
Unable to display preview. Download preview PDF.
- 1.Singhal A (2001) Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin 24(4):35–43Google Scholar
- 2.Li L, Chou W (2002) Improving latent semantic indexing based classifier with information gain. Technical Report May 16Google Scholar
- 3.Furnas G, Deerwester S, Dumais S T, Landauer T K, Harshman R, Streeter L A, Lochbaum K E (1988) Information retrieval using a singular value decomposition model of latent semantic structure. In: Chiaramella Y (ed), Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Grenoble, France), pp 465–480, ACMGoogle Scholar
- 4.Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Clustering, Athens, Greece, pp 208–215Google Scholar
- 5.Lang K (1995) Newsweeder: Learning to filter netnews. In: ICML, Tahoe City, California, USA pp 331–339Google Scholar