A New Document Clustering Algorithm for Topic Discovering and Labeling
Conference paper
Abstract
In this paper, we introduce a new clustering algorithm for obtaining labeled document clusters that accurately identify the topics of a text collection. In order to determine the topics, our approach relies on both probable term pairs generated from the collection and the estimation of the topic homogeneity associated to term pair clusters. Experimental results obtained over two benchmark text collections demonstrate the utility of this new approach.
Keywords
document clustering topic discovery topic descriptions Download
to read the full conference paper text
References
- 1.Cutting, D.R., Pedersen, J.O., Karger, D.R., Tukey, J.W.: Scatter/Gather: a cluster-based approach to browsing large document collections. In: Belkin, N.J., Ingwersen, P., Pejtersen, A.M. (eds.) Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329. ACM, Denmark (1992)Google Scholar
- 2.Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and intuitive clustering of web documents. In: Heckerman, D., Mannila, H., Pregibon, D. (eds.) Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD 1997), pp. 287–290. AAAI Press, California (1997)Google Scholar
- 3.Pons-Porrata, A., Berlanga-Llavori, R., Ruiz-Shulcloper, J.: Topic discovery based on text mining techniques. Inf. Process. Manage. 43(3), 752–768 (2007)CrossRefMATHGoogle Scholar
- 4.Beil, F., Martin, E., Xu, X.: Frequent Term-Based Text Clustering. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 436–442. ACM, Canada (2002)CrossRefGoogle Scholar
- 5.Fung, B.C.M., Wang, K., Martin, E.: Hierarchical Document Clustering Using Frequent Itemsets. In: Barbará, D., Kamath, C. (eds.) Proceedings of the Third SIAM International Conference on Data Mining, pp. 59–70. SIAM, California (2003)Google Scholar
- 6.Li, Y., Chung, S.M., Holt, J.D.: Text Document Clustering Based on Frequent Word Meaning Sequences. Data Knowl. Eng. 64(1), 381–404 (2008)CrossRefGoogle Scholar
- 7.Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC-3. In: Proceedings of the Third TREC Conference, Washington, pp. 69–80 (1994)Google Scholar
- 8.Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1(6), 80–83 (1945)MathSciNetCrossRefGoogle Scholar
Copyright information
© Springer-Verlag Berlin Heidelberg 2008