A New Document Clustering Algorithm for Topic Discovering and Labeling

  • Henry Anaya-Sánchez
  • Aurora Pons-Porrata
  • Rafael Berlanga-Llavori
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5197)


In this paper, we introduce a new clustering algorithm for obtaining labeled document clusters that accurately identify the topics of a text collection. In order to determine the topics, our approach relies on both probable term pairs generated from the collection and the estimation of the topic homogeneity associated to term pair clusters. Experimental results obtained over two benchmark text collections demonstrate the utility of this new approach.


document clustering topic discovery topic descriptions 


  1. 1.
    Cutting, D.R., Pedersen, J.O., Karger, D.R., Tukey, J.W.: Scatter/Gather: a cluster-based approach to browsing large document collections. In: Belkin, N.J., Ingwersen, P., Pejtersen, A.M. (eds.) Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329. ACM, Denmark (1992)Google Scholar
  2. 2.
    Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and intuitive clustering of web documents. In: Heckerman, D., Mannila, H., Pregibon, D. (eds.) Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD 1997), pp. 287–290. AAAI Press, California (1997)Google Scholar
  3. 3.
    Pons-Porrata, A., Berlanga-Llavori, R., Ruiz-Shulcloper, J.: Topic discovery based on text mining techniques. Inf. Process. Manage. 43(3), 752–768 (2007)CrossRefzbMATHGoogle Scholar
  4. 4.
    Beil, F., Martin, E., Xu, X.: Frequent Term-Based Text Clustering. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 436–442. ACM, Canada (2002)CrossRefGoogle Scholar
  5. 5.
    Fung, B.C.M., Wang, K., Martin, E.: Hierarchical Document Clustering Using Frequent Itemsets. In: Barbará, D., Kamath, C. (eds.) Proceedings of the Third SIAM International Conference on Data Mining, pp. 59–70. SIAM, California (2003)Google Scholar
  6. 6.
    Li, Y., Chung, S.M., Holt, J.D.: Text Document Clustering Based on Frequent Word Meaning Sequences. Data Knowl. Eng. 64(1), 381–404 (2008)CrossRefGoogle Scholar
  7. 7.
    Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC-3. In: Proceedings of the Third TREC Conference, Washington, pp. 69–80 (1994)Google Scholar
  8. 8.
    Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1(6), 80–83 (1945)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Henry Anaya-Sánchez
    • 1
  • Aurora Pons-Porrata
    • 1
  • Rafael Berlanga-Llavori
    • 2
  1. 1.Center for Pattern Recognition and Data MiningUniversidad de OrienteSantiago de CubaCuba
  2. 2.Department of Languages and Computer SystemsUniversitat Jaume ICastellóSpain

Personalised recommendations