Advertisement

Text Document Topical Recursive Clustering and Automatic Labeling of a Hierarchy of Document Clusters

  • Xiaoxiao Li
  • Jiyang Chen
  • Osmar Zaiane
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7819)

Abstract

The overwhelming amount of textual documents available nowadays highlights the need for information organization and discovery. Effectively organizing documents into a hierarchy of topics and subtopics makes it easier for users to browse the documents. This paper borrows community mining from social network analysis to generate a hierarchy of topically coherent document clusters. It focuses on giving the document clusters descriptive labels. We propose to use betweenness centrality measure in networks of co-occurring terms to label the clusters. We also incorporate keyphrase extraction and automatic titling in cluster labeling. The results show that the cluster labeling method utilizing KEA to extract keyphrases from the documents generates the best labels overall comparing to other methods and baselines.

Keywords

Text Mining and Web Mining Cluster Labeling Document Clustering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Berendsen, R., Kovachev, B., Nastou, E.-P., de Rijke, M., Weerkamp, W.: Result disambiguation in web people search. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 146–157. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  2. 2.
    Chen, S.Y., Chang, C.N., Nien, Y.H., Ke, H.R.: Concept extraction and clustering for search result organization and virtual community construction. Computer Science and Information Systems 9(1), 323–355 (2012)CrossRefGoogle Scholar
  3. 3.
    Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks. Physical Review E 70(6), 66111 (2004)CrossRefGoogle Scholar
  4. 4.
    Cui, H., Zaiane, O.R.: Hierarchical structural approach to improving the browsability of web search engine results. In: Proceedings of the12th International Workshop on Database and Expert Systems Applications, pp. 956–960. IEEE (2001)Google Scholar
  5. 5.
    Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1992, pp. 318–329. ACM, New York (1992)CrossRefGoogle Scholar
  6. 6.
    Dawid, W.: Descriptive Clustering as a Method for Exploring Text Collections. PhD thesis, Poznan University of Technology, Poznań, Poland (2006)Google Scholar
  7. 7.
    Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–274. ACM (2001)Google Scholar
  8. 8.
    Ferragina, P., Gulli, A.: A personalized search engine based on web-snippet hierarchical clustering. Software: Practice and Experience 38(2), 189–225 (2008)CrossRefGoogle Scholar
  9. 9.
    Frigui, H., Nasraoui, O.: Simultaneous categorization of text documents and identification of cluster-dependent keywords. In: Proceedings of the 2002 IEEE International Conference on Fuzzy Systems, FUZZ-IEEE 2002, vol. 2, pp. 1108–1113. IEEE (2002)Google Scholar
  10. 10.
    Jansen, B.J., Booth, D.L., Spink, A.: Determining the user intent of web search engine queries. In: Proceedings of the 16th International Conference on World Wide Web, pp. 1149–1150. ACM, New York (2007)CrossRefGoogle Scholar
  11. 11.
    Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. Pattern Analysis and Machine Intelligence 24(7), 881–892 (2002)CrossRefGoogle Scholar
  12. 12.
    Krishnapuram, R., Kummamuru, K.: Automatic taxonomy generation: Issues and possibilities. In: Fuzzy Sets and Systems IFSA 2003, pp. 184–184 (2003)Google Scholar
  13. 13.
    Kummamuru, K., Dhawale, A., Krishnapuram, R.: Fuzzy co-clustering of documents and keywords. In: The 12th IEEE International Conference on Fuzzy Systems, vol. 2, pp. 772–777. IEEE (2003)Google Scholar
  14. 14.
    Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th International Conference on World Wide Web, pp. 658–665. ACM (2004)Google Scholar
  15. 15.
    Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 366–376. Association for Computational Linguistics (2010)Google Scholar
  16. 16.
    Lopez, C., Prince, V., Roche, M.: Automatic titling of electronic documents with noun phrase extraction. In: Soft Computing and Pattern Recognition (SoCPaR), pp. 168–171. IEEE (2010)Google Scholar
  17. 17.
    Manning, C.D., Raghavan, P., Schutze, H.: Introduction to information retrieval, vol. 1. Cambridge University Press, Cambridge (2008)zbMATHCrossRefGoogle Scholar
  18. 18.
    Mei, Q., Shen, X., Zhai, C.X.: Automatic labeling of multinomial topic models. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 490–499. ACM (2007)Google Scholar
  19. 19.
    Popescul, A., Ungar, L.H.: Automatic labeling of document clusters. Unpublished Manuscript (2000)Google Scholar
  20. 20.
    Sanderson, M., Croft, B.: Deriving concept hierarchies from text. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 206–213. ACM (1999)Google Scholar
  21. 21.
    Scaiella, U., Ferragina, P., Marino, A., Ciaramita, M.: Topical clustering of search results. In: Proceedings of the fifth ACM International Conference on Web Search and Data Mining, pp. 223–232. ACM (2012)Google Scholar
  22. 22.
    Treeratpituk, P., Callan, J.: Automatically labeling hierarchical clusters. In: Proceedings of the 2006 International Conference on Digital Government Research, pp. 167–176. ACM (2006)Google Scholar
  23. 23.
    Wang, X., Bramer, M.: Exploring web search results clustering. In: Research and Development in Intelligent Systems XXIII, pp. 393–397 (2007)Google Scholar
  24. 24.
    Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: Kea: Practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital Libraries, pp. 254–255. ACM (1999)Google Scholar
  25. 25.
    Yip, K.Y., Cheung, D.W., Ng, M.K.: Harp: A practical projected clustering algorithm. IEEE Transactions on Knowledge and Data Engineering 16(11), 1387–1397 (2004)CrossRefGoogle Scholar
  26. 26.
    Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to Web search results. In: Proceedings of the Eighth International Conference on World Wide Web, WWW 1999, pp. 1361–1374. Elsevier North-Holland, Inc., New York (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Xiaoxiao Li
    • 1
  • Jiyang Chen
    • 2
  • Osmar Zaiane
    • 1
  1. 1.Department of Computing ScienceUniversity of AlbertaEdmontonCanada
  2. 2.Google CanadaKitchenerCanada

Personalised recommendations