Text Document Topical Recursive Clustering and Automatic Labeling of a Hierarchy of Document Clusters

Li, Xiaoxiao; Chen, Jiyang; Zaiane, Osmar

doi:10.1007/978-3-642-37456-2_17

Xiaoxiao Li²³,
Jiyang Chen²⁴ &
Osmar Zaiane²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7819))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

9757 Accesses
4 Citations

Abstract

The overwhelming amount of textual documents available nowadays highlights the need for information organization and discovery. Effectively organizing documents into a hierarchy of topics and subtopics makes it easier for users to browse the documents. This paper borrows community mining from social network analysis to generate a hierarchy of topically coherent document clusters. It focuses on giving the document clusters descriptive labels. We propose to use betweenness centrality measure in networks of co-occurring terms to label the clusters. We also incorporate keyphrase extraction and automatic titling in cluster labeling. The results show that the cluster labeling method utilizing KEA to extract keyphrases from the documents generates the best labels overall comparing to other methods and baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berendsen, R., Kovachev, B., Nastou, E.-P., de Rijke, M., Weerkamp, W.: Result disambiguation in web people search. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 146–157. Springer, Heidelberg (2012)
Chapter Google Scholar
Chen, S.Y., Chang, C.N., Nien, Y.H., Ke, H.R.: Concept extraction and clustering for search result organization and virtual community construction. Computer Science and Information Systems 9(1), 323–355 (2012)
Article Google Scholar
Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks. Physical Review E 70(6), 66111 (2004)
Article Google Scholar
Cui, H., Zaiane, O.R.: Hierarchical structural approach to improving the browsability of web search engine results. In: Proceedings of the12th International Workshop on Database and Expert Systems Applications, pp. 956–960. IEEE (2001)
Google Scholar
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1992, pp. 318–329. ACM, New York (1992)
Chapter Google Scholar
Dawid, W.: Descriptive Clustering as a Method for Exploring Text Collections. PhD thesis, Poznan University of Technology, Poznań, Poland (2006)
Google Scholar
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–274. ACM (2001)
Google Scholar
Ferragina, P., Gulli, A.: A personalized search engine based on web-snippet hierarchical clustering. Software: Practice and Experience 38(2), 189–225 (2008)
Article Google Scholar
Frigui, H., Nasraoui, O.: Simultaneous categorization of text documents and identification of cluster-dependent keywords. In: Proceedings of the 2002 IEEE International Conference on Fuzzy Systems, FUZZ-IEEE 2002, vol. 2, pp. 1108–1113. IEEE (2002)
Google Scholar
Jansen, B.J., Booth, D.L., Spink, A.: Determining the user intent of web search engine queries. In: Proceedings of the 16th International Conference on World Wide Web, pp. 1149–1150. ACM, New York (2007)
Chapter Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. Pattern Analysis and Machine Intelligence 24(7), 881–892 (2002)
Article Google Scholar
Krishnapuram, R., Kummamuru, K.: Automatic taxonomy generation: Issues and possibilities. In: Fuzzy Sets and Systems IFSA 2003, pp. 184–184 (2003)
Google Scholar
Kummamuru, K., Dhawale, A., Krishnapuram, R.: Fuzzy co-clustering of documents and keywords. In: The 12th IEEE International Conference on Fuzzy Systems, vol. 2, pp. 772–777. IEEE (2003)
Google Scholar
Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th International Conference on World Wide Web, pp. 658–665. ACM (2004)
Google Scholar
Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 366–376. Association for Computational Linguistics (2010)
Google Scholar
Lopez, C., Prince, V., Roche, M.: Automatic titling of electronic documents with noun phrase extraction. In: Soft Computing and Pattern Recognition (SoCPaR), pp. 168–171. IEEE (2010)
Google Scholar
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to information retrieval, vol. 1. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Mei, Q., Shen, X., Zhai, C.X.: Automatic labeling of multinomial topic models. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 490–499. ACM (2007)
Google Scholar
Popescul, A., Ungar, L.H.: Automatic labeling of document clusters. Unpublished Manuscript (2000)
Google Scholar
Sanderson, M., Croft, B.: Deriving concept hierarchies from text. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 206–213. ACM (1999)
Google Scholar
Scaiella, U., Ferragina, P., Marino, A., Ciaramita, M.: Topical clustering of search results. In: Proceedings of the fifth ACM International Conference on Web Search and Data Mining, pp. 223–232. ACM (2012)
Google Scholar
Treeratpituk, P., Callan, J.: Automatically labeling hierarchical clusters. In: Proceedings of the 2006 International Conference on Digital Government Research, pp. 167–176. ACM (2006)
Google Scholar
Wang, X., Bramer, M.: Exploring web search results clustering. In: Research and Development in Intelligent Systems XXIII, pp. 393–397 (2007)
Google Scholar
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: Kea: Practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital Libraries, pp. 254–255. ACM (1999)
Google Scholar
Yip, K.Y., Cheung, D.W., Ng, M.K.: Harp: A practical projected clustering algorithm. IEEE Transactions on Knowledge and Data Engineering 16(11), 1387–1397 (2004)
Article Google Scholar
Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to Web search results. In: Proceedings of the Eighth International Conference on World Wide Web, WWW 1999, pp. 1361–1374. Elsevier North-Holland, Inc., New York (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada
Xiaoxiao Li & Osmar Zaiane
Google Canada, Kitchener, Ontario, Canada
Jiyang Chen

Authors

Xiaoxiao Li
View author publications
You can also search for this author in PubMed Google Scholar
Jiyang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Osmar Zaiane
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
Jian Pei
Dept. of Computer Science and Information Engineering, Institute of Medical Informatics, National Cheng Kung University, Tainan, Taiwan
Vincent S. Tseng
Faculty of Engineering and Information Technology, University of Technology Sydney, Broadway, P.O. Box 123, 2007, Sydney, NSW, Australia
Longbing Cao & Guandong Xu &
Asian Office of Aerospace Research and Development (AOARD), Air Force Office of Scientific Research (AFOSR), Air Force Research Laboratory USA, Osaka University, 7-23-17 Roppongi, 106-0032, Minato-ku, Tokyo, Japan
Hiroshi Motoda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Chen, J., Zaiane, O. (2013). Text Document Topical Recursive Clustering and Automatic Labeling of a Hierarchy of Document Clusters. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37456-2_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-37456-2_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37455-5
Online ISBN: 978-3-642-37456-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics