Clustering web documents using hierarchical representation with multi-granularity

Huang, Faliang; Zhang, Shichao; He, Minghua; Wu, Xindong

doi:10.1007/s11280-012-0197-x

Clustering web documents using hierarchical representation with multi-granularity

Published: 15 January 2013

Volume 17, pages 105–126, (2014)
Cite this article

World Wide Web Aims and scope Submit manuscript

Faliang Huang¹,
Shichao Zhang^2,5,
Minghua He³ &
…
Xindong Wu⁴

583 Accesses
12 Citations
Explore all metrics

Abstract

Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with “false correlation”. In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a two-phase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problem resulted from the sparse term-paragraph matrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerance-rough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering graph data: the roadmap to spectral techniques

Article Open access 22 January 2024

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

Article 31 July 2023

References

Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), pp. 436–442. Edmonton, Alberta, Canada (2002)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bruno, E., Faessel, N., Glotin, H., Maitre, J.L., Scholl, M.: Indexing and querying segmented web pages: the blockweb model. World Wide Web 14(5–6), 623–649 (2011)
Article Google Scholar
Chen, C.-L., Tseng, F.S.C., Liang, T.: An integration of fuzzy association rules and WordNet for document clustering. Knowl. Inf. Syst. 28(3), 687–708 (2011)
Article Google Scholar
Cui, J., Liu, H., He, J., Li, P., Du, X., Wang, P.: Tagclus: a random walk-based method for tag clustering. Knowl. Inf. Syst. 27(2), 193–225 (2011)
Article MATH Google Scholar
Derrick, C.: TinyLex: static n-gram index pruning with perfect recall. In: Proceedings of the 17th Conference on Information and Knowledge Management (CIKM 2008), pp. 409–418. Napa Valley, California, USA (2008)
Farahat, A.K., Kamel, M.S.: Statistical semantics for enhancing document clustering. Knowl. Inf. Syst. 28(2), 365–393 (2011)
Article Google Scholar
Fodeh, S., Punch, B., Tan, P.-N.: On ontology-driven document clustering using core semantic features. Knowl. Inf. Syst. 28(2), 395–421 (2011)
Article Google Scholar
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42, 177–196 (2001)
Article MATH Google Scholar
Hossain, M.S., Angryk, R.A.: GDClust: a graph-based document clustering technique. In: Proceedings of the Seventh IEEE International Conference on Data Mining. (ICDM Workshops 2007), pp. 417–422. Omaha, Nebraska, USA (2007)
Hotho, A., Maedche, A., Staab, S.: Ontology-based text document clustering. KI 16(4), 48–54 (2002)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM’03), pp. 541–544. Melbourne, Florida, USA (2003)
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.gtar.gz
http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/bootstrappingIE/7sectors.tar.gz
Huang, F., Xie, G., Yao, Z., Cai, S.: Clustering transactions based on weighting maximal frequent itemsets. In: Proceedings of the 3rd International Conference on Intelligent System and Knowledge Engineering (ISKE 2008), pp. 262–266. Xiamen, China (2008)
Huang, F., Zhang, S.: Clustering web documents based on knowledge granularity. In: Proceedings of the Eighth Asia Pacific Web Conference (APWeb’06), pp. 85–96. Harbin, China (2006)
Jing, L., Zhou, L., Ng, M., Huang, Z.: Ontology-based distance measure for text clustering. In: Proceedings of SIAM SDM Workshop on Text Mining, Bethesda, Maryland, USA (2003)
Keller, M., Bengio, S.: A neural network for text representation. In: Proceedings of the International Conference on Artificial Neural Networks (ICANN’05), pp. 667–672. Warsaw, Poland (2005)
Khy, S., Ishikawa, Y., Kitagawa, H.: A novelty-based clustering method for on-line documents. World Wide Web 11(1), 1–37 (2008)
Article Google Scholar
Kryszkiewicz, M.: Properties of in complete information systems in the framework of rough sets. In: Polkowski, L., Skowron, A., (eds.) Rough Set in Knowledge Discovery 1: Methodology and Applications, Studies in Fuzziness and Soft Computing 18, pp. 422-450. Physica Verlag (1998)
Lang, N.C.: A Tolerance Rough Set Approach to Clustering Web Search Results. Warsaw University, Pisa, Italy (2003)
Leung, C., Chan, S., Chung, F., Ngai, G.: A probabilistic rating inference framework for mining user preferences from reviews. World Wide Web 14(2), 187–215 (2011)
Article Google Scholar
Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)
Article Google Scholar
Liu, N., Zhang, B., Yan, J., et al.: Text representation: from vector to tensor. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), pp. 725–728. Houston, Texas, USA (2005)
Ma, J., Xu, W., Sun, Y.-h., et al.: An ontology-based text-mining method to cluster proposals for research project selection. IEEE Trans. Syst. Man Cybern. Syst. Hum. 42(3), 784–790 (2012)
Article Google Scholar
Parapar, J., Barreiro, A.: Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints. ECIR, pp. 645–653 (2009)
Park, S., An, D.U., Char, B.R., et al.: Document Clustering with Cluster Refinement and Non-negative Matrix Factorization. ICONIP. (2), 281–288 (2009)
Park, S., Lee, S.R.: Enhancing document clustering using condensing cluster terms and fuzzy association. IEICE Trans. Inf. Syst. 94(6), 1227–1234 (2011)
Article Google Scholar
Pawlak, Z.: Rough Sets, Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991)
Pawlak, Z.: Granularity of knowledge, indiscernibility and rough sets. In: Proceedings of IEEE International Conference on Fuzzy Systems, pp. 106–110. Anchorage, Alaska (1998)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Sedding, J., Kazakov, D.: WordNet-based text document clustering. In: Proceedings of COLING-2004 Workshop on Robust Methods in Analysis of Natural Language Data, pp. 104–113. Geneva, Switzerland (2004)
Siivola, V., Pellom, B.: Growing an n-gram language model. In: Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH’05), pp. 1309–1312. Lisbon, Portugal (2005)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. KDD-2000 Workshop on Text Mining, (2000)
Theodosiou, T., Darzentas, N., Angelis, L., Ouzounis, C.A.: PuReD-MCL: a graph-based PubMed document clustering methodology. Bioinformatics 24(17), 1935–1941 (2008)
Article Google Scholar
Tsai, F.S., Zhang, Y.: D2S: document-to-sentence framework for novelty detection. Knowl. Inf. Syst. (KAIS) 29, 419–433 (2011)
Article Google Scholar
Varelas, G., Voutsakis, E., Raftopoulou, P., et al.: Semantic similarity methods in wordNet and their application to information retrieval on the web. In: Proceedings of Seventh ACM International Workshop on Web Information and Data Management (WIDM 2005), pp. 10–16. Bremen, Germany (2005)
Wang, F., Li, P., König, A.C.: Efficient Document Clustering via Online Nonnegative Matrix Factorizations. SDM, pp. 908–919 (2011)
Yao, Y.Y.: Information granulation and rough set approximation. Int. J. Intell. Syst. 16, 87–104 (2001)
Article MATH Google Scholar
Yao, Y.Y.: A partition model of granular computing. LNCS Trans. Rough Sets 1, 232–253 (2004)
Google Scholar
Yao, Y.Y.: Granular computing for the design of information retrieval support systems. In: Wu, W., Xiong, H., Shekhar, S. (eds.) Information Retrieval and Clustering. Kluwer Academic Publishers 299 (2003)
Yao,Y.Y.: Granular computing for data mining. In: Proceedings of SPIE Conference on Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security, pp. 1–12. Orlando, FL, USA (2006)
Zadeh, L.A.: Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Set Syst. 19, 111–127 (1997)
Article MathSciNet Google Scholar
Zadeh, L.A.: Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems. Soft Comput. 2, 23–25 (1998)
Article Google Scholar
Zheng W.: Architecture for Paragraphs (in Chinese). Fujian People’s Press, Fuzhou, China (1984)

Download references

Author information

Authors and Affiliations

Faculty of Software, Fujian Normal University, Cangshan District, 8 Shangsan Road, Fuzhou, 350007, China
Faliang Huang
College of Computer Science and IT, Guangxi Normal University, Guilin, 541004, PR, China
Shichao Zhang
Computer Science, Aston University, Birmingham, Aston Triangle, B4 7ET, United Kingdom
Minghua He
Department of Computer Science, University of Vermont, 33 Colchester Avenue, Burlington, VT, 05405, USA
Xindong Wu
Faculty of Engineering and Information Technology, UTS, PO Box 123, Broadway, NSW, 2007, Australia
Shichao Zhang

Authors

Faliang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Shichao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Minghua He
View author publications
You can also search for this author in PubMed Google Scholar
Xindong Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shichao Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, F., Zhang, S., He, M. et al. Clustering web documents using hierarchical representation with multi-granularity. World Wide Web 17, 105–126 (2014). https://doi.org/10.1007/s11280-012-0197-x

Download citation

Received: 23 May 2012
Revised: 18 September 2012
Accepted: 06 December 2012
Published: 15 January 2013
Issue Date: January 2014
DOI: https://doi.org/10.1007/s11280-012-0197-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering web documents using hierarchical representation with multi-granularity

Abstract

Access this article

Similar content being viewed by others

Clustering graph data: the roadmap to spectral techniques

A comprehensive and analytical review of text clustering techniques

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering web documents using hierarchical representation with multi-granularity

Abstract

Access this article

Similar content being viewed by others

Clustering graph data: the roadmap to spectral techniques

A comprehensive and analytical review of text clustering techniques

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation