Skip to main content
Log in

Clustering web documents using hierarchical representation with multi-granularity

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with “false correlation”. In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a two-phase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problem resulted from the sparse term-paragraph matrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerance-rough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), pp. 436–442. Edmonton, Alberta, Canada (2002)

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  3. Bruno, E., Faessel, N., Glotin, H., Maitre, J.L., Scholl, M.: Indexing and querying segmented web pages: the blockweb model. World Wide Web 14(5–6), 623–649 (2011)

    Article  Google Scholar 

  4. Chen, C.-L., Tseng, F.S.C., Liang, T.: An integration of fuzzy association rules and WordNet for document clustering. Knowl. Inf. Syst. 28(3), 687–708 (2011)

    Article  Google Scholar 

  5. Cui, J., Liu, H., He, J., Li, P., Du, X., Wang, P.: Tagclus: a random walk-based method for tag clustering. Knowl. Inf. Syst. 27(2), 193–225 (2011)

    Article  MATH  Google Scholar 

  6. Derrick, C.: TinyLex: static n-gram index pruning with perfect recall. In: Proceedings of the 17th Conference on Information and Knowledge Management (CIKM 2008), pp. 409–418. Napa Valley, California, USA (2008)

  7. Farahat, A.K., Kamel, M.S.: Statistical semantics for enhancing document clustering. Knowl. Inf. Syst. 28(2), 365–393 (2011)

    Article  Google Scholar 

  8. Fodeh, S., Punch, B., Tan, P.-N.: On ontology-driven document clustering using core semantic features. Knowl. Inf. Syst. 28(2), 395–421 (2011)

    Article  Google Scholar 

  9. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42, 177–196 (2001)

    Article  MATH  Google Scholar 

  10. Hossain, M.S., Angryk, R.A.: GDClust: a graph-based document clustering technique. In: Proceedings of the Seventh IEEE International Conference on Data Mining. (ICDM Workshops 2007), pp. 417–422. Omaha, Nebraska, USA (2007)

  11. Hotho, A., Maedche, A., Staab, S.: Ontology-based text document clustering. KI 16(4), 48–54 (2002)

    Google Scholar 

  12. Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM’03), pp. 541–544. Melbourne, Florida, USA (2003)

  13. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.gtar.gz

  14. http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/bootstrappingIE/7sectors.tar.gz

  15. Huang, F., Xie, G., Yao, Z., Cai, S.: Clustering transactions based on weighting maximal frequent itemsets. In: Proceedings of the 3rd International Conference on Intelligent System and Knowledge Engineering (ISKE 2008), pp. 262–266. Xiamen, China (2008)

  16. Huang, F., Zhang, S.: Clustering web documents based on knowledge granularity. In: Proceedings of the Eighth Asia Pacific Web Conference (APWeb’06), pp. 85–96. Harbin, China (2006)

  17. Jing, L., Zhou, L., Ng, M., Huang, Z.: Ontology-based distance measure for text clustering. In: Proceedings of SIAM SDM Workshop on Text Mining, Bethesda, Maryland, USA (2003)

  18. Keller, M., Bengio, S.: A neural network for text representation. In: Proceedings of the International Conference on Artificial Neural Networks (ICANN’05), pp. 667–672. Warsaw, Poland (2005)

  19. Khy, S., Ishikawa, Y., Kitagawa, H.: A novelty-based clustering method for on-line documents. World Wide Web 11(1), 1–37 (2008)

    Article  Google Scholar 

  20. Kryszkiewicz, M.: Properties of in complete information systems in the framework of rough sets. In: Polkowski, L., Skowron, A., (eds.) Rough Set in Knowledge Discovery 1: Methodology and Applications, Studies in Fuzziness and Soft Computing 18, pp. 422-450. Physica Verlag (1998)

  21. Lang, N.C.: A Tolerance Rough Set Approach to Clustering Web Search Results. Warsaw University, Pisa, Italy (2003)

  22. Leung, C., Chan, S., Chung, F., Ngai, G.: A probabilistic rating inference framework for mining user preferences from reviews. World Wide Web 14(2), 187–215 (2011)

    Article  Google Scholar 

  23. Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)

    Article  Google Scholar 

  24. Liu, N., Zhang, B., Yan, J., et al.: Text representation: from vector to tensor. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), pp. 725–728. Houston, Texas, USA (2005)

  25. Ma, J., Xu, W., Sun, Y.-h., et al.: An ontology-based text-mining method to cluster proposals for research project selection. IEEE Trans. Syst. Man Cybern. Syst. Hum. 42(3), 784–790 (2012)

    Article  Google Scholar 

  26. Parapar, J., Barreiro, A.: Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints. ECIR, pp. 645–653 (2009)

  27. Park, S., An, D.U., Char, B.R., et al.: Document Clustering with Cluster Refinement and Non-negative Matrix Factorization. ICONIP. (2), 281–288 (2009)

  28. Park, S., Lee, S.R.: Enhancing document clustering using condensing cluster terms and fuzzy association. IEICE Trans. Inf. Syst. 94(6), 1227–1234 (2011)

    Article  Google Scholar 

  29. Pawlak, Z.: Rough Sets, Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991)

  30. Pawlak, Z.: Granularity of knowledge, indiscernibility and rough sets. In: Proceedings of IEEE International Conference on Fuzzy Systems, pp. 106–110. Anchorage, Alaska (1998)

  31. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  32. Sedding, J., Kazakov, D.: WordNet-based text document clustering. In: Proceedings of COLING-2004 Workshop on Robust Methods in Analysis of Natural Language Data, pp. 104–113. Geneva, Switzerland (2004)

  33. Siivola, V., Pellom, B.: Growing an n-gram language model. In: Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH’05), pp. 1309–1312. Lisbon, Portugal (2005)

  34. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. KDD-2000 Workshop on Text Mining, (2000)

  35. Theodosiou, T., Darzentas, N., Angelis, L., Ouzounis, C.A.: PuReD-MCL: a graph-based PubMed document clustering methodology. Bioinformatics 24(17), 1935–1941 (2008)

    Article  Google Scholar 

  36. Tsai, F.S., Zhang, Y.: D2S: document-to-sentence framework for novelty detection. Knowl. Inf. Syst. (KAIS) 29, 419–433 (2011)

    Article  Google Scholar 

  37. Varelas, G., Voutsakis, E., Raftopoulou, P., et al.: Semantic similarity methods in wordNet and their application to information retrieval on the web. In: Proceedings of Seventh ACM International Workshop on Web Information and Data Management (WIDM 2005), pp. 10–16. Bremen, Germany (2005)

  38. Wang, F., Li, P., König, A.C.: Efficient Document Clustering via Online Nonnegative Matrix Factorizations. SDM, pp. 908–919 (2011)

  39. Yao, Y.Y.: Information granulation and rough set approximation. Int. J. Intell. Syst. 16, 87–104 (2001)

    Article  MATH  Google Scholar 

  40. Yao, Y.Y.: A partition model of granular computing. LNCS Trans. Rough Sets 1, 232–253 (2004)

    Google Scholar 

  41. Yao, Y.Y.: Granular computing for the design of information retrieval support systems. In: Wu, W., Xiong, H., Shekhar, S. (eds.) Information Retrieval and Clustering. Kluwer Academic Publishers 299 (2003)

  42. Yao,Y.Y.: Granular computing for data mining. In: Proceedings of SPIE Conference on Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security, pp. 1–12. Orlando, FL, USA (2006)

  43. Zadeh, L.A.: Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Set Syst. 19, 111–127 (1997)

    Article  MathSciNet  Google Scholar 

  44. Zadeh, L.A.: Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems. Soft Comput. 2, 23–25 (1998)

    Article  Google Scholar 

  45. Zheng W.: Architecture for Paragraphs (in Chinese). Fujian People’s Press, Fuzhou, China (1984)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shichao Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, F., Zhang, S., He, M. et al. Clustering web documents using hierarchical representation with multi-granularity. World Wide Web 17, 105–126 (2014). https://doi.org/10.1007/s11280-012-0197-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-012-0197-x

Keywords

Navigation