Skip to main content

Clustering Web Documents Based on Knowledge Granularity

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3841))

Abstract

We propose a new data model for Web document representation based on granulation computing, named as Expanded Vector Space Model (EVSM). Traditional Web document clustering is based on two-level knowledge granularity: document and term. It can lead to that clustering results are of “false relevant”. In our approach, Web documents are represented in many-level knowledge granularity. Knowledge granularity with sufficiently conceptual sentences is beneficial for knowledge engineers to understand valuable relations hidden in data. With granularity calculation data can be more efficiently and effectively disposed of and knowledge engineers can handle the same dataset in different knowledge levels. This provides more reliable soundness for interpreting results of various data analysis methods. We experimentally evaluate the proposed approach and demonstrate that our algorithm is promising and efficient.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hsu, A.L., Halgamuge, S.K.: Enhancement of topology preservation and hierarchical dynamic self-organising maps for data visualization. International Journal of Approximate Reasoning 32(2-3), 259–279 (2003)

    Article  MATH  Google Scholar 

  2. Liu, B., Xia, Y., Yu, P.S.: Clustering Through Decision Tree Constructio. In: SIGMOD 2000 (2000)

    Google Scholar 

  3. Hung, C., Wermter, S.: A dynamic adaptive self-organising hybrid model for text clustering. In: Proceedings of The Third IEEE International Conference on Data Mining (ICDM 2003), Melbourne, USA, pp. 75–82 (November 2003)

    Google Scholar 

  4. Hung, C., Wermter, S.: A time-based self-organising model for document clustering. In: Proceedings of International Joint Conference on Neural Networks, Budapest, Hungary, pp. 17–22 (July 2004)

    Google Scholar 

  5. Ngo, C.L., Nguyen, H.S.: A Tolerance Rough Set Approach to Clustering Web Search Results. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 515–517. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  6. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2000)

    Google Scholar 

  7. Yoon, J., Raghavan, V., Chakilam, V.: BitCube: Clustering and Statistical Analysis for XML Documents. In: Thirteenth International Conference on Scientific and Statistical Database Management, Fairfax, Virginia, July 18-20 (2001)

    Google Scholar 

  8. Kryszkiewicz, M.: Properties of in complete information systems in the framework of rough sets. In: Polkowski, L. (ed.) A Skow roneds. Rough Sets in Data Mining and Knowledge Discovery, pp. 422–450. Springer, Berlin (1998)

    Google Scholar 

  9. Kryszkiewicz, M.: Rough set approach to incomplete information system. Information Sciences 112, 39–495 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  10. Pawlak, Z.: Rough Sets, Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht

    Google Scholar 

  11. Pawlak, Z.: Granularity of knowledge, indiscernibility and rough sets. In: Proceedings of 1998 IEEE International Conference on Fuzzy Systems, pp. 106–110 (1998)

    Google Scholar 

  12. Salton, G., McGill, J.M. (eds.): Introduction to Modern Information Retrieval. McGill-Hill (1983)

    Google Scholar 

  13. Zhang, S.: Knowledge discovery in multi-databases by analyzing local instances. PhD Thesis, Deakin University (2001)

    Google Scholar 

  14. Poe, V., Klauer, P., Brobst, S.: Building A Data Warehouse for Decision Support, 2nd edn. Prentice Hall PTR, Englewood Cliffs

    Google Scholar 

  15. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

  16. Yao, Y.Y.: Information granulation and rough set approximation. International Journal of Intelligent Systems 16, 87–104 (2001)

    Article  MATH  Google Scholar 

  17. Yao, Y.Y.: Granular computing for the design of information retrieval support systems. In: Wu, W., Xiong, H., Shekhar, S. (eds.) Information Retrieval and Clustering, p. 299. Kluwer Academic Publishers, Dordrecht (2003)

    Google Scholar 

  18. Yao, Y.Y.: A Partition Model of Granular Computing. T. Rough Sets 2004, pp. 232–253 (2004)

    Google Scholar 

  19. Zadeh, L.A.: Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems 19, 111–127

    Google Scholar 

  20. Zadeh, L.A.: Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/ intelligent systems. Soft Computing 2, 23–25

    Google Scholar 

  21. Wenzhen, Z.: Architecture for Paragraphs (in Chinese). Fujian People’s Press (1984)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Huang, F., Zhang, S. (2006). Clustering Web Documents Based on Knowledge Granularity. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_9

Download citation

  • DOI: https://doi.org/10.1007/11610113_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-31142-3

  • Online ISBN: 978-3-540-32437-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics