Skip to main content

A Review of Web Document Clustering Approaches

  • Conference paper
Text Mining and its Applications

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 138))

  • 1010 Accesses

Abstract

The impact of the World Wide Web as a main source of information extraction is increasing dramatically. Though, the web search environment is not ideal. The existence of an abundance of information, in combination with the dynamic and heterogeneous nature of the Web, makes information retrieval a difficult process for the average user. It is a valid requirement then the development of techniques that can help the users effectively organize and browse the available information, with the ultimate goal of satisfying their information need. Cluster analysis, which deals with the organization of a collection of objects into cohesive groups, can play a very important role towards the achievement of this objective. In this paper, we present an exhaustive survey of web document clustering approaches available on the literature, classified into three main categories: text-based, linkbased and hybrid. Furthermore, we present a thorough comparison of the algorithms based on the various facets of their features and functionality. Finally, from the review of the different approaches we conclude that although clustering has been a topic for the scientific community for three decades, there are still many open issues that call for more research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Rasmussen, E.: Clustering Algorithms. In Information Retrieval, W.B. Frakes & R. Baeza-Yates (eds.), Prentice Hall PTR, New Jersey (1992).

    Google Scholar 

  2. Jain, A.K., Murty, M.N., Flyn, P.J.: Data Clustering: A Review. ACM Computing Surveys, Vol. 31. No. 3 (1999.)

    Google Scholar 

  3. C. J. van Rijbergen, Information Retrieval. Butterworths (1979).

    Google Scholar 

  4. Zhao, Y., Karypis, G.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report #01–40. University of Minnesota, Computer Science Department. Minneapolis, MN (2001).

    Google Scholar 

  5. El-Hamdouchi, A., Willett, P.: Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal 32 (1989).

    Google Scholar 

  6. Willett, P.: Recent Trends in Hierarchic document Clustering: a critical review. Information & Management. 24(5) (1988) 577–597.

    Google Scholar 

  7. Sibson, R.: SLINK: an optimally efficient algorithm for the single link cluster method. The Computer Journal 16 (1973) 30–34.

    Article  MathSciNet  Google Scholar 

  8. Voorhees, E. M.: Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing & Management, 22 (1986) 465–476.

    Article  Google Scholar 

  9. Defays, D.: An efficient algorithm for the complete link method. The Computer Journal 20 (1977) 364–366.

    Article  MathSciNet  MATH  Google Scholar 

  10. http://www.google.com

  11. http://www.lycos.com

  12. Steinbach, M., G. Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining (2000).

    Google Scholar 

  13. Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. The Computer Journal 26 (1983) 354–359.

    Article  MATH  Google Scholar 

  14. El-Hamdouchi, A., Willett, P.: Hierarchic document clustering using Ward’s method. Proceedings of the Ninth International Conference on Research and Development in Information Retrieval. ACM, Washington (1986) 149–156.

    Google Scholar 

  15. Zhao, Y., Karypis, G.: Evaluation of Hierarchical Clustering Algorithms for Document Datasets (2001).

    Google Scholar 

  16. Karypis, G., Han, EH, Kumar, V.: Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modelling. IEEE Computer, Vol. 32, No. 8 (1999) 68–75.

    Article  Google Scholar 

  17. Karypis, G., Kumar, V.: A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1) (1999).

    Google Scholar 

  18. Boley, D., Gini, M., Gross, R., Han, E.H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decision Support Systems, 27(3) (1999) 329–341.

    Article  Google Scholar 

  19. Han, EH., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Webace: a web agent for document categorization and exploration. Technical Report TR-97-049, Department of Computer Science, University of Minnesota, Minneapolis (1997).

    Google Scholar 

  20. Dhillon, I.S.: Co-clustering documents and words using Bipartite Spectral Graph Partitioning. UT CS Technical Report # TR 2001–05.

    Google Scholar 

  21. Kohonen, T.: Self-organizing maps. Springer-Verlag, Berlin (1995).

    Book  Google Scholar 

  22. Merkl, D.: Text Data Mining. In: Dale, R., Moisl, H., Somers, H. (eds.), A handbook of natural language processing: techniques and applications for the processing of language as text, Marcel Dekker, New York (1998).

    Google Scholar 

  23. Salton, G., Wang, A., Yang, C.: A vector space model for information retrieval. Journal of the American Society for Information Science, volume 18 (1975) 613–620.

    MATH  Google Scholar 

  24. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: Fuzzy C-Means Algorithm. Computers and Geosciences (1984).

    Google Scholar 

  25. Looney, C.: A Fuzzy Clustering and Fuzzy Merging Algorithm, Technical Report, CSUNR-101–1999.

    Google Scholar 

  26. Everitt, B. S., Hand, D. J.: Finite Mixture Distributions. London: Chapman and Hall (1981).

    Book  MATH  Google Scholar 

  27. Cheeseman, P., Stutz, J.: Bayesian Classification (AutoClass): Theory and Results, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press (1996) 153–180.

    Google Scholar 

  28. Kleinberg, J.: Authoritative sources in a hyperlinked environment. Proceeding of the 9th ACM-SIAM Symposium on Discrete Algorithms (1997).

    Google Scholar 

  29. Croft, W. B.: Retrieval strategies for hypertext. Information Processing and Management, 29 (1993) 313–324.

    Article  Google Scholar 

  30. Frei, H. P., Stieger, D.: The Use of Semantic Links in Hypertext Information Retrieval. Information Processing and Management, 31(1) (1995) 1–13.

    Google Scholar 

  31. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford (1998).

    Google Scholar 

  32. A. Botafogo, RA., Shneiderman, B.: Identifying aggregates in hypertext structures. Proceedings of the 3 rd ACM Conference on Hypertext (1991) 63–74.

    Google Scholar 

  33. Botafogo, RA.: Cluster analysis for hypertext systems. Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (1993) 116–125.

    Google Scholar 

  34. Larson, R.R.: Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace. Proceedings of the 1996 American Society for Information Science Annual Meeting (1996).

    Google Scholar 

  35. Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for Emerging Cyber-Communities. Proceedings of the 8th WWW Conference (1999).

    Google Scholar 

  36. Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: Extracting usable structures from the Web. Proceedings of the ACM Sigchi Conference on Human Factors in Computing (1996).

    Google Scholar 

  37. Weiss, R., Velez, B., Sheldon, M., Nemprempre, C., Szilagyi, P., Gifford, D.K.: HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proceedings of the Seventh ACM Conference on Hypertext (1996).

    Google Scholar 

  38. Modha, D., Spangler, W.S.: Clustering hypertext with applications to web searching. In Proc. ACM Conference on Hypertext and Hypermedia (2000).

    Google Scholar 

  39. Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining (2000).

    Google Scholar 

  40. Cutting, DR., Karger, DR., Pedersen, J.O., Tukey, J.W.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (1992) 318–329.

    Google Scholar 

  41. Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. Proceedings of SIGIR ’98, Melbourne, Appendix-Questionnaire (1998) 46–54.

    Google Scholar 

  42. Sthehl, A., Joydeep, G., Mooney, R.: Impact of Similarity Measures on Web-page Clustering. Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search (AAAI 2000) 30–31.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Oikonomakou, N., Vazirgiannis, M. (2004). A Review of Web Document Clustering Approaches. In: Sirmakessis, S. (eds) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol 138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45219-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45219-5_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-05780-9

  • Online ISBN: 978-3-540-45219-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics