Abstract
The impact of the World Wide Web as a main source of information extraction is increasing dramatically. Though, the web search environment is not ideal. The existence of an abundance of information, in combination with the dynamic and heterogeneous nature of the Web, makes information retrieval a difficult process for the average user. It is a valid requirement then the development of techniques that can help the users effectively organize and browse the available information, with the ultimate goal of satisfying their information need. Cluster analysis, which deals with the organization of a collection of objects into cohesive groups, can play a very important role towards the achievement of this objective. In this paper, we present an exhaustive survey of web document clustering approaches available on the literature, classified into three main categories: text-based, linkbased and hybrid. Furthermore, we present a thorough comparison of the algorithms based on the various facets of their features and functionality. Finally, from the review of the different approaches we conclude that although clustering has been a topic for the scientific community for three decades, there are still many open issues that call for more research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Rasmussen, E.: Clustering Algorithms. In Information Retrieval, W.B. Frakes & R. Baeza-Yates (eds.), Prentice Hall PTR, New Jersey (1992).
Jain, A.K., Murty, M.N., Flyn, P.J.: Data Clustering: A Review. ACM Computing Surveys, Vol. 31. No. 3 (1999.)
C. J. van Rijbergen, Information Retrieval. Butterworths (1979).
Zhao, Y., Karypis, G.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report #01–40. University of Minnesota, Computer Science Department. Minneapolis, MN (2001).
El-Hamdouchi, A., Willett, P.: Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal 32 (1989).
Willett, P.: Recent Trends in Hierarchic document Clustering: a critical review. Information & Management. 24(5) (1988) 577–597.
Sibson, R.: SLINK: an optimally efficient algorithm for the single link cluster method. The Computer Journal 16 (1973) 30–34.
Voorhees, E. M.: Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing & Management, 22 (1986) 465–476.
Defays, D.: An efficient algorithm for the complete link method. The Computer Journal 20 (1977) 364–366.
Steinbach, M., G. Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining (2000).
Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. The Computer Journal 26 (1983) 354–359.
El-Hamdouchi, A., Willett, P.: Hierarchic document clustering using Ward’s method. Proceedings of the Ninth International Conference on Research and Development in Information Retrieval. ACM, Washington (1986) 149–156.
Zhao, Y., Karypis, G.: Evaluation of Hierarchical Clustering Algorithms for Document Datasets (2001).
Karypis, G., Han, EH, Kumar, V.: Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modelling. IEEE Computer, Vol. 32, No. 8 (1999) 68–75.
Karypis, G., Kumar, V.: A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1) (1999).
Boley, D., Gini, M., Gross, R., Han, E.H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decision Support Systems, 27(3) (1999) 329–341.
Han, EH., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Webace: a web agent for document categorization and exploration. Technical Report TR-97-049, Department of Computer Science, University of Minnesota, Minneapolis (1997).
Dhillon, I.S.: Co-clustering documents and words using Bipartite Spectral Graph Partitioning. UT CS Technical Report # TR 2001–05.
Kohonen, T.: Self-organizing maps. Springer-Verlag, Berlin (1995).
Merkl, D.: Text Data Mining. In: Dale, R., Moisl, H., Somers, H. (eds.), A handbook of natural language processing: techniques and applications for the processing of language as text, Marcel Dekker, New York (1998).
Salton, G., Wang, A., Yang, C.: A vector space model for information retrieval. Journal of the American Society for Information Science, volume 18 (1975) 613–620.
Bezdek, J.C., Ehrlich, R., Full, W.: FCM: Fuzzy C-Means Algorithm. Computers and Geosciences (1984).
Looney, C.: A Fuzzy Clustering and Fuzzy Merging Algorithm, Technical Report, CSUNR-101–1999.
Everitt, B. S., Hand, D. J.: Finite Mixture Distributions. London: Chapman and Hall (1981).
Cheeseman, P., Stutz, J.: Bayesian Classification (AutoClass): Theory and Results, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press (1996) 153–180.
Kleinberg, J.: Authoritative sources in a hyperlinked environment. Proceeding of the 9th ACM-SIAM Symposium on Discrete Algorithms (1997).
Croft, W. B.: Retrieval strategies for hypertext. Information Processing and Management, 29 (1993) 313–324.
Frei, H. P., Stieger, D.: The Use of Semantic Links in Hypertext Information Retrieval. Information Processing and Management, 31(1) (1995) 1–13.
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford (1998).
A. Botafogo, RA., Shneiderman, B.: Identifying aggregates in hypertext structures. Proceedings of the 3 rd ACM Conference on Hypertext (1991) 63–74.
Botafogo, RA.: Cluster analysis for hypertext systems. Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (1993) 116–125.
Larson, R.R.: Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace. Proceedings of the 1996 American Society for Information Science Annual Meeting (1996).
Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for Emerging Cyber-Communities. Proceedings of the 8th WWW Conference (1999).
Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: Extracting usable structures from the Web. Proceedings of the ACM Sigchi Conference on Human Factors in Computing (1996).
Weiss, R., Velez, B., Sheldon, M., Nemprempre, C., Szilagyi, P., Gifford, D.K.: HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proceedings of the Seventh ACM Conference on Hypertext (1996).
Modha, D., Spangler, W.S.: Clustering hypertext with applications to web searching. In Proc. ACM Conference on Hypertext and Hypermedia (2000).
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining (2000).
Cutting, DR., Karger, DR., Pedersen, J.O., Tukey, J.W.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (1992) 318–329.
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. Proceedings of SIGIR ’98, Melbourne, Appendix-Questionnaire (1998) 46–54.
Sthehl, A., Joydeep, G., Mooney, R.: Impact of Similarity Measures on Web-page Clustering. Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search (AAAI 2000) 30–31.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Oikonomakou, N., Vazirgiannis, M. (2004). A Review of Web Document Clustering Approaches. In: Sirmakessis, S. (eds) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol 138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45219-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-45219-5_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-05780-9
Online ISBN: 978-3-540-45219-5
eBook Packages: Springer Book Archive