A Review of Web Document Clustering Approaches

Oikonomakou, N.; Vazirgiannis, M.

doi:10.1007/978-3-540-45219-5_6

N. Oikonomakou³ &
M. Vazirgiannis³

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 138))

1010 Accesses

Abstract

The impact of the World Wide Web as a main source of information extraction is increasing dramatically. Though, the web search environment is not ideal. The existence of an abundance of information, in combination with the dynamic and heterogeneous nature of the Web, makes information retrieval a difficult process for the average user. It is a valid requirement then the development of techniques that can help the users effectively organize and browse the available information, with the ultimate goal of satisfying their information need. Cluster analysis, which deals with the organization of a collection of objects into cohesive groups, can play a very important role towards the achievement of this objective. In this paper, we present an exhaustive survey of web document clustering approaches available on the literature, classified into three main categories: text-based, linkbased and hybrid. Furthermore, we present a thorough comparison of the algorithms based on the various facets of their features and functionality. Finally, from the review of the different approaches we conclude that although clustering has been a topic for the scientific community for three decades, there are still many open issues that call for more research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Rasmussen, E.: Clustering Algorithms. In Information Retrieval, W.B. Frakes & R. Baeza-Yates (eds.), Prentice Hall PTR, New Jersey (1992).
Google Scholar
Jain, A.K., Murty, M.N., Flyn, P.J.: Data Clustering: A Review. ACM Computing Surveys, Vol. 31. No. 3 (1999.)
Google Scholar
C. J. van Rijbergen, Information Retrieval. Butterworths (1979).
Google Scholar
Zhao, Y., Karypis, G.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report #01–40. University of Minnesota, Computer Science Department. Minneapolis, MN (2001).
Google Scholar
El-Hamdouchi, A., Willett, P.: Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal 32 (1989).
Google Scholar
Willett, P.: Recent Trends in Hierarchic document Clustering: a critical review. Information & Management. 24(5) (1988) 577–597.
Google Scholar
Sibson, R.: SLINK: an optimally efficient algorithm for the single link cluster method. The Computer Journal 16 (1973) 30–34.
Article MathSciNet Google Scholar
Voorhees, E. M.: Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing & Management, 22 (1986) 465–476.
Article Google Scholar
Defays, D.: An efficient algorithm for the complete link method. The Computer Journal 20 (1977) 364–366.
Article MathSciNet MATH Google Scholar
http://www.google.com
http://www.lycos.com
Steinbach, M., G. Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining (2000).
Google Scholar
Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. The Computer Journal 26 (1983) 354–359.
Article MATH Google Scholar
El-Hamdouchi, A., Willett, P.: Hierarchic document clustering using Ward’s method. Proceedings of the Ninth International Conference on Research and Development in Information Retrieval. ACM, Washington (1986) 149–156.
Google Scholar
Zhao, Y., Karypis, G.: Evaluation of Hierarchical Clustering Algorithms for Document Datasets (2001).
Google Scholar
Karypis, G., Han, EH, Kumar, V.: Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modelling. IEEE Computer, Vol. 32, No. 8 (1999) 68–75.
Article Google Scholar
Karypis, G., Kumar, V.: A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1) (1999).
Google Scholar
Boley, D., Gini, M., Gross, R., Han, E.H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decision Support Systems, 27(3) (1999) 329–341.
Article Google Scholar
Han, EH., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Webace: a web agent for document categorization and exploration. Technical Report TR-97-049, Department of Computer Science, University of Minnesota, Minneapolis (1997).
Google Scholar
Dhillon, I.S.: Co-clustering documents and words using Bipartite Spectral Graph Partitioning. UT CS Technical Report # TR 2001–05.
Google Scholar
Kohonen, T.: Self-organizing maps. Springer-Verlag, Berlin (1995).
Book Google Scholar
Merkl, D.: Text Data Mining. In: Dale, R., Moisl, H., Somers, H. (eds.), A handbook of natural language processing: techniques and applications for the processing of language as text, Marcel Dekker, New York (1998).
Google Scholar
Salton, G., Wang, A., Yang, C.: A vector space model for information retrieval. Journal of the American Society for Information Science, volume 18 (1975) 613–620.
MATH Google Scholar
Bezdek, J.C., Ehrlich, R., Full, W.: FCM: Fuzzy C-Means Algorithm. Computers and Geosciences (1984).
Google Scholar
Looney, C.: A Fuzzy Clustering and Fuzzy Merging Algorithm, Technical Report, CSUNR-101–1999.
Google Scholar
Everitt, B. S., Hand, D. J.: Finite Mixture Distributions. London: Chapman and Hall (1981).
Book MATH Google Scholar
Cheeseman, P., Stutz, J.: Bayesian Classification (AutoClass): Theory and Results, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press (1996) 153–180.
Google Scholar
Kleinberg, J.: Authoritative sources in a hyperlinked environment. Proceeding of the 9th ACM-SIAM Symposium on Discrete Algorithms (1997).
Google Scholar
Croft, W. B.: Retrieval strategies for hypertext. Information Processing and Management, 29 (1993) 313–324.
Article Google Scholar
Frei, H. P., Stieger, D.: The Use of Semantic Links in Hypertext Information Retrieval. Information Processing and Management, 31(1) (1995) 1–13.
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford (1998).
Google Scholar
A. Botafogo, RA., Shneiderman, B.: Identifying aggregates in hypertext structures. Proceedings of the 3 rd ACM Conference on Hypertext (1991) 63–74.
Google Scholar
Botafogo, RA.: Cluster analysis for hypertext systems. Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (1993) 116–125.
Google Scholar
Larson, R.R.: Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace. Proceedings of the 1996 American Society for Information Science Annual Meeting (1996).
Google Scholar
Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for Emerging Cyber-Communities. Proceedings of the 8th WWW Conference (1999).
Google Scholar
Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: Extracting usable structures from the Web. Proceedings of the ACM Sigchi Conference on Human Factors in Computing (1996).
Google Scholar
Weiss, R., Velez, B., Sheldon, M., Nemprempre, C., Szilagyi, P., Gifford, D.K.: HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proceedings of the Seventh ACM Conference on Hypertext (1996).
Google Scholar
Modha, D., Spangler, W.S.: Clustering hypertext with applications to web searching. In Proc. ACM Conference on Hypertext and Hypermedia (2000).
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining (2000).
Google Scholar
Cutting, DR., Karger, DR., Pedersen, J.O., Tukey, J.W.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (1992) 318–329.
Google Scholar
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. Proceedings of SIGIR ’98, Melbourne, Appendix-Questionnaire (1998) 46–54.
Google Scholar
Sthehl, A., Joydeep, G., Mooney, R.: Impact of Similarity Measures on Web-page Clustering. Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search (AAAI 2000) 30–31.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Informatics, Athens University of Economics & Business, Patision 76, 10434, Greece
N. Oikonomakou & M. Vazirgiannis

Authors

N. Oikonomakou
View author publications
You can also search for this author in PubMed Google Scholar
M. Vazirgiannis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Technology Institute, Research Academic, 61 Riga Feraiou Str, 26221, Patras, Greece
Spiros Sirmakessis (Assistant Professor) (Assistant Professor)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oikonomakou, N., Vazirgiannis, M. (2004). A Review of Web Document Clustering Approaches. In: Sirmakessis, S. (eds) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol 138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45219-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-540-45219-5_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-05780-9
Online ISBN: 978-3-540-45219-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics