An effective web document clustering algorithm based on bisection and merge

Lee, Ingyu; On, Byung-Won

doi:10.1007/s10462-011-9203-4

An effective web document clustering algorithm based on bisection and merge

Published: 18 January 2011

Volume 36, pages 69–85, (2011)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Ingyu Lee¹ &
Byung-Won On²

193 Accesses
5 Citations
Explore all metrics

Abstract

To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Banerjee A, Basu S, Merugu S (2007) Multi-way clustering on relation graphs. Proceedings of SIAM data mining
Bekkerman R (2005) Name data set. http://www.cs.umass.edu/~ronb
Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Proceedings of international conference on world wide web
Cheng D, Kannan R, Vempala S, Wang G (2006) A divide-and-merge methodology for clustering. ACM Trans Database Syst 31(4):1499–1525
Google Scholar
Cohen W, Ravikumar P, Fienberg S (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IIWEB Workshop, pp 73–78
Dhillon I, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors: a multilevel approach. IEEE Trans Pattern Anal Mach Intell 11(29): 1944–1957
Google Scholar
Dunlup A, Kernighan B (1985) A procedure for placement of standard-cell VLSI circuits. IEEE Trans CAD, pp 92–98
Elmacioglu E, Tan Y, Yan S, Kan M, Lee D (2007) PSNUS: Web people name disambiguation by simple clustering with rich features. In: Proceedings of international workshop on semantic evaluation (SemEval). Prague, Czech Republic
Fiedler M (1973) Algebraic connectivity of graphs. Czechoslovak Math J 23: 298–305
MathSciNet Google Scholar
Fiduccia C, Mattheyses R (1982) A linear time heuristic for improving network partitions. In: Proceedings of 19th IEEE design automation conference
Golub G, Loan C (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore
Han J, Kamber M, Tung A (2001) Spatial clustering methods in data mining: a survey, geographic data mining and knowledge discovery. Taylor and Francis, United Kingdom
Google Scholar
Han H, Giles C, Zha H (2004) Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of ACM/IEEE joint conference on digital libraries
Han H, Giles C, Zha H (2005) Name disambiguation in author citations using a k-way spectral clustering method. In: Proceedings of ACM/IEEE joint conference on digital libraries
Heath M (2002) Scientific computing: an introductory survey. Prentice Hall, Englewood Cliffs
Google Scholar
Hendrickson B, Leland R (1994) The Chaco user’s guide: version 2.0, Sandia
Jain A (2008) Data clustering: 50 years beyond K-means. In: Proceedings of the 19th international conference on pattern recognition (ICPR), Tampa, USA
Karypis G, Kumar V (1998) A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. J Parall Distrib Comput 48(1): 71–95
Article MathSciNet Google Scholar
Karypis G, Kumar V (1997) ParMETIS: Parallel graph partitioning and sparse matrix ordering library. Department of Computer Science, University of Minnesota, TR, pp 97–60
Lee D, On B, Kang J, Park S (2005) Effective and scalable solutions for mixed and split citation problems in digital libraries. In: Proceedings of the ACM SIGMOD workshop on information quality in information systems, Baltimore, USA
Malin B (2005) Unsupervised name disambiguation via social network similarity. In: Proceedings of the SIAM SDM workshop on link analysis, Counterterrorism and Security
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley sympositum on mathematical statistics and probability
Minkov E, Cohen W, Ng A (2006) Contextual search and name disambiguation in email using graphs. In: Proceedings of the SIGIR, pp 27–34
Newman M (2004) Detecting community structure in networks. Eur Phys J B 38: 321–330
Article Google Scholar
Pothen A, Simon H, Liou K (1990) Partitioning sparse sparse matrices with eigenvectors of graphs. SIAM J Mat Anal Appl 11(3): 430–452
Article MathSciNet MATH Google Scholar
Zeimpekis D, Gallopoulos E (2006) TMG: A MATLAB toolbox for generating term document matrices from text collections. Grouping multidimensional data: recent advances in clustering. Springer, New York, pp 187–210
Google Scholar

Download references

Author information

Authors and Affiliations

Sorrell College of Business, Troy University, Troy, AL, USA
Ingyu Lee
Advanced Digital Sciences Center, Singapore, Singapore
Byung-Won On

Authors

Ingyu Lee
View author publications
You can also search for this author in PubMed Google Scholar
Byung-Won On
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ingyu Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, I., On, BW. An effective web document clustering algorithm based on bisection and merge. Artif Intell Rev 36, 69–85 (2011). https://doi.org/10.1007/s10462-011-9203-4

Download citation

Published: 18 January 2011
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10462-011-9203-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An effective web document clustering algorithm based on bisection and merge

Abstract

Access this article

Similar content being viewed by others

A Sampling-PSO-K-means Algorithm for Document Clustering

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Clustering graph data: the roadmap to spectral techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An effective web document clustering algorithm based on bisection and merge

Abstract

Access this article

Similar content being viewed by others

A Sampling-PSO-K-means Algorithm for Document Clustering

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Clustering graph data: the roadmap to spectral techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation