Artificial Intelligence Review

, Volume 36, Issue 1, pp 69–85 | Cite as

An effective web document clustering algorithm based on bisection and merge

Article

Abstract

To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means.

Keywords

Clustering Spectral bisection Entity resolution Data mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Banerjee A, Basu S, Merugu S (2007) Multi-way clustering on relation graphs. Proceedings of SIAM data miningGoogle Scholar
  2. Bekkerman R (2005) Name data set. http://www.cs.umass.edu/~ronb
  3. Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Proceedings of international conference on world wide webGoogle Scholar
  4. Cheng D, Kannan R, Vempala S, Wang G (2006) A divide-and-merge methodology for clustering. ACM Trans Database Syst 31(4):1499–1525Google Scholar
  5. Cohen W, Ravikumar P, Fienberg S (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IIWEB Workshop, pp 73–78Google Scholar
  6. Dhillon I, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors: a multilevel approach. IEEE Trans Pattern Anal Mach Intell 11(29): 1944–1957Google Scholar
  7. Dunlup A, Kernighan B (1985) A procedure for placement of standard-cell VLSI circuits. IEEE Trans CAD, pp 92–98Google Scholar
  8. Elmacioglu E, Tan Y, Yan S, Kan M, Lee D (2007) PSNUS: Web people name disambiguation by simple clustering with rich features. In: Proceedings of international workshop on semantic evaluation (SemEval). Prague, Czech RepublicGoogle Scholar
  9. Fiedler M (1973) Algebraic connectivity of graphs. Czechoslovak Math J 23: 298–305MathSciNetGoogle Scholar
  10. Fiduccia C, Mattheyses R (1982) A linear time heuristic for improving network partitions. In: Proceedings of 19th IEEE design automation conferenceGoogle Scholar
  11. Golub G, Loan C (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, BaltimoreGoogle Scholar
  12. Han J, Kamber M, Tung A (2001) Spatial clustering methods in data mining: a survey, geographic data mining and knowledge discovery. Taylor and Francis, United KingdomGoogle Scholar
  13. Han H, Giles C, Zha H (2004) Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of ACM/IEEE joint conference on digital librariesGoogle Scholar
  14. Han H, Giles C, Zha H (2005) Name disambiguation in author citations using a k-way spectral clustering method. In: Proceedings of ACM/IEEE joint conference on digital librariesGoogle Scholar
  15. Heath M (2002) Scientific computing: an introductory survey. Prentice Hall, Englewood CliffsGoogle Scholar
  16. Hendrickson B, Leland R (1994) The Chaco user’s guide: version 2.0, SandiaGoogle Scholar
  17. Jain A (2008) Data clustering: 50 years beyond K-means. In: Proceedings of the 19th international conference on pattern recognition (ICPR), Tampa, USAGoogle Scholar
  18. Karypis G, Kumar V (1998) A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. J Parall Distrib Comput 48(1): 71–95MathSciNetCrossRefGoogle Scholar
  19. Karypis G, Kumar V (1997) ParMETIS: Parallel graph partitioning and sparse matrix ordering library. Department of Computer Science, University of Minnesota, TR, pp 97–60Google Scholar
  20. Lee D, On B, Kang J, Park S (2005) Effective and scalable solutions for mixed and split citation problems in digital libraries. In: Proceedings of the ACM SIGMOD workshop on information quality in information systems, Baltimore, USAGoogle Scholar
  21. Malin B (2005) Unsupervised name disambiguation via social network similarity. In: Proceedings of the SIAM SDM workshop on link analysis, Counterterrorism and SecurityGoogle Scholar
  22. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley sympositum on mathematical statistics and probabilityGoogle Scholar
  23. Minkov E, Cohen W, Ng A (2006) Contextual search and name disambiguation in email using graphs. In: Proceedings of the SIGIR, pp 27–34Google Scholar
  24. Newman M (2004) Detecting community structure in networks. Eur Phys J B 38: 321–330CrossRefGoogle Scholar
  25. Pothen A, Simon H, Liou K (1990) Partitioning sparse sparse matrices with eigenvectors of graphs. SIAM J Mat Anal Appl 11(3): 430–452MathSciNetMATHCrossRefGoogle Scholar
  26. Zeimpekis D, Gallopoulos E (2006) TMG: A MATLAB toolbox for generating term document matrices from text collections. Grouping multidimensional data: recent advances in clustering. Springer, New York, pp 187–210Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.Sorrell College of BusinessTroy UniversityTroyUSA
  2. 2.Advanced Digital Sciences CenterSingaporeSingapore

Personalised recommendations