Hierarchical Clustering Algorithms for Document Datasets

Zhao, Ying; Karypis, George; Fayyad, Usama

doi:10.1007/s10618-005-0361-3

Hierarchical Clustering Algorithms for Document Datasets

Published: March 2005

Volume 10, pages 141–168, (2005)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Ying Zhao¹,
George Karypis¹ &
Usama Fayyad¹

3766 Accesses
382 Citations
3 Altmetric
Explore all metrics

Abstract

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, clustering algorithms that build meaningful hierarchies out of large document collections are ideal tools for their interactive visualization and exploration as they provide data-views that are consistent, predictable, and at different levels of granularity. This paper focuses on document clustering algorithms that build such hierarchical solutions and (i) presents a comprehensive study of partitional and agglomerative algorithms that use different criterion functions and merging schemes, and (ii) presents a new class of clustering algorithms called constrained agglomerative algorithms, which combine features from both partitional and agglomerative approaches that allows them to reduce the early-stage errors made by agglomerative methods and hence improve the quality of clustering solutions. The experimental evaluation shows that, contrary to the common belief, partitional algorithms always lead to better solutions than agglomerative algorithms; making them ideal for clustering large document collections due to not only their relatively low computational requirements, but also higher clustering quality. Furthermore, the constrained agglomerative methods consistently lead to better solutions than agglomerative methods alone and for many cases they outperform partitional methods, as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal, C.C., Gates, S.C., and Yu, P.S. 1999. On the merits of building categorization systems by supervised clustering. In Proc. of the Fifth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining, pp. 352–356.
Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proc. of the Sixth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining, pp. 407–416.
Boley, D., Gini, M., Gross, R., Han, E.H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1999. Document categorization and query generation on the world wide web using WebACE. AI Review, 11:365-391.
Google Scholar
Boley, D., Gini, M., Gross, R., Han, E.H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1999. Partitioning-based clustering for web document categorization. Decision Support Systems, 27(3):329–341.
Google Scholar
Boley, D. 1998. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4):325–344.
Google Scholar
Cheeseman, P. and Stutz, J. 1996. Baysian classification (autoclass): Theory and results. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smith, and R. Uthurusamy (Eds.), {Advances in Knowledge Discovery and Data Mining}. pp. 153–180. AAAI/MIT Press.
Chung-Kuan Cheng and Yen-Chuen A. 1991 An improved two-way partitioning algorithm with stable performance. IEEE Transactions on Computer Aided Design, 10(12):1502–1511.
Google Scholar
Cutting, D.R., Pedersen, J.O., Karger, D.R., and Tukey, J.W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In {Proceedings of the ACM SIGIR}. Copenhagen, pp. 318–329.
Devore, J. and Peck, R. 1997. Statistics: The Exploration and Analysis of Data. Belmont, CA: Duxbury Press.
Google Scholar
Dhillon, I., Guan, Y., and Kogan, J. 2002. Iterative clustering of high dimensional text data augmented by local search. In {Proc. of the 2002 IEEE International Conference on Data Mining}, pp. 131–138.
Dhillon, I.S. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. In {Knowledge Discovery and Data Mining}, pp. 269–274.
Dhillon I.S. and Modha, D.S. 2001. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1/2):143–175.
Google Scholar
Chris Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and Horst Simon. 2001. Spectral min-max cut for graph partitioning and data clustering. Technical Report LBNL-47937, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, 2001.
Duda, R.O., Hart, P.E., and Stork, D.G. 2001. Pattern Classification. John Wiley & Sons.
Guha, S., Rastogi, R., and Shim, K. 1998. {CURE}: An efficient clustering algorithm for large databases. In {Proc. of 1998 ACM-SIGMOD Int. Conf. on Management of Data}, pp. 73–84.
Guha, S., Rastogi, R., and Shim, K. 1999. ROCK: A robust clustering algorithm for categorical attributes. In Proc. of the 15th Int’l Conf. on Data Eng., pp. 512–521.
Hagen, L. and Kahng, A. 1991. Fast spectral methods for ratio cut partitioning and clustering. In {Proceedings of IEEE International Conference on Computer Aided Design}, pp. 10–13.
Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1998. {WebACE}: A web agent for document categorization and exploartion. In {Proc. of the 2nd International Conference on Autonomous Agents}, pp. 408–415.
Han, E.H., Karypis, G., Kumar, V., and Mobasher, B. 1998. Hypergraph based clustering in high-dimensional data sets: A summary of results. Bulletin of the Technical Committee on Data Engineering, 21(1):15–22.
Google Scholar
Jain, A.K. and Dubes, R.C. 1988. {Algorithms for Clustering Data}. Prentice Hall, 1988.
Karypis, G., Han, E.H., and Kumar, V. 1999. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68–75.
Google Scholar
Karypis, G. 2002. {CLUTO} a clustering toolkit. Technical Report 02-017, Dept. of Computer Science, University of Minnesota. Available at http://www.cs.umn.edu~cluto.
King, B. 1967. Step-wise clustering procedures. 1967. Journal of the American Statistical Association. 69: 86–101.
Google Scholar
Kohavi, R. and Sommerfield, D. 1995. Feature subset selection using the wrapper method: Overfitting and dynamic search space topology. In Proc. of the First Int’l Conference on Knowledge Discovery and Data Mining. Montreal, Quebec, pp. 192–197.
Larsen, B. and Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proc. of the Fifth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining. pp. 16–22.
Leouski, A. and Croft, W. 1996. An evaluation of techniques for clustering search results. Technical Report IR-76, Department of Computer Science, University of Massachusetts, Amherst.
Lewis, D.D. 1999. Reuters-21578 text categorization test collection distribution 1.0. http://www.research. att.com/∼lewis.
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math. Statist, Prob. pp. 281–297.
Moore, J., Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., and Mobasher, B. 1997. Web page categorization and feature selection using association rule and principal component clustering. In {7th Workshop on Information Technologies and Systems}.
Ng, R. and Han, J. 1994. Efficient and effective clustering method for spatial data mining. In {Proc. of the 20th VLDB Conference}. Santiago, Chile, pp. 144–155.
Porter, M.F. 1980 An algorithm for suffix stripping. Program, 14(3):130–137.
Google Scholar
Puzicha, J., Hofmann, T., and Buhmann, J. 2000. A theory of proximity based clustering: Structure detection by optimization. PATREC: Pattern Recognition, Pergamon Press, 33(4):617–634.
Google Scholar
Salton, G. 1989. {Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer}. Addison-Wesley.
Savaresi, S. and Boley, D. 2001. On the performance of bisecting k-means and {PDDP}. In {First {SIAM} International Conference on Data Mining ({SDM}’2001)}.
Shi, J. and Malik, J. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 22(8):888–905.
Google Scholar
Sneath, P.H. and Sokal, R.R. 1973. Numerical Taxonomy. London, UK: Freeman.
Google Scholar
Steinbach, M., Karypis, G., and Kumar, V. 2000. A comparison of document clustering techniques. In KDD Workshop on Text Mining.
Strehl, A. and Ghosh, J. 2000. Scalable approach to balanced, high-dimensional clustering of market-baskets. In {Proceedings of HiPC}, pp. 525–536.
TREC. 1999. Text REtrieval conference. http://trec.nist.gov.
van Rijsbergen, C.J. 1979. Information Retrieval. Butterworths, London.
Google Scholar
Willett, P. 1988. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24(5):577–597.
Google Scholar
Yahoo! Yahoo! http://www.yahoo.com.
Zahn, K. 1971. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20):68–86.
Zha, H., He, X., Ding, C., Simon, H., and Gu, M. 2001. Bipartite graph partitioning and data clustering. In CIKM, pp. 25–32.
Zhang, B., Kleyner, G., and Hsu, M. 1999. A local search approach to K-clustering. HP Labs Technical Report HPL-1999-119, Hewlett-Packard Laboratories.
Zhao, Y. and Karypis, G. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proc. of Int’l. Conf. on Information and Knowledge Management. pp. 515–524.
Zhao, Y. and Karypis, G. 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55(3):311–331.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering and Digital Technology Center and Army HPC Research Center, University of Minnesota, Minneapolis, MN, 55455
Ying Zhao, George Karypis & Usama Fayyad (Editor)

Authors

Ying Zhao
View author publications
You can also search for this author in PubMed Google Scholar
George Karypis
View author publications
You can also search for this author in PubMed Google Scholar
Usama Fayyad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Zhao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, Y., Karypis, G. & Fayyad, U. Hierarchical Clustering Algorithms for Document Datasets. Data Min Knowl Disc 10, 141–168 (2005). https://doi.org/10.1007/s10618-005-0361-3

Download citation

Received: 21 June 2003
Revised: 23 July 2004
Issue Date: March 2005
DOI: https://doi.org/10.1007/s10618-005-0361-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical Clustering Algorithms for Document Datasets

Abstract

Access this article

Similar content being viewed by others

Concepts Labeling of Document Clusters Using a Hierarchical Agglomerative Clustering (HAC) Technique

An Analytical Approach to Document Clustering Techniques

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hierarchical Clustering Algorithms for Document Datasets

Abstract

Access this article

Similar content being viewed by others

Concepts Labeling of Document Clusters Using a Hierarchical Agglomerative Clustering (HAC) Technique

An Analytical Approach to Document Clustering Techniques

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation