Skip to main content
Log in

Hierarchical Clustering Algorithms for Document Datasets

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, clustering algorithms that build meaningful hierarchies out of large document collections are ideal tools for their interactive visualization and exploration as they provide data-views that are consistent, predictable, and at different levels of granularity. This paper focuses on document clustering algorithms that build such hierarchical solutions and (i) presents a comprehensive study of partitional and agglomerative algorithms that use different criterion functions and merging schemes, and (ii) presents a new class of clustering algorithms called constrained agglomerative algorithms, which combine features from both partitional and agglomerative approaches that allows them to reduce the early-stage errors made by agglomerative methods and hence improve the quality of clustering solutions. The experimental evaluation shows that, contrary to the common belief, partitional algorithms always lead to better solutions than agglomerative algorithms; making them ideal for clustering large document collections due to not only their relatively low computational requirements, but also higher clustering quality. Furthermore, the constrained agglomerative methods consistently lead to better solutions than agglomerative methods alone and for many cases they outperform partitional methods, as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aggarwal, C.C., Gates, S.C., and Yu, P.S. 1999. On the merits of building categorization systems by supervised clustering. In Proc. of the Fifth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining, pp. 352–356.

  • Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proc. of the Sixth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining, pp. 407–416.

  • Boley, D., Gini, M., Gross, R., Han, E.H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1999. Document categorization and query generation on the world wide web using WebACE. AI Review, 11:365-391.

    Google Scholar 

  • Boley, D., Gini, M., Gross, R., Han, E.H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1999. Partitioning-based clustering for web document categorization. Decision Support Systems, 27(3):329–341.

    Google Scholar 

  • Boley, D. 1998. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4):325–344.

    Google Scholar 

  • Cheeseman, P. and Stutz, J. 1996. Baysian classification (autoclass): Theory and results. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smith, and R. Uthurusamy (Eds.), {Advances in Knowledge Discovery and Data Mining}. pp. 153–180. AAAI/MIT Press.

  • Chung-Kuan Cheng and Yen-Chuen A. 1991 An improved two-way partitioning algorithm with stable performance. IEEE Transactions on Computer Aided Design, 10(12):1502–1511.

    Google Scholar 

  • Cutting, D.R., Pedersen, J.O., Karger, D.R., and Tukey, J.W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In {Proceedings of the ACM SIGIR}. Copenhagen, pp. 318–329.

  • Devore, J. and Peck, R. 1997. Statistics: The Exploration and Analysis of Data. Belmont, CA: Duxbury Press.

    Google Scholar 

  • Dhillon, I., Guan, Y., and Kogan, J. 2002. Iterative clustering of high dimensional text data augmented by local search. In {Proc. of the 2002 IEEE International Conference on Data Mining}, pp. 131–138.

  • Dhillon, I.S. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. In {Knowledge Discovery and Data Mining}, pp. 269–274.

  • Dhillon I.S. and Modha, D.S. 2001. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1/2):143–175.

    Google Scholar 

  • Chris Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and Horst Simon. 2001. Spectral min-max cut for graph partitioning and data clustering. Technical Report LBNL-47937, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, 2001.

  • Duda, R.O., Hart, P.E., and Stork, D.G. 2001. Pattern Classification. John Wiley & Sons.

  • Guha, S., Rastogi, R., and Shim, K. 1998. {CURE}: An efficient clustering algorithm for large databases. In {Proc. of 1998 ACM-SIGMOD Int. Conf. on Management of Data}, pp. 73–84.

  • Guha, S., Rastogi, R., and Shim, K. 1999. ROCK: A robust clustering algorithm for categorical attributes. In Proc. of the 15th Int’l Conf. on Data Eng., pp. 512–521.

  • Hagen, L. and Kahng, A. 1991. Fast spectral methods for ratio cut partitioning and clustering. In {Proceedings of IEEE International Conference on Computer Aided Design}, pp. 10–13.

  • Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1998. {WebACE}: A web agent for document categorization and exploartion. In {Proc. of the 2nd International Conference on Autonomous Agents}, pp. 408–415.

  • Han, E.H., Karypis, G., Kumar, V., and Mobasher, B. 1998. Hypergraph based clustering in high-dimensional data sets: A summary of results. Bulletin of the Technical Committee on Data Engineering, 21(1):15–22.

    Google Scholar 

  • Jain, A.K. and Dubes, R.C. 1988. {Algorithms for Clustering Data}. Prentice Hall, 1988.

  • Karypis, G., Han, E.H., and Kumar, V. 1999. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68–75.

    Google Scholar 

  • Karypis, G. 2002. {CLUTO} a clustering toolkit. Technical Report 02-017, Dept. of Computer Science, University of Minnesota. Available at http://www.cs.umn.edu~cluto.

  • King, B. 1967. Step-wise clustering procedures. 1967. Journal of the American Statistical Association. 69: 86–101.

    Google Scholar 

  • Kohavi, R. and Sommerfield, D. 1995. Feature subset selection using the wrapper method: Overfitting and dynamic search space topology. In Proc. of the First Int’l Conference on Knowledge Discovery and Data Mining. Montreal, Quebec, pp. 192–197.

  • Larsen, B. and Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proc. of the Fifth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining. pp. 16–22.

  • Leouski, A. and Croft, W. 1996. An evaluation of techniques for clustering search results. Technical Report IR-76, Department of Computer Science, University of Massachusetts, Amherst.

  • Lewis, D.D. 1999. Reuters-21578 text categorization test collection distribution 1.0. http://www.research. att.com/∼lewis.

  • MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math. Statist, Prob. pp. 281–297.

  • Moore, J., Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., and Mobasher, B. 1997. Web page categorization and feature selection using association rule and principal component clustering. In {7th Workshop on Information Technologies and Systems}.

  • Ng, R. and Han, J. 1994. Efficient and effective clustering method for spatial data mining. In {Proc. of the 20th VLDB Conference}. Santiago, Chile, pp. 144–155.

  • Porter, M.F. 1980 An algorithm for suffix stripping. Program, 14(3):130–137.

    Google Scholar 

  • Puzicha, J., Hofmann, T., and Buhmann, J. 2000. A theory of proximity based clustering: Structure detection by optimization. PATREC: Pattern Recognition, Pergamon Press, 33(4):617–634.

    Google Scholar 

  • Salton, G. 1989. {Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer}. Addison-Wesley.

  • Savaresi, S. and Boley, D. 2001. On the performance of bisecting k-means and {PDDP}. In {First {SIAM} International Conference on Data Mining ({SDM}’2001)}.

  • Shi, J. and Malik, J. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 22(8):888–905.

    Google Scholar 

  • Sneath, P.H. and Sokal, R.R. 1973. Numerical Taxonomy. London, UK: Freeman.

    Google Scholar 

  • Steinbach, M., Karypis, G., and Kumar, V. 2000. A comparison of document clustering techniques. In KDD Workshop on Text Mining.

  • Strehl, A. and Ghosh, J. 2000. Scalable approach to balanced, high-dimensional clustering of market-baskets. In {Proceedings of HiPC}, pp. 525–536.

  • TREC. 1999. Text REtrieval conference. http://trec.nist.gov.

  • van Rijsbergen, C.J. 1979. Information Retrieval. Butterworths, London.

    Google Scholar 

  • Willett, P. 1988. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24(5):577–597.

    Google Scholar 

  • Yahoo! Yahoo! http://www.yahoo.com.

  • Zahn, K. 1971. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20):68–86.

  • Zha, H., He, X., Ding, C., Simon, H., and Gu, M. 2001. Bipartite graph partitioning and data clustering. In CIKM, pp. 25–32.

  • Zhang, B., Kleyner, G., and Hsu, M. 1999. A local search approach to K-clustering. HP Labs Technical Report HPL-1999-119, Hewlett-Packard Laboratories.

  • Zhao, Y. and Karypis, G. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proc. of Int’l. Conf. on Information and Knowledge Management. pp. 515–524.

  • Zhao, Y. and Karypis, G. 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55(3):311–331.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying Zhao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, Y., Karypis, G. & Fayyad, U. Hierarchical Clustering Algorithms for Document Datasets. Data Min Knowl Disc 10, 141–168 (2005). https://doi.org/10.1007/s10618-005-0361-3

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-005-0361-3

Keywords

Navigation