Machine Learning

, Volume 55, Issue 3, pp 311–331 | Cite as

Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

  • Ying Zhao
  • George Karypis


This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.

partitional clustering criterion function data mining information retrieval 


  1. Available at http://www.cs.umn.edutlkarypicluto/files/datasets.tar.gz.Google Scholar
  2. Available from Scholar
  3. Beeferman, D., & Berger, A. (2000). Agglomerative clustering of a search engine query log. In Proc. of the Sixth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining (pp. 407–416).Google Scholar
  4. Berry, M., Dumais, S., & O’Brien, G. (1995). Using linear algebra for intelligent information retrieval. SIAM Review, 37, 573–595.CrossRefGoogle Scholar
  5. Boley, D. (1998). Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2:4. Google Scholar
  6. Cheeseman, P., & Stutz, J (1996). Baysian classification (AutoClass): Theory and results. In U. Fayyad, G. Piatetsky-Shapil. P Smith, & R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining (pp. 153–180). AAAIT Press.Google Scholar
  7. Cheng, C.-K, & Wei, Y.-C. A. (1991). An improved two-way partitioning algorithm with stable performance. IEEE Transactions on Computer Aided Design, 10:12, 1502–1511.CrossRefGoogle Scholar
  8. Cutting, D., Pedersen, J., Karger, D., & Tukey, J. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR. (pp. 318–329). Copenhagen.Google Scholar
  9. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39. Google Scholar
  10. Devore, J., & Peck, R. (1997). Statistics: The exploration and analysis of data.Belmont, CA: Duxbury Press.Google Scholar
  11. Dhillon, S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Knowledge Discovery and Data Mining (pp. 269–274).Google Scholar
  12. Dhillon, I. S., & Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42:1/2, 143–175.CrossRefGoogle Scholar
  13. Ding, C., He, X., Zha, H., Gu, M., & Simon, H. (2001). Spectral min-max cut for graph partitioning adut data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA.Google Scholar
  14. Duda, R., Hart, P., & Stork, D. (2001). Pattern classification. John Wiley & Sons.Google Scholar
  15. Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the Second Int’l Conference on Knowledge Discovery and Data Mining. Portland: OR.Google Scholar
  16. Fisher, D. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning,2, 139–172.Google Scholar
  17. Fisher, D. (1996). Iterative optimization and simplification of hierarchical clusterings. Journal of Artificial Intelligence Research,4, 147–180.Google Scholar
  18. Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. In Proc. of 1998 ACM-SIGMOD Int. Conf on Management of Data. Google Scholar
  19. Guha, S., Rastogi, R., & Shim, K. (1999). ROCK: A robust clustering algorithm for categorical attributes. In Proc. of the 15th Int’l Conjf on Data Eng. Google Scholar
  20. Hagen, L., & Kahng, A. (1991). Fast spectral methods for ratio cut partitioning and clustering. In Proceedings of IEEE International Conference on Computer Aided Design (pp. 10–13).Google Scholar
  21. Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). WebACE: A web agent for document categorization and exploartion. In Proc. of the 2nd International Conference on Autonomous Agents. Google Scholar
  22. Han, J., Kamber, M., & Tung, A. K. H. (2001). Spatial clustering methods in data mining: A survey In H. Miller, & J. Han (Eds.), Geographic data mining and knowledge discovery. Taylor and Francis.Google Scholar
  23. Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1994). OHSUMED: An interactive retrial evaluation and new large test collection for research. In SIGIR-94 (pp. 192–201).Google Scholar
  24. Jackson, J. E. (1991). A User’s guide to principal components. John Wiley & Sons.Google Scholar
  25. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review ACM Computing Surveys, 31:3, 264–323.CrossRefGoogle Scholar
  26. Karypis, G., & Han, E. (2000). Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Technical Report TR–00–016, Department of Computer Science, University of Minnesota, Minneapolis. Available on the WWW at URL ht://ww.cs.umn.edutkarypis. Google Scholar
  27. Karypis, G., Han, E., & Kumar, V. (1999a). Chameleon: Ahierarchical clustering algorithm using dynamic modeling. IEEE Computer 32:8, 68–75. Google Scholar
  28. Karypis, G., Han, E., & Kumar, V. (1999b). Mulilevel refinement for hierarchical clustering. Technical Report TR–99–020, Department of Computer Science, University of Minnesota, Minneapolis.Google Scholar
  29. King, B. (1967). Step-wise clustering procedures. Journal of the American Statistical Association, 69, 86–101.Google Scholar
  30. Kolda, T., & Hendrickson, B. (2000). Paitioning: sparse rectangular and structurally non symmetric matrices for parallel computation. SIAM Journal on scientific Computing, 21:6, 2048–2072.CrossRefGoogle Scholar
  31. Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proc. of the Fifth ACM SIGKDD Int’l Coenfrence on Knowledge Discovery and Data Mining (pp. 16–22).Google Scholar
  32. Lewis, D. D. (1999). Reuters-2t1578 text categorization test collection Distribution 1.0. http://www.research. Scholar
  33. MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math Statist Prob (pp. 281–297).Google Scholar
  34. Meila, M., & Heckerman, D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42, 9–29. CrossRefGoogle Scholar
  35. Ng, R., & Han, J. (1994). Efficient and effective clustering method for spatial data mining. In Proc. of the 20th VLDB Conference (pp. 144–155). Santiago, Chile.Google Scholar
  36. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14:3, 130–137.Google Scholar
  37. Puzicha, J., Hofmann, T., & Buhmann, J. M. (2000). A theory of proximity based clustering: Structure detection by optimization. PATREC: Pattern Recognition. Pergamon Press. (vol. 33, pp. 617–634).Google Scholar
  38. Salton, G. (1989). Automatic text processing: The transformation, analysis, & retrieval of information by computer. Addison-Wesley.Google Scholar
  39. Savaresi, S., & Boley, D. (2001). On the performance of bisecting K-means and PDDP. In First SIAM International Conference on Data Mining (SDM’2001). Google Scholar
  40. Savaresi, S., Boley, D., Bittanti, S., & Gazzaniga, G. (2002) Choosing the cluster to split in bisecting divisive clustering algorithms. In Second SIAM International Conference on Data Mining (SDM’2002). Google Scholar
  41. Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:8, 888–905.CrossRefGoogle Scholar
  42. Sneath, P. H., & Sokal, R. R. (1973). Numerical taxonomy. London, UK: Freeman.Google Scholar
  43. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. KDD Workshop on Text Mining. Google Scholar
  44. Strehl, A., & Ghosh, J. (2000). Scalable approach to balanced, high-dimensional clustering of market-baskets. In Proceedings of HiPC. TREC (1999). Text REtrieval conference. Scholar
  45. Zahn, K. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on 606 Computers, C-20, 68–86.Google Scholar
  46. Zha, H., He, X., Ding, C., Simon, H., & Gu, M. (2001a). Bipartite graph partitioning and data clustering. CIKM. Google Scholar
  47. Zha, H., He, X., Ding, C., Simon, H., & Gu, M. (2001b). Spectral relaxation for K-means clustering. Technical Report TR-2001-XX, Pennsylvania State University, University Park, PA.Google Scholar
  48. Zhao, Y., & Karypis, G. (2001). Criterionfunctionsfor document clustering: Experiments and analysis. Technical Report TR #01–40, Department of Computer Science, University of Minnesota, Minneapolis, MN. Available on the WWW at Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Ying Zhao
    • 1
  • George Karypis
    • 1
  1. 1.Department of Computer Science MinneapolisUniversity of MinnesotaUSA

Personalised recommendations