Skip to main content

Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Abstract

This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.

References

  • Available at http://www.cs.umn.edutlkarypicluto/files/datasets.tar.gz.

  • Available from ftp://ftp.cs.corell.edu/pub/smartt

  • Beeferman, D., & Berger, A. (2000). Agglomerative clustering of a search engine query log. In Proc. of the Sixth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining (pp. 407–416).

  • Berry, M., Dumais, S., & O’Brien, G. (1995). Using linear algebra for intelligent information retrieval. SIAM Review, 37, 573–595.

    Article  Google Scholar 

  • Boley, D. (1998). Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2:4.

  • Cheeseman, P., & Stutz, J (1996). Baysian classification (AutoClass): Theory and results. In U. Fayyad, G. Piatetsky-Shapil. P Smith, & R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining (pp. 153–180). AAAIT Press.

  • Cheng, C.-K, & Wei, Y.-C. A. (1991). An improved two-way partitioning algorithm with stable performance. IEEE Transactions on Computer Aided Design, 10:12, 1502–1511.

    Article  Google Scholar 

  • Cutting, D., Pedersen, J., Karger, D., & Tukey, J. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR. (pp. 318–329). Copenhagen.

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39.

  • Devore, J., & Peck, R. (1997). Statistics: The exploration and analysis of data.Belmont, CA: Duxbury Press.

    Google Scholar 

  • Dhillon, S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Knowledge Discovery and Data Mining (pp. 269–274).

  • Dhillon, I. S., & Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42:1/2, 143–175.

    Article  Google Scholar 

  • Ding, C., He, X., Zha, H., Gu, M., & Simon, H. (2001). Spectral min-max cut for graph partitioning adut data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA.

    Google Scholar 

  • Duda, R., Hart, P., & Stork, D. (2001). Pattern classification. John Wiley & Sons.

  • Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the Second Int’l Conference on Knowledge Discovery and Data Mining. Portland: OR.

    Google Scholar 

  • Fisher, D. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning,2, 139–172.

    Google Scholar 

  • Fisher, D. (1996). Iterative optimization and simplification of hierarchical clusterings. Journal of Artificial Intelligence Research,4, 147–180.

    Google Scholar 

  • Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. In Proc. of 1998 ACM-SIGMOD Int. Conf on Management of Data.

  • Guha, S., Rastogi, R., & Shim, K. (1999). ROCK: A robust clustering algorithm for categorical attributes. In Proc. of the 15th Int’l Conjf on Data Eng.

  • Hagen, L., & Kahng, A. (1991). Fast spectral methods for ratio cut partitioning and clustering. In Proceedings of IEEE International Conference on Computer Aided Design (pp. 10–13).

  • Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). WebACE: A web agent for document categorization and exploartion. In Proc. of the 2nd International Conference on Autonomous Agents.

  • Han, J., Kamber, M., & Tung, A. K. H. (2001). Spatial clustering methods in data mining: A survey In H. Miller, & J. Han (Eds.), Geographic data mining and knowledge discovery. Taylor and Francis.

  • Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1994). OHSUMED: An interactive retrial evaluation and new large test collection for research. In SIGIR-94 (pp. 192–201).

  • Jackson, J. E. (1991). A User’s guide to principal components. John Wiley & Sons.

  • Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review ACM Computing Surveys, 31:3, 264–323.

    Article  Google Scholar 

  • Karypis, G., & Han, E. (2000). Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Technical Report TR–00–016, Department of Computer Science, University of Minnesota, Minneapolis. Available on the WWW at URL ht://ww.cs.umn.edutkarypis.

    Google Scholar 

  • Karypis, G., Han, E., & Kumar, V. (1999a). Chameleon: Ahierarchical clustering algorithm using dynamic modeling. IEEE Computer 32:8, 68–75.

    Google Scholar 

  • Karypis, G., Han, E., & Kumar, V. (1999b). Mulilevel refinement for hierarchical clustering. Technical Report TR–99–020, Department of Computer Science, University of Minnesota, Minneapolis.

    Google Scholar 

  • King, B. (1967). Step-wise clustering procedures. Journal of the American Statistical Association, 69, 86–101.

    Google Scholar 

  • Kolda, T., & Hendrickson, B. (2000). Paitioning: sparse rectangular and structurally non symmetric matrices for parallel computation. SIAM Journal on scientific Computing, 21:6, 2048–2072.

    Article  Google Scholar 

  • Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proc. of the Fifth ACM SIGKDD Int’l Coenfrence on Knowledge Discovery and Data Mining (pp. 16–22).

  • Lewis, D. D. (1999). Reuters-2t1578 text categorization test collection Distribution 1.0. http://www.research. att.com/~lewis.

  • MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math Statist Prob (pp. 281–297).

  • Meila, M., & Heckerman, D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42, 9–29.

    Article  Google Scholar 

  • Ng, R., & Han, J. (1994). Efficient and effective clustering method for spatial data mining. In Proc. of the 20th VLDB Conference (pp. 144–155). Santiago, Chile.

  • Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14:3, 130–137.

    Google Scholar 

  • Puzicha, J., Hofmann, T., & Buhmann, J. M. (2000). A theory of proximity based clustering: Structure detection by optimization. PATREC: Pattern Recognition. Pergamon Press. (vol. 33, pp. 617–634).

  • Salton, G. (1989). Automatic text processing: The transformation, analysis, & retrieval of information by computer. Addison-Wesley.

  • Savaresi, S., & Boley, D. (2001). On the performance of bisecting K-means and PDDP. In First SIAM International Conference on Data Mining (SDM’2001).

  • Savaresi, S., Boley, D., Bittanti, S., & Gazzaniga, G. (2002) Choosing the cluster to split in bisecting divisive clustering algorithms. In Second SIAM International Conference on Data Mining (SDM’2002).

  • Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:8, 888–905.

    Article  Google Scholar 

  • Sneath, P. H., & Sokal, R. R. (1973). Numerical taxonomy. London, UK: Freeman.

    Google Scholar 

  • Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. KDD Workshop on Text Mining.

  • Strehl, A., & Ghosh, J. (2000). Scalable approach to balanced, high-dimensional clustering of market-baskets. In Proceedings of HiPC. TREC (1999). Text REtrieval conference. http://trec.nist.gov.

  • Zahn, K. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on 606 Computers, C-20, 68–86.

    Google Scholar 

  • Zha, H., He, X., Ding, C., Simon, H., & Gu, M. (2001a). Bipartite graph partitioning and data clustering. CIKM.

  • Zha, H., He, X., Ding, C., Simon, H., & Gu, M. (2001b). Spectral relaxation for K-means clustering. Technical Report TR-2001-XX, Pennsylvania State University, University Park, PA.

    Google Scholar 

  • Zhao, Y., & Karypis, G. (2001). Criterionfunctionsfor document clustering: Experiments and analysis. Technical Report TR #01–40, Department of Computer Science, University of Minnesota, Minneapolis, MN. Available on the WWW at http://cs.umn.edu/-karypis/publications.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Zhao, Y., Karypis, G. Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Mach Learn 55, 311–331 (2004). https://doi.org/10.1023/B:MACH.0000027785.44527.d6

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:MACH.0000027785.44527.d6

  • partitional clustering
  • criterion function
  • data mining
  • information retrieval