Abstract
This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies (hard, stochastic, soft and deterministic annealing (DA) based assignments) to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher (vMF) distributions. A large variety of text collections, both with and without feature selection, are used for the study, which yields several insights, including (a) showing situations wherein the vMF-centric approaches, which are based on directional statistics, fare better than multinomial model-based methods, and (b) quantifying the trade-off between increased performance of the soft and DA assignments and their increased computational demands. We also compare all the model-based algorithms with two state-of-the-art discriminative approaches to document clustering based, respectively, on graph partitioning (CLUTO) and a spectral coclustering method. Overall, DA and CLUTO perform the best but are also the most computationally expensive. The vMF models provide good performance at low cost while the spectral coclustering algorithm fares worse than vMF-based methods for a majority of the datasets.
References
Banerjee A, Dhillon I, Ghosh J, Merugu S (2004) An information theoretic analysis of maximum likelihood mixture estimation for exponential families. In: Proc 21st int conf machine learning, pp 57–64. Banff, Canada, July 2004
Banerjee A, Dhillon IS, Ghosh J, Sra S (2003) Generative model-based clustering of directional data. In: Proc 9th ACM SIGKDD int conf knowledge discovery and data mining, pp 19–28, Washington, DC, August 2003
Banerjee A, Ghosh J (2004) Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans Neural Net 15(3):702–719
Banerjee A, Langford J (2004) An objective evaluation criterion for clustering. In: Proc 10th ACM SIGKDD int conf knowledge discovery and data mining, pp 702–719
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proc 7th ACM SIGKDD int conf knowledge discovery and data mining, pp 269–274
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1):143–175
Ghosh J (2003) Scalable clustering. In: Ye N (ed) Handbook of data mining, pp 341–364. Lawrence Erlbaum Assoc
Karypis G (2002) CLUTO—A clustering toolkit. Dept of Computer Science, University of Minnesota, http://www-users.cs.umn.edu/∼karypis/cluto/
Kohonen T, Kaski S, Lagus K, Salojärvi J, Honkela J, Paatero V, Saarela A (2000) Self organization of a massive document collection. IEEE Trans Neural Net 11(3):574–585
Mardia KV (1975) Statistics of directional data. J Roy Stat Soc Ser B 37(3):349–393
McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: AAAI workshop on learning for text categorization, pp 41–48
McCallum AK (1996) Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Available at http://www.cs.cmu.edu/∼mccallum/bow
Meila M, Heckerman D (2001) An experimental comparison of model-based clustering methods. Mach Learn 42:9–29
Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14, pp 849–856. MIT
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill
Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Research and development in information retrieval, pp 208–215
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, Boston, MA
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining partitions. J Mach Learn Res 3:583–617
Strehl A, Ghosh J, Mooney RJ (2000) Impact of similarity measures on web-page clustering. In: AAAI workshop on AI for web search, pp 58–64
Vaithyanathan S, Dom B (2000) Model-based hierarchical clustering. In: Proc 16th conf uncertainty in artificial intelligence, pp 599–608
Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn 55(3):311–331
Zhong S, Ghosh J (2003) A unified framework for model-based clustering. J Mach Learn Res 4:1001–1037, November 2003
Zhong S, Ghosh J (2004) Generative model-based document clustering: A comparative study. Technical report, Department of Computer Science and Engineering, Florida Atlantic University, June 2004. Available at http://www.cse.fau.edu/∼zhong/papers/textc-full.pdf
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhong, S., Ghosh, J. Generative model-based document clustering: a comparative study. Knowl Inf Syst 8, 374–384 (2005). https://doi.org/10.1007/s10115-004-0194-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-004-0194-1