Knowledge and Information Systems

, Volume 8, Issue 3, pp 374–384 | Cite as

Generative model-based document clustering: a comparative study

Article

Abstract

This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies (hard, stochastic, soft and deterministic annealing (DA) based assignments) to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher (vMF) distributions. A large variety of text collections, both with and without feature selection, are used for the study, which yields several insights, including (a) showing situations wherein the vMF-centric approaches, which are based on directional statistics, fare better than multinomial model-based methods, and (b) quantifying the trade-off between increased performance of the soft and DA assignments and their increased computational demands. We also compare all the model-based algorithms with two state-of-the-art discriminative approaches to document clustering based, respectively, on graph partitioning (CLUTO) and a spectral coclustering method. Overall, DA and CLUTO perform the best but are also the most computationally expensive. The vMF models provide good performance at low cost while the spectral coclustering algorithm fares worse than vMF-based methods for a majority of the datasets.

Keywords

Comparative study Document clustering Model-based clustering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Banerjee A, Dhillon I, Ghosh J, Merugu S (2004) An information theoretic analysis of maximum likelihood mixture estimation for exponential families. In: Proc 21st int conf machine learning, pp 57–64. Banff, Canada, July 2004Google Scholar
  2. 2.
    Banerjee A, Dhillon IS, Ghosh J, Sra S (2003) Generative model-based clustering of directional data. In: Proc 9th ACM SIGKDD int conf knowledge discovery and data mining, pp 19–28, Washington, DC, August 2003Google Scholar
  3. 3.
    Banerjee A, Ghosh J (2004) Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans Neural Net 15(3):702–719CrossRefGoogle Scholar
  4. 4.
    Banerjee A, Langford J (2004) An objective evaluation criterion for clustering. In: Proc 10th ACM SIGKDD int conf knowledge discovery and data mining, pp 702–719Google Scholar
  5. 5.
    Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proc 7th ACM SIGKDD int conf knowledge discovery and data mining, pp 269–274Google Scholar
  6. 6.
    Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1):143–175CrossRefGoogle Scholar
  7. 7.
    Ghosh J (2003) Scalable clustering. In: Ye N (ed) Handbook of data mining, pp 341–364. Lawrence Erlbaum AssocGoogle Scholar
  8. 8.
    Karypis G (2002) CLUTO—A clustering toolkit. Dept of Computer Science, University of Minnesota, http://www-users.cs.umn.edu/∼karypis/cluto/Google Scholar
  9. 9.
    Kohonen T, Kaski S, Lagus K, Salojärvi J, Honkela J, Paatero V, Saarela A (2000) Self organization of a massive document collection. IEEE Trans Neural Net 11(3):574–585CrossRefGoogle Scholar
  10. 10.
    Mardia KV (1975) Statistics of directional data. J Roy Stat Soc Ser B 37(3):349–393Google Scholar
  11. 11.
    McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: AAAI workshop on learning for text categorization, pp 41–48Google Scholar
  12. 12.
    McCallum AK (1996) Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Available at http://www.cs.cmu.edu/∼mccallum/bowGoogle Scholar
  13. 13.
    Meila M, Heckerman D (2001) An experimental comparison of model-based clustering methods. Mach Learn 42:9–29CrossRefGoogle Scholar
  14. 14.
    Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14, pp 849–856. MITGoogle Scholar
  15. 15.
    Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-HillGoogle Scholar
  16. 16.
    Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Research and development in information retrieval, pp 208–215Google Scholar
  17. 17.
    Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, Boston, MAGoogle Scholar
  18. 18.
    Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining partitions. J Mach Learn Res 3:583–617CrossRefGoogle Scholar
  19. 19.
    Strehl A, Ghosh J, Mooney RJ (2000) Impact of similarity measures on web-page clustering. In: AAAI workshop on AI for web search, pp 58–64Google Scholar
  20. 20.
    Vaithyanathan S, Dom B (2000) Model-based hierarchical clustering. In: Proc 16th conf uncertainty in artificial intelligence, pp 599–608Google Scholar
  21. 21.
    Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn 55(3):311–331CrossRefGoogle Scholar
  22. 22.
    Zhong S, Ghosh J (2003) A unified framework for model-based clustering. J Mach Learn Res 4:1001–1037, November 2003CrossRefGoogle Scholar
  23. 23.
    Zhong S, Ghosh J (2004) Generative model-based document clustering: A comparative study. Technical report, Department of Computer Science and Engineering, Florida Atlantic University, June 2004. Available at http://www.cse.fau.edu/∼zhong/papers/textc-full.pdfGoogle Scholar

Copyright information

© Springer-Verlag 2005

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringFlorida Atlantic UniversityBoca RatonUSA
  2. 2.Department of Electrical and Computer EngineeringUniversity of Texas at AustinAustinUSA

Personalised recommendations