Generative model-based document clustering: a comparative study

Abstract

This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies (hard, stochastic, soft and deterministic annealing (DA) based assignments) to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher (vMF) distributions. A large variety of text collections, both with and without feature selection, are used for the study, which yields several insights, including (a) showing situations wherein the vMF-centric approaches, which are based on directional statistics, fare better than multinomial model-based methods, and (b) quantifying the trade-off between increased performance of the soft and DA assignments and their increased computational demands. We also compare all the model-based algorithms with two state-of-the-art discriminative approaches to document clustering based, respectively, on graph partitioning (CLUTO) and a spectral coclustering method. Overall, DA and CLUTO perform the best but are also the most computationally expensive. The vMF models provide good performance at low cost while the spectral coclustering algorithm fares worse than vMF-based methods for a majority of the datasets.

This is a preview of subscription content, access via your institution.

References

  1. 1.

    Banerjee A, Dhillon I, Ghosh J, Merugu S (2004) An information theoretic analysis of maximum likelihood mixture estimation for exponential families. In: Proc 21st int conf machine learning, pp 57–64. Banff, Canada, July 2004

  2. 2.

    Banerjee A, Dhillon IS, Ghosh J, Sra S (2003) Generative model-based clustering of directional data. In: Proc 9th ACM SIGKDD int conf knowledge discovery and data mining, pp 19–28, Washington, DC, August 2003

  3. 3.

    Banerjee A, Ghosh J (2004) Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans Neural Net 15(3):702–719

    Article  Google Scholar 

  4. 4.

    Banerjee A, Langford J (2004) An objective evaluation criterion for clustering. In: Proc 10th ACM SIGKDD int conf knowledge discovery and data mining, pp 702–719

  5. 5.

    Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proc 7th ACM SIGKDD int conf knowledge discovery and data mining, pp 269–274

  6. 6.

    Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1):143–175

    Article  Google Scholar 

  7. 7.

    Ghosh J (2003) Scalable clustering. In: Ye N (ed) Handbook of data mining, pp 341–364. Lawrence Erlbaum Assoc

  8. 8.

    Karypis G (2002) CLUTO—A clustering toolkit. Dept of Computer Science, University of Minnesota, http://www-users.cs.umn.edu/∼karypis/cluto/

  9. 9.

    Kohonen T, Kaski S, Lagus K, Salojärvi J, Honkela J, Paatero V, Saarela A (2000) Self organization of a massive document collection. IEEE Trans Neural Net 11(3):574–585

    Article  Google Scholar 

  10. 10.

    Mardia KV (1975) Statistics of directional data. J Roy Stat Soc Ser B 37(3):349–393

    Google Scholar 

  11. 11.

    McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: AAAI workshop on learning for text categorization, pp 41–48

  12. 12.

    McCallum AK (1996) Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Available at http://www.cs.cmu.edu/∼mccallum/bow

  13. 13.

    Meila M, Heckerman D (2001) An experimental comparison of model-based clustering methods. Mach Learn 42:9–29

    Article  Google Scholar 

  14. 14.

    Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14, pp 849–856. MIT

  15. 15.

    Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill

  16. 16.

    Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Research and development in information retrieval, pp 208–215

  17. 17.

    Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, Boston, MA

  18. 18.

    Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining partitions. J Mach Learn Res 3:583–617

    Article  Google Scholar 

  19. 19.

    Strehl A, Ghosh J, Mooney RJ (2000) Impact of similarity measures on web-page clustering. In: AAAI workshop on AI for web search, pp 58–64

  20. 20.

    Vaithyanathan S, Dom B (2000) Model-based hierarchical clustering. In: Proc 16th conf uncertainty in artificial intelligence, pp 599–608

  21. 21.

    Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn 55(3):311–331

    Article  Google Scholar 

  22. 22.

    Zhong S, Ghosh J (2003) A unified framework for model-based clustering. J Mach Learn Res 4:1001–1037, November 2003

    Article  Google Scholar 

  23. 23.

    Zhong S, Ghosh J (2004) Generative model-based document clustering: A comparative study. Technical report, Department of Computer Science and Engineering, Florida Atlantic University, June 2004. Available at http://www.cse.fau.edu/∼zhong/papers/textc-full.pdf

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Shi Zhong.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Zhong, S., Ghosh, J. Generative model-based document clustering: a comparative study. Knowl Inf Syst 8, 374–384 (2005). https://doi.org/10.1007/s10115-004-0194-1

Download citation

Keywords

  • Comparative study
  • Document clustering
  • Model-based clustering