Knowledge and Information Systems

, Volume 8, Issue 3, pp 374–384

Generative model-based document clustering: a comparative study

Article

DOI: 10.1007/s10115-004-0194-1

Cite this article as:
Zhong, S. & Ghosh, J. Knowl Inf Syst (2005) 8: 374. doi:10.1007/s10115-004-0194-1

Abstract

This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies (hard, stochastic, soft and deterministic annealing (DA) based assignments) to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher (vMF) distributions. A large variety of text collections, both with and without feature selection, are used for the study, which yields several insights, including (a) showing situations wherein the vMF-centric approaches, which are based on directional statistics, fare better than multinomial model-based methods, and (b) quantifying the trade-off between increased performance of the soft and DA assignments and their increased computational demands. We also compare all the model-based algorithms with two state-of-the-art discriminative approaches to document clustering based, respectively, on graph partitioning (CLUTO) and a spectral coclustering method. Overall, DA and CLUTO perform the best but are also the most computationally expensive. The vMF models provide good performance at low cost while the spectral coclustering algorithm fares worse than vMF-based methods for a majority of the datasets.

Keywords

Comparative study Document clustering Model-based clustering 

Copyright information

© Springer-Verlag 2005

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringFlorida Atlantic UniversityBoca RatonUSA
  2. 2.Department of Electrical and Computer EngineeringUniversity of Texas at AustinAustinUSA