Semi-supervised model-based document clustering: A comparative study

Zhong, Shi

doi:10.1007/s10994-006-6540-7

Semi-supervised model-based document clustering: A comparative study

Published: 28 March 2006

Volume 65, pages 3–29, (2006)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Semi-supervised model-based document clustering: A comparative study

Download PDF

Shi Zhong¹

1235 Accesses
43 Citations
Explore all metrics

Abstract

Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial model-based semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete.

References

Banerjee, A., Dhillon, I., Ghosh, J., & Merugu, S. (2004). An information theoretic analysis of maximum likelihood mixture estimation for exponential families. In Proc. 21st Int. Conf. Machine Learning (pp. 57–64). Banff, Canada.
Banerjee, A., Dhillon, I., Sra, S., & Ghosh, J. (2003). Generative model-based clustering of directional data. In Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining (pp. 19–28).
Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.
Article MATH MathSciNet Google Scholar
Basu, S., Banerjee, A., & Mooney, R. (2002). Semi-supervised clustering by seeding. In Proc. 19th Int. Conf. Machine Learning (pp. 19–26). Sydney, Australia.
Basu, S., Bilenko, M., & Mooney, R. J. (2004). In A probabilistic framework for semi-supervised clustering. Proc. 10th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining pp. 59–68. Seattle, WA.
Blum, A., & Chawla, S. (2001). In Learning from labeled and unlabeled data using graph mincuts. Proc. 18th Int. Conf. Machine Learning (pp. 19–26).
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. The 11th Annual Conf. Computational Learning Theory (pp. 92–100).
Castelli, V., & Cover, T. M. (1996). The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. Inform. Theory, 42, 2102–2117.
Article MATH MathSciNet Google Scholar
Chang, H., & Yeung, D.-Y. (2004). In Locally linear metric adaptation for semi-supervised clustering. Proc. 21st Int. Conf. Machine Learning (pp. 153–160). Banff, Canada.
Cohn, D., Caruana, R., & McCallum, A. (2003). Semi-supervised clustering with user feedback (Technical Report TR2003-1892). Cornell University.
Cozman, F. G., Cohen, I., & Cirelo, M. C. (2003). Semi-supervised learning of mixture models. In Proc. 20th Int. Conf. Machine Learning (pp. 106–113).
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38.
MATH MathSciNet Google Scholar
Dhillon, I., & Guan, Y. (2003). In Information theoretic clustering of sparse co-occurrence data. Proc. IEEE Int. Conf. Data Mining (pp. 517–520). Melbourne, FL.
Dhillon, I. S., Mallela, S., & Kumar, R. (2002). In Enhanced word clustering for hierarchical text classification. Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining (pp. 446–455).
Dhillon, I. S., & Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.
Article MATH Google Scholar
Dom, B. (2001). An information-theoretic external cluster-validity measure (Technical Report RJ10219). IBM.
Dong, A., & Bhanu, B. (2003). In A new semi-supervised em algorithm for image retrieval. Proc. IEEE Int. Conf. Computer Vision Pattern Recognition (pp. 662–667). Madison, MI.
Friedman, N., Mosenzon, O., Slonim, N., & Tishby, N. (2001). Multivariate information bottleneck. In Proc. 17th Conf. Uncertainty in Artificial Intelligence.
Ghosh, J. (2003). Scalable clustering. In N. Ye (Ed.), Handbook of data mining (pp. 341–364). Le Erlbaum Assoc.
Gondek, D., & Hofmann, T. (2003). Conditional information bottleneck clustering. IEEE Int. Conf. Data Mining Workshop on Clustering Large Datasets (pp. 36–42). Melbourne, FL.
Guerrero-Curieses, A., & Cid-Sueiro, J. (2000). An entropy minimization principle for semi-supervised terrain classification. In Proc. IEEE Int. Conf. Image Processing (pp. 312–315).
Han, E. H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). In WebACE: A web agent for document categorization and exploration. Proc. 2nd Int. Conf. Autonomous Agents (pp. 408–415).
He, J., Lan, M., Tan, C.-L., Sung, S. -Y., & Low, H.-B. (2004). Initialization of cluster refinement algorithms: A review and comparative study. In Proc. IEEE Int. Joint Conf. Neural Networks (pp. 297–302).
Hersh, W., Buckley, C., Leone, T. J., & Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proc. ACM SIGIR (pp. 192–201).
Jardine, N., & van Rijsbergen, C. J. (1971). The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7, 217–240.
Article Google Scholar
Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proc. 16th Int. Conf. Machine Learning (pp. 200–209).
Karypis, G. (2002). CLUTO—a clustering toolkit. Dept. of Computer Science, University of Minnesota. http://www-users.cs.umn.edu/~karypis/cluto/.
Katsavounidis, I., Kuo, C., & Zhang, Z. (1994). A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters, 1, 144–146.
Google Scholar
Mardia, K. V. (1975). Statistics of directional data. J. Royal Statistical Society. Series B (Methodological), 37, 349–393.
MATH MathSciNet Google Scholar
McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Available at http://www.cs.cmu.edu/~mccallum/bow.
Meila, M., & Heckerman, D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42, 9–29.
Article MATH Google Scholar
Nigam, K. (2001). Using unlabeled data to improve text classification. Doctoral dissertation, School of Computer Science, Carnegie Mellon University.
Nigam, K., & Ghani, R. (2000). Understanding the behavior of co-training. KDD Workshop on Text Mining.
Nigam, K., Mccallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39, 103–134.
Article MATH Google Scholar
Rose, K. (1998). Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. In Proc. IEEE, 86, 2210–2239.
Seeger, M. (2001). Learning with labeled and unlabeled data (Technical Report). Institute for Adaptive and Neural Computation, University of Edinburgh.
Slonim, N., & Weiss, Y. (2003). Maximum likelihood and the information bottleneck. Advances in Neural Information Processing Systems 15 (pp. 335–342). Cambridge, MA: MIT Press.
Stark, H., & Woods, J. W. (1994). Probability, random processes, and estimation theory for engineers. Englewood Cliffs, New Jersey: Prentice Hall.
Strehl, A., & Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining partitions. Journal of Machine Learning Research, 3, 583–617.
Article MathSciNet Google Scholar
Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. In Proc. 37th Annual Allerton Conf. Communication, Control and Computing (pp. 368–377).
Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained k-means clustering with background knowledge. In Proc. 18th Int. Conf. Machine Learning (pp. 577–584).
Wu, Y., & Huang, T. S. (2000). Self-supervised learning for visual tracking and recognition of human hand. In Proc. 17th National Conference on Artificial Intelligence (pp. 243–248).
Zeng, H.-J., Wang, X.-H., Chen, Z., Lu, H., & Ma, W.-Y. (2003). Cbc: Clustering based text classification requiring minimal labeled data. In Proc. IEEE Int. Conf. Data Mining (pp. 443–450). Melbourne, FL.
Zhang, T., & Oles, F. (2000). A probabilistic analysis on the value of unlabeled data for classification problems. In Proc. 17th Int. Conf. Machine Learning (pp. 1191–1198).
Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: experiments and analysis (Technical Report #01-40). Department of Computer Science, University of Minnesota.
Zhong, S. (2005). Efficient online spherical k-means clustering. In Proc. IEEE Int. Joint Conf. Neural Networks. Montreal, Canada. (pp. 3180–3185).
Zhong, S., & Ghosh, J. (2003). A unified framework for model-based clustering. Journal of Machine Learning Research, 4, 1001–1037.
Article MathSciNet Google Scholar
Zhong, S., & Ghosh, J. (2005). Generative model-based document clustering: A comparative study. Knowledge and Information Systems: An International Journal. 8, 374–384.
Google Scholar
Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semi-supervised learning using gaussian fields and harmonic functions. In Proc. 20th Int. Conf. Machine Learning (pp. 912–919). Washington DC.

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL, 33431
Shi Zhong

Authors

Shi Zhong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shi Zhong.

Additional information

Editor: Andrew Moore

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhong, S. Semi-supervised model-based document clustering: A comparative study. Mach Learn 65, 3–29 (2006). https://doi.org/10.1007/s10994-006-6540-7

Download citation

Received: 05 January 2004
Revised: 23 November 2005
Accepted: 23 November 2005
Published: 28 March 2006
Issue Date: October 2006
DOI: https://doi.org/10.1007/s10994-006-6540-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Semi-supervised model-based document clustering: A comparative study

Abstract

Article PDF

Similar content being viewed by others

Semi-supervised Document Clustering via Loci

Automatic constraints generation for semisupervised clustering: experiences with documents classification

A semi-supervised framework for concept-based hierarchical document clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semi-supervised model-based document clustering: A comparative study

Abstract

Article PDF

Similar content being viewed by others

Semi-supervised Document Clustering via Loci

Automatic constraints generation for semisupervised clustering: experiences with documents classification

A semi-supervised framework for concept-based hierarchical document clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation