Abstract
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial model-based semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete.
Article PDF
Similar content being viewed by others
References
Banerjee, A., Dhillon, I., Ghosh, J., & Merugu, S. (2004). An information theoretic analysis of maximum likelihood mixture estimation for exponential families. In Proc. 21st Int. Conf. Machine Learning (pp. 57–64). Banff, Canada.
Banerjee, A., Dhillon, I., Sra, S., & Ghosh, J. (2003). Generative model-based clustering of directional data. In Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining (pp. 19–28).
Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.
Basu, S., Banerjee, A., & Mooney, R. (2002). Semi-supervised clustering by seeding. In Proc. 19th Int. Conf. Machine Learning (pp. 19–26). Sydney, Australia.
Basu, S., Bilenko, M., & Mooney, R. J. (2004). In A probabilistic framework for semi-supervised clustering. Proc. 10th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining pp. 59–68. Seattle, WA.
Blum, A., & Chawla, S. (2001). In Learning from labeled and unlabeled data using graph mincuts. Proc. 18th Int. Conf. Machine Learning (pp. 19–26).
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. The 11th Annual Conf. Computational Learning Theory (pp. 92–100).
Castelli, V., & Cover, T. M. (1996). The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. Inform. Theory, 42, 2102–2117.
Chang, H., & Yeung, D.-Y. (2004). In Locally linear metric adaptation for semi-supervised clustering. Proc. 21st Int. Conf. Machine Learning (pp. 153–160). Banff, Canada.
Cohn, D., Caruana, R., & McCallum, A. (2003). Semi-supervised clustering with user feedback (Technical Report TR2003-1892). Cornell University.
Cozman, F. G., Cohen, I., & Cirelo, M. C. (2003). Semi-supervised learning of mixture models. In Proc. 20th Int. Conf. Machine Learning (pp. 106–113).
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38.
Dhillon, I., & Guan, Y. (2003). In Information theoretic clustering of sparse co-occurrence data. Proc. IEEE Int. Conf. Data Mining (pp. 517–520). Melbourne, FL.
Dhillon, I. S., Mallela, S., & Kumar, R. (2002). In Enhanced word clustering for hierarchical text classification. Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining (pp. 446–455).
Dhillon, I. S., & Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.
Dom, B. (2001). An information-theoretic external cluster-validity measure (Technical Report RJ10219). IBM.
Dong, A., & Bhanu, B. (2003). In A new semi-supervised em algorithm for image retrieval. Proc. IEEE Int. Conf. Computer Vision Pattern Recognition (pp. 662–667). Madison, MI.
Friedman, N., Mosenzon, O., Slonim, N., & Tishby, N. (2001). Multivariate information bottleneck. In Proc. 17th Conf. Uncertainty in Artificial Intelligence.
Ghosh, J. (2003). Scalable clustering. In N. Ye (Ed.), Handbook of data mining (pp. 341–364). Le Erlbaum Assoc.
Gondek, D., & Hofmann, T. (2003). Conditional information bottleneck clustering. IEEE Int. Conf. Data Mining Workshop on Clustering Large Datasets (pp. 36–42). Melbourne, FL.
Guerrero-Curieses, A., & Cid-Sueiro, J. (2000). An entropy minimization principle for semi-supervised terrain classification. In Proc. IEEE Int. Conf. Image Processing (pp. 312–315).
Han, E. H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). In WebACE: A web agent for document categorization and exploration. Proc. 2nd Int. Conf. Autonomous Agents (pp. 408–415).
He, J., Lan, M., Tan, C.-L., Sung, S. -Y., & Low, H.-B. (2004). Initialization of cluster refinement algorithms: A review and comparative study. In Proc. IEEE Int. Joint Conf. Neural Networks (pp. 297–302).
Hersh, W., Buckley, C., Leone, T. J., & Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proc. ACM SIGIR (pp. 192–201).
Jardine, N., & van Rijsbergen, C. J. (1971). The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7, 217–240.
Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proc. 16th Int. Conf. Machine Learning (pp. 200–209).
Karypis, G. (2002). CLUTO—a clustering toolkit. Dept. of Computer Science, University of Minnesota. http://www-users.cs.umn.edu/~karypis/cluto/.
Katsavounidis, I., Kuo, C., & Zhang, Z. (1994). A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters, 1, 144–146.
Mardia, K. V. (1975). Statistics of directional data. J. Royal Statistical Society. Series B (Methodological), 37, 349–393.
McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Available at http://www.cs.cmu.edu/~mccallum/bow.
Meila, M., & Heckerman, D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42, 9–29.
Nigam, K. (2001). Using unlabeled data to improve text classification. Doctoral dissertation, School of Computer Science, Carnegie Mellon University.
Nigam, K., & Ghani, R. (2000). Understanding the behavior of co-training. KDD Workshop on Text Mining.
Nigam, K., Mccallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39, 103–134.
Rose, K. (1998). Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. In Proc. IEEE, 86, 2210–2239.
Seeger, M. (2001). Learning with labeled and unlabeled data (Technical Report). Institute for Adaptive and Neural Computation, University of Edinburgh.
Slonim, N., & Weiss, Y. (2003). Maximum likelihood and the information bottleneck. Advances in Neural Information Processing Systems 15 (pp. 335–342). Cambridge, MA: MIT Press.
Stark, H., & Woods, J. W. (1994). Probability, random processes, and estimation theory for engineers. Englewood Cliffs, New Jersey: Prentice Hall.
Strehl, A., & Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining partitions. Journal of Machine Learning Research, 3, 583–617.
Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. In Proc. 37th Annual Allerton Conf. Communication, Control and Computing (pp. 368–377).
Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained k-means clustering with background knowledge. In Proc. 18th Int. Conf. Machine Learning (pp. 577–584).
Wu, Y., & Huang, T. S. (2000). Self-supervised learning for visual tracking and recognition of human hand. In Proc. 17th National Conference on Artificial Intelligence (pp. 243–248).
Zeng, H.-J., Wang, X.-H., Chen, Z., Lu, H., & Ma, W.-Y. (2003). Cbc: Clustering based text classification requiring minimal labeled data. In Proc. IEEE Int. Conf. Data Mining (pp. 443–450). Melbourne, FL.
Zhang, T., & Oles, F. (2000). A probabilistic analysis on the value of unlabeled data for classification problems. In Proc. 17th Int. Conf. Machine Learning (pp. 1191–1198).
Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: experiments and analysis (Technical Report #01-40). Department of Computer Science, University of Minnesota.
Zhong, S. (2005). Efficient online spherical k-means clustering. In Proc. IEEE Int. Joint Conf. Neural Networks. Montreal, Canada. (pp. 3180–3185).
Zhong, S., & Ghosh, J. (2003). A unified framework for model-based clustering. Journal of Machine Learning Research, 4, 1001–1037.
Zhong, S., & Ghosh, J. (2005). Generative model-based document clustering: A comparative study. Knowledge and Information Systems: An International Journal. 8, 374–384.
Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semi-supervised learning using gaussian fields and harmonic functions. In Proc. 20th Int. Conf. Machine Learning (pp. 912–919). Washington DC.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Andrew Moore
Rights and permissions
About this article
Cite this article
Zhong, S. Semi-supervised model-based document clustering: A comparative study. Mach Learn 65, 3–29 (2006). https://doi.org/10.1007/s10994-006-6540-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-006-6540-7