Semi-supervised model-based document clustering: A comparative study
- 714 Downloads
- 26 Citations
Abstract
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial model-based semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete.
Keywords
Semi-supervised clustering Seeded clustering Constrained clustering Clustering with feedback Model-based clustering Deterministic annealingReferences
- Banerjee, A., Dhillon, I., Ghosh, J., & Merugu, S. (2004). An information theoretic analysis of maximum likelihood mixture estimation for exponential families. In Proc. 21st Int. Conf. Machine Learning (pp. 57–64). Banff, Canada.Google Scholar
- Banerjee, A., Dhillon, I., Sra, S., & Ghosh, J. (2003). Generative model-based clustering of directional data. In Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining (pp. 19–28).Google Scholar
- Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.MATHMathSciNetCrossRefGoogle Scholar
- Basu, S., Banerjee, A., & Mooney, R. (2002). Semi-supervised clustering by seeding. In Proc. 19th Int. Conf. Machine Learning (pp. 19–26). Sydney, Australia.Google Scholar
- Basu, S., Bilenko, M., & Mooney, R. J. (2004). In A probabilistic framework for semi-supervised clustering. Proc. 10th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining pp. 59–68. Seattle, WA.Google Scholar
- Blum, A., & Chawla, S. (2001). In Learning from labeled and unlabeled data using graph mincuts. Proc. 18th Int. Conf. Machine Learning (pp. 19–26).Google Scholar
- Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. The 11th Annual Conf. Computational Learning Theory (pp. 92–100).Google Scholar
- Castelli, V., & Cover, T. M. (1996). The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. Inform. Theory, 42, 2102–2117.MATHMathSciNetCrossRefGoogle Scholar
- Chang, H., & Yeung, D.-Y. (2004). In Locally linear metric adaptation for semi-supervised clustering. Proc. 21st Int. Conf. Machine Learning (pp. 153–160). Banff, Canada.Google Scholar
- Cohn, D., Caruana, R., & McCallum, A. (2003). Semi-supervised clustering with user feedback (Technical Report TR2003-1892). Cornell University.Google Scholar
- Cozman, F. G., Cohen, I., & Cirelo, M. C. (2003). Semi-supervised learning of mixture models. In Proc. 20th Int. Conf. Machine Learning (pp. 106–113).Google Scholar
- Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38.MATHMathSciNetGoogle Scholar
- Dhillon, I., & Guan, Y. (2003). In Information theoretic clustering of sparse co-occurrence data. Proc. IEEE Int. Conf. Data Mining (pp. 517–520). Melbourne, FL.Google Scholar
- Dhillon, I. S., Mallela, S., & Kumar, R. (2002). In Enhanced word clustering for hierarchical text classification. Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining (pp. 446–455).Google Scholar
- Dhillon, I. S., & Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.MATHCrossRefGoogle Scholar
- Dom, B. (2001). An information-theoretic external cluster-validity measure (Technical Report RJ10219). IBM.Google Scholar
- Dong, A., & Bhanu, B. (2003). In A new semi-supervised em algorithm for image retrieval. Proc. IEEE Int. Conf. Computer Vision Pattern Recognition (pp. 662–667). Madison, MI.Google Scholar
- Friedman, N., Mosenzon, O., Slonim, N., & Tishby, N. (2001). Multivariate information bottleneck. In Proc. 17th Conf. Uncertainty in Artificial Intelligence.Google Scholar
- Ghosh, J. (2003). Scalable clustering. In N. Ye (Ed.), Handbook of data mining (pp. 341–364). Le Erlbaum Assoc.Google Scholar
- Gondek, D., & Hofmann, T. (2003). Conditional information bottleneck clustering. IEEE Int. Conf. Data Mining Workshop on Clustering Large Datasets (pp. 36–42). Melbourne, FL.Google Scholar
- Guerrero-Curieses, A., & Cid-Sueiro, J. (2000). An entropy minimization principle for semi-supervised terrain classification. In Proc. IEEE Int. Conf. Image Processing (pp. 312–315).Google Scholar
- Han, E. H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). In WebACE: A web agent for document categorization and exploration. Proc. 2nd Int. Conf. Autonomous Agents (pp. 408–415).Google Scholar
- He, J., Lan, M., Tan, C.-L., Sung, S. -Y., & Low, H.-B. (2004). Initialization of cluster refinement algorithms: A review and comparative study. In Proc. IEEE Int. Joint Conf. Neural Networks (pp. 297–302).Google Scholar
- Hersh, W., Buckley, C., Leone, T. J., & Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proc. ACM SIGIR (pp. 192–201).Google Scholar
- Jardine, N., & van Rijsbergen, C. J. (1971). The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7, 217–240.CrossRefGoogle Scholar
- Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proc. 16th Int. Conf. Machine Learning (pp. 200–209).Google Scholar
- Karypis, G. (2002). CLUTO—a clustering toolkit. Dept. of Computer Science, University of Minnesota. http://www-users.cs.umn.edu/~karypis/cluto/.
- Katsavounidis, I., Kuo, C., & Zhang, Z. (1994). A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters, 1, 144–146.Google Scholar
- Mardia, K. V. (1975). Statistics of directional data. J. Royal Statistical Society. Series B (Methodological), 37, 349–393.MATHMathSciNetGoogle Scholar
- McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Available at http://www.cs.cmu.edu/~mccallum/bow.
- Meila, M., & Heckerman, D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42, 9–29.MATHCrossRefGoogle Scholar
- Nigam, K. (2001). Using unlabeled data to improve text classification. Doctoral dissertation, School of Computer Science, Carnegie Mellon University.Google Scholar
- Nigam, K., & Ghani, R. (2000). Understanding the behavior of co-training. KDD Workshop on Text Mining.Google Scholar
- Nigam, K., Mccallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39, 103–134.MATHCrossRefGoogle Scholar
- Rose, K. (1998). Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. In Proc. IEEE, 86, 2210–2239.Google Scholar
- Seeger, M. (2001). Learning with labeled and unlabeled data (Technical Report). Institute for Adaptive and Neural Computation, University of Edinburgh.Google Scholar
- Slonim, N., & Weiss, Y. (2003). Maximum likelihood and the information bottleneck. Advances in Neural Information Processing Systems 15 (pp. 335–342). Cambridge, MA: MIT Press.Google Scholar
- Stark, H., & Woods, J. W. (1994). Probability, random processes, and estimation theory for engineers. Englewood Cliffs, New Jersey: Prentice Hall.Google Scholar
- Strehl, A., & Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining partitions. Journal of Machine Learning Research, 3, 583–617.MathSciNetCrossRefGoogle Scholar
- Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. In Proc. 37th Annual Allerton Conf. Communication, Control and Computing (pp. 368–377).Google Scholar
- Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained k-means clustering with background knowledge. In Proc. 18th Int. Conf. Machine Learning (pp. 577–584).Google Scholar
- Wu, Y., & Huang, T. S. (2000). Self-supervised learning for visual tracking and recognition of human hand. In Proc. 17th National Conference on Artificial Intelligence (pp. 243–248).Google Scholar
- Zeng, H.-J., Wang, X.-H., Chen, Z., Lu, H., & Ma, W.-Y. (2003). Cbc: Clustering based text classification requiring minimal labeled data. In Proc. IEEE Int. Conf. Data Mining (pp. 443–450). Melbourne, FL.Google Scholar
- Zhang, T., & Oles, F. (2000). A probabilistic analysis on the value of unlabeled data for classification problems. In Proc. 17th Int. Conf. Machine Learning (pp. 1191–1198).Google Scholar
- Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: experiments and analysis (Technical Report #01-40). Department of Computer Science, University of Minnesota.Google Scholar
- Zhong, S. (2005). Efficient online spherical k-means clustering. In Proc. IEEE Int. Joint Conf. Neural Networks. Montreal, Canada. (pp. 3180–3185).Google Scholar
- Zhong, S., & Ghosh, J. (2003). A unified framework for model-based clustering. Journal of Machine Learning Research, 4, 1001–1037.MathSciNetCrossRefGoogle Scholar
- Zhong, S., & Ghosh, J. (2005). Generative model-based document clustering: A comparative study. Knowledge and Information Systems: An International Journal. 8, 374–384.Google Scholar
- Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semi-supervised learning using gaussian fields and harmonic functions. In Proc. 20th Int. Conf. Machine Learning (pp. 912–919). Washington DC.Google Scholar