Machine Learning

, Volume 65, Issue 1, pp 3–29 | Cite as

Semi-supervised model-based document clustering: A comparative study

Article

Abstract

Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial model-based semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete.

Keywords

Semi-supervised clustering Seeded clustering Constrained clustering Clustering with feedback Model-based clustering Deterministic annealing 

References

  1. Banerjee, A., Dhillon, I., Ghosh, J., & Merugu, S. (2004). An information theoretic analysis of maximum likelihood mixture estimation for exponential families. In Proc. 21st Int. Conf. Machine Learning (pp. 57–64). Banff, Canada.Google Scholar
  2. Banerjee, A., Dhillon, I., Sra, S., & Ghosh, J. (2003). Generative model-based clustering of directional data. In Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining (pp. 19–28).Google Scholar
  3. Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.MATHMathSciNetCrossRefGoogle Scholar
  4. Basu, S., Banerjee, A., & Mooney, R. (2002). Semi-supervised clustering by seeding. In Proc. 19th Int. Conf. Machine Learning (pp. 19–26). Sydney, Australia.Google Scholar
  5. Basu, S., Bilenko, M., & Mooney, R. J. (2004). In A probabilistic framework for semi-supervised clustering. Proc. 10th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining pp. 59–68. Seattle, WA.Google Scholar
  6. Blum, A., & Chawla, S. (2001). In Learning from labeled and unlabeled data using graph mincuts. Proc. 18th Int. Conf. Machine Learning (pp. 19–26).Google Scholar
  7. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. The 11th Annual Conf. Computational Learning Theory (pp. 92–100).Google Scholar
  8. Castelli, V., & Cover, T. M. (1996). The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. Inform. Theory, 42, 2102–2117.MATHMathSciNetCrossRefGoogle Scholar
  9. Chang, H., & Yeung, D.-Y. (2004). In Locally linear metric adaptation for semi-supervised clustering. Proc. 21st Int. Conf. Machine Learning (pp. 153–160). Banff, Canada.Google Scholar
  10. Cohn, D., Caruana, R., & McCallum, A. (2003). Semi-supervised clustering with user feedback (Technical Report TR2003-1892). Cornell University.Google Scholar
  11. Cozman, F. G., Cohen, I., & Cirelo, M. C. (2003). Semi-supervised learning of mixture models. In Proc. 20th Int. Conf. Machine Learning (pp. 106–113).Google Scholar
  12. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38.MATHMathSciNetGoogle Scholar
  13. Dhillon, I., & Guan, Y. (2003). In Information theoretic clustering of sparse co-occurrence data. Proc. IEEE Int. Conf. Data Mining (pp. 517–520). Melbourne, FL.Google Scholar
  14. Dhillon, I. S., Mallela, S., & Kumar, R. (2002). In Enhanced word clustering for hierarchical text classification. Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining (pp. 446–455).Google Scholar
  15. Dhillon, I. S., & Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.MATHCrossRefGoogle Scholar
  16. Dom, B. (2001). An information-theoretic external cluster-validity measure (Technical Report RJ10219). IBM.Google Scholar
  17. Dong, A., & Bhanu, B. (2003). In A new semi-supervised em algorithm for image retrieval. Proc. IEEE Int. Conf. Computer Vision Pattern Recognition (pp. 662–667). Madison, MI.Google Scholar
  18. Friedman, N., Mosenzon, O., Slonim, N., & Tishby, N. (2001). Multivariate information bottleneck. In Proc. 17th Conf. Uncertainty in Artificial Intelligence.Google Scholar
  19. Ghosh, J. (2003). Scalable clustering. In N. Ye (Ed.), Handbook of data mining (pp. 341–364). Le Erlbaum Assoc.Google Scholar
  20. Gondek, D., & Hofmann, T. (2003). Conditional information bottleneck clustering. IEEE Int. Conf. Data Mining Workshop on Clustering Large Datasets (pp. 36–42). Melbourne, FL.Google Scholar
  21. Guerrero-Curieses, A., & Cid-Sueiro, J. (2000). An entropy minimization principle for semi-supervised terrain classification. In Proc. IEEE Int. Conf. Image Processing (pp. 312–315).Google Scholar
  22. Han, E. H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). In WebACE: A web agent for document categorization and exploration. Proc. 2nd Int. Conf. Autonomous Agents (pp. 408–415).Google Scholar
  23. He, J., Lan, M., Tan, C.-L., Sung, S. -Y., & Low, H.-B. (2004). Initialization of cluster refinement algorithms: A review and comparative study. In Proc. IEEE Int. Joint Conf. Neural Networks (pp. 297–302).Google Scholar
  24. Hersh, W., Buckley, C., Leone, T. J., & Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proc. ACM SIGIR (pp. 192–201).Google Scholar
  25. Jardine, N., & van Rijsbergen, C. J. (1971). The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7, 217–240.CrossRefGoogle Scholar
  26. Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proc. 16th Int. Conf. Machine Learning (pp. 200–209).Google Scholar
  27. Karypis, G. (2002). CLUTO—a clustering toolkit. Dept. of Computer Science, University of Minnesota. http://www-users.cs.umn.edu/~karypis/cluto/.
  28. Katsavounidis, I., Kuo, C., & Zhang, Z. (1994). A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters, 1, 144–146.Google Scholar
  29. Mardia, K. V. (1975). Statistics of directional data. J. Royal Statistical Society. Series B (Methodological), 37, 349–393.MATHMathSciNetGoogle Scholar
  30. McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Available at http://www.cs.cmu.edu/~mccallum/bow.
  31. Meila, M., & Heckerman, D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42, 9–29.MATHCrossRefGoogle Scholar
  32. Nigam, K. (2001). Using unlabeled data to improve text classification. Doctoral dissertation, School of Computer Science, Carnegie Mellon University.Google Scholar
  33. Nigam, K., & Ghani, R. (2000). Understanding the behavior of co-training. KDD Workshop on Text Mining.Google Scholar
  34. Nigam, K., Mccallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39, 103–134.MATHCrossRefGoogle Scholar
  35. Rose, K. (1998). Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. In Proc. IEEE, 86, 2210–2239.Google Scholar
  36. Seeger, M. (2001). Learning with labeled and unlabeled data (Technical Report). Institute for Adaptive and Neural Computation, University of Edinburgh.Google Scholar
  37. Slonim, N., & Weiss, Y. (2003). Maximum likelihood and the information bottleneck. Advances in Neural Information Processing Systems 15 (pp. 335–342). Cambridge, MA: MIT Press.Google Scholar
  38. Stark, H., & Woods, J. W. (1994). Probability, random processes, and estimation theory for engineers. Englewood Cliffs, New Jersey: Prentice Hall.Google Scholar
  39. Strehl, A., & Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining partitions. Journal of Machine Learning Research, 3, 583–617.MathSciNetCrossRefGoogle Scholar
  40. Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. In Proc. 37th Annual Allerton Conf. Communication, Control and Computing (pp. 368–377).Google Scholar
  41. Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained k-means clustering with background knowledge. In Proc. 18th Int. Conf. Machine Learning (pp. 577–584).Google Scholar
  42. Wu, Y., & Huang, T. S. (2000). Self-supervised learning for visual tracking and recognition of human hand. In Proc. 17th National Conference on Artificial Intelligence (pp. 243–248).Google Scholar
  43. Zeng, H.-J., Wang, X.-H., Chen, Z., Lu, H., & Ma, W.-Y. (2003). Cbc: Clustering based text classification requiring minimal labeled data. In Proc. IEEE Int. Conf. Data Mining (pp. 443–450). Melbourne, FL.Google Scholar
  44. Zhang, T., & Oles, F. (2000). A probabilistic analysis on the value of unlabeled data for classification problems. In Proc. 17th Int. Conf. Machine Learning (pp. 1191–1198).Google Scholar
  45. Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: experiments and analysis (Technical Report #01-40). Department of Computer Science, University of Minnesota.Google Scholar
  46. Zhong, S. (2005). Efficient online spherical k-means clustering. In Proc. IEEE Int. Joint Conf. Neural Networks. Montreal, Canada. (pp. 3180–3185).Google Scholar
  47. Zhong, S., & Ghosh, J. (2003). A unified framework for model-based clustering. Journal of Machine Learning Research, 4, 1001–1037.MathSciNetCrossRefGoogle Scholar
  48. Zhong, S., & Ghosh, J. (2005). Generative model-based document clustering: A comparative study. Knowledge and Information Systems: An International Journal. 8, 374–384.Google Scholar
  49. Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semi-supervised learning using gaussian fields and harmonic functions. In Proc. 20th Int. Conf. Machine Learning (pp. 912–919). Washington DC.Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringFlorida Atlantic UniversityBoca Raton

Personalised recommendations