Tri-training and Data Editing Based Semi-supervised Clustering Algorithm

  • Chao Deng
  • Mao Zu Guo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4293)

Abstract

Seeds based semi-supervised clustering algorithms often utilize a seeds set consisting of a small amount of labeled data to initialize cluster centroids, hence improve the performance of clustering over whole data set. Researches indicate that both the scale and quality of seeds set greatly restrict the performance of semi-supervised clustering. A novel semi-supervised clustering algorithm named DE-Tri-training semi-supervised K means is proposed. In new algorithm, prior to initializing cluster centroids, the training process of a semi-supervised classification approach named Tri-training is used to label the unlabeled data and add them into initial seeds to enlarge the scale. Meanwhile, to improve the quality of enlarged seeds set, a Nearest Neighbor Rule based data editing technique named Depuration is introduced into the Tri-training process to eliminate and correct the noise and mislabeled data among the enlarged seeds. Experiments show that novel algorithm can effectively improve the initialization of cluster centroids and enhance clustering performance.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2001)MATHGoogle Scholar
  2. 2.
    Zhong, S.: Semi-supervised model-based document clustering: A comparative study. Machine Learning (published online, March 2006)Google Scholar
  3. 3.
    Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. MIT Press, Cambridge (2006), http://www.kyb.tuebingen.mpg.de/ssl-book/ssl_toc.pdf Google Scholar
  4. 4.
    Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-supervised clustering. In: 21st International Conference on Machine Learning, Banff, Canada (ICML 2004), pp. 81–88 (2004)Google Scholar
  5. 5.
    Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: The 19th International Conference on Machine Learning (ICML 2002), pp. 19–26 (2002)Google Scholar
  6. 6.
    Demiriz, A., Bennett, K.P., Embrechts, M.J.: Semi-supervised clustering using genetic algorithms. In: Dagli, C.H., et al. (eds.) Intelligent Engineering Systems Through Artificial Neural Networks(ANNIE 1999), pp. 809–814. ASME Press, NewYork (1999)Google Scholar
  7. 7.
    Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-Means clustering with background knowledge. In: 18th International Conference on Machine Learning (ICML 2001), pp. 577–584 (2001)Google Scholar
  8. 8.
    Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2004), Seattle, WA, pp. 59–68 (2004)Google Scholar
  9. 9.
    Seeger, M.: Learning with labelled and unlabelled data. Tech. Rep., Institute for Adaptive and Neural Computation, University of Edinburgh, UK (2002)Google Scholar
  10. 10.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 103–134 (2000)MATHCrossRefGoogle Scholar
  11. 11.
    Ghahramani, Z., Jordan, M.I.: Supervised learning from incomplete data via the EM approach. Advances in Neural Information Processing Systems 6, 120–127 (1994)Google Scholar
  12. 12.
    Joachims, T.: Transductive inference for text classification using support vector machines. In: The Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, pp. 200–209 (1999)Google Scholar
  13. 13.
    Blum, A., Lafferty, J., Rwebangira, M., Reddy, R.: Semi-supervised learning using randomized mincuts. In: The 21st International Conference on Machine Learning (ICML 2004) (2004)Google Scholar
  14. 14.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: The 11th Annual Conference on Computational Learning Theory (COLT 1998), pp. 92–100 (1998)Google Scholar
  15. 15.
    Goldman, S., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: The 17th International Conference on Machine Learning (ICML 2000), San Francisco, CA, pp. 327–334 (2000)Google Scholar
  16. 16.
    Zhou, Z.H., Li, M.: Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering 11, 1529–1541 (2005)CrossRefGoogle Scholar
  17. 17.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: The 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  18. 18.
    Li, M., Zhou, Z.H.: SETRED: Self-training with editing. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 611–621. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  19. 19.
    Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters 24, 1015–1022 (2003)CrossRefGoogle Scholar
  20. 20.
    Koplowitz, J., Brown, T.A.: On the relation of performance to editing in nearest neighbor rules. Pattern Recognition 13, 251–255 (1981)CrossRefGoogle Scholar
  21. 21.
    Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Workshop on Artificial Intelligence for Web Search (AAAI-2000), pp. 58–64 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Chao Deng
    • 1
  • Mao Zu Guo
    • 1
  1. 1.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinP.R. China

Personalised recommendations