SETRED: Self-training with Editing

  • Ming Li
  • Zhi-Hua Zhou
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3518)


Self-training is a semi-supervised learning algorithm in which a learner keeps on labeling unlabeled examples and retraining itself on an enlarged labeled training set. Since the self-training process may erroneously label some unlabeled examples, sometimes the learned hypothesis does not perform well. In this paper, a new algorithm named Setred is proposed, which utilizes a specific data editing method to identify and remove the mislabeled examples from the self-labeled data. In detail, in each iteration of the self-training process, the local cut edge weight statistic is used to help estimate whether a newly labeled example is reliable or not, and only the reliable self-labeled examples are used to enlarge the labeled training set. Experiments show that the introduction of data editing is beneficial, and the learned hypotheses of Setred outperform those learned by the standard self-training algorithm.


Generalization Ability Unlabeled Data Base Learner Data Editing Computational Learn Theory 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, New York, NY, pp. 92–100 (1998)Google Scholar
  2. 2.
    Cohen, I., Cozman, F.G., Sebe, N., Cirelo, M.C., Huang, T.S.: Semisupervised learning of classifier: theory, algorithms, and their application to human-computer interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 1553–1567 (2004)CrossRefGoogle Scholar
  3. 3.
    Jiang, Y., Zhou, Z.-H.: Editing training data for kNN classifiers with neural network ensemble. In: Yin, F.-L., Wang, J., Guo, C. (eds.) ISNN 2004. LNCS, vol. 3173, pp. 356–361. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  4. 4.
    Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning, San Francisco, CA, pp. 200–209 (1999)Google Scholar
  5. 5.
    Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 3–12 (1994)Google Scholar
  6. 6.
    McCallum, A., Nigam, K.: Employing EM in pool-based active learning for text classification. In: Proceedings of the 15th International Conference on Machine Learning, Madison, WI, pp. 359–367 (1998)Google Scholar
  7. 7.
    Muhlenbach, F., Lallich, S., Zighed, D.A.: Identifying and handling mislabelled instances. Journal of Intelligent Information Systems 39, 89–109 (2004)CrossRefGoogle Scholar
  8. 8.
    Muslea, I., Minton, S., Knoblock, C.A.: Selective sampling with redundant views. In: Proceeding of the 17th International Conference on Machine Learning, Stanford, CA, pp. 621–626 (2000)Google Scholar
  9. 9.
    Muslea, I., Minton, S., Knoblock, C.A.: Active + semi-supervised learning = robust multi-view learning. In: Proceeding of the 19th International Conference on Machine Learning, Sydney, Australia, pp. 435–442 (2002)Google Scholar
  10. 10.
    Nigam, K., Ghani, R.: Analyzing the effectiveness and applicabilbity of co-training. In: Proceedings of the 9th International Conference on Information and Knowledge Management, Washington, DC, pp. 86–93 (2000)Google Scholar
  11. 11.
    Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 103–134 (2000)zbMATHCrossRefGoogle Scholar
  12. 12.
    Sarkar, A.: Applying co-training methods to statistical parsing. In: Proceedings of the 2nd Annual Meeting of the North American Chapter of the Association of Computational Linguistics, Pittsburgh, PA, pp. 95–102 (2001)Google Scholar
  13. 13.
    Seeger, M.: Learning with labeled and unlabeled data. Technical Report, University of Edinburgh, Edinburgh, UK (2001)Google Scholar
  14. 14.
    Seuong, H., Opper, M., Sompolinski, H.: Query by committee. In: Proceedings of the 5th ACM Workshop on Computational Learning Theory, Pittsburgh, PA, pp. 287–294 (1992)Google Scholar
  15. 15.
    Wilson, D.R.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics 2, 408–421 (1972)zbMATHCrossRefGoogle Scholar
  16. 16.
    Zhou, Z.-H., Chen, K.-J., Jiang, Y.: Exploiting unlabeled data in content-based image retrieval. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 525–536. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  17. 17.
    Zighed, D.A., Lallich, S., Muhlenbach, F.: Separability index in supervised learning. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 475–487. Springer, Heidelberg (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Ming Li
    • 1
  • Zhi-Hua Zhou
    • 1
  1. 1.National Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina

Personalised recommendations