Semi-supervised Document Classification with a Mislabeling Error Model

  • Anastasia Krithara
  • Massih R. Amini
  • Jean-Michel Renders
  • Cyril Goutte
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4956)


This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model [6] for text classification where the training set is partially labeled. The proposed approach iteratively labels the unlabeled documents and estimates the probabilities of its labeling errors. These probabilities are then taken into account in the estimation of the new model parameters before the next round. Our approach outperforms an earlier semi-supervised extension of PLSA introduced by [9] which is based on the use of fake labels. However, it maintains its simplicity and ability to solve multiclass problems. In addition, it gives valuable information about the most uncertain and difficult classes to label. We perform experiments over the 20Newsgroups, WebKB and Reuters document collections and show the effectiveness of our approach over two other semi-supervised algorithms applied to these text classification problems.


Unlabeled Data Latent Variable Model Latent Topic Aspect Model Label Error 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amini, M.R., Gallinari, P.: The use of unlabeled data to improve supervised learning for text summarization. In: SIGIR, pp. 105–112 (2002)Google Scholar
  2. 2.
    Amini, M.R., Gallinari, P.: Semi-supervised learning with an explicit label-error model for misclassified data. In: Proceedings of the 18th IJCAI, pp. 555–560 (2003)Google Scholar
  3. 3.
    Blum, A., Mitchell, T.M.: Combining labeled and unlabeled data with co-training. In: COLT 1998, pp. 92–100 (1998)Google Scholar
  4. 4.
    Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: EMNLP/VLC (1999)Google Scholar
  5. 5.
    Gaussier, E., Goutte, C.: Learning from partially labelled data - with confidence. In: Learning from Partially Classified Training Data - Proceedings of the ICML 2005 workshop (2005)Google Scholar
  6. 6.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd ACM SIGIR, pp. 50–57 (1999)Google Scholar
  7. 7.
    Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of ICML 1999, 16th International Conference on Machine Learning, pp. 200–209. Morgan Kaufmann Publishers, San Francisco (1999)Google Scholar
  8. 8.
    Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proceedings of the European Conference on Machine Learning (1998)Google Scholar
  9. 9.
    Krithara, A., Goutte, C., Renders, J.M., Amini, M.R.: Reducing the annotation burden in text classification. In: Proceedings of the 1st International Conference on Multidisciplinary Information Sciences and Technologies (InSciT 2006), Merida, Spain (October 2006)Google Scholar
  10. 10.
    Lewis, D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US, pp. 81–93 (1994)Google Scholar
  11. 11.
    McLernon, B., Kushmerick, N.: Transductive pattern learning for information extraction. In: Proc. Workshop Adaptive Text Extraction and Mining (2006), Conf. European Association for Computational LinguisticsGoogle Scholar
  12. 12.
    Miller, D.J., Uyar, H.S.: A mixture of experts classifier with learning based on both labelled and unlabeled data. In: Proc. of NIPS-(1997)Google Scholar
  13. 13.
    Nigam, K., McCallum, K.A., Thrun, S., Mitchell, T.M.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134 (2000)zbMATHCrossRefGoogle Scholar
  14. 14.
    Saul, L., Pereira, F.: Aggregate and mixed-order Markov models for statistical language processing. In: Proc of 2nd ICEMNLP (1997)Google Scholar
  15. 15.
    Si, L., Callan, J.: A semi-supervised learning method to merge search engine results. ACM Transactions on Information Systems 24(4), 457–491 (2003)CrossRefGoogle Scholar
  16. 16.
    Slonim, N., Friedman, N., Tishby, N.: Usupervised Document Classification Using Sequentiel Information Maximization. In: SIGIR, pp. 129–136 (2002)Google Scholar
  17. 17.
    Zhang, T.: The value of unlabeled data for classification problems. In: ICML (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Anastasia Krithara
    • 1
  • Massih R. Amini
    • 2
  • Jean-Michel Renders
    • 1
  • Cyril Goutte
    • 3
  1. 1.Xerox Research Centre Europe, chemin de MaupertuisMeylanFrance
  2. 2.University Pierre et Marie CurieParisFrance
  3. 3.National Research Council CanadaGatineauCanada

Personalised recommendations