Text Classification from Partially Labeled Distributed Data

  • Catarina Silva
  • Bemardete Ribeiro
Conference paper


One of the main problems with text classification systems is the lack of labeled data, as well as the cost of labeling unlabeled data [1]. Thus, there is a growing interest in exploring the combination of labeled and unlabeled data, i.e., partially labeled data [2], as a way to improve classification performance in text classification. The ready availability of this kind of data in most applications makes it an appealing source of information.

The distributed nature of the data, usually available online, makes it a very interesting problem suited to be solved with distributed computing tools, delivered by emerging GRID computing environments.

We evaluate the advantages obtained by blending supervised and unsupervised learning in a support vector machine automatic text classifier. We further evaluate the possibility of learning actively and propose a method for choosing the samples to be learned.


Text Classification Unlabeled Data Grid Infrastructure Small Split Improve Classification Performance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    S. Kiritchenko and S. Matwin, “Email Classification with Co-Training”, 2001 Conference of the Centre for Advanced Studies on Collaborative Research, pp. 8.Google Scholar
  2. [2]
    M. Szummer, “Learning from Partially Labeled Data”, PhD Thesis, Massachusetts Institute of Technology, 2002.Google Scholar
  3. [3]
    G. Schohn and D. Cohn, “Less is more: Active Learning with Support Vector Machines”, ICML, 2000, pp. 839–846.Google Scholar
  4. [4]
    M. Seeger, “Learning with Labeled and Unlabeled Data”, Technical Report, Institute for Adaptive and Neural Computation, Univ. Edinburgh, 2001.Google Scholar
  5. [5]
    C. Silva and B. Ribeiro, “On the Evaluation of Text Processing in Inductive Categorization”, ICMLA, 2003, pp. 121–127.Google Scholar
  6. [6]
    J. Hong and S. Cho, “Incremental Support Vector Machine for Unlabeled Data Classification”, ICONIP, 2002, pp. 1403–1407.Google Scholar
  7. [7]
    B. Liu, Y. Dai, X. Li, W. Lee and P. Yu, “Building Text Classifiers Using Positive and Unlabeled Examples”, ICDM, 2003, pp. 179.Google Scholar
  8. [8]
    S. Zelikovitz and H. Hirsh, “Improving Text Classification with LSI Using Background Knowledge”, ICIKM, 2001.Google Scholar
  9. [9]
    T. Joachims, “Transductive Inference for Text Classification using Support Vector Machines”, ICML, 1999, pp. 200–209.Google Scholar
  10. [10]
    V. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995.Google Scholar
  11. [11]
    C. Silva and B. Ribeiro, “Labeled and Unlabeled Data in Text Categorization”, IEEE IJCNN, 2004.Google Scholar
  12. [12]
    Y. Baram, R. El-Yaniv and K. Luz, “Online Choice of Active Learning Algorithms”, ICML, 2003, pp. 255–291.Google Scholar

Copyright information

© Springer-Verlag/Wien 2005

Authors and Affiliations

  • Catarina Silva
    • 1
    • 2
  • Bemardete Ribeiro
    • 2
  1. 1.Escola Superior de Tecnologia e GestãoInstituto Politécnico de LeiriaPortugal
  2. 2.Departamento de Engenharia Informática, Centra de Informática e SistemasUniversidade de CoimbraPortugal

Personalised recommendations