Text Classification from Partially Labeled Distributed Data
One of the main problems with text classification systems is the lack of labeled data, as well as the cost of labeling unlabeled data . Thus, there is a growing interest in exploring the combination of labeled and unlabeled data, i.e., partially labeled data , as a way to improve classification performance in text classification. The ready availability of this kind of data in most applications makes it an appealing source of information.
The distributed nature of the data, usually available online, makes it a very interesting problem suited to be solved with distributed computing tools, delivered by emerging GRID computing environments.
We evaluate the advantages obtained by blending supervised and unsupervised learning in a support vector machine automatic text classifier. We further evaluate the possibility of learning actively and propose a method for choosing the samples to be learned.
KeywordsText Classification Unlabeled Data Grid Infrastructure Small Split Improve Classification Performance
Unable to display preview. Download preview PDF.
- S. Kiritchenko and S. Matwin, “Email Classification with Co-Training”, 2001 Conference of the Centre for Advanced Studies on Collaborative Research, pp. 8.Google Scholar
- M. Szummer, “Learning from Partially Labeled Data”, PhD Thesis, Massachusetts Institute of Technology, 2002.Google Scholar
- G. Schohn and D. Cohn, “Less is more: Active Learning with Support Vector Machines”, ICML, 2000, pp. 839–846.Google Scholar
- M. Seeger, “Learning with Labeled and Unlabeled Data”, Technical Report, Institute for Adaptive and Neural Computation, Univ. Edinburgh, 2001.Google Scholar
- C. Silva and B. Ribeiro, “On the Evaluation of Text Processing in Inductive Categorization”, ICMLA, 2003, pp. 121–127.Google Scholar
- J. Hong and S. Cho, “Incremental Support Vector Machine for Unlabeled Data Classification”, ICONIP, 2002, pp. 1403–1407.Google Scholar
- B. Liu, Y. Dai, X. Li, W. Lee and P. Yu, “Building Text Classifiers Using Positive and Unlabeled Examples”, ICDM, 2003, pp. 179.Google Scholar
- S. Zelikovitz and H. Hirsh, “Improving Text Classification with LSI Using Background Knowledge”, ICIKM, 2001.Google Scholar
- T. Joachims, “Transductive Inference for Text Classification using Support Vector Machines”, ICML, 1999, pp. 200–209.Google Scholar
- V. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995.Google Scholar
- C. Silva and B. Ribeiro, “Labeled and Unlabeled Data in Text Categorization”, IEEE IJCNN, 2004.Google Scholar
- Y. Baram, R. El-Yaniv and K. Luz, “Online Choice of Active Learning Algorithms”, ICML, 2003, pp. 255–291.Google Scholar