Building a Text Classifier by a Keyword and Wikipedia Knowledge

  • Qiang Qiu
  • Yang Zhang
  • Junping Zhu
  • Wei Qu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5678)


Traditional approach for building text classifiers usually require a lot of labeled documents, which are expensive to obtain. In this paper, we propose a new text classification approach based on a keyword and Wikipedia knowledge, so as to avoid labeling documents manually. Firstly, we retrieve a set of related documents about the keyword from Wikipedia. And then, with the help of related Wikipedia pages, more positive documents are extracted from the unlabeled documents. Finally, we train a text classifier with these positive documents and unlabeled documents. The experiment result on 20Newsgroup dataset show that the proposed approach performs very competitively compared with NB-SVM, a PU learner, and NB, a supervised learner.


text classification keyword unlabeled document Wikipedia 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on Computational learning theory (1998)Google Scholar
  2. 2.
    Ghani, R.: Combining labeled and unlabeled data for multiclass text categorization. In: International Conference on Machine Learning (2002)Google Scholar
  3. 3.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine learning 39 (2000)Google Scholar
  4. 4.
    Liu, B., Lee, W., Yu, P., Li, X.: Partially Supervised Classification of Text Documents. In: International Conference on Machine Learning, pp. 387–394 (2002)Google Scholar
  5. 5.
    Li, X., Liu, B.: Learning to Classify Texts Using Positive and Unlabeled Data. In: International joint Conference on Artificial Intelligence, pp. 587–594 (2003)Google Scholar
  6. 6.
    Yu, H., Han, J., Chang, K.C.-C.: PEBL: Positive Example Based Learning for Web Page Classification Using SVM. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 239–248 (2002)Google Scholar
  7. 7.
    Fung, G.P.C., Yu, J.X., Lu, H., Yu, P.S.: Text Classification without Negative Examples. Proc. 21st Int’l Conf. Data Engineering (2005)Google Scholar
  8. 8.
    Yu, H., Han, J.: PEBL: Web Page Classification without Negative Examples. IEEE Trans. Knowledge and Data Engineering (2004)Google Scholar
  9. 9.
    Li, X., Liu, B.: Learning from Positive and Unlabeled Examples with Different Data Distributions. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 218–229. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  10. 10.
    Fung, G.P.C., et al.: Text Classification without Negative Examples Revisit. IEEE Transactions on Knowledge and Data Engineering 18(1), 6–20 (2006)CrossRefGoogle Scholar
  11. 11.
    Li, X., Liu, B., Ng, S.-K.: Learning to Classify Documents with Only a Small Positive Training Set. In: The European Conference on Machine Learning, pp. 201–213 (2007)Google Scholar
  12. 12.
    McCallum, A., Nigam, K.: Text classification by bootstrapping with keywords, EM and shrinkage. In: ACL Workshop on Unsupervised Learning in Natural Language Processing (1999)Google Scholar
  13. 13.
    Liu, B., Li, X., Lee, W.S., Yu, P.S.: Text Classification by Labeling Words. In: Proc. 19th National Conference on Artificial Intelligence (2004)Google Scholar
  14. 14.
    Ko, Y., Seo, J.: Text classification from unlabeled documents with bootstrapping and feature projection techniques. Information Processing and Management (2009)Google Scholar
  15. 15.
    Qiu, Q., Zhang, Y., Zhu, J.: Build a text classifier by a keyword and unlabeled documents. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining (2009)Google Scholar
  16. 16.
    Wang., P., Hu, J., Zeng, H.J., Chen, Z.: Using Wikipedia knowledge to improve text classification. In: Knowledge information System (2008)Google Scholar
  17. 17.
    Wang., P., Hu, J., Zeng, H.J., Chen, L.: Improving Text Classification By Using Encyclopedia Knowledge. In: IEEE International Conference on Data Mining (2007)Google Scholar
  18. 18.
    Medelyan, O., Milne, D.: Augmenting domain-specific thesauri with knowledge from Wikipedia. In: Proceedings of the NZ Computer Science Research Student Conference, Christchurch, NZ (2008)Google Scholar
  19. 19.
    Milne, D., Medelyan, O., Witten, I.H.: Mining domain-specific thesauri from Wikipedia: A case study. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (2006)Google Scholar
  20. 20.
    Barbara, D., Domeniconi, C., Kang, N.: Mining Relevant Text from Unlabeled Documents. In: Proceedings of the Third IEEE International Conference on Data Mining (2003)Google Scholar
  21. 21.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 workshop on learning for text categorization (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Qiang Qiu
    • 1
  • Yang Zhang
    • 1
  • Junping Zhu
    • 1
  • Wei Qu
    • 1
  1. 1.College of Information EngineeringNorthwest A&F UniversityYangling

Personalised recommendations