Semi-supervised learning in large scale text categorization

  • Zewen Xu (许泽文)
  • Jianqiang Li (李建强)
  • Bo Liu (刘 博)
  • Jing Bi (毕 敏)
  • Rong Li (李 蓉)
  • Rui Mao (毛 睿)


The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately because of its huge number. Automatic text classification technology based on machine learning can classify a large number of natural language documents into the corresponding subject categories according to its correct semantics. It is helpful to grasp the text information directly. By learning from a set of hand-labeled documents, we obtain the traditional supervised classifier for text categorization (TC). However, labeling all data by human is labor intensive and time consuming. To solve this problem, some scholars proposed a semi-supervised learning method to train classifier, but it is unfeasible for various kinds and great number of Web data since it still needs a part of hand-labeled data. In 2012, Li et al. invented a fully automatic categorization approach for text (FACT) based on supervised learning, where no manual labeling efforts are required. But automatically labeling all data can bring noise into experiment and cause the fact that the result cannot meet the accuracy requirement. We put forward a new idea that part of data with high accuracy can be automatically tagged based on the semantic of category name, then a semi-supervised way is taken to train classifier with both labeled and unlabeled data, and ultimately a precise classification of massive text data can be achieved. The empirical experiments show that the method outperforms the supervised support vector machine (SVM) in terms of both F1 performance and classification accuracy in most cases. It proves the effectiveness of the semi-supervised algorithm in automatic TC.


text data mining semi-supervised automatic tagging classifier 

CLC number

TP 391.1 TP 311 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    LI J Q, ZHAO Y, LIU B. Exploiting semantic resources for large scale text categorization [J]. Journal of Intelligent Information Systems, 2012, 39(3): 763–788.CrossRefGoogle Scholar
  2. [2]
    MIYATO T, DAI A M, GOODFELLOW I. Virtual adversarial training for semi-supervised text classification [EB/OL]. (2016-07-22). Scholar
  3. [3]
    YIN C Y, XIANG J, ZHANG H, et al. A new SVM method for short text classification based on semisupervised learning [C]//2015 4th International Conference on Advanced Information Technology and Sensor Application. Dubai, UAE: IEEE, 2015: 100–103.Google Scholar
  4. [4]
    JOHNSON R, ZHANG T. Semi-supervised convolutional neural networks for text categorization via region embedding [J]. Advances in Neural Information Processing Systems, 2015, 28: 919–927.Google Scholar
  5. [5]
    JOHNSON R, ZHANG T. Supervised and semisupervised text categorization using LSTM for region embeddings [C]//Proceedings of the 33rd International Conference on Machine Learning. New York, USA: JMLR W&CP, 2016: 1–9.Google Scholar
  6. [6]
    SEBASTIANI F. Machine learning in automated text categorization [J]. ACM Computing Surveys, 2002, 34(1): 1–47.MathSciNetCrossRefGoogle Scholar
  7. [7]
    JOACHIMS T. Transductive inference for text classification using support vector machines [C]//Proceedings of the 16th International Conference on Machine Learning. Bled, Slovenia: [s.n.], 1999: 200–209.Google Scholar
  8. [8]
    SIOLAS G, D’ALCHé-BUC F. Support vector machines based on a semantic kernel for text categorization [C]//Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neuralnetworks. Washington, USA: IEEE, 2000: 205–209.Google Scholar
  9. [9]
    BASILI R, CAMMISA M, MOSCHITTI A. Effective use of Wordnet semantics via kernel-based learning [C]// Proceedings of the 9th Conference on Computational Natural Language Learning. Ann Arbor, USA: Association for Computational Linguistics, 2005: 1–8.CrossRefGoogle Scholar
  10. [10]
    GABRILOVICH E, MARKOVITCH S. Feature generation for text categorization using world knowledge [C]//International Joint Conference on Artificial Intelligence. [s.l.]: Morgan Kaufmann Publishers Inc, 2005: 1048–1053.Google Scholar
  11. [11]
    WANG P, DOMENICONI C. Building semantic kernels for text classification using wikipedia [C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA: ACM, 2008: 713–721.Google Scholar
  12. [12]
    CHAPELLE O, SCHöLKOPF B, ZIEN A. Semisupervised learning [M]. London, England: MIT Press, 2006.CrossRefGoogle Scholar
  13. [13]
    SINDHWANI V, KEERTHI S S. Large scale semisupervised linear SVMs [C]//International ACM SIGIR Conference on Research and Development in Information Retrieval. Washington, USA: ACM, 2006: 477–484.Google Scholar
  14. [14]
    SINDHWANI V, KEERTHI S S. Newton methods for fast solution of semi-supervised linear SVMs [EB/OL]. (2016-07-22). http: // viewdoc/download.Google Scholar
  15. [15]
    LI C H, YANG J C, PARK S C. Text categorization algorithms using semantic approaches, corpus-based thesaurus and WordNet [J]. Expert Systems with Applications, 2012, 39: 765–772.CrossRefGoogle Scholar
  16. [16]
    FOX-ROBERTS P, ROSTEN E. Unbiased generative semi-supervised learning [J]. Journal of Machine Learning Research, 2014, 15: 367–443.MathSciNetMATHGoogle Scholar
  17. [17]
    SHANG F H, JIAO L C, LIU Y Y, et al. Semisupervised learning with nuclear norm regularization [J]. Pattern Recognization, 2013, 46(8): 2323–2336.CrossRefMATHGoogle Scholar
  18. [18]
    WANG J, JEBARA T, CHANG S F. Semi-supervised learning using greedy max-cut [J]. Journal of Machine Learning Research, 2013, 14: 729–758.MathSciNetMATHGoogle Scholar
  19. [19]
    CHENG S, SHI Y H, QIN Q D. Particle swarm optimization based semi-supervised learning on chinese text categorization [C]//Proceedings of the 2012 IEEE Congress on Evolutionary Computation. Brisbane, Australia: IEEE, 2012: 1–8.Google Scholar
  20. [20]
    LENG Y, XU X Y, QI G H. Combining active learning and semi-supervised learning to construct SVM classifier [J]. Knowledge-Based Systems, 2013, 44(1): 121–131.CrossRefGoogle Scholar
  21. [21]
    LI J Q, LIU C C, LIU B, et al. Diversity-aware retrieval of medical records [J]. Compuer in Industries, 2015, 69(1): 81–91.CrossRefGoogle Scholar
  22. [22]
    YANG J M, LIU Y N, ZHU X D, et al. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization [J]. Information Processing and Management, 2012, 48(4): 741–754.CrossRefGoogle Scholar
  23. [23]
    BREVE F, ZHAO L, QUILES M, et al. Particle competition and cooperation in networks for semisupervised learning [J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 24(9): 1686–1698.CrossRefGoogle Scholar
  24. [24]
    LI J Q, WANG F. Semi-supervised learning via mean field methods [J]. Neurocomputing, 2016, 177: 385–393.CrossRefGoogle Scholar

Copyright information

© Shanghai Jiaotong University and Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  • Zewen Xu (许泽文)
    • 1
    • 2
  • Jianqiang Li (李建强)
    • 1
    • 2
    • 3
    • 4
  • Bo Liu (刘 博)
    • 1
  • Jing Bi (毕 敏)
    • 1
  • Rong Li (李 蓉)
    • 1
  • Rui Mao (毛 睿)
    • 3
    • 4
  1. 1.School of Software EngineeringBeijing University of TechnologyBeijingChina
  2. 2.Beijing Engineering Research Center for IoT Software and SystemsBeijing University of TechnologyBeijingChina
  3. 3.Guangdong Key Laboratory of Popular High Performance ComputersShenzhen UniversityGuangdongChina
  4. 4.Shenzhen Key Laboratory of Service Computing and ApplicationsShenzhen UniversityGuangdongChina

Personalised recommendations