Skip to main content
Log in

SVM based adaptive learning method for text classification from positive and unlabeled documents

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Automatic text classification is one of the most important tools in Information Retrieval. This paper presents a novel text classifier using positive and unlabeled examples. The primary challenge of this problem as compared with the classical text classification problem is that no labeled negative documents are available in the training example set. Firstly, we identify many more reliable negative documents by an improved 1-DNF algorithm with a very low error rate. Secondly, we build a set of classifiers by iteratively applying the SVM algorithm on a training data set, which is augmented during iteration. Thirdly, different from previous PU-oriented text classification works, we adopt the weighted vote of all classifiers generated in the iteration steps to construct the final classifier instead of choosing one of the classifiers as the final classifier. Finally, we discuss an approach to evaluate the weighted vote of all classifiers generated in the iteration steps to construct the final classifier based on PSO (Particle Swarm Optimization), which can discover the best combination of the weights. In addition, we built a focused crawler based on link-contexts guided by different classifiers to evaluate our method. Several comprehensive experiments have been conducted using the Reuters data set and thousands of web pages. Experimental results show that our method increases the performance (F1-measure) compared with PEBL, and a focused web crawler guided by our PSO-based classifier outperforms other several classifiers both in harvest rate and target recall.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bing L, Wee SL, Philip S Yu, Xiaoli L (2002) Partially supervised classification of text documents. The nineteenth international conference on machine learning (ICML), Sydney, Australia, pp 384–397

  2. Bing L, Yang D, Xiaoli L, Wee SL, Philip S Yu (2003) Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE international conference on data mining (ICDM 2003), Melbourne, Florida, USA, pp 179–188

  3. Carlisle A, Dozier G (2001) An Off-The-Shelf PSO. In: Proceedings of the workshop on particle swarm optimization, Indianapolis, pp 1–6

  4. Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the fifteenth national conference on artificial intellligence (AAAI-98), Madison, USA, pp 509–516

  5. De Falco I, Della Cioppa A, Iazzetta A and Tarantino E (2005). An evolutionary approach for automatically extracting intelligible classification rules. Knowl Inf Syst 7(2): 179–201

    Article  Google Scholar 

  6. Denis F (1998) PAC learning from positive statistical queries. In: Proceedings of the 9th international conference on algorithmic learning theory. Lecture Notes in Artificial Intelligence. vol 1501, Springer, Heidelberg, pp 112–126

  7. Denis F, Gilleron R, Tommasi M (2002) Text classification from positive and unlabeled examples. Conference on information processing and management of uncertainty in knowledge-based systems (IPMU), Annecy, France, pp 1927–1934

  8. DeComite F, Denis F, Gilleron R (1999) Positive and unlabeled examples help learning. In: Proceedings of the 10th international conference on algorithmic learning theory, Tokyo, Japan, pp 219–230

  9. Eberhart RC, Shi Y (2000) Comparing inertia weights and constriction factors in particle swarm optimization. In: Proceedings of the 2000 congress on evolutionary computation. Washington, DC, vol 1, pp 84–88

  10. Hwanjo Y, Jiawei H, Kevin Chen-Chuan Chang (2002) PEBL: Positive example based learning for Web page classification using SVM. In: Proceedings 8th international conference on knowledge discovery and data mining (KDD’02), Edmonton, Canada, pp 239–248

  11. Han ES, Karypis G, Kumar V (2001) Text categorization using weight adjusted k-nearest neighbor classification. In: Proceedings of the 5th Pacific-Asia conference on knowledge discovery and data mining, Hong Kong, pp 53–65

  12. Kennedy J, Eberhart R (1995) Particle swarm optimization, IEEE International Conference on Neural Networks, Perth, Australia, vol 4, 1942–1948

  13. Lin WY and Kuo IC (2004). A genetic selection algorithm for OLAP data cubes. Knowl Inf Syst 6(1): 83–102

    Article  Google Scholar 

  14. Larry MM and MalikYousef (2001). One-Class SVMs for document classification. J Mach Learn Res 2: 139–154

    Google Scholar 

  15. Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: SIGIR ’94: Proceedings of the seventeenth annual international ACM SIGIR conference on research and development in information retrieval, Dublin, Ireland, pp 3–12

  16. Lang K (1995) NewsWeeder: Learning to Filter Netnews. In: Machine Learning: Proceedings of the twelfth international conference (ICML ’95), San Francisco, CA, USA, pp 331–339

  17. Levis D, Ringuette M (1994) A comparison of two learning algorithms for text categorization. In: Third annual symposium on document analysis and information retrieval, Las Vegas, US, pp 81–93

  18. Letouzey F, Denis F, Gilleron R (2000) Learning from positive and unlabeled examples. In: Proceedings of the 11th international conference on algorithmic learning theory, Sydney, Australia, pp 71–85

  19. Mukherjea S (2004). Discovering and analyzing World Wide Web collections. Knowl Inf Syst 6(2): 230–241

    Google Scholar 

  20. Muslea I, Minton S, Knoblock CA (2002) Active + semi-supervised learning = robust multi-view learning. In: Proceedings of the nineteenth international conference on machine learning, Morgan Kaufmann Publishers Inc, San Francisco, USA, pp 435–442

  21. Merwe VD, Engelbrecht AP (2003) Data clustering using particle swarm optimisation. In: Proceedings of IEEE congress on evolutionary computation (CEC 2003), Canbella, Australia, pp 215–220

  22. Omran M, Salman A, Engelbrecht AP (2002) Image classification using particle swarm optimization. In: Proceedings of the 4th Asia-Pacific conference on simulated evolution and learning (SEAL 2002), Singapore, pp 370–374

  23. Pazzani MJ, Muramatsu J, Billsus D (1996) Syskill & Webert: Identifying interesting Web sites. In Proceedings of the thirteenth national conference on artificial intelligence (AAAI-96), Portland, USA. AAAI Press/MIT Press, Cambridge, MA, pp 54–61

  24. Srinivasan P, Menczer F and Pant G (2005). A general evaluation framework for topical crawlers. Inf Retr 8(3): 417–447

    Article  Google Scholar 

  25. Salton G and Buckley C (1988). Term weighting approaches in automatic text retrieval. Inf Process Manage 24(5): 513–523

    Article  Google Scholar 

  26. Thorsten Joachims (1998) Text Categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning, Heidelberg, Germany, pp 137–142

  27. Xiaoli L, Bing L (2003) Learning to classify text using positive and unlabeled data. In: Proceedings of eighteenth international joint conference on artificial intelligence (IJCAI-03), Acapulco, Mexico, pp 587–594

  28. Yang Y, Pedersen JP (1997) Feature selection in statistical learning of text categorization. In: Proceedings of the fourteenth international conference on machine learning, Nashville, Tennessee, USA, pp 412–420

  29. Zhihua Z, Shifu C and Zhaoqian C (2000). FANNC: A fast adaptive neural network classifier. Knowl Inf Syst 2(1): 115–129

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Peng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peng, T., Zuo, W. & He, F. SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst 16, 281–301 (2008). https://doi.org/10.1007/s10115-007-0107-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-007-0107-1

Keywords

Navigation