Abstract
Automatic text classification is one of the most important tools in Information Retrieval. This paper presents a novel text classifier using positive and unlabeled examples. The primary challenge of this problem as compared with the classical text classification problem is that no labeled negative documents are available in the training example set. Firstly, we identify many more reliable negative documents by an improved 1-DNF algorithm with a very low error rate. Secondly, we build a set of classifiers by iteratively applying the SVM algorithm on a training data set, which is augmented during iteration. Thirdly, different from previous PU-oriented text classification works, we adopt the weighted vote of all classifiers generated in the iteration steps to construct the final classifier instead of choosing one of the classifiers as the final classifier. Finally, we discuss an approach to evaluate the weighted vote of all classifiers generated in the iteration steps to construct the final classifier based on PSO (Particle Swarm Optimization), which can discover the best combination of the weights. In addition, we built a focused crawler based on link-contexts guided by different classifiers to evaluate our method. Several comprehensive experiments have been conducted using the Reuters data set and thousands of web pages. Experimental results show that our method increases the performance (F1-measure) compared with PEBL, and a focused web crawler guided by our PSO-based classifier outperforms other several classifiers both in harvest rate and target recall.
Similar content being viewed by others
References
Bing L, Wee SL, Philip S Yu, Xiaoli L (2002) Partially supervised classification of text documents. The nineteenth international conference on machine learning (ICML), Sydney, Australia, pp 384–397
Bing L, Yang D, Xiaoli L, Wee SL, Philip S Yu (2003) Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE international conference on data mining (ICDM 2003), Melbourne, Florida, USA, pp 179–188
Carlisle A, Dozier G (2001) An Off-The-Shelf PSO. In: Proceedings of the workshop on particle swarm optimization, Indianapolis, pp 1–6
Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the fifteenth national conference on artificial intellligence (AAAI-98), Madison, USA, pp 509–516
De Falco I, Della Cioppa A, Iazzetta A and Tarantino E (2005). An evolutionary approach for automatically extracting intelligible classification rules. Knowl Inf Syst 7(2): 179–201
Denis F (1998) PAC learning from positive statistical queries. In: Proceedings of the 9th international conference on algorithmic learning theory. Lecture Notes in Artificial Intelligence. vol 1501, Springer, Heidelberg, pp 112–126
Denis F, Gilleron R, Tommasi M (2002) Text classification from positive and unlabeled examples. Conference on information processing and management of uncertainty in knowledge-based systems (IPMU), Annecy, France, pp 1927–1934
DeComite F, Denis F, Gilleron R (1999) Positive and unlabeled examples help learning. In: Proceedings of the 10th international conference on algorithmic learning theory, Tokyo, Japan, pp 219–230
Eberhart RC, Shi Y (2000) Comparing inertia weights and constriction factors in particle swarm optimization. In: Proceedings of the 2000 congress on evolutionary computation. Washington, DC, vol 1, pp 84–88
Hwanjo Y, Jiawei H, Kevin Chen-Chuan Chang (2002) PEBL: Positive example based learning for Web page classification using SVM. In: Proceedings 8th international conference on knowledge discovery and data mining (KDD’02), Edmonton, Canada, pp 239–248
Han ES, Karypis G, Kumar V (2001) Text categorization using weight adjusted k-nearest neighbor classification. In: Proceedings of the 5th Pacific-Asia conference on knowledge discovery and data mining, Hong Kong, pp 53–65
Kennedy J, Eberhart R (1995) Particle swarm optimization, IEEE International Conference on Neural Networks, Perth, Australia, vol 4, 1942–1948
Lin WY and Kuo IC (2004). A genetic selection algorithm for OLAP data cubes. Knowl Inf Syst 6(1): 83–102
Larry MM and MalikYousef (2001). One-Class SVMs for document classification. J Mach Learn Res 2: 139–154
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: SIGIR ’94: Proceedings of the seventeenth annual international ACM SIGIR conference on research and development in information retrieval, Dublin, Ireland, pp 3–12
Lang K (1995) NewsWeeder: Learning to Filter Netnews. In: Machine Learning: Proceedings of the twelfth international conference (ICML ’95), San Francisco, CA, USA, pp 331–339
Levis D, Ringuette M (1994) A comparison of two learning algorithms for text categorization. In: Third annual symposium on document analysis and information retrieval, Las Vegas, US, pp 81–93
Letouzey F, Denis F, Gilleron R (2000) Learning from positive and unlabeled examples. In: Proceedings of the 11th international conference on algorithmic learning theory, Sydney, Australia, pp 71–85
Mukherjea S (2004). Discovering and analyzing World Wide Web collections. Knowl Inf Syst 6(2): 230–241
Muslea I, Minton S, Knoblock CA (2002) Active + semi-supervised learning = robust multi-view learning. In: Proceedings of the nineteenth international conference on machine learning, Morgan Kaufmann Publishers Inc, San Francisco, USA, pp 435–442
Merwe VD, Engelbrecht AP (2003) Data clustering using particle swarm optimisation. In: Proceedings of IEEE congress on evolutionary computation (CEC 2003), Canbella, Australia, pp 215–220
Omran M, Salman A, Engelbrecht AP (2002) Image classification using particle swarm optimization. In: Proceedings of the 4th Asia-Pacific conference on simulated evolution and learning (SEAL 2002), Singapore, pp 370–374
Pazzani MJ, Muramatsu J, Billsus D (1996) Syskill & Webert: Identifying interesting Web sites. In Proceedings of the thirteenth national conference on artificial intelligence (AAAI-96), Portland, USA. AAAI Press/MIT Press, Cambridge, MA, pp 54–61
Srinivasan P, Menczer F and Pant G (2005). A general evaluation framework for topical crawlers. Inf Retr 8(3): 417–447
Salton G and Buckley C (1988). Term weighting approaches in automatic text retrieval. Inf Process Manage 24(5): 513–523
Thorsten Joachims (1998) Text Categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning, Heidelberg, Germany, pp 137–142
Xiaoli L, Bing L (2003) Learning to classify text using positive and unlabeled data. In: Proceedings of eighteenth international joint conference on artificial intelligence (IJCAI-03), Acapulco, Mexico, pp 587–594
Yang Y, Pedersen JP (1997) Feature selection in statistical learning of text categorization. In: Proceedings of the fourteenth international conference on machine learning, Nashville, Tennessee, USA, pp 412–420
Zhihua Z, Shifu C and Zhaoqian C (2000). FANNC: A fast adaptive neural network classifier. Knowl Inf Syst 2(1): 115–129
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Peng, T., Zuo, W. & He, F. SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst 16, 281–301 (2008). https://doi.org/10.1007/s10115-007-0107-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-007-0107-1