Advertisement

A Novel Active Learning Method Using SVM for Text Classification

  • Mohamed GoudjilEmail author
  • Mouloud Koudil
  • Mouldi Bedda
  • Noureddine Ghoggali
Research Article

Abstract

Support vector machines (SVMs) are a popular class of supervised learning algorithms, and are particularly applicable to large and high-dimensional classification problems. Like most machine learning methods for data classification and information retrieval, they require manually labeled data samples in the training stage. However, manual labeling is a time consuming and errorprone task. One possible solution to this issue is to exploit the large number of unlabeled samples that are easily accessible via the internet. This paper presents a novel active learning method for text categorization. The main objective of active learning is to reduce the labeling effort, without compromising the accuracy of classification, by intelligently selecting which samples should be labeled. The proposed method selects a batch of informative samples using the posterior probabilities provided by a set of multi-class SVM classifiers, and these samples are then manually labeled by an expert. Experimental results indicate that the proposed active learning method significantly reduces the labeling effort, while simultaneously enhancing the classification accuracy.

Keywords

Text categorization active learning support vector machine (SVM) pool-based active learning pairwise coupling 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002.CrossRefGoogle Scholar
  2. [2]
    B. Settles. Active Learning Literature Survey. Computer Sciences Technical Report, 1648, University of Wisconsinadison, USA, 2010.zbMATHGoogle Scholar
  3. [3]
    D. D. Lewis, W. A. Gale. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Springer-Verlag, New York, USA, 1994.Google Scholar
  4. [4]
    C. Persello, L. Bruzzone. Active and semisupervised learning for the classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 11, pp. 6937–6956, 2014.CrossRefGoogle Scholar
  5. [5]
    G. Chen, T. J. Wang, L. Y. Gong, P. Herrera. Multi-class support vector machine active learning for music annotation. International Journal of Innovative Computing, Information and Control, vol. 6, no. 3, pp. 921–930, 2010.Google Scholar
  6. [6]
    S. Tong, D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, vol. 2, pp. 45–66, 2002.zbMATHGoogle Scholar
  7. [7]
    S. A. A. Balamurugan, R. Rajaram. Effective and efficient feature selection for large-scale data using Bayestheorem. International Journal of Automation and Computing, vol. 6, no. 1, pp. 62–71, 2009.CrossRefGoogle Scholar
  8. [8]
    J. A. Mangai, V. S. Kumar, S. A. alias Balamurugan. A novel feature selection framework for automatic web page classification. International Journal of Automation and Computing, vol. 9, no. 4, pp. 442–448, 2012.Google Scholar
  9. [9]
    I. Hmeidi, B. Hawashin, E. El-Qawasmeh. Performance of KNN and SVM classifiers on full word Arabic articles. Advanced Engineering Informatics, vol. 22, no. 1, pp. 106–111, 2008.CrossRefGoogle Scholar
  10. [10]
    B. Trstenjak, S. Mikac, D. Donko. KNN with TF-IDF based framework for text categorization. Procedia Engineering, vol. 69, pp. 1356–1364, 2014.CrossRefGoogle Scholar
  11. [11]
    S. Gazzah, N. E. B. Amara. Neural networks and support vector machines classifiers for writer identification using arabic script. The International Arab Journal of Information Technology, vol. 5, no. 1, pp. 92–101, 2008.Google Scholar
  12. [12]
    W. Lam, Y. Q. Han. Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 628–633, 2003.CrossRefGoogle Scholar
  13. [13]
    Q. Shen, R. Jensen. Rough sets, their extensions and applications. International Journal of Automation and Computing, vol. 4, no. 3, pp. 217–228, 2007.CrossRefGoogle Scholar
  14. [14]
    L. Messikh, M. Bedda, N. Doghmane. Binary phoneme classification using fixed and adaptive segment-based neural networkapproach. The International Arab Journal of Information Technology, vol. 8, no. 1, pp. 48–51, 2011.Google Scholar
  15. [15]
    T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning Chemnitz, Springer, Chemnitz, Germany, pp. 137–142, 1998.Google Scholar
  16. [16]
    Y. M. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, vol. 1, no. 1–2, pp. 69–90, 1999.CrossRefGoogle Scholar
  17. [17]
    T. Luo, K. Kramer, S. Samson, A. Remsen, D. B. Goldgof, L. O. Hall, T. Hopkins. Active learning to recognize multiple types of plankton. In Proceedings of the 17th International Conference on Pattern Recognition, IEEE, Cambridge, USA, vol. 3, pp. 478–481, 2004.zbMATHGoogle Scholar
  18. [18]
    M. Goudjil, M. Koudil, N. Hammami, M. Bedda, M. Alruily. Arabic text categorization using SVM active learning technique: An overview. In Proceedings of World Congress on Computer and Information Technology, IEEE, Sousse, Tunisia, 2013.Google Scholar
  19. [19]
    P. Mitra, C. A. Murthy, S. K. Pal. A probabilistic active support vector learning algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 3, pp. 413–418, 2004.CrossRefGoogle Scholar
  20. [20]
    G. Schohn, D. Cohn. Less is more: Active learning with support vector machines. In Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, USA, pp. 839–846, 2000.Google Scholar
  21. [21]
    K. Brinker. Incorporating diversity in active learning with support vector machines. In Proceedings of the 20th International Conference on Machine Learning, ACM, Washington, USA, pp. 59–66, 2003.Google Scholar
  22. [22]
    Y. Baram, R. El-Yaniv, K. Luz. Online choice of active learning algorithms. Journal of Machine Learning Research, vol. 5, pp. 255–291, 2004.MathSciNetGoogle Scholar
  23. [23]
    N. Roy, A. McCallum. Toward optimal active learning through monte carlo estimation of error reduction. In Proceedings of the 18th International Conference on Machine Learning, Bellevue, USA, pp. 441–448, 2001.Google Scholar
  24. [24]
    A. K. McCallumzy, K. Nigamy. Employing EM and poolbased active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning, Madison, USA, pp. 350–358, 1998.Google Scholar
  25. [25]
    S. C. H. Hoi, R. Jin, M. R. Lyu. Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International Conference on World Wide Web, ACM, New York, USA, pp. 633–642, 2006.CrossRefGoogle Scholar
  26. [26]
    M. Goudjil, M. Bedda, M. Koudil, N. Ghoggali. Using active learning in text classification of quranic sciences. In Proceedings of International Conference on Advances in Information Technology for the Holy Quran and its Science, Taibah University, Madinah, Saudi Arabia, pp. 209–213, 2013.Google Scholar
  27. [27]
    M. Goudjil. Text Categorization using reduced training set. Research Journal of Applied Sciences, Engineering and Technology. vol. 10, no. 12, pp. 1363–1369, 2015.CrossRefGoogle Scholar
  28. [28]
    V. N. Vapnik. Statistical Learning Theory, NewYork, USA: Wiley, 1998.zbMATHGoogle Scholar
  29. [29]
    N. Ghoggali, F. Melgani, Y. Bazi. A multiobjective genetic SVM approach for classification problems with limited training samples. IEEE Transactions on Geoscience and Remote Sensing, vol. 47, no. 6, pp. 1707–1718, 2009.CrossRefGoogle Scholar
  30. [30]
    T. Hastie, R. Tibshirani. Classification by pairwise coupling. The Annals of Statistics, vol. 26, no. 2, pp. 451–471, 1998.MathSciNetCrossRefzbMATHGoogle Scholar
  31. [31]
    K. B. Duan, S. S. Keerthi. Which is the best multiclass SVM method? An empirical study. In Proceedings of the 6th International Workshop, MCS 2005, California, USA, pp. 278–285, 2005.Google Scholar
  32. [32]
    T. F. Wu, C. J. Lin, R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, vol. 5, pp. 975–1005, 2003.MathSciNetzbMATHGoogle Scholar
  33. [33]
    C. C. Chang, C. J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, Article number 27, 2011.CrossRefGoogle Scholar
  34. [34]
    M. K. Li, I. K. Sethi. Confidence-based active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1251–1261, 2006.CrossRefGoogle Scholar
  35. [35]
    B. Demir, C. Persello, L. Bruzzone. Batch-mode activelearning methods for the interactive classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 3, pp. 1014–1031, 2011.CrossRefGoogle Scholar
  36. [36]
    M. Sassano. An empirical study of active learning with support vector machines for Japanese word segmentation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, USA, pp. 505–512, 2002.Google Scholar
  37. [37]
    S. C. H, Hoi, R. Jin, M. R. Lyu. Batch mode active learning with applications to text categorization and image retrieval. IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1233–1248, 2009.CrossRefGoogle Scholar
  38. [38]
    A. Cardoso-Cachopo, A. L. Oliveira. Semi-supervised single-label text categorization using centroid-based classifiers. In Proceedings of the ACM Symposium on Applied Computing, ACM, Seoul, Korea, pp. 844–851, 2007.Google Scholar
  39. [39]
    K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, vol. 60, no. 5, pp. 493–502, 2004.CrossRefGoogle Scholar
  40. [40]
    G. Salton, C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, vol. 24, no. 5, pp. 513–523, 1988.CrossRefGoogle Scholar

Copyright information

© Institute of Automation, Chinese Academy of Sciences and Springer-Verlag GmbH Germany, part of Springer Nature 2016

Authors and Affiliations

  1. 1.École nationale Supérieure d’Informatique (ESI)Oued Smar, AlgiersAlgeria
  2. 2.AL Jouf UniversitySakakaKingdom of Saudi Arabia
  3. 3.LAAAS laboratory, Faculté de TechnologieUniversité Batna 2BatnaAlgeria

Personalised recommendations