Skip to main content
Log in

A Novel Active Learning Method Using SVM for Text Classification

  • Research Article
  • Published:
International Journal of Automation and Computing Aims and scope Submit manuscript

Abstract

Support vector machines (SVMs) are a popular class of supervised learning algorithms, and are particularly applicable to large and high-dimensional classification problems. Like most machine learning methods for data classification and information retrieval, they require manually labeled data samples in the training stage. However, manual labeling is a time consuming and errorprone task. One possible solution to this issue is to exploit the large number of unlabeled samples that are easily accessible via the internet. This paper presents a novel active learning method for text categorization. The main objective of active learning is to reduce the labeling effort, without compromising the accuracy of classification, by intelligently selecting which samples should be labeled. The proposed method selects a batch of informative samples using the posterior probabilities provided by a set of multi-class SVM classifiers, and these samples are then manually labeled by an expert. Experimental results indicate that the proposed active learning method significantly reduces the labeling effort, while simultaneously enhancing the classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002.

    Article  Google Scholar 

  2. B. Settles. Active Learning Literature Survey. Computer Sciences Technical Report, 1648, University of Wisconsinadison, USA, 2010.

    MATH  Google Scholar 

  3. D. D. Lewis, W. A. Gale. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Springer-Verlag, New York, USA, 1994.

    Google Scholar 

  4. C. Persello, L. Bruzzone. Active and semisupervised learning for the classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 11, pp. 6937–6956, 2014.

    Article  Google Scholar 

  5. G. Chen, T. J. Wang, L. Y. Gong, P. Herrera. Multi-class support vector machine active learning for music annotation. International Journal of Innovative Computing, Information and Control, vol. 6, no. 3, pp. 921–930, 2010.

    Google Scholar 

  6. S. Tong, D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, vol. 2, pp. 45–66, 2002.

    MATH  Google Scholar 

  7. S. A. A. Balamurugan, R. Rajaram. Effective and efficient feature selection for large-scale data using Bayestheorem. International Journal of Automation and Computing, vol. 6, no. 1, pp. 62–71, 2009.

    Article  Google Scholar 

  8. J. A. Mangai, V. S. Kumar, S. A. alias Balamurugan. A novel feature selection framework for automatic web page classification. International Journal of Automation and Computing, vol. 9, no. 4, pp. 442–448, 2012.

    Google Scholar 

  9. I. Hmeidi, B. Hawashin, E. El-Qawasmeh. Performance of KNN and SVM classifiers on full word Arabic articles. Advanced Engineering Informatics, vol. 22, no. 1, pp. 106–111, 2008.

    Article  Google Scholar 

  10. B. Trstenjak, S. Mikac, D. Donko. KNN with TF-IDF based framework for text categorization. Procedia Engineering, vol. 69, pp. 1356–1364, 2014.

    Article  Google Scholar 

  11. S. Gazzah, N. E. B. Amara. Neural networks and support vector machines classifiers for writer identification using arabic script. The International Arab Journal of Information Technology, vol. 5, no. 1, pp. 92–101, 2008.

    Google Scholar 

  12. W. Lam, Y. Q. Han. Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 628–633, 2003.

    Article  Google Scholar 

  13. Q. Shen, R. Jensen. Rough sets, their extensions and applications. International Journal of Automation and Computing, vol. 4, no. 3, pp. 217–228, 2007.

    Article  Google Scholar 

  14. L. Messikh, M. Bedda, N. Doghmane. Binary phoneme classification using fixed and adaptive segment-based neural networkapproach. The International Arab Journal of Information Technology, vol. 8, no. 1, pp. 48–51, 2011.

    Google Scholar 

  15. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning Chemnitz, Springer, Chemnitz, Germany, pp. 137–142, 1998.

    Google Scholar 

  16. Y. M. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, vol. 1, no. 1–2, pp. 69–90, 1999.

    Article  Google Scholar 

  17. T. Luo, K. Kramer, S. Samson, A. Remsen, D. B. Goldgof, L. O. Hall, T. Hopkins. Active learning to recognize multiple types of plankton. In Proceedings of the 17th International Conference on Pattern Recognition, IEEE, Cambridge, USA, vol. 3, pp. 478–481, 2004.

    MATH  Google Scholar 

  18. M. Goudjil, M. Koudil, N. Hammami, M. Bedda, M. Alruily. Arabic text categorization using SVM active learning technique: An overview. In Proceedings of World Congress on Computer and Information Technology, IEEE, Sousse, Tunisia, 2013.

    Google Scholar 

  19. P. Mitra, C. A. Murthy, S. K. Pal. A probabilistic active support vector learning algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 3, pp. 413–418, 2004.

    Article  Google Scholar 

  20. G. Schohn, D. Cohn. Less is more: Active learning with support vector machines. In Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, USA, pp. 839–846, 2000.

    Google Scholar 

  21. K. Brinker. Incorporating diversity in active learning with support vector machines. In Proceedings of the 20th International Conference on Machine Learning, ACM, Washington, USA, pp. 59–66, 2003.

    Google Scholar 

  22. Y. Baram, R. El-Yaniv, K. Luz. Online choice of active learning algorithms. Journal of Machine Learning Research, vol. 5, pp. 255–291, 2004.

    MathSciNet  Google Scholar 

  23. N. Roy, A. McCallum. Toward optimal active learning through monte carlo estimation of error reduction. In Proceedings of the 18th International Conference on Machine Learning, Bellevue, USA, pp. 441–448, 2001.

    Google Scholar 

  24. A. K. McCallumzy, K. Nigamy. Employing EM and poolbased active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning, Madison, USA, pp. 350–358, 1998.

    Google Scholar 

  25. S. C. H. Hoi, R. Jin, M. R. Lyu. Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International Conference on World Wide Web, ACM, New York, USA, pp. 633–642, 2006.

    Chapter  Google Scholar 

  26. M. Goudjil, M. Bedda, M. Koudil, N. Ghoggali. Using active learning in text classification of quranic sciences. In Proceedings of International Conference on Advances in Information Technology for the Holy Quran and its Science, Taibah University, Madinah, Saudi Arabia, pp. 209–213, 2013.

    Google Scholar 

  27. M. Goudjil. Text Categorization using reduced training set. Research Journal of Applied Sciences, Engineering and Technology. vol. 10, no. 12, pp. 1363–1369, 2015.

    Article  Google Scholar 

  28. V. N. Vapnik. Statistical Learning Theory, NewYork, USA: Wiley, 1998.

    MATH  Google Scholar 

  29. N. Ghoggali, F. Melgani, Y. Bazi. A multiobjective genetic SVM approach for classification problems with limited training samples. IEEE Transactions on Geoscience and Remote Sensing, vol. 47, no. 6, pp. 1707–1718, 2009.

    Article  Google Scholar 

  30. T. Hastie, R. Tibshirani. Classification by pairwise coupling. The Annals of Statistics, vol. 26, no. 2, pp. 451–471, 1998.

    Article  MathSciNet  MATH  Google Scholar 

  31. K. B. Duan, S. S. Keerthi. Which is the best multiclass SVM method? An empirical study. In Proceedings of the 6th International Workshop, MCS 2005, California, USA, pp. 278–285, 2005.

    Google Scholar 

  32. T. F. Wu, C. J. Lin, R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, vol. 5, pp. 975–1005, 2003.

    MathSciNet  MATH  Google Scholar 

  33. C. C. Chang, C. J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, Article number 27, 2011.

    Article  Google Scholar 

  34. M. K. Li, I. K. Sethi. Confidence-based active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1251–1261, 2006.

    Article  Google Scholar 

  35. B. Demir, C. Persello, L. Bruzzone. Batch-mode activelearning methods for the interactive classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 3, pp. 1014–1031, 2011.

    Article  Google Scholar 

  36. M. Sassano. An empirical study of active learning with support vector machines for Japanese word segmentation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, USA, pp. 505–512, 2002.

    Google Scholar 

  37. S. C. H, Hoi, R. Jin, M. R. Lyu. Batch mode active learning with applications to text categorization and image retrieval. IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1233–1248, 2009.

    Article  Google Scholar 

  38. A. Cardoso-Cachopo, A. L. Oliveira. Semi-supervised single-label text categorization using centroid-based classifiers. In Proceedings of the ACM Symposium on Applied Computing, ACM, Seoul, Korea, pp. 844–851, 2007.

    Google Scholar 

  39. K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, vol. 60, no. 5, pp. 493–502, 2004.

    Article  Google Scholar 

  40. G. Salton, C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, vol. 24, no. 5, pp. 513–523, 1988.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Goudjil.

Additional information

Recommended by Edit-in-Chief Huo-Sheng Hu

Mohamed Goudjil received the M. Sc. degree in computer engineering from Boumerdes University, Algeria in 2008. He is currently a Ph. D. degree candidate in computer engineering at Ecole nationale Supérieure d’Informatique (ESI), Algeria. From 2005 to 2008, he was a researcher at Advanced Technologies & Resarchs Centre and a lecturer for seven years in different universities.

His research interests include text classification, arabic language processing and machine learning.

Mouloud Koudil received the Ph.D. degree in computer science from l’Ecole nationale Supérieure d’Informatique (ESI), Algeria in 2002. He is currently a full time professor and rector of the same institution.

His research interests include wireless sensor networks, networks on chips, and hardware/software codesign.

Mouldi Bedda received the Ph.D. degree in electrical engineering from the University Nancy 2, France in 1985. From 1985 to 2006, he worked with the University Badji Mokhtar Annaba, Algeria. He was the director of Automatic and Signals Laboratory from 2001 to 2006. Since 2006, he is a full professor at the college of engineering of Al Jouf university KSA. He supervised several Ph. D. students in speech processing, biomedical signals, hand written recognition and image processing.

His research interests include speech processing, biomedical signals, hand written recognition and image processing.

Noureddine Ghoggail received the State Engineer degree in electronics from the University of Batna, Algeria in 2000, and the Ph.D. degree in information and communication technologies in Department of Information Engineering and Computer Science, University of Trento, Italy. He is currently an assistant professor at University of Batna in Algeria.

His research interests include pattern recognition and evolutionary computation methodologies for remote sensing image analysis.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goudjil, M., Koudil, M., Bedda, M. et al. A Novel Active Learning Method Using SVM for Text Classification. Int. J. Autom. Comput. 15, 290–298 (2018). https://doi.org/10.1007/s11633-015-0912-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11633-015-0912-z

Keywords

Navigation