Abstract
There are plenty of unlabeled data in different domains, and effective ways that apply machine learning techniques are in dire need to be found for the ability to use them efficiently. Semi-supervised learning methods are utilized to extract useful information from these unlabeled data. In our study, the Incremental Parallel Training with Cross-Validation (IPT-CV) method is proposed as a novel semi-supervised learning method. This proposed method employs several classifiers and different views of the datasets to label the unlabeled data in an efficient manner. The classifiers used in the algorithm work in parallel each round and enlarge the labeled set according to a validation rule. The method was compared with two well-known SSL methods in the literature. The web was chosen as the domain of the experiments, since it is a land of unlabeled files. Nine binary classification datasets were used from the publicly available WebKB, Banksearch, and the individually collected Conference datasets. The results were statistically analyzed, and according to these analyses, the proposed IPT-CV method showed the highest classification accuracy among all of the methods that were examined.
Similar content being viewed by others
References
Witten, I.H.; Frank, E.; Hall, M.A.: Data mining: practical machine learning tools and techniques, p. 629. Morgan Kaufmann Publishers, San Francisco, CA (2011)
Zhu, X.: Semi-supervised learning literature survey. University of Wisconsin, Madison (2005)
Goldberg, A. B.: New directions in semi-supervised learning. Doctor of Philosophy Dissertation, University of Wisconsin (2010)
Liu, B.: Web data mining: exploring hyperlinks, contents, and usage data, 2nd edn., p. 622. Springer, Berlin Heidelberg (2011)
Sadarangani, A.; Jivani, A.: A Survey of semi-supervised learning. Int. J. Eng. Sci. Res. Technol. 5(10), 138–143 (2016)
Triguero, I.; Garcia, S.; Herrera, F.: Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl. Inf. Syst. 42(2), 245–284 (2015)
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting of the association for computational linguistics, pp. 189–196 (1995)
Rosenberg, C.; Hebert, M.; and Schneiderman, H.: Semi-supervised self-training of object detection models. In: seventh IEEE workshops on applications of computer vision (WACV/MOTION'05), 29–36 (2005)
Iggane, M.; Ennaji, A.; Mammass, D.; Yassa, M.E.: Self-training using a k-Nearest neighbor as a base classifier reinforced by support vector machines. Int. J. Comput. Appl. 56(6), 43–46 (2012)
Yu, N.: Domain adaptation for opinion classification: a self-training approach. J. Inf. Sci. Theor. Practice, 10–26 (2013)
Nigam, K.; McCallum, A.K.; Thrun, S.; Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 103–134 (2000)
Miyato, T.; Dai, A. M.; Goodfellow, I.: Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725. (2016)
Blum, A.; and Mitchell, T.: 1998. Combining labeled and unlabeled data with co-training. In: Proceedings of conference on computational learning theory, pp. 92–100 (1998)
Kiritchenko, S.; and Matwin, S.: Email classification with co-training. In: Proceedings of the 2001 conference of the centre for advanced studies on collaborative research, Toronto, Ontario, Canada, IBM Press, pp. 192–201 (2001)
Wang, J.; Luo, S.; and Zeng, X.: A random subspace method for co-training. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), Hong Kong, pp. 195–200 (2008)
Zhou, Z.H.; Li, M.: Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17(11), 1529–1541 (2005)
Sun, S.; Jin, F.: Robust co-training. Int. J. Pattern Recognit Artif Intell. 25(7), 1113–1126 (2011)
Yu, S.; Krishnapuram, B.; Rosales, R.; Rao, R.B.: Bayesian co-training. The Journal of Machine Learning Research 12, 2649–2680 (2011)
Xu, J.; He, H.; Man, H.: DCPE co-training for classification. Neurocomputing 86, 75–85 (2012)
Ma, F.; Meng, D.; Xie, Q.; Li, Z.; and Dong, X.: Self-Paced co-training. In: proceedings of the international conference on machine learning, Sydney, Australia, pp. 2275–2284 (2017)
Wu, J.; Li, L.; and Wang, W. Y.: Reinforced co-training. In: Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, New Orleans, Louisiana, USA, pp. 1252–1262 (2018)
Yi J.; Lee D.; Chieu H. L.: Co-training for commit classification. In: proceedings of the 2021 EMNLP workshop W-NUT: the seventh workshop on noisy user-generated text, pp. 389–395 (2021)
Kihlman, R.; Fasli, M.: Classifying human rights violations using deep multi-label co-training. IEEE Int. Conf. Big Data 2021, 4887–4895 (2021)
Kijsirikul, B.; Sasipongpairoege, P.; Soonthornphisaj, N.; and Meknavin, S.: Supervised and unsupervised learning algorithms for thai web page identification. In: proceedings of pacific rim international conference on artificial intelligence, Australia, pp. 690–700 (2000)
Soonthornphisaj, N.; Kijsirikul, B.: Iterative cross-training: an algorithm for learning from unlabeled web pages. Int. J. Intell. Syst. 19(2), 131–147 (2004)
Soonthornphisaj, N.; Kijsirikul, B.: Combining ILP with semi-supervised learning for web page categorization. Int. J. Comput. Inf. Eng. 1, 213–216 (2007)
Muggleton, S.: Inverse entailment and progol. New Gener. Comput. 13, 245–286 (1995)
Nie, F.; Cai, G.; and Li, X.: Multi-view clustering and semi-supervised classification with adaptive neighbours. In: thirty-first AAAI conference on artificial intelligence. (2017)
Van Engelen, J.E.; Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109, 373–440 (2020)
Ünal, H.E.; Özel, S.A.; Ünal, İ: Performance of using tag-based feature sets in web page classification. Süleyman Demirel Univ. J. Natural Appl. Sci. 22(2), 583–594 (2018)
Uysal, A.K.; Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manag. 50, 104–112 (2014)
Özel, S.A.: A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst. Appl. 38(4), 3407–3415 (2011)
Ünal, H.E.; Özel, S.A.; Ünal, İ: Effect of tagged-terms on web page classification accuracy. Global J. Technol. 3, 244–250 (2013)
Craven, M.; DiPasquo, D.; Freitag, D.; McCallum, A.; Mitchell, T.; Nigam, K.; and Slattery, S.: Learning to extract symbolic knowledge from the World Wide Web. In: proceedings of the 15th national conference on artificial intelligence in Madison, Wisconsin, USA, american association for artificial intelligence, pp. 509–516 (1998)
Sinka, M.; Corne, D.: A large benchmark dataset for web document clustering. Soft Comput. Syst. Design Manag. Appl. 87, 881–890 (2002)
Van Rijsbergen, C. J.: Information retrieval. Butterworths, p. 208 (1979)
Soonthornphisaj, N.; Chartbanchachai, P.; Pratheeptham, T.; and Kijsirikul, B.: Web page categorization using hierarchical headings structure. In: proceedings of the 24th international conference on information technology interfaces in Cavtat, Croatia, IEEE, 37–42 (2002)
Shaker, M.; Ibrahim, H.; Mustapha, A.; Abdullah, L.N.: Information extraction from hypertext mark-up language web pages. J. Comput. Sci. 5(8), 596–607 (2009)
Baykan, E.; Henzinger, M.; Marian, L.; Weber, I.: A Comprehensive Study of Features and Algorithms for URL-based Topic Classification. ACM Trans. Web 5(3), 1–29 (2011)
Han, J.; Kamber, M.; Pei, J.: Data mining: concepts and techniques, p. 703p. Morgan Kaufmann Publishers, USA (2012)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ünal, H.E., Özel, S.A. A Novel Approach for Semi-supervised Learning: Incremental Parallel Training with Cross-Validation (IPT-CV). Arab J Sci Eng 48, 10457–10477 (2023). https://doi.org/10.1007/s13369-022-07433-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-022-07433-w