Abstract
Feature selection, which can reduce the dimensions of feature space without sacrificing the performance of the classifier, is an effective technique for text classification. Because many classifiers cannot deal with the features with high dimensions, filtering the redundant information from the original feature space becomes one of the core goals in feature selection field. In this paper, the concept of equivalence word set is introduced and a set of equivalence word sets (represented as EWS 1) is constructed using the rich semantic information of the Open Directory Project (ODP). On this basis, an artificial bee colony based feature selection method is proposed for filtering the redundant information, and a feature subset FS is obtained by using an optimal feature selection (OFS) method and two predetermined thresholds. In order to obtain the best predetermined thresholds, an improved memory based artificial bee colony method (IABCM) is proposed. In the experiments, fuzzy support vector machine (FSVM) and Naïve Bayesian (NB) classifiers are used on six datasets: LingSpam, WebKB, SpamAssian, 20-Newsgroups, Reuters21578 and TREC2007. Experimental results verify that when FSVM and NB are applied, the proposed method is efficient and achieves better accuracy than several representative feature selection methods.
Similar content being viewed by others
References
Chen J, Huang H, Tian S et al (2009) Feature selection for text classification with Naïve Bayes [J]. Expert Syst Appl 36(3):5432–5435
Lebanon G, Mao Y, Dillon J (2007) The Locally Weighted Bag of Words Framework for Document Representation [J]. J Mach Learn Res 8(2):2405–2441
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics [J]. J Artif Intell Res 37(1):141–188
Gheyas IA, Smith LS (2010) Feature subset selection in large dimensionality domains. Pattern Recogn 43(1):5–13
Uğuz H (2011) A two-stage feature selection method for text classification by using information gain, principal component analysis and genetic algorithm [J]. Knowl-Based Syst 24(7):1024–1032
Azam N, Yao J (2012) Comparison of term frequency and document frequency based feature selection metrics in text classification [J]. Expert Syst Appl 39(5):4760–4768
Liu Y, Wang Y, Feng L et al (2014) Term frequency combined hybrid feature selection method for spam filtering [J]. Pattern Anal Applic 19(2):369–383
Al-Anzi FS, Abuzeina D (2016) Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing [J]. Journal of King Saud University - Computer and Information Sciences
Tenenhaus M, Vinzi VE, Chatelin YM et al (2005) PLS path modeling [J]. Comput Stat Data Anal 48:159–205
Kruskal JB, Wish M (1978) Multidimensional scaling [M]. Sage
Zhang W, Clark RAJ, Wang Y et al (2016) Unsupervised language identification based on Latent Dirichlet Allocation [J]. Comput Speech Lang 39:47–66
Han M, Ren W (2015) Global mutual information-based feature selection approach using single-objective and multi-objective optimization [J]. Neurocomputing 168:47–54
Kohavi R, John G (1997) Wrappers for feature selection [J]. Artif Intell 97(2):273–324
Quinlan JR (1986) Induction of decision trees [J]. Mach Learn 1:81–106
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text classification [C]. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp 412–420
Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text classification [J]. Expert Syst Appl 33(1):1–5
Yang HH, Moody J (1970) Feature Selection Based on Joint Mutual Information [J]
Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text classification [J]. Inform Process Manage 48(4):741–754
Wang D, Zhang H, Liu R, Lv W (2012) Feature selection based on term frequency and t-test for text classification [C]. In: ACM International Conference Proceeding Series, pp 1482–1486
Zhang Y, Zhang Z (2012) Feature subset selection with cumulate conditional mutual information minimization [J]. Expert Syst Appl 39(5):6078–6088
Quinlan JR (1986) Induction of decision trees [J]. Mach Learn 1:81–106
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550
Lin Y, Hu Q, Liu J et al (2015) Multi-label feature selection based on max-dependency and min-redundancy[J]. Neurocomputing 168:92–103
Ševa J., Schatten M, Grd P (2015) Open directory project based universal taxonomy for personalization of online (Re)sources [J]. Expert Syst Appl 42:6306–6314
Perugini S (2008) Symbolic links in the open directory project [J]. Inf Process Manag 44:910–930
Foraker S, Murphy GL (2012) Polysemy in sentence comprehension: Effects of meaning dominance [J]. J Mem Lang 67(4):407–425
Koch MR, Pavlić M, Katić MA (2015) Homonyms and Synonyms in NOK Method [J]. Procedia Eng 100:1055–1061
WordNet 2.0. [14 August 2008]. Available from: http://wordnet.princeton.edu/oldversions
Huang KC, Geller J, Halper M et al (2009) Using WordNet synonym substitution to enhance UMLS source integration - Artificial Intelligence in Medicine [J]. Artif Intell Med 46(2):97– 109
Kennedy J (2010) Particle swarm optimization [J]. Encyclopedia of Machine Learning, Springer US, pp 760–766
Geem ZW, Kim JH, Loganathan GV (2001) A new heuristic optimization algorithm: Harmony search [J]. Simulation 76(2):60–68
Pan WT (2012) A new fruit fly optimization algorithm: taking the financial distress model as an example [J]. Knowl-Based Syst 26:69–74
Karaboga D, Akay B (2009) A comparative study of artificial bee colony algorithm [J]. Appl Math Comput 214(1):108–132
Karaboga D, Basturk B (2007) A powerful and efficient algorithm for numerical function optimization: artificial bee colony (ABC) algorithm [J]. J Glob Optim 39(3):459–471
Li XN, Yang GF (2016) Artificial bee colony algorithm with memory [J]. Appl Soft Comput 41:362–372
Yang J, Liu Y, Liu Z et al (2011) A new feature selection algorithm based on binomial hypothesis testing for spam filtering [J]. Knowl-Based Syst 24(6):904–914
SpamAssassin (2005) Spamassassin public corpus. http://spamassassin.apache.org/publiccorpus/. Accessed June 2008
Cormack GV TREC 2007 spam track overview [C]. In: Proceedings of TREC 2007: the 16th text retrieval conference
Porter MF (1997) An algorithm for suffix stripping [M]. Readings in information retrieval, Morgan Kaufmann Publishers Inc, Kaufmann
Lin C, Wang S (2002) Fuzzy Support Vector Machines [J]. IEEE Trans Neural Netw 13(2):464–471
Nikhil RP, Kuhu P, James MK, James CB (2005) A possibilistic fuzzy c-means clustering algorithm [J]. IEEE Trans Fuzzy Syst 13(4):517–530
McCallum A, Nigam K (2007) A comparison of event models for naive Bayes text classification [C]. In: EACL ’03 Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, vol 1, pp 307–314
Wang YW, Liu Y, Zhu X (2014) Two-step based hybrid feature selection method for spam filtering [J]. J Intell Fuzzy Syst 27(6):2785–2796
Wang Y, Liu Y, Feng L et al (2015) Novel feature selection method based on harmony search for email classification [J]. Knowl-Based Syst 73:311–323
Pan QK, Sang HY, Duan JH et al (2014) An improved fruit fly optimization algorithm for continuous function optimization problems [J]. Knowl-Based Syst 62:69–83
Kasuya E (2010) Wilcoxon signed-ranks test: symmetry should be confirmed before the test [J]. Animal Behav 79(3):765–767
Acknowledgments
This research was supported by the Beijing Natural Science Foundation, under grant no. 4174105, the Joint Funds of the National Natural Science Foundation of China, under grant no. U1509214, and the Discipline Construction Foundation of the Central University of Finance and Economics, under grant no. 2016XX02.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, Y., Feng, L. & Zhu, J. Novel artificial bee colony based feature selection method for filtering redundant information. Appl Intell 48, 868–885 (2018). https://doi.org/10.1007/s10489-017-1010-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-017-1010-4