Abstract
Machine learning algorithms (MLAs) usually process large and complex datasets containing a substantial number of features to extract meaningful information about the target concept (a.k.a class). In most cases, MLAs suffer from the latency and computational complexity issues while processing such complex datasets due to the presence of lesser weight (i.e., irrelevant or redundant) features. The computing time of the MLAs increases explosively with increase in the number of features, feature dependence, number of records, types of the features, and nested features categories present in such datasets. Appropriate feature selection before applying MLA is a handy solution to effectively resolve the computing speed and accuracy trade-off while processing large and complex datasets. However, selection of the features that are sufficient, necessary, and are highly co-related with the target concept is very challenging. This paper presents an efficient feature selection algorithm based on random forest to improve the performance of the MLAs without sacrificing the guarantees on the accuracy while processing the large and complex datasets. The proposed feature selection algorithm yields unique features that are closely related with the target concept (i.e., class). The proposed algorithm significantly reduces the computing time of the MLAs without degrading the accuracy much while learning the target concept from the large and complex datasets. The simulation results fortify the efficacy and effectiveness of the proposed algorithm.
Similar content being viewed by others
References
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Benítez JM, Herrera F (2014) A review of microarray datasets and applied feature selection methods. Inf Sci (Ny) 282:111–135
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157–1182
Kuo FY, Sloan IH (2005) Lifting the curse of dimensionality. Not AMS 52(11):1320–1328
Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q (2017) Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. Int J Autom Comput 14(5):503–519
Belarbi MA, Mahmoudi S, Belalem G (2017) PCA as dimensionality reduction for large-scale image retrieval systems. Int J Ambient Comput Intell 8(4):45–58
Wang S-H, Zhang Y-D, Dong Z, Phillips P (2018) Dimensionality reduction of brain image features. Springer, Singapore, pp 105–118
Heer B, Maußner A (2018) Projection methods and the curse of dimensionality. J Math Finance 08(02):317–334
Chow YT, Darbon J, Osher S, Yin W (2019) Algorithm for overcoming the curse of dimensionality for state-dependent Hamilton–Jacobi equations. J Comput Phys 387:376–409
Christiansen B, Christiansen B (2018) Ensemble averaging and the curse of dimensionality. J Clim 31(4):1587–1596
Agarwal S, Ranjan P (2018) High dimensionality characteristics and new fuzzy versatile particle swarm optimization. Springer, Singapore, pp 267–275
Duan K-B, Rajapakse JC, Wang H, Azuaje F (2005) Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobiosci 4(3):228–234
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2):1–135
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin, Heidelberg, pp 137–142
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Croft WB, Metzler D, Strohman T (2010) Search engines information retrieval in practice. Addison-Wesley, Reading
Liu Huan, Lei Yu (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502
Sleeman D, Edwards P (eds) (2014) International conference on machine learning 9th : 1992 : Aberdeen, Machine learning : proceedings of the ninth international workshop (ML92). Morgan Kaufman/Elsevier Science
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Langley P (1994) Selection of relevant features in machine learning. In: Proceedings of the AAAI Fall symposium on relevance, pp 1–5
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1–2):245–271
Malhi A, Gao RX (2004) PCA-based feature selection scheme for machine defect classification. IEEE Trans Instrum Meas 53(6):1517–1525
Breiman L (2001) Random forests. Mach Learn 45:5–32
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Springer, New York
Liu H, Motoda H, Setiono R, Zhao Z (2010) Feature selection: an ever evolving frontier in data mining. In: Fourth Workshop on Feature Selection in Data Mining, pp 4–13
Rokach L, Maimon O (2008) Data mining with decision trees: theory and applications. World Scientific, Singapore
Estevez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201
Su B, Ding X, Wang H, Wu Y (2018) Discriminative dimensionality reduction for multi-dimensional sequences. IEEE Trans Pattern Anal Mach Intell 40(1):77–91
Doak J (1992) Cse-92-18- an evaluation of feature selection methods and their application to computer security
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(1–4):131–156
Abbasi A, Chen H, Salem A (2008) Sentiment analysis in multiple languages. ACM Trans Inf Syst 26(3):1–34
Sun Z, Bebis G, Miller R (2004) Object detection using feature subset selection. Pattern Recognit 37(11):2165–2176
Niu D, Wang Y, Wu DD (2010) Power load forecasting using support vector machine and ant colony optimization. Expert Syst Appl 37(3):2531–2539
Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Raymer ML, Punch WF, Goodman ED, Kuhn LA, Jain AK (2000) Dimensionality reduction using genetic algorithms. IEEE Trans Evol Comput 4(2):164–171
Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Aggarwal CC, Zhai C (2012) Mining text data. Springer, Berlin
Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20:2429–2437
Liu H, Motoda H (2008) Computational methods of feature selection. Chapman & Hall/CRC, Boca Raton
Oliveira LS, Sabourin R, Bortolozzi F, Suen CY (2003) A methodology for feature selection using multiobjective genetic algorithms for handwritten digit string recognition. Int J Pattern Recognit Artif Intell 17(06):903–929
Peng Y, Wu Z, Jiang J (2010) A novel feature selection approach for biomedical data classification. J Biomed Inform 43(1):15–23
Parkka J, Ermes M, Korpipaa P, Mantyjarvi J, Peltola J, Korhonen I (2006) Activity classification using realistic data from wearable sensors. IEEE Trans Inf Technol Biomed 10(1):119–128
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression datA. J Bioinform Comput Biol 03(02):185–205
Somorjai RL, Dolenko B, Baumgartner R (2003) Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics 19(12):1484–1491
Lessmann S, Voß S (2009) Feature selection in marketing applications. Springer, Berlin, Heidelberg, pp 200–208
Meiri R, Zahavi J (2006) Using simulated annealing to optimize the feature selection problem in marketing applications. Eur J Oper Res 171(3):842–858
Liu H, Motoda H (1998) Feature extraction, construction and selection: a data mining perspective. Kluwer Academic, Dordrecht
Teranol T, Ishino Y (1998) Interactive genetic algorithm based feature selection and its application to marketing data analysis. In: Feature extraction, construction and selection. Springer, Boston, MA, pp 393–406
Gheyas IA, Smith LS (2010) Feature subset selection in large dimensionality domains. Pattern Recognit 43(1):5–13
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Bolc L (1987) Computational models of learning. Springer, Berlin Heidelberg
Singh S, Kubica J, Larsen S, Sorokina D (2009) Parallel large scale feature selection for logistic regression. In: Proceedings of the 2009 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, pp 1172–1183
Faris H, Al-Zoubi AM, Heidari AA, Aljarah I, Mafarja M, Hassonah MA, Fujita H (2019) An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Inf Fusion 48:67–83
Liu J, Abbass HA, Tan KC (2019) Evolutionary computation. In: Evolutionary computation and complex networks. Springer, Cham, pp 3–22
Gu S, Cheng R, Jin Y (2018) Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Comput 22(3):811–822
Hancer E, Xue B, Zhang M (2018) Differential evolution for filter feature selection based on information theory and feature ranking. Knowl Based Syst 140:103–119
Mafarja M, Aljarah I, Heidari AA, Hammouri AI, Faris H, Al-Zoubi AM, Mirjalili S (2018) Evolutionary population dynamics and grasshopper optimization approaches for feature selection problems. Knowl Based Syst 145:25–45
Viegas F, Rocha L, Gonçalves M, Mourão F, Sá G, Salles T, Andrade G, Sandin I (2018) A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing 273:554–569
Sayed GI, Hassanien AE, Azar AT (2019) Feature selection via a novel chaotic crow search algorithm. Neural Comput Appl 31(1):171–188
Zheng L, Wang H, Gao S (2018) Sentimental feature selection for sentiment analysis of Chinese online reviews. Int J Mach Learn Cybern 9(1):75–84
Neshatpour K, Behnia F, Homayoun H, Sasan A (2018) ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In: 2018 design, automation and test in Europe conference and exhibition (DATE), pp 551–556
Hanchuan Peng H, Fuhui Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Hu Q, Zhang L, Zhang D, Pan W, An S, Pedrycz W (2011) Measuring relevance between discrete and continuous features based on neighborhood mutual information. Expert Syst Appl 38(9):10737–10750
Sharmin S, Ali AA, Khan MAH, Shoyaib M (2017) Feature selection and discretization based on mutual information. In: 2017 IEEE international conference on imaging, vision and pattern recognition (icIVPR). pp 1–6
Wang L (2005) Support vector machines: theory and applications. Springer, Berlin
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Kamiński B, Jakubczyk M, Szufel P (2018) A framework for sensitivity analysis of decision trees. Cent Eur J Oper Res 26(1):135–159
Bayes T (1991) An essay towards solving a problem in the doctrine of chances. 1763. MD Comput 8(3):157–171
Fix E, Hodges Jr JL (1952) Discriminatory analysis-nonparametric discrimination: small sample performance (No. UCB-11). California Univ Berkeley
Blake C, Merz C (1998) UCI repository of machine learning databases. University of California, Dept. Information and Computer Science, Irvine, CA, USA
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Majeed, A. Improving Time Complexity and Accuracy of the Machine Learning Algorithms Through Selection of Highly Weighted Top k Features from Complex Datasets. Ann. Data. Sci. 6, 599–621 (2019). https://doi.org/10.1007/s40745-019-00217-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40745-019-00217-4