Improving Time Complexity and Accuracy of the Machine Learning Algorithms Through Selection of Highly Weighted Top k Features from Complex Datasets

Majeed, Abdul

doi:10.1007/s40745-019-00217-4

Improving Time Complexity and Accuracy of the Machine Learning Algorithms Through Selection of Highly Weighted Top k Features from Complex Datasets

Published: 02 May 2019

Volume 6, pages 599–621, (2019)
Cite this article

Annals of Data Science Aims and scope Submit manuscript

Abdul Majeed¹

1173 Accesses
16 Citations
Explore all metrics

Abstract

Machine learning algorithms (MLAs) usually process large and complex datasets containing a substantial number of features to extract meaningful information about the target concept (a.k.a class). In most cases, MLAs suffer from the latency and computational complexity issues while processing such complex datasets due to the presence of lesser weight (i.e., irrelevant or redundant) features. The computing time of the MLAs increases explosively with increase in the number of features, feature dependence, number of records, types of the features, and nested features categories present in such datasets. Appropriate feature selection before applying MLA is a handy solution to effectively resolve the computing speed and accuracy trade-off while processing large and complex datasets. However, selection of the features that are sufficient, necessary, and are highly co-related with the target concept is very challenging. This paper presents an efficient feature selection algorithm based on random forest to improve the performance of the MLAs without sacrificing the guarantees on the accuracy while processing the large and complex datasets. The proposed feature selection algorithm yields unique features that are closely related with the target concept (i.e., class). The proposed algorithm significantly reduces the computing time of the MLAs without degrading the accuracy much while learning the target concept from the large and complex datasets. The simulation results fortify the efficacy and effectiveness of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reliable Attribute Selection Based on Random Forest (RASER)

Performance Evaluation and Analysis of Feature Selection Algorithms

Variance-Based Feature Selection for Enhanced Classification Performance

References

Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Benítez JM, Herrera F (2014) A review of microarray datasets and applied feature selection methods. Inf Sci (Ny) 282:111–135
Article Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157–1182
Google Scholar
Kuo FY, Sloan IH (2005) Lifting the curse of dimensionality. Not AMS 52(11):1320–1328
Google Scholar
Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q (2017) Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. Int J Autom Comput 14(5):503–519
Article Google Scholar
Belarbi MA, Mahmoudi S, Belalem G (2017) PCA as dimensionality reduction for large-scale image retrieval systems. Int J Ambient Comput Intell 8(4):45–58
Article Google Scholar
Wang S-H, Zhang Y-D, Dong Z, Phillips P (2018) Dimensionality reduction of brain image features. Springer, Singapore, pp 105–118
Google Scholar
Heer B, Maußner A (2018) Projection methods and the curse of dimensionality. J Math Finance 08(02):317–334
Article Google Scholar
Chow YT, Darbon J, Osher S, Yin W (2019) Algorithm for overcoming the curse of dimensionality for state-dependent Hamilton–Jacobi equations. J Comput Phys 387:376–409
Article Google Scholar
Christiansen B, Christiansen B (2018) Ensemble averaging and the curse of dimensionality. J Clim 31(4):1587–1596
Article Google Scholar
Agarwal S, Ranjan P (2018) High dimensionality characteristics and new fuzzy versatile particle swarm optimization. Springer, Singapore, pp 267–275
Google Scholar
Duan K-B, Rajapakse JC, Wang H, Azuaje F (2005) Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobiosci 4(3):228–234
Article Google Scholar
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2):1–135
Article Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin, Heidelberg, pp 137–142
Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Article Google Scholar
Croft WB, Metzler D, Strohman T (2010) Search engines information retrieval in practice. Addison-Wesley, Reading
Google Scholar
Liu Huan, Lei Yu (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502
Article Google Scholar
Sleeman D, Edwards P (eds) (2014) International conference on machine learning 9th : 1992 : Aberdeen, Machine learning : proceedings of the ninth international workshop (ML92). Morgan Kaufman/Elsevier Science
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Article Google Scholar
Langley P (1994) Selection of relevant features in machine learning. In: Proceedings of the AAAI Fall symposium on relevance, pp 1–5
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1–2):245–271
Article Google Scholar
Malhi A, Gao RX (2004) PCA-based feature selection scheme for machine defect classification. IEEE Trans Instrum Meas 53(6):1517–1525
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45:5–32
Article Google Scholar
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Springer, New York
Book Google Scholar
Liu H, Motoda H, Setiono R, Zhao Z (2010) Feature selection: an ever evolving frontier in data mining. In: Fourth Workshop on Feature Selection in Data Mining, pp 4–13
Rokach L, Maimon O (2008) Data mining with decision trees: theory and applications. World Scientific, Singapore
Google Scholar
Estevez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201
Article Google Scholar
Su B, Ding X, Wang H, Wu Y (2018) Discriminative dimensionality reduction for multi-dimensional sequences. IEEE Trans Pattern Anal Mach Intell 40(1):77–91
Article Google Scholar
Doak J (1992) Cse-92-18- an evaluation of feature selection methods and their application to computer security
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(1–4):131–156
Article Google Scholar
Abbasi A, Chen H, Salem A (2008) Sentiment analysis in multiple languages. ACM Trans Inf Syst 26(3):1–34
Article Google Scholar
Sun Z, Bebis G, Miller R (2004) Object detection using feature subset selection. Pattern Recognit 37(11):2165–2176
Article Google Scholar
Niu D, Wang Y, Wu DD (2010) Power load forecasting using support vector machine and ant colony optimization. Expert Syst Appl 37(3):2531–2539
Article Google Scholar
Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Article Google Scholar
Raymer ML, Punch WF, Goodman ED, Kuhn LA, Jain AK (2000) Dimensionality reduction using genetic algorithms. IEEE Trans Evol Comput 4(2):164–171
Article Google Scholar
Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Article Google Scholar
Aggarwal CC, Zhai C (2012) Mining text data. Springer, Berlin
Book Google Scholar
Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20:2429–2437
Article Google Scholar
Liu H, Motoda H (2008) Computational methods of feature selection. Chapman & Hall/CRC, Boca Raton
Google Scholar
Oliveira LS, Sabourin R, Bortolozzi F, Suen CY (2003) A methodology for feature selection using multiobjective genetic algorithms for handwritten digit string recognition. Int J Pattern Recognit Artif Intell 17(06):903–929
Article Google Scholar
Peng Y, Wu Z, Jiang J (2010) A novel feature selection approach for biomedical data classification. J Biomed Inform 43(1):15–23
Article Google Scholar
Parkka J, Ermes M, Korpipaa P, Mantyjarvi J, Peltola J, Korhonen I (2006) Activity classification using realistic data from wearable sensors. IEEE Trans Inf Technol Biomed 10(1):119–128
Article Google Scholar
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression datA. J Bioinform Comput Biol 03(02):185–205
Article Google Scholar
Somorjai RL, Dolenko B, Baumgartner R (2003) Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics 19(12):1484–1491
Article Google Scholar
Lessmann S, Voß S (2009) Feature selection in marketing applications. Springer, Berlin, Heidelberg, pp 200–208
Google Scholar
Meiri R, Zahavi J (2006) Using simulated annealing to optimize the feature selection problem in marketing applications. Eur J Oper Res 171(3):842–858
Article Google Scholar
Liu H, Motoda H (1998) Feature extraction, construction and selection: a data mining perspective. Kluwer Academic, Dordrecht
Book Google Scholar
Teranol T, Ishino Y (1998) Interactive genetic algorithm based feature selection and its application to marketing data analysis. In: Feature extraction, construction and selection. Springer, Boston, MA, pp 393–406
Gheyas IA, Smith LS (2010) Feature subset selection in large dimensionality domains. Pattern Recognit 43(1):5–13
Article Google Scholar
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Google Scholar
Bolc L (1987) Computational models of learning. Springer, Berlin Heidelberg
Book Google Scholar
Singh S, Kubica J, Larsen S, Sorokina D (2009) Parallel large scale feature selection for logistic regression. In: Proceedings of the 2009 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, pp 1172–1183
Faris H, Al-Zoubi AM, Heidari AA, Aljarah I, Mafarja M, Hassonah MA, Fujita H (2019) An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Inf Fusion 48:67–83
Article Google Scholar
Liu J, Abbass HA, Tan KC (2019) Evolutionary computation. In: Evolutionary computation and complex networks. Springer, Cham, pp 3–22
Gu S, Cheng R, Jin Y (2018) Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Comput 22(3):811–822
Article Google Scholar
Hancer E, Xue B, Zhang M (2018) Differential evolution for filter feature selection based on information theory and feature ranking. Knowl Based Syst 140:103–119
Article Google Scholar
Mafarja M, Aljarah I, Heidari AA, Hammouri AI, Faris H, Al-Zoubi AM, Mirjalili S (2018) Evolutionary population dynamics and grasshopper optimization approaches for feature selection problems. Knowl Based Syst 145:25–45
Article Google Scholar
Viegas F, Rocha L, Gonçalves M, Mourão F, Sá G, Salles T, Andrade G, Sandin I (2018) A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing 273:554–569
Article Google Scholar
Sayed GI, Hassanien AE, Azar AT (2019) Feature selection via a novel chaotic crow search algorithm. Neural Comput Appl 31(1):171–188
Article Google Scholar
Zheng L, Wang H, Gao S (2018) Sentimental feature selection for sentiment analysis of Chinese online reviews. Int J Mach Learn Cybern 9(1):75–84
Article Google Scholar
Neshatpour K, Behnia F, Homayoun H, Sasan A (2018) ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In: 2018 design, automation and test in Europe conference and exhibition (DATE), pp 551–556
Hanchuan Peng H, Fuhui Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Hu Q, Zhang L, Zhang D, Pan W, An S, Pedrycz W (2011) Measuring relevance between discrete and continuous features based on neighborhood mutual information. Expert Syst Appl 38(9):10737–10750
Article Google Scholar
Sharmin S, Ali AA, Khan MAH, Shoyaib M (2017) Feature selection and discretization based on mutual information. In: 2017 IEEE international conference on imaging, vision and pattern recognition (icIVPR). pp 1–6
Wang L (2005) Support vector machines: theory and applications. Springer, Berlin
Book Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Kamiński B, Jakubczyk M, Szufel P (2018) A framework for sensitivity analysis of decision trees. Cent Eur J Oper Res 26(1):135–159
Article Google Scholar
Bayes T (1991) An essay towards solving a problem in the doctrine of chances. 1763. MD Comput 8(3):157–171
Google Scholar
Fix E, Hodges Jr JL (1952) Discriminatory analysis-nonparametric discrimination: small sample performance (No. UCB-11). California Univ Berkeley
Blake C, Merz C (1998) UCI repository of machine learning databases. University of California, Dept. Information and Computer Science, Irvine, CA, USA

Download references

Author information

Authors and Affiliations

School of Information and Electronics Engineering, Korea Aerospace University, Deogyang-gu, Goyang-si, Gyeonggi-do, 412-791, South Korea
Abdul Majeed

Authors

Abdul Majeed
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdul Majeed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Majeed, A. Improving Time Complexity and Accuracy of the Machine Learning Algorithms Through Selection of Highly Weighted Top k Features from Complex Datasets. Ann. Data. Sci. 6, 599–621 (2019). https://doi.org/10.1007/s40745-019-00217-4

Download citation

Received: 16 November 2018
Revised: 12 April 2019
Accepted: 16 April 2019
Published: 02 May 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s40745-019-00217-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Time Complexity and Accuracy of the Machine Learning Algorithms Through Selection of Highly Weighted Top k Features from Complex Datasets

Abstract

Access this article

Similar content being viewed by others

Reliable Attribute Selection Based on Random Forest (RASER)

Performance Evaluation and Analysis of Feature Selection Algorithms

Variance-Based Feature Selection for Enhanced Classification Performance

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving Time Complexity and Accuracy of the Machine Learning Algorithms Through Selection of Highly Weighted Top k Features from Complex Datasets

Abstract

Access this article

Similar content being viewed by others

Reliable Attribute Selection Based on Random Forest (RASER)

Performance Evaluation and Analysis of Feature Selection Algorithms

Variance-Based Feature Selection for Enhanced Classification Performance

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation