Skip to main content
Log in

Improving Time Complexity and Accuracy of the Machine Learning Algorithms Through Selection of Highly Weighted Top k Features from Complex Datasets

  • Published:
Annals of Data Science Aims and scope Submit manuscript

Abstract

Machine learning algorithms (MLAs) usually process large and complex datasets containing a substantial number of features to extract meaningful information about the target concept (a.k.a class). In most cases, MLAs suffer from the latency and computational complexity issues while processing such complex datasets due to the presence of lesser weight (i.e., irrelevant or redundant) features. The computing time of the MLAs increases explosively with increase in the number of features, feature dependence, number of records, types of the features, and nested features categories present in such datasets. Appropriate feature selection before applying MLA is a handy solution to effectively resolve the computing speed and accuracy trade-off while processing large and complex datasets. However, selection of the features that are sufficient, necessary, and are highly co-related with the target concept is very challenging. This paper presents an efficient feature selection algorithm based on random forest to improve the performance of the MLAs without sacrificing the guarantees on the accuracy while processing the large and complex datasets. The proposed feature selection algorithm yields unique features that are closely related with the target concept (i.e., class). The proposed algorithm significantly reduces the computing time of the MLAs without degrading the accuracy much while learning the target concept from the large and complex datasets. The simulation results fortify the efficacy and effectiveness of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Benítez JM, Herrera F (2014) A review of microarray datasets and applied feature selection methods. Inf Sci (Ny) 282:111–135

    Article  Google Scholar 

  2. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157–1182

    Google Scholar 

  3. Kuo FY, Sloan IH (2005) Lifting the curse of dimensionality. Not AMS 52(11):1320–1328

    Google Scholar 

  4. Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q (2017) Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. Int J Autom Comput 14(5):503–519

    Article  Google Scholar 

  5. Belarbi MA, Mahmoudi S, Belalem G (2017) PCA as dimensionality reduction for large-scale image retrieval systems. Int J Ambient Comput Intell 8(4):45–58

    Article  Google Scholar 

  6. Wang S-H, Zhang Y-D, Dong Z, Phillips P (2018) Dimensionality reduction of brain image features. Springer, Singapore, pp 105–118

    Google Scholar 

  7. Heer B, Maußner A (2018) Projection methods and the curse of dimensionality. J Math Finance 08(02):317–334

    Article  Google Scholar 

  8. Chow YT, Darbon J, Osher S, Yin W (2019) Algorithm for overcoming the curse of dimensionality for state-dependent Hamilton–Jacobi equations. J Comput Phys 387:376–409

    Article  Google Scholar 

  9. Christiansen B, Christiansen B (2018) Ensemble averaging and the curse of dimensionality. J Clim 31(4):1587–1596

    Article  Google Scholar 

  10. Agarwal S, Ranjan P (2018) High dimensionality characteristics and new fuzzy versatile particle swarm optimization. Springer, Singapore, pp 267–275

    Google Scholar 

  11. Duan K-B, Rajapakse JC, Wang H, Azuaje F (2005) Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobiosci 4(3):228–234

    Article  Google Scholar 

  12. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2):1–135

    Article  Google Scholar 

  13. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin, Heidelberg, pp 137–142

    Google Scholar 

  14. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Article  Google Scholar 

  15. Croft WB, Metzler D, Strohman T (2010) Search engines information retrieval in practice. Addison-Wesley, Reading

    Google Scholar 

  16. Liu Huan, Lei Yu (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502

    Article  Google Scholar 

  17. Sleeman D, Edwards P (eds) (2014) International conference on machine learning 9th : 1992 : Aberdeen, Machine learning : proceedings of the ninth international workshop (ML92). Morgan Kaufman/Elsevier Science

  18. Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324

    Article  Google Scholar 

  19. Langley P (1994) Selection of relevant features in machine learning. In: Proceedings of the AAAI Fall symposium on relevance, pp 1–5

  20. Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1–2):245–271

    Article  Google Scholar 

  21. Malhi A, Gao RX (2004) PCA-based feature selection scheme for machine defect classification. IEEE Trans Instrum Meas 53(6):1517–1525

    Article  Google Scholar 

  22. Breiman L (2001) Random forests. Mach Learn 45:5–32

    Article  Google Scholar 

  23. Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Springer, New York

    Book  Google Scholar 

  24. Liu H, Motoda H, Setiono R, Zhao Z (2010) Feature selection: an ever evolving frontier in data mining. In: Fourth Workshop on Feature Selection in Data Mining, pp 4–13

  25. Rokach L, Maimon O (2008) Data mining with decision trees: theory and applications. World Scientific, Singapore

    Google Scholar 

  26. Estevez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201

    Article  Google Scholar 

  27. Su B, Ding X, Wang H, Wu Y (2018) Discriminative dimensionality reduction for multi-dimensional sequences. IEEE Trans Pattern Anal Mach Intell 40(1):77–91

    Article  Google Scholar 

  28. Doak J (1992) Cse-92-18- an evaluation of feature selection methods and their application to computer security

  29. Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(1–4):131–156

    Article  Google Scholar 

  30. Abbasi A, Chen H, Salem A (2008) Sentiment analysis in multiple languages. ACM Trans Inf Syst 26(3):1–34

    Article  Google Scholar 

  31. Sun Z, Bebis G, Miller R (2004) Object detection using feature subset selection. Pattern Recognit 37(11):2165–2176

    Article  Google Scholar 

  32. Niu D, Wang Y, Wu DD (2010) Power load forecasting using support vector machine and ant colony optimization. Expert Syst Appl 37(3):2531–2539

    Article  Google Scholar 

  33. Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312

    Article  Google Scholar 

  34. Raymer ML, Punch WF, Goodman ED, Kuhn LA, Jain AK (2000) Dimensionality reduction using genetic algorithms. IEEE Trans Evol Comput 4(2):164–171

    Article  Google Scholar 

  35. Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517

    Article  Google Scholar 

  36. Aggarwal CC, Zhai C (2012) Mining text data. Springer, Berlin

    Book  Google Scholar 

  37. Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20:2429–2437

    Article  Google Scholar 

  38. Liu H, Motoda H (2008) Computational methods of feature selection. Chapman & Hall/CRC, Boca Raton

    Google Scholar 

  39. Oliveira LS, Sabourin R, Bortolozzi F, Suen CY (2003) A methodology for feature selection using multiobjective genetic algorithms for handwritten digit string recognition. Int J Pattern Recognit Artif Intell 17(06):903–929

    Article  Google Scholar 

  40. Peng Y, Wu Z, Jiang J (2010) A novel feature selection approach for biomedical data classification. J Biomed Inform 43(1):15–23

    Article  Google Scholar 

  41. Parkka J, Ermes M, Korpipaa P, Mantyjarvi J, Peltola J, Korhonen I (2006) Activity classification using realistic data from wearable sensors. IEEE Trans Inf Technol Biomed 10(1):119–128

    Article  Google Scholar 

  42. Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression datA. J Bioinform Comput Biol 03(02):185–205

    Article  Google Scholar 

  43. Somorjai RL, Dolenko B, Baumgartner R (2003) Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics 19(12):1484–1491

    Article  Google Scholar 

  44. Lessmann S, Voß S (2009) Feature selection in marketing applications. Springer, Berlin, Heidelberg, pp 200–208

    Google Scholar 

  45. Meiri R, Zahavi J (2006) Using simulated annealing to optimize the feature selection problem in marketing applications. Eur J Oper Res 171(3):842–858

    Article  Google Scholar 

  46. Liu H, Motoda H (1998) Feature extraction, construction and selection: a data mining perspective. Kluwer Academic, Dordrecht

    Book  Google Scholar 

  47. Teranol T, Ishino Y (1998) Interactive genetic algorithm based feature selection and its application to marketing data analysis. In: Feature extraction, construction and selection. Springer, Boston, MA, pp 393–406

  48. Gheyas IA, Smith LS (2010) Feature subset selection in large dimensionality domains. Pattern Recognit 43(1):5–13

    Article  Google Scholar 

  49. Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106

    Google Scholar 

  50. Bolc L (1987) Computational models of learning. Springer, Berlin Heidelberg

    Book  Google Scholar 

  51. Singh S, Kubica J, Larsen S, Sorokina D (2009) Parallel large scale feature selection for logistic regression. In: Proceedings of the 2009 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, pp 1172–1183

  52. Faris H, Al-Zoubi AM, Heidari AA, Aljarah I, Mafarja M, Hassonah MA, Fujita H (2019) An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Inf Fusion 48:67–83

    Article  Google Scholar 

  53. Liu J, Abbass HA, Tan KC (2019) Evolutionary computation. In: Evolutionary computation and complex networks. Springer, Cham, pp 3–22

  54. Gu S, Cheng R, Jin Y (2018) Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Comput 22(3):811–822

    Article  Google Scholar 

  55. Hancer E, Xue B, Zhang M (2018) Differential evolution for filter feature selection based on information theory and feature ranking. Knowl Based Syst 140:103–119

    Article  Google Scholar 

  56. Mafarja M, Aljarah I, Heidari AA, Hammouri AI, Faris H, Al-Zoubi AM, Mirjalili S (2018) Evolutionary population dynamics and grasshopper optimization approaches for feature selection problems. Knowl Based Syst 145:25–45

    Article  Google Scholar 

  57. Viegas F, Rocha L, Gonçalves M, Mourão F, Sá G, Salles T, Andrade G, Sandin I (2018) A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing 273:554–569

    Article  Google Scholar 

  58. Sayed GI, Hassanien AE, Azar AT (2019) Feature selection via a novel chaotic crow search algorithm. Neural Comput Appl 31(1):171–188

    Article  Google Scholar 

  59. Zheng L, Wang H, Gao S (2018) Sentimental feature selection for sentiment analysis of Chinese online reviews. Int J Mach Learn Cybern 9(1):75–84

    Article  Google Scholar 

  60. Neshatpour K, Behnia F, Homayoun H, Sasan A (2018) ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In: 2018 design, automation and test in Europe conference and exhibition (DATE), pp 551–556

  61. Hanchuan Peng H, Fuhui Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  62. Hu Q, Zhang L, Zhang D, Pan W, An S, Pedrycz W (2011) Measuring relevance between discrete and continuous features based on neighborhood mutual information. Expert Syst Appl 38(9):10737–10750

    Article  Google Scholar 

  63. Sharmin S, Ali AA, Khan MAH, Shoyaib M (2017) Feature selection and discretization based on mutual information. In: 2017 IEEE international conference on imaging, vision and pattern recognition (icIVPR). pp 1–6

  64. Wang L (2005) Support vector machines: theory and applications. Springer, Berlin

    Book  Google Scholar 

  65. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  66. Kamiński B, Jakubczyk M, Szufel P (2018) A framework for sensitivity analysis of decision trees. Cent Eur J Oper Res 26(1):135–159

    Article  Google Scholar 

  67. Bayes T (1991) An essay towards solving a problem in the doctrine of chances. 1763. MD Comput 8(3):157–171

    Google Scholar 

  68. Fix E, Hodges Jr JL (1952) Discriminatory analysis-nonparametric discrimination: small sample performance (No. UCB-11). California Univ Berkeley

  69. Blake C, Merz C (1998) UCI repository of machine learning databases. University of California, Dept. Information and Computer Science, Irvine, CA, USA

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdul Majeed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Majeed, A. Improving Time Complexity and Accuracy of the Machine Learning Algorithms Through Selection of Highly Weighted Top k Features from Complex Datasets. Ann. Data. Sci. 6, 599–621 (2019). https://doi.org/10.1007/s40745-019-00217-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40745-019-00217-4

Keywords

Navigation