Skip to main content

A statistical feature selection technique

Abstract

Various feature selection techniques have been proposed in the field of machine learning. The filter approaches are typically faster while wrapper approaches are more reliable though computationally expensive. Feature selection techniques often strive to achieve performance similar to wrapper approaches employing various computational approaches. Feature selection techniques typically depend on ways how they compute feature–feature correlation and feature–class correlation. These two computations are highly governed by the correlation measure being used. In this work, a method is developed named enhanced correlation-based feature selection (ECFS) to effectively employ the feature–feature and feature–class correlations to extract relevant feature subset from multi-class gene expression data as well as machine learning datasets. The performance of ECFS in terms of classification accuracies obtained by decision tree, random forest and KNN classifiers has been found highly satisfactory over several benchmark datasets.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  • Ahmed H, Mahanta P, Bhattacharyya D, Kalita JK (2011) Gerc: tree based clustering for gene expression data. In: 2011 IEEE 11th International conference on bioinformatics and bioengineering (BIBE). IEEE, pp 299–302

  • Ahmed HA, Mahanta P, Bhattacharyya DK, Kalita JK (2012) Module extraction from subspace co-expression networks. Netw Model Anal Health Inform Bioinform 1(4):183–195

    Article  Google Scholar 

  • Alberti KGMM, Zimmet P (1998) Definition, diagnosis and classification of diabetes mellitus and its complications. Diabetic Med 15:539–553

    Article  Google Scholar 

  • Bache K, Lichman M (2013) UCI machine learning repository. Available http://archive.ics.uci.edu/ml

  • Bradski GR (1998) Computer vision face tracking for use in a perceptual user interface

  • Cannady J (1998) Artificial neural networks for misuse detection. In: National information systems security conference, pp 368–81

  • Carl G, Kesidis G, Brooks RR, Rai S (2006) Denial-of-service attack-detection techniques. Internet Comput IEEE 10(1):82–89

    Article  Google Scholar 

  • Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology, vol. 8, pp 93–103

  • Cho R, Campbell M, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, Lockhart D et al (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 2(1):65–73

    Article  Google Scholar 

  • Cover TM, Thomas JA (2012) Elements of information theory. John Wiley & Sons

  • Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection. In: Machine Learning-International Workshop Then Conference. Citeseer, pp 74–81

  • Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1):155–176

    Article  MATH  MathSciNet  Google Scholar 

  • Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555

    MATH  MathSciNet  Google Scholar 

  • Forina M, Leardi R, Armanino C, Lanteri S (1991) Parvus—an extendible package for data exploration, classification and correlation

  • Gadge J, Patil AA (2008) Port scan detection. In: 16th IEEE International Conference on Networks, 2008. ICON 2008. IEEE, pp. 1–6

  • Hall MA (1999) Correlation-based feature selection for machine learning. Ph.D. dissertation, The University of Waikato

  • Hartigan JA (1975) Clustering algorithms. John Wiley & Sons Inc

  • Hastie T, Tibshirani R, Friedman J, Franklin J (2005) The elements of statistical learning: data mining, inference and prediction. Math Intell 27(2):83–85

    Google Scholar 

  • Holmes G, Donkin A, Witten I (1994) Weka: a machine learning workbench. In: Proceedings of the 1994 Second Australian and New Zealand Conference on intelligent information systems, 1994. IEEE, pp 357–361

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323

    Article  Google Scholar 

  • Kira K, Rendell LA (1992) The feature selection problem: Traditional methods and a new algorithm. In: Proceedings of the National Conference on Artificial Intelligence. John Wiley & Sons Ltd, pp 129–129

  • Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif intell 97(1):273–324

    Article  MATH  Google Scholar 

  • Kononenko I, Šimec E, Robnik-Šikonja M (1997) Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl Intell 7(1):39–55

    Article  Google Scholar 

  • Lee S, Park Y-T, dAuriol BJ (2012) A novel feature selection method based on normalized mutual information. Appl Intell 37(1):100–120

    Article  Google Scholar 

  • Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid object detection. In: International Conference on image processing. 2002. Proceedings. 2002 , vol. 1. IEEE, pp I–900

  • Lu X, Peng X, Deng Y, Feng B, Liu P, Liao B (2014) A novel feature selection method based on correlation-based feature selection in cancer recognition. J Comput Theor Nanosci 11(2):427–433

    Article  Google Scholar 

  • Ma BLWHY (1998) Integrating classification and association rule mining. Proceedings of the 4th, 1998

  • Mahanta P, Ahmed H, Bhattacharyya D, Kalita JK (2011) Triclustering in gene expression data analysis: A selected survey. In: 2nd National Conference on emerging trends and applications in computer science (NCETACS), 2011, IEEE, pp 1–6

  • Mahanta P, Ahmed HA, Bhattacharyya DK, Kalita JK (2012) An effective method for network module extraction from microarray data. BMC Bioinform 13(Suppl 13):S4

    Article  Google Scholar 

  • Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312

    Article  Google Scholar 

  • Nagi S, Bhattacharyya DK (2013) Classification of microarray cancer data using ensemble approach. Netw Model Anal Health Inform Bioinform, pp 1–15

  • Niu B, Yuan X-C, Roeper P, Su Q, Peng C-R, Yin J-Y, Ding J, Li H, Lu W-C (2013) Hiv-1 protease cleavage site prediction based on two-stage feature selection method. Protein Peptide Lett 20(3):290–298

    Google Scholar 

  • Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  • Quinlan JR, Compton P, Horn K, Lazarus L (1987) Inductive knowledge acquisition: a case study. In: Proceedings of the Second Australian Conference on applications of expert systems. Addison-Wesley Longman Publishing Co., Inc, pp 137–156

  • Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics, vol. 23, no. 19

  • Sneath PH, Sokal RR, et al (1973) Numerical taxonomy. The principles and practice of numerical classification

  • Soliman OS, Rassem A (2012) Correlation based feature selection using quantum bio inspired estimation of distribution algorithm. In: Multi-disciplinary Trends in Artificial Intelligence. Springer, Berlin, pp 318–329

  • Vaidya J, Clifton C (2002) Privacy preserving association rule mining in vertically partitioned data. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM, pp 639–644

  • Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R (2012) Threshold-based feature selection techniques for high-dimensional bioinformatics data. Netw Model Anal Health Inform Bioinform 1(1–2):47–61

    Article  Google Scholar 

  • Weiner P (1973) Linear pattern matching algorithms. In: IEEE Conference Record of 14th Annual Symposium on switching and automata theory, (1973) SWAT’08. IEEE, pp. 1–11

  • Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. Mach Learn Int Workshop Then Conf 20(2):856

    Google Scholar 

  • Zhong N, Dong J, Ohsuga S (2001) Using rough sets with heuristics for feature selection. J Intell Inf Systems 16(3):199–214

    Article  MATH  Google Scholar 

  • Zhou Y, Qureshi R, Sacan A (2012) Data simulation and regulatory network reconstruction from time-series microarray data using stepwise multiple linear regression. Netw Model Anal Health Inform Bioinform 1(1–2):3–17

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dhruba K. Bhattacharyya.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Borah, P., Ahmed, H.A. & Bhattacharyya, D.K. A statistical feature selection technique. Netw Model Anal Health Inform Bioinforma 3, 55 (2014). https://doi.org/10.1007/s13721-014-0055-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13721-014-0055-0

Keywords

  • Feature selection
  • Machine learning
  • Feature–feature correlation
  • Feature–class correlation