Abstract
Various feature selection techniques have been proposed in the field of machine learning. The filter approaches are typically faster while wrapper approaches are more reliable though computationally expensive. Feature selection techniques often strive to achieve performance similar to wrapper approaches employing various computational approaches. Feature selection techniques typically depend on ways how they compute feature–feature correlation and feature–class correlation. These two computations are highly governed by the correlation measure being used. In this work, a method is developed named enhanced correlation-based feature selection (ECFS) to effectively employ the feature–feature and feature–class correlations to extract relevant feature subset from multi-class gene expression data as well as machine learning datasets. The performance of ECFS in terms of classification accuracies obtained by decision tree, random forest and KNN classifiers has been found highly satisfactory over several benchmark datasets.
Similar content being viewed by others
References
Ahmed H, Mahanta P, Bhattacharyya D, Kalita JK (2011) Gerc: tree based clustering for gene expression data. In: 2011 IEEE 11th International conference on bioinformatics and bioengineering (BIBE). IEEE, pp 299–302
Ahmed HA, Mahanta P, Bhattacharyya DK, Kalita JK (2012) Module extraction from subspace co-expression networks. Netw Model Anal Health Inform Bioinform 1(4):183–195
Alberti KGMM, Zimmet P (1998) Definition, diagnosis and classification of diabetes mellitus and its complications. Diabetic Med 15:539–553
Bache K, Lichman M (2013) UCI machine learning repository. Available http://archive.ics.uci.edu/ml
Bradski GR (1998) Computer vision face tracking for use in a perceptual user interface
Cannady J (1998) Artificial neural networks for misuse detection. In: National information systems security conference, pp 368–81
Carl G, Kesidis G, Brooks RR, Rai S (2006) Denial-of-service attack-detection techniques. Internet Comput IEEE 10(1):82–89
Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology, vol. 8, pp 93–103
Cho R, Campbell M, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, Lockhart D et al (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 2(1):65–73
Cover TM, Thomas JA (2012) Elements of information theory. John Wiley & Sons
Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection. In: Machine Learning-International Workshop Then Conference. Citeseer, pp 74–81
Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1):155–176
Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555
Forina M, Leardi R, Armanino C, Lanteri S (1991) Parvus—an extendible package for data exploration, classification and correlation
Gadge J, Patil AA (2008) Port scan detection. In: 16th IEEE International Conference on Networks, 2008. ICON 2008. IEEE, pp. 1–6
Hall MA (1999) Correlation-based feature selection for machine learning. Ph.D. dissertation, The University of Waikato
Hartigan JA (1975) Clustering algorithms. John Wiley & Sons Inc
Hastie T, Tibshirani R, Friedman J, Franklin J (2005) The elements of statistical learning: data mining, inference and prediction. Math Intell 27(2):83–85
Holmes G, Donkin A, Witten I (1994) Weka: a machine learning workbench. In: Proceedings of the 1994 Second Australian and New Zealand Conference on intelligent information systems, 1994. IEEE, pp 357–361
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Kira K, Rendell LA (1992) The feature selection problem: Traditional methods and a new algorithm. In: Proceedings of the National Conference on Artificial Intelligence. John Wiley & Sons Ltd, pp 129–129
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif intell 97(1):273–324
Kononenko I, Šimec E, Robnik-Šikonja M (1997) Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl Intell 7(1):39–55
Lee S, Park Y-T, dAuriol BJ (2012) A novel feature selection method based on normalized mutual information. Appl Intell 37(1):100–120
Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid object detection. In: International Conference on image processing. 2002. Proceedings. 2002 , vol. 1. IEEE, pp I–900
Lu X, Peng X, Deng Y, Feng B, Liu P, Liao B (2014) A novel feature selection method based on correlation-based feature selection in cancer recognition. J Comput Theor Nanosci 11(2):427–433
Ma BLWHY (1998) Integrating classification and association rule mining. Proceedings of the 4th, 1998
Mahanta P, Ahmed H, Bhattacharyya D, Kalita JK (2011) Triclustering in gene expression data analysis: A selected survey. In: 2nd National Conference on emerging trends and applications in computer science (NCETACS), 2011, IEEE, pp 1–6
Mahanta P, Ahmed HA, Bhattacharyya DK, Kalita JK (2012) An effective method for network module extraction from microarray data. BMC Bioinform 13(Suppl 13):S4
Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Nagi S, Bhattacharyya DK (2013) Classification of microarray cancer data using ensemble approach. Netw Model Anal Health Inform Bioinform, pp 1–15
Niu B, Yuan X-C, Roeper P, Su Q, Peng C-R, Yin J-Y, Ding J, Li H, Lu W-C (2013) Hiv-1 protease cleavage site prediction based on two-stage feature selection method. Protein Peptide Lett 20(3):290–298
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Quinlan JR, Compton P, Horn K, Lazarus L (1987) Inductive knowledge acquisition: a case study. In: Proceedings of the Second Australian Conference on applications of expert systems. Addison-Wesley Longman Publishing Co., Inc, pp 137–156
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics, vol. 23, no. 19
Sneath PH, Sokal RR, et al (1973) Numerical taxonomy. The principles and practice of numerical classification
Soliman OS, Rassem A (2012) Correlation based feature selection using quantum bio inspired estimation of distribution algorithm. In: Multi-disciplinary Trends in Artificial Intelligence. Springer, Berlin, pp 318–329
Vaidya J, Clifton C (2002) Privacy preserving association rule mining in vertically partitioned data. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM, pp 639–644
Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R (2012) Threshold-based feature selection techniques for high-dimensional bioinformatics data. Netw Model Anal Health Inform Bioinform 1(1–2):47–61
Weiner P (1973) Linear pattern matching algorithms. In: IEEE Conference Record of 14th Annual Symposium on switching and automata theory, (1973) SWAT’08. IEEE, pp. 1–11
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. Mach Learn Int Workshop Then Conf 20(2):856
Zhong N, Dong J, Ohsuga S (2001) Using rough sets with heuristics for feature selection. J Intell Inf Systems 16(3):199–214
Zhou Y, Qureshi R, Sacan A (2012) Data simulation and regulatory network reconstruction from time-series microarray data using stepwise multiple linear regression. Netw Model Anal Health Inform Bioinform 1(1–2):3–17
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Borah, P., Ahmed, H.A. & Bhattacharyya, D.K. A statistical feature selection technique. Netw Model Anal Health Inform Bioinforma 3, 55 (2014). https://doi.org/10.1007/s13721-014-0055-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13721-014-0055-0