A statistical feature selection technique

Borah, Pallabi; Ahmed, Hasin A.; Bhattacharyya, Dhruba K.

doi:10.1007/s13721-014-0055-0

Pallabi Borah¹,
Hasin A. Ahmed² &
Dhruba K. Bhattacharyya²

662 Accesses
8 Citations
Explore all metrics

Abstract

Various feature selection techniques have been proposed in the field of machine learning. The filter approaches are typically faster while wrapper approaches are more reliable though computationally expensive. Feature selection techniques often strive to achieve performance similar to wrapper approaches employing various computational approaches. Feature selection techniques typically depend on ways how they compute feature–feature correlation and feature–class correlation. These two computations are highly governed by the correlation measure being used. In this work, a method is developed named enhanced correlation-based feature selection (ECFS) to effectively employ the feature–feature and feature–class correlations to extract relevant feature subset from multi-class gene expression data as well as machine learning datasets. The performance of ECFS in terms of classification accuracies obtained by decision tree, random forest and KNN classifiers has been found highly satisfactory over several benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ahmed H, Mahanta P, Bhattacharyya D, Kalita JK (2011) Gerc: tree based clustering for gene expression data. In: 2011 IEEE 11th International conference on bioinformatics and bioengineering (BIBE). IEEE, pp 299–302
Ahmed HA, Mahanta P, Bhattacharyya DK, Kalita JK (2012) Module extraction from subspace co-expression networks. Netw Model Anal Health Inform Bioinform 1(4):183–195
Article Google Scholar
Alberti KGMM, Zimmet P (1998) Definition, diagnosis and classification of diabetes mellitus and its complications. Diabetic Med 15:539–553
Article Google Scholar
Bache K, Lichman M (2013) UCI machine learning repository. Available http://archive.ics.uci.edu/ml
Bradski GR (1998) Computer vision face tracking for use in a perceptual user interface
Cannady J (1998) Artificial neural networks for misuse detection. In: National information systems security conference, pp 368–81
Carl G, Kesidis G, Brooks RR, Rai S (2006) Denial-of-service attack-detection techniques. Internet Comput IEEE 10(1):82–89
Article Google Scholar
Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology, vol. 8, pp 93–103
Cho R, Campbell M, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, Lockhart D et al (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 2(1):65–73
Article Google Scholar
Cover TM, Thomas JA (2012) Elements of information theory. John Wiley & Sons
Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection. In: Machine Learning-International Workshop Then Conference. Citeseer, pp 74–81
Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1):155–176
Article MATH MathSciNet Google Scholar
Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555
MATH MathSciNet Google Scholar
Forina M, Leardi R, Armanino C, Lanteri S (1991) Parvus—an extendible package for data exploration, classification and correlation
Gadge J, Patil AA (2008) Port scan detection. In: 16th IEEE International Conference on Networks, 2008. ICON 2008. IEEE, pp. 1–6
Hall MA (1999) Correlation-based feature selection for machine learning. Ph.D. dissertation, The University of Waikato
Hartigan JA (1975) Clustering algorithms. John Wiley & Sons Inc
Hastie T, Tibshirani R, Friedman J, Franklin J (2005) The elements of statistical learning: data mining, inference and prediction. Math Intell 27(2):83–85
Google Scholar
Holmes G, Donkin A, Witten I (1994) Weka: a machine learning workbench. In: Proceedings of the 1994 Second Australian and New Zealand Conference on intelligent information systems, 1994. IEEE, pp 357–361
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Article Google Scholar
Kira K, Rendell LA (1992) The feature selection problem: Traditional methods and a new algorithm. In: Proceedings of the National Conference on Artificial Intelligence. John Wiley & Sons Ltd, pp 129–129
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif intell 97(1):273–324
Article MATH Google Scholar
Kononenko I, Šimec E, Robnik-Šikonja M (1997) Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl Intell 7(1):39–55
Article Google Scholar
Lee S, Park Y-T, dAuriol BJ (2012) A novel feature selection method based on normalized mutual information. Appl Intell 37(1):100–120
Article Google Scholar
Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid object detection. In: International Conference on image processing. 2002. Proceedings. 2002 , vol. 1. IEEE, pp I–900
Lu X, Peng X, Deng Y, Feng B, Liu P, Liao B (2014) A novel feature selection method based on correlation-based feature selection in cancer recognition. J Comput Theor Nanosci 11(2):427–433
Article Google Scholar
Ma BLWHY (1998) Integrating classification and association rule mining. Proceedings of the 4th, 1998
Mahanta P, Ahmed H, Bhattacharyya D, Kalita JK (2011) Triclustering in gene expression data analysis: A selected survey. In: 2nd National Conference on emerging trends and applications in computer science (NCETACS), 2011, IEEE, pp 1–6
Mahanta P, Ahmed HA, Bhattacharyya DK, Kalita JK (2012) An effective method for network module extraction from microarray data. BMC Bioinform 13(Suppl 13):S4
Article Google Scholar
Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Article Google Scholar
Nagi S, Bhattacharyya DK (2013) Classification of microarray cancer data using ensemble approach. Netw Model Anal Health Inform Bioinform, pp 1–15
Niu B, Yuan X-C, Roeper P, Su Q, Peng C-R, Yin J-Y, Ding J, Li H, Lu W-C (2013) Hiv-1 protease cleavage site prediction based on two-stage feature selection method. Protein Peptide Lett 20(3):290–298
Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Quinlan JR, Compton P, Horn K, Lazarus L (1987) Inductive knowledge acquisition: a case study. In: Proceedings of the Second Australian Conference on applications of expert systems. Addison-Wesley Longman Publishing Co., Inc, pp 137–156
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics, vol. 23, no. 19
Sneath PH, Sokal RR, et al (1973) Numerical taxonomy. The principles and practice of numerical classification
Soliman OS, Rassem A (2012) Correlation based feature selection using quantum bio inspired estimation of distribution algorithm. In: Multi-disciplinary Trends in Artificial Intelligence. Springer, Berlin, pp 318–329
Vaidya J, Clifton C (2002) Privacy preserving association rule mining in vertically partitioned data. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM, pp 639–644
Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R (2012) Threshold-based feature selection techniques for high-dimensional bioinformatics data. Netw Model Anal Health Inform Bioinform 1(1–2):47–61
Article Google Scholar
Weiner P (1973) Linear pattern matching algorithms. In: IEEE Conference Record of 14th Annual Symposium on switching and automata theory, (1973) SWAT’08. IEEE, pp. 1–11
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. Mach Learn Int Workshop Then Conf 20(2):856
Google Scholar
Zhong N, Dong J, Ohsuga S (2001) Using rough sets with heuristics for feature selection. J Intell Inf Systems 16(3):199–214
Article MATH Google Scholar
Zhou Y, Qureshi R, Sacan A (2012) Data simulation and regulatory network reconstruction from time-series microarray data using stepwise multiple linear regression. Netw Model Anal Health Inform Bioinform 1(1–2):3–17
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Assam down town University, Guwahati, 781026, India
Pallabi Borah
Department of Computer Science and Engineering, Tezpur University, Napaam, Tezpur, 784028, India
Hasin A. Ahmed & Dhruba K. Bhattacharyya

Authors

Pallabi Borah
View author publications
You can also search for this author in PubMed Google Scholar
Hasin A. Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Dhruba K. Bhattacharyya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dhruba K. Bhattacharyya.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Borah, P., Ahmed, H.A. & Bhattacharyya, D.K. A statistical feature selection technique. Netw Model Anal Health Inform Bioinforma 3, 55 (2014). https://doi.org/10.1007/s13721-014-0055-0

Download citation

Received: 28 June 2013
Revised: 14 February 2014
Accepted: 21 February 2014
Published: 20 March 2014
DOI: https://doi.org/10.1007/s13721-014-0055-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A statistical feature selection technique

Abstract

Access this article

Similar content being viewed by others

Efficient Feature Selection Algorithm for Gene Classification

Machine-Learning Algorithms for Feature Selection from Gene Expression Data

EFS-MI: an ensemble feature selection method for classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A statistical feature selection technique

Abstract

Access this article

Similar content being viewed by others

Efficient Feature Selection Algorithm for Gene Classification

Machine-Learning Algorithms for Feature Selection from Gene Expression Data

EFS-MI: an ensemble feature selection method for classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation