A genetic algorithm-based method for feature subset selection
- 514 Downloads
- 51 Citations
Abstract
As a commonly used technique in data preprocessing, feature selection selects a subset of informative attributes or variables to build models describing data. By removing redundant and irrelevant or noise features, feature selection can improve the predictive accuracy and the comprehensibility of the predictors or classifiers. Many feature selection algorithms with different selection criteria has been introduced by researchers. However, it is discovered that no single criterion is best for all applications. In this paper, we propose a framework based on a genetic algorithm (GA) for feature subset selection that combines various existing feature selection methods. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for a particular inductive learning algorithm of interest to build the classifier. We conducted experiments using three data sets and three existing feature selection methods. The experimental results demonstrate that our approach is a robust and effective approach to find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm.
Keywords
Feature selection Gene Selection Genetic algorithm Microarray gene expression data analysisPreview
Unable to display preview. Download preview PDF.
References
- Alon U et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96:6745–6750CrossRefGoogle Scholar
- Breiman L, Forest R Technical Report. Stat. Dept, UCBGoogle Scholar
- Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data mining Knowl Dis 2(2):121–167CrossRefGoogle Scholar
- Chuang H-Y et al (2004) Identifying significant genes from microarray data. Fourth IEEE symposium on bioinformatics and bioengineering (BIBE’04) p. 358Google Scholar
- Dash M, Liu H (1999) Handling large unsupervised data via dimensionality reduction. ACM SIGMOD workshop on research issues in data mining and knowledge discoveryGoogle Scholar
- Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889Google Scholar
- Furey T, Cristianini N, Bednarski DN, Schummer DM (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16:906–914CrossRefGoogle Scholar
- Golub TR et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286Google Scholar
- Guyon I, Elisseeff A (2003) An introduction to variable and feature selection (Kernel Machines Section). JMLR 3:1157–1182zbMATHCrossRefGoogle Scholar
- Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422zbMATHCrossRefGoogle Scholar
- Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer series in statistics. Springer, New YorkGoogle Scholar
- Hsu FD, Shapiro J, Taksa I (2002) Methods of data fusion in information retreival: Rank vs. Score combination. DIMACS Technical report 58Google Scholar
- Jirapech-Umpai T, Aitken S (2005) Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinform 6:148Google Scholar
- LeCun Y, Denker JS, Solla SA (1990) Optimum brain damage. Touretzky DS (ed) Advances in neural information processing systems II, Morgan Kaufmann, MateoGoogle Scholar
- Liu Y (2004) A comparative study on feature selection methods for drug discovery. J Chem Inform Comput Sci 44(5):1823–1828CrossRefGoogle Scholar
- Liu H, Setiono R (1995) χ2: feature selection and discretization of numeric attributes. In: Proceedings IEEE 7th international conference on tools with artificial intelligence, pp 338–391Google Scholar
- Liu H, Li J, Wong L (2002) A comparative study on feature selection and classification methods using gene expression profiles and proteomic pattern. Genom Inform 13:51–60Google Scholar
- Liu H et al. (2005) Evolving feature selection. Intelligent systems. IEEE Vol 20(6): 64–76Google Scholar
- Liu X, Krishnan A, Mondry A (2005) An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinform 6:76Google Scholar
- Mao Y, Zhou X, Pi D, Sun Y, STC Wong (2005) Multiclass cancer classification by using fuzzy support vector machine and binary decision tree with gene selection. J Biomed Biotechnol 2:160–171CrossRefGoogle Scholar
- Noble WS (2004) Support vector machine applications in computational biology. In: Schoelkopf B, suda KT, Vert J.-P (eds). Kernel methods in computational biology. MIT, New York, pp 71–92Google Scholar
- Peng HC, Long FH, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Patt Anal Mach Intell 27(8):1226–1238CrossRefGoogle Scholar
- Schölkopf B, Guyon I, Weston J (2003) Statistical learning and kernel methods in bioinformatics. In: Frasconi P, Shamir R (eds) Artificial intelligence and heuristic methods in bioinformatics. vol 183. IOS Press, Amsterdam, pp 1–21Google Scholar
- Singh D (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209CrossRefGoogle Scholar
- Singh D et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209CrossRefGoogle Scholar
- Somorjai RL, Dolenko B, Baumgartner R (2003) Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics. 12; 19(12):1484–91Google Scholar
- Space Physics Group; Applied Physics Laboratory; Johns Hopkins University; Johns Hopkins Road; Laurel; MD 20723Google Scholar
- Vapnik V (1998) Statistical learning theory. Wiley, New YorkzbMATHGoogle Scholar
- Yang J, Honavar V (1998) Feature subset selection using a genetic algorithm. IEEE Intell Syst 13:44–49CrossRefGoogle Scholar
- Yu L, Liu H (2003) Efficiently handling feature redundancy in high-dimensional data. In: Proceedings of ACM SIGKDD international conference knowledge discovery and data mining (KDD 03), ACM, New york, pp. 685–690Google Scholar
- Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V (2000) Feature Selection for SVMs. Adv Neural Inform Process Syst 13Google Scholar