Abstract
Gene selection is a key problem in gene expression based cancer recognition and related tasks. A measure, called neighborhood mutual information (NMI), is introduced to evaluate the relevance between genes and related decision in this work. Then the measure is combined with the search strategy of minimal redundancy and maximal relevancy (mRMR) for constructing a NMI based mRMR gene selection algorithm (NMI_mRMR). In addition, it is also found that the first k best genes with respect to NMI are usually enough for cancer classification. We can just perform mRMR on these genes and remove the rest in the preprocessing step, which will lead to reduction of computational time. Based on this observation, an efficient gene selection algorithm, denoted by NMI_EmRMR, is proposed. Several cancer recognition tasks are gathered for testing the proposed technique. The experimental results show NMI_EmRMR is effective and efficient.
Similar content being viewed by others
References
Chee M, Yang R, Hubbell E et al (1996) Accessing genetic information with high-density DNA arrays. Science 274:610–614
Fodor SP, Read JL, Pirrung MC et al (1991) Light-directed, spatially addressable parallel chemical synthesis. Science 251:767–773
DeRisi J et al (1996) Use of a cDNA microarray to analyze gene expression patterns in human cancer. Nat Genet 14:457–460
Golub T et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Hoogeboom HJ, Kosters WA, Laros JFJ (2008) Selection of DNA markers. IEEE Trans Syst Man Cybernet Part C Appl Rev 38:26–32
Piatetsky-Shapiro G, Tamayo P (2003) Articles on microarray data mining. SIGKDD Explor 5:1–5
Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20:2429–2437
Saeys Y, Inza I, Larranag P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517
Bandyopadhyay S, Maulik U, Roy D (2008) Gene identification: classical and computational intelligence approaches. IEEE Trans Syst Man Cybernet Part C Appl Rev 38:55–68
Zhu ZX, Ong YS, Dash M (2007) Wrapper-filter feature selection algorithm using a memetic framework. IEEE Trans Syst Man Cybernet Part B Cybernet 37:70–76
Chow TWS, Wang P, Ma EWM (2008) A new feature selection scheme using a data distribution factor for unsupervised nominal data. IEEE Trans Syst Man Cybernet Part B Cybernet 38:499–509
Guyon I et al (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
Chen Z, Li J, Wei L (2007) A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue. Artif Intell Med 41:161–175
Liu J, Ranka S, Kahveci T (2008) Classification and feature selection algorithms for multi-class. CGH Data 24:i86–i95
Maglietta R, D’Addabbo A, Piepoli A, Perri BF et al (2007) Selection of relevant genes in cancer diagnosis based on their prediction accuracy. Artif Intell Med 40:29–44
Su Y, Murali TM, Pavlovic V, Kasif S (2003) RankGene: identification of diagnostic genes based on expression data. Bioinformatics, pp 1578–1579
Ding C, Peng H (2003) Minimum redundancy feature selection from microarray gene expression data. In: Proceedings of the IEEE computer society conference on bioinformatics, pp 523–528
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226–1238
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3:185–205
Zhang Y, Ding C, Li T (2008) Gene selection algorithm by combining reliefF and mRMR. BMC Genomics 9(Suppl 2):S27. doi:10.1186/1471-2164-9-S2-S27
Yun C, Shin D, Jo H, Yang J, Kim S (2007) An experimental study on feature subset selection methods. Computer and Information Technology, in CIT 2007. 7th IEEE international conference on, pp 77–82
Liu X, Krishnan A, Mondry A (2005) An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinform 6:76. doi:10.1186/1471-2105-6-76
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5:537–550
Wang H, Bell D, Murtagh F (1999) Axiomatic approach to feature subset selection based on relevance. IEEE Trans Pattern Anal Mach Intell 21:271–277
Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of 17th international conference machine learning, pp 359–366
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res J Mach Learn Res 5:1205–1224
Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of thirteenth international joint conference on artificial intelligence. Morgan Kaufmann, San Mateo, CA, pp 1022–1027
Kwak N, Choi CH (2002) Input feature selection by mutual information based on Parzen window. IEEE Trans Pattern Anal Mach Intell 24:1667–1671
Li J, Su H, Chen H, Futscher BW (2007) Optimal search-based gene subset selection for gene array cancer classification. IEEE Trans Inform Technol Biomed 11:398–405
Perou CM, Sørlie T, Eisen MB et al (2000) Molecular portraits of human breast tumours. Nature 406:747–752
Alizadeh A et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 4051:503–511
Slonim DK, et al (2000) Class predication and discovery using expression data. In: Proceedings of the 4th annual international conference on computational molecular biology, pp 263–272
Liu J, Iba H, Ishizuka M (2001) Selecting informative genes with parallel genetic algorithms in tissue classification. Genome Inform 12:14–23
Armstrong SA et al (2000) MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 30:41–47
Beer DG, Kardia SLR, Huang CC et al (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med 8:816–824
Khan J, Weil JS, Ringnér M, Saall LH, Ladanyi M et al (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7:673–679
Hu QH, Yu DR, Liu JF, Wu C (2008) Neighborhood rough set based heterogeneous feature subset selection. Inf Sci 178:3577–3594
Hu QH, Yu DR, Xie ZX. Information-preserving hybrid data reduction based on fuzzy-rough techniques. Pattern Recogn Lett 27:414–423
Robnik-sikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53:23–69
Sotoca JM, Pla F, Sánchez JS (2007) Band selection in multispectral images by minimization of dependent information. IEEE Trans Syst Man Cybernet Part C Appl Rev 37:258–267
Acknowledgments
Supported by the National Natural Science Foundation of China under Grants No. 60703013 and 61070089.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hu, Q., Pan, W., An, S. et al. An efficient gene selection technique for cancer recognition based on neighborhood mutual information. Int. J. Mach. Learn. & Cyber. 1, 63–74 (2010). https://doi.org/10.1007/s13042-010-0008-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-010-0008-6