Abstract
This paper proposes an ensemble framework for gene selection, which is aimed at addressing instability problems presented in the gene filtering task. The complex process of gene selection from gene expression data faces different instability problems from the informative gene subsets found by different filter methods. This makes the identification of significant genes by the experts difficult. The instability of results can come from filter methods, gene classifier methods, different datasets of the same disease and multiple valid groups of biomarkers. Even though there is a wide number of proposals, the complexity imposed by this problem remains a challenge today. This work proposes a framework involving five stages of gene filtering to discover biomarkers for diagnosis and classification tasks. This framework performs a process of stable feature selection, facing the problems above and, thus, providing a more suitable and reliable solution for clinical and research purposes. Our proposal involves a process of multistage gene filtering, in which several ensemble strategies for gene selection were added in such a way that different classifiers simultaneously assess gene subsets to face instability. Firstly, we apply an ensemble of recent gene selection methods to obtain diversity in the genes found (stability according to filter methods). Next, we apply an ensemble of known classifiers to filter genes relevant to all classifiers at a time (stability according to classification methods). The achieved results were evaluated in two different datasets of the same disease (pancreatic ductal adenocarcinoma), in search of stability according to the disease, for which promising results were achieved.
Similar content being viewed by others
References
Bourne PE, Wissig H (2003) Structural bioinformatics. Wiley-Liss Inc, Hoboken
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386
Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, deSchaetzen V, Duque R, Bersini H, Nowé A (2012) A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinform 9(4):1106–1118
Inza I, Larrañaga P, Blanco R, Cerrolaza A (2004) Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med 31:91–103
Jager J, Sengupta R, Ruzzo W (2003) Improved gene selection for classification of microarrays. In: Pacific symposium on biocomputing (UW CSE Computational Biology Group)
Kumari B, Swarnkar T (2011) Filter versus wrapper feature subset selection in large dimensionality microarray: a review. Int J Comput Sci Inf Technol (IJCSIT) 2(3):1048–1053
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y (2009) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3):392–398
He Z, Yu W (2010) Stable feature selection for biomarker discovery. Comput Biol Chem 34(4):215–225
Xue B, Zhang M, Browne W, Yao X (2016) A survey on evolutionary computation approaches to feature selection. IEEE Trans Evol Comput 20(4):606–626
Yang P, Hwa Y, Zhou B, Zomaya A (2016) A review of ensemble methods in bioinformatics: including stability of feature selection and ensemble feature selection methods. Bioinformatics 4:296–308
Baruque B, Corchado E, Mata A, Corchado JM (2010) A forecasting solution to the oil spill problem based on a hybrid intelligent system. Inf Sci 180(10):2029–2043
Guyon I (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Natarajan A, Ravi T (2014) A survey on gene feature selection using microarray data for cancer classification. Int J Comput Sci Commun (IJCSC) 5(1):126–129
Shraddha S, Anuradha N, Swapnil S (2014) Feature selection techniques and microarray data: a survey. Int J Emerg Technol Adv Eng 4(1):179–183
Tyagi V, Mishra A (2013) A survey on different feature selection methods for microarray data analysis. Int J Comput Appl 67(16):36–40
Wang Y, Tetko I, Hall M, Frank E, Facius A, Mayer K, Mewes H (2005) Gene selection from microarray data for cancer classification—a machine learning approach. Comput Biol Chem 29:37–46
Liu H, Liu L, Zhang H (2010) Ensemble gene selection by grouping for microarray data classification. J Biomed Inform 43:81–87
Bol’on-Canedo V, Sánchez-Marońo N, Alonso-Betanzos A (2012) An ensemble of filters and classifiers for microarray data classification. Pattern Recognit 45:531–539
Das A, Das S, Ghosha A (2017) Ensemble feature selection using bi-objective genetic algorithm. Knowl Based Syst 118:124–139
Seijo-Pardo B, Porto-Daz I, Boln-Canedo V, Alonso-Betanzos A (2017) Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl Based Syst 123:116–127
Badea L, Herlea V, Olimpia S, Dumitrascu T, Popescu I (2008) Combined gene expression analysis of whole-tissue and microdissected pancreatic ductal adenocarcinoma identifies genes specifically overexpressed in tumor epithelia. Hepato-Gastroenterology 88:2015–2026
Kota J, Hancock J, Kwon J, Korc M (2017) Pancreatic cancer: stroma and its current and emerging targeted therapies. Cancer Lett 391:38–49
Bhaw-Luximon A, Jhurry D (2015) New avenues for improving pancreatic ductal adenocarcinoma (pdac) treatment: selective stroma depletion combined with nano drug delivery. Cancer Lett 369(2):266–273
Hidalgo M, Cascinu S, Kleeff J, Labianca R, Löhr JM, Neoptolemos J, Real FX, Van Laethem JL, Heinemann V (2015) Addressing the challenges of pancreatic cancer: future directions for improving outcomes. Pancreatology 15(1):8–18
Korc M (2007) Pancreatic cancer-associated stroma production. Am J Surg 194(4):S84–S86
Fang Z, Du R, Cui X (2012) Uniform approximation is more appropriate for Wilcoxon rank-sum test in gene set analysis. PLoS One 7(2):e31,505
Weiss P (2005) Applications of generating functions in nonparametric tests. Math J 9(4):803–823
Berrar DP, Dubitzky W, Granzow M (2003) A practical approach to microarray data analysis. Kluwer Academic Publishers, New York
Wolters M (2015) A genetic algorithm for fixed-size subset selection. R-Package kofnGA, Version 1.2
Wolters M (2015) A genetic algorithm for selection of fixed-size subsets with application to design problems. J Stat Soft 68(1):1–18
Kursa M, Rudnicki W (2010) Feature selection with the Boruta package. J Stat Softw 36(11):1–13
Kursa M, Rudnicki W (2016) Wrapper algorithm for all relevant feature selection. Package Boruta, Version 5.1.0. https://m2.icm.edu.pl/boruta/
Mahmoud O, Harrison A, Perperoglou A, Gul A, Khan Z, Metodiev M, Lausen B (2014) A feature selection method for classification within functional genomics experiments based on the proportional overlapping score. BMC Bioinform 15(274):1–20
Mahmoud O, Harrison A, Perperoglou A, Gul A, Khan Z, Lausen B (2015) propOverlap: feature (gene) selection based on the proportional overlapping scores. R package version 1.0. http://CRAN.R-project.org/package=propOverlap
Ahdesmaki AKS (2010) Feature selection in omics prediction problems using CAT scores and false non-discovery rate control. Ann Appl Stat 4:503–519
Ahdesmaki M, Zuber V, Gibb S, Strimmer K (2015) sda: shrinkage discriminant analysis and CAT score variable selection. R package version 1.3.7. http://CRAN.R-project.org/package=sda
Ishwaran H, Rao J (2005) Spike and slab variable selection: frequentist and Bayesian strategies. Ann Stat 33(2):730–773
Ishwaran H, Rao J, Kogalur UB (2013) spikeslab: prediction and variable selection using spike and slab regression. R package version 1.1.5. http://web.ccs.miami.edu/~hishwaran. http://www.kogalur.com
Friedman J, Hastie T, Tibshirani R (2008) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22. http://www.stanford.edu/~hastie/Papers/glmnet.pdf
Zhou F, Luo Y, Meng Q, Ge R, Mai G, Liu J (2015) Sublasso: gene selection using lasso for microarray data with user-defined genes fixed in model. R-Project, package version 1.0
Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, Cambridge
Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan G, Ng A, Liu B, Yu P, Zhou ZH, Steinbach M, Hand D, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37
Vervoort S, Boxtel V, Coffer P (2013) he role of sry-related hmg box transcription factor 4 (sox4) in tumorigenesis and metastasis: friend or foe? Oncogene 32(29):339–409. https://www.ncbi.nlm.nih.gov/pubmed/23246969
Hasegawa S, Nagano H, Konno M, Eguchi H, Tomokuni A, Tomimaru Y, Asaoka T, Wada H, Hama N, Kawamoto K, Marubashi S, Nishida N, Koseki J, Mori M, Doki Y, Ishii H (2016) A crucial epithelial to mesenchymal transition regulator, sox4/ezh2 axis is closely related to the clinical outcome in pancreatic cancer patients. Int J Oncol 48(1):145–152. https://www.ncbi.nlm.nih.gov/pubmed/26648239
Li Q, Hou L, Ding G, Li Y, Wang J, Qian B, Sun J, Wang Q (2015) Kdm6b induces epithelial-mesenchymal transition and enhances clear cell renal cell carcinoma metastasis through the activation of slug. Int J Clin Exp Pathol 8(6):6334–6344. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4525843/
Yamamoto K, Tateishi K, Kudo Y, Sato T, Yamamoto S, Miyabayashi K, Matsusaka K, Asaoka Y, Ijichi H, Hirata Y, Otsuka M, Nakai Y, Isayama H, Ikenoue T, Kurokawa M, Fukayama M, Kokudo N, Omata M, Koike K (2014) Loss of histone demethylase KDM6B enhances aggressiveness of pancreatic cancer through downregulation of c/ebp. Carcinogenesis 35(11):2404–2414. https://www.ncbi.nlm.nih.gov/pubmed/24947179
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Castellanos-Garzón, J.A., Ramos, J., López-Sánchez, D. et al. An Ensemble Framework Coping with Instability in the Gene Selection Process. Interdiscip Sci Comput Life Sci 10, 12–23 (2018). https://doi.org/10.1007/s12539-017-0274-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-017-0274-z