Skip to main content

Advertisement

Log in

An Ensemble Framework Coping with Instability in the Gene Selection Process

  • Original Research Article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

This paper proposes an ensemble framework for gene selection, which is aimed at addressing instability problems presented in the gene filtering task. The complex process of gene selection from gene expression data faces different instability problems from the informative gene subsets found by different filter methods. This makes the identification of significant genes by the experts difficult. The instability of results can come from filter methods, gene classifier methods, different datasets of the same disease and multiple valid groups of biomarkers. Even though there is a wide number of proposals, the complexity imposed by this problem remains a challenge today. This work proposes a framework involving five stages of gene filtering to discover biomarkers for diagnosis and classification tasks. This framework performs a process of stable feature selection, facing the problems above and, thus, providing a more suitable and reliable solution for clinical and research purposes. Our proposal involves a process of multistage gene filtering, in which several ensemble strategies for gene selection were added in such a way that different classifiers simultaneously assess gene subsets to face instability. Firstly, we apply an ensemble of recent gene selection methods to obtain diversity in the genes found (stability according to filter methods). Next, we apply an ensemble of known classifiers to filter genes relevant to all classifiers at a time (stability according to classification methods). The achieved results were evaluated in two different datasets of the same disease (pancreatic ductal adenocarcinoma), in search of stability according to the disease, for which promising results were achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Bourne PE, Wissig H (2003) Structural bioinformatics. Wiley-Liss Inc, Hoboken

    Book  Google Scholar 

  2. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386

    Article  Google Scholar 

  3. Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, deSchaetzen V, Duque R, Bersini H, Nowé A (2012) A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinform 9(4):1106–1118

    Article  Google Scholar 

  4. Inza I, Larrañaga P, Blanco R, Cerrolaza A (2004) Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med 31:91–103

    Article  Google Scholar 

  5. Jager J, Sengupta R, Ruzzo W (2003) Improved gene selection for classification of microarrays. In: Pacific symposium on biocomputing (UW CSE Computational Biology Group)

  6. Kumari B, Swarnkar T (2011) Filter versus wrapper feature subset selection in large dimensionality microarray: a review. Int J Comput Sci Inf Technol (IJCSIT) 2(3):1048–1053

    Google Scholar 

  7. Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y (2009) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3):392–398

    Article  Google Scholar 

  8. He Z, Yu W (2010) Stable feature selection for biomarker discovery. Comput Biol Chem 34(4):215–225

    Article  CAS  Google Scholar 

  9. Xue B, Zhang M, Browne W, Yao X (2016) A survey on evolutionary computation approaches to feature selection. IEEE Trans Evol Comput 20(4):606–626

    Article  Google Scholar 

  10. Yang P, Hwa Y, Zhou B, Zomaya A (2016) A review of ensemble methods in bioinformatics: including stability of feature selection and ensemble feature selection methods. Bioinformatics 4:296–308

    Google Scholar 

  11. Baruque B, Corchado E, Mata A, Corchado JM (2010) A forecasting solution to the oil spill problem based on a hybrid intelligent system. Inf Sci 180(10):2029–2043

    Article  Google Scholar 

  12. Guyon I (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    Google Scholar 

  13. Natarajan A, Ravi T (2014) A survey on gene feature selection using microarray data for cancer classification. Int J Comput Sci Commun (IJCSC) 5(1):126–129

    Google Scholar 

  14. Shraddha S, Anuradha N, Swapnil S (2014) Feature selection techniques and microarray data: a survey. Int J Emerg Technol Adv Eng 4(1):179–183

    Google Scholar 

  15. Tyagi V, Mishra A (2013) A survey on different feature selection methods for microarray data analysis. Int J Comput Appl 67(16):36–40

    Google Scholar 

  16. Wang Y, Tetko I, Hall M, Frank E, Facius A, Mayer K, Mewes H (2005) Gene selection from microarray data for cancer classification—a machine learning approach. Comput Biol Chem 29:37–46

    Article  Google Scholar 

  17. Liu H, Liu L, Zhang H (2010) Ensemble gene selection by grouping for microarray data classification. J Biomed Inform 43:81–87

    Article  CAS  Google Scholar 

  18. Bol’on-Canedo V, Sánchez-Marońo N, Alonso-Betanzos A (2012) An ensemble of filters and classifiers for microarray data classification. Pattern Recognit 45:531–539

    Article  Google Scholar 

  19. Das A, Das S, Ghosha A (2017) Ensemble feature selection using bi-objective genetic algorithm. Knowl Based Syst 118:124–139

    Article  Google Scholar 

  20. Seijo-Pardo B, Porto-Daz I, Boln-Canedo V, Alonso-Betanzos A (2017) Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl Based Syst 123:116–127

    Article  Google Scholar 

  21. Badea L, Herlea V, Olimpia S, Dumitrascu T, Popescu I (2008) Combined gene expression analysis of whole-tissue and microdissected pancreatic ductal adenocarcinoma identifies genes specifically overexpressed in tumor epithelia. Hepato-Gastroenterology 88:2015–2026

    Google Scholar 

  22. Kota J, Hancock J, Kwon J, Korc M (2017) Pancreatic cancer: stroma and its current and emerging targeted therapies. Cancer Lett 391:38–49

    Article  CAS  Google Scholar 

  23. Bhaw-Luximon A, Jhurry D (2015) New avenues for improving pancreatic ductal adenocarcinoma (pdac) treatment: selective stroma depletion combined with nano drug delivery. Cancer Lett 369(2):266–273

    Article  CAS  Google Scholar 

  24. Hidalgo M, Cascinu S, Kleeff J, Labianca R, Löhr JM, Neoptolemos J, Real FX, Van Laethem JL, Heinemann V (2015) Addressing the challenges of pancreatic cancer: future directions for improving outcomes. Pancreatology 15(1):8–18

    Article  Google Scholar 

  25. Korc M (2007) Pancreatic cancer-associated stroma production. Am J Surg 194(4):S84–S86

    Article  CAS  Google Scholar 

  26. Fang Z, Du R, Cui X (2012) Uniform approximation is more appropriate for Wilcoxon rank-sum test in gene set analysis. PLoS One 7(2):e31,505

    Article  CAS  Google Scholar 

  27. Weiss P (2005) Applications of generating functions in nonparametric tests. Math J 9(4):803–823

    Google Scholar 

  28. Berrar DP, Dubitzky W, Granzow M (2003) A practical approach to microarray data analysis. Kluwer Academic Publishers, New York

    Book  Google Scholar 

  29. Wolters M (2015) A genetic algorithm for fixed-size subset selection. R-Package kofnGA, Version 1.2

  30. Wolters M (2015) A genetic algorithm for selection of fixed-size subsets with application to design problems. J Stat Soft 68(1):1–18

    Google Scholar 

  31. Kursa M, Rudnicki W (2010) Feature selection with the Boruta package. J Stat Softw 36(11):1–13

    Article  Google Scholar 

  32. Kursa M, Rudnicki W (2016) Wrapper algorithm for all relevant feature selection. Package Boruta, Version 5.1.0. https://m2.icm.edu.pl/boruta/

  33. Mahmoud O, Harrison A, Perperoglou A, Gul A, Khan Z, Metodiev M, Lausen B (2014) A feature selection method for classification within functional genomics experiments based on the proportional overlapping score. BMC Bioinform 15(274):1–20

    Google Scholar 

  34. Mahmoud O, Harrison A, Perperoglou A, Gul A, Khan Z, Lausen B (2015) propOverlap: feature (gene) selection based on the proportional overlapping scores. R package version 1.0. http://CRAN.R-project.org/package=propOverlap

  35. Ahdesmaki AKS (2010) Feature selection in omics prediction problems using CAT scores and false non-discovery rate control. Ann Appl Stat 4:503–519

    Article  Google Scholar 

  36. Ahdesmaki M, Zuber V, Gibb S, Strimmer K (2015) sda: shrinkage discriminant analysis and CAT score variable selection. R package version 1.3.7. http://CRAN.R-project.org/package=sda

  37. Ishwaran H, Rao J (2005) Spike and slab variable selection: frequentist and Bayesian strategies. Ann Stat 33(2):730–773

    Article  Google Scholar 

  38. Ishwaran H, Rao J, Kogalur UB (2013) spikeslab: prediction and variable selection using spike and slab regression. R package version 1.1.5. http://web.ccs.miami.edu/~hishwaran. http://www.kogalur.com

  39. Friedman J, Hastie T, Tibshirani R (2008) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22. http://www.stanford.edu/~hastie/Papers/glmnet.pdf

  40. Zhou F, Luo Y, Meng Q, Ge R, Mai G, Liu J (2015) Sublasso: gene selection using lasso for microarray data with user-defined genes fixed in model. R-Project, package version 1.0

  41. Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, Cambridge

    Book  Google Scholar 

  42. Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan G, Ng A, Liu B, Yu P, Zhou ZH, Steinbach M, Hand D, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37

    Article  Google Scholar 

  43. Vervoort S, Boxtel V, Coffer P (2013) he role of sry-related hmg box transcription factor 4 (sox4) in tumorigenesis and metastasis: friend or foe? Oncogene 32(29):339–409. https://www.ncbi.nlm.nih.gov/pubmed/23246969

    Article  Google Scholar 

  44. Hasegawa S, Nagano H, Konno M, Eguchi H, Tomokuni A, Tomimaru Y, Asaoka T, Wada H, Hama N, Kawamoto K, Marubashi S, Nishida N, Koseki J, Mori M, Doki Y, Ishii H (2016) A crucial epithelial to mesenchymal transition regulator, sox4/ezh2 axis is closely related to the clinical outcome in pancreatic cancer patients. Int J Oncol 48(1):145–152. https://www.ncbi.nlm.nih.gov/pubmed/26648239

    Article  Google Scholar 

  45. Li Q, Hou L, Ding G, Li Y, Wang J, Qian B, Sun J, Wang Q (2015) Kdm6b induces epithelial-mesenchymal transition and enhances clear cell renal cell carcinoma metastasis through the activation of slug. Int J Clin Exp Pathol 8(6):6334–6344. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4525843/

  46. Yamamoto K, Tateishi K, Kudo Y, Sato T, Yamamoto S, Miyabayashi K, Matsusaka K, Asaoka Y, Ijichi H, Hirata Y, Otsuka M, Nakai Y, Isayama H, Ikenoue T, Kurokawa M, Fukayama M, Kokudo N, Omata M, Koike K (2014) Loss of histone demethylase KDM6B enhances aggressiveness of pancreatic cancer through downregulation of c/ebp. Carcinogenesis 35(11):2404–2414. https://www.ncbi.nlm.nih.gov/pubmed/24947179

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José A. Castellanos-Garzón.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Castellanos-Garzón, J.A., Ramos, J., López-Sánchez, D. et al. An Ensemble Framework Coping with Instability in the Gene Selection Process. Interdiscip Sci Comput Life Sci 10, 12–23 (2018). https://doi.org/10.1007/s12539-017-0274-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-017-0274-z

Keywords

Navigation