Soft Computing

, Volume 21, Issue 22, pp 6895–6906 | Cite as

Multistage feature selection approach for high-dimensional cancer data

  • Alhasan Alkuhlani
  • Mohammad Nassef
  • Ibrahim Farag
Methodologies and Application


Cancer is a serious disease that causes death worldwide. DNA methylation (DNAm) is an epigenetic mechanism, which controls the regulation of gene expression and is useful in early detection of cancer. The challenge with DNA methylation microarray datasets is the huge number of CpG sites compared to the number of samples. Recent research efforts attempted to reduce this high dimensionality by different feature selection techniques. This article proposes a multistage feature selection approach to select the optimal CpG sites from three different DNAm cancer datasets (breast, colon and lung). The proposed approach combines three different filter feature selection methods including Fisher Criterion, t-test and Area Under ROC Curve. In addition, as a wrapper feature selection, we apply genetic algorithms with Support Vector Machine Recursive Feature Elimination (SVM-RFE) as its fitness function, and SVM as its evaluator. Using the Incremental Feature Selection (IFS) strategy, subsets of 24, 13 and 27 optimal CpG sites are selected for the breast, colon and lung cancer datasets, respectively. By applying fivefold cross-validation on the training datasets, these subsets of optimal CpG sites showed perfect classification accuracies of 100, 100 and 97.67%, respectively. Moreover, the testing of the three independent cancer datasets by these final subsets resulted in accuracies 96.02, 98.81 and 94.51%, respectively. The experimental results demonstrated high classification performance and small optimal feature subsets. Consequently, the biological significance of the genes corresponding to these feature subsets is validated using enrichment analysis.


DNA methylation (DNAm) CpG sites Feature selection Genetic algorithms Support vector machine (SVM) Incremental feature selection (IFS) Enrichment analysis 



We are so grateful to Prof. Amr Badr for the significant consultations and advice we got while working on this research.

Compliance with ethical standards

Conflict of interest

The authors confirm that this article content has no conflict of interest.

Supplementary material

500_2016_2439_MOESM1_ESM.pdf (221 kb)
Supplementary 1: The accuracy (Ac), sensitivity (Sn), specificity (Sp) of each run of IFS for each of the three training sets using MSFS (NoGA-Mode) (PDF 221 kb).
500_2016_2439_MOESM2_ESM.pdf (206 kb)
Supplementary 2: The accuracy (Ac), sensitivity (Sn), specificity (Sp) of each run of IFS for each of the three training sets using MSFS (GA-Mode) (PDF 206 kb).
500_2016_2439_MOESM3_ESM.pdf (261 kb)
Supplementary 3: The accuracy (Ac), sensitivity (Sn), specificity (Sp) of each run of IFS for each of the three independent sets using MSFS (NoGA-Mode) (PDF 260 kb).
500_2016_2439_MOESM4_ESM.pdf (262 kb)
Supplementary 4: The accuracy (Ac), sensitivity (Sn), specificity (Sp) of each run of IFS for each of the three independent sets using MSFS (GA-Mode) (PDF 261 kb).
500_2016_2439_MOESM5_ESM.pdf (241 kb)
Supplementary 5: GO terms and KEGG pathways of genes corresponding to the selected CpG sites for the three attempted cancer datasets (PDF 241 kb).


  1. Al-Hussaini H, Subramanyam D, Reedijk M, Sridhar SS (2011) Notch signaling pathway as a therapeutic target in breast cancer. Mol Cancer Ther 10(1):9–15CrossRefGoogle Scholar
  2. Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, Irizarry RA (2014) Minfi: a flexible and comprehensive bioconductor package for the analysis of infinium DNA methylation microarrays. Bioinformatics 30(10):1363–1369CrossRefGoogle Scholar
  3. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29CrossRefGoogle Scholar
  4. Barat A, Ruskin HJ (2015) Comparative correlation structure of colon cancer locus specific methylation: characterisation of patient profiles and potential markers across 3 array-based datasets. J Cancer 6(8):795CrossRefGoogle Scholar
  5. Bibikova M, Le J, Barnes B, Saedinia-Melnyk S, Zhou L, Shen R, Gunderson KL (2009) Genome-wide dna methylation profiling using infinium assay. Epigenomics 1(1):177–200CrossRefGoogle Scholar
  6. Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL et al (2011) High density dna methylation array with single cpg site resolution. Genomics 98(4):288–295CrossRefGoogle Scholar
  7. Birts CN, Harding R, Soosaipillai G, Halder T, Azim-Araghi A, Darley M, Cutress RI, Bateman AC, Blaydes JP (2011) Expression of ctbp family protein isoforms in breast cancer and their role in chemoresistance. Biol Cell 103(1):1–19CrossRefGoogle Scholar
  8. Blackmore JK, Karmakar S, Gu G, Chaubal V, Wang L, Li W, Smith CL (2014) The smrt coregulator enhances growth of estrogen receptor-\(\alpha \)-positive breast cancer cells by promotion of cell cycle progression and inhibition of apoptosis. Endocrinology 155(9):3251–3261CrossRefGoogle Scholar
  9. Butterworth R, Piatetsky-Shapiro G, Simovici D (2005) On feature selection through clustering. In: Fifth IEEE international conference on data mining, p. 4Google Scholar
  10. Cai Z, Xu D, Zhang Q, Zhang J, Ngai SM, Shao J (2015) Classification of lung cancer using ensemble-based feature selection and machine learning methods. Mol BioSyst 11(3):791–800CrossRefGoogle Scholar
  11. Chen Z, Fillmore CM, Hammerman PS, Kim CF, Wong KK (2014) Non-small-cell lung cancers: a heterogeneous set of diseases. Nat Rev Cancer 14(8):535–546CrossRefGoogle Scholar
  12. Das PM, Singal R (2004) Dna methylation and cancer. J Clin Oncol 22(22):4632–4642CrossRefGoogle Scholar
  13. Deng Y, Deng H, Liu J, Han G, Malkoski S, Liu B, Zhao R, Wang XJ, Zhang Q (2012) Transcriptional down-regulation of brca1 and e-cadherin by ctbp1 in breast cancer. Mol Carcinog 51(6):500–507CrossRefGoogle Scholar
  14. Do H, Wong NC, Murone C, John T, Solomon B, Mitchell PL, Dobrovic A (2014) A critical re-assessment of DNA repair gene promoter methylation in non-small cell lung carcinoma. Sci Rep 4:4186Google Scholar
  15. Egger G, Liang G, Aparicio A, Jones PA (2004) Epigenetics in human disease and prospects for epigenetic therapy. Nature 429(6990):457–463CrossRefGoogle Scholar
  16. Ein-Dor L, Kela I, Getz G, Givol D, Domany E (2005) Outcome signature genes in breast cancer: Is there a unique set? Bioinformatics 21(2):171–178CrossRefGoogle Scholar
  17. Fan TW, Lane AN, Higashi RM, Farag MA, Gao H, Bousamra M, Miller DM (2009) Altered regulation of metabolic pathways in human lung cancer discerned by 13 C stable isotope-resolved metabolomics (sirm). Mol Cancer 8(1):1Google Scholar
  18. Fang OH, Mustapha N, Sulaiman MN (2011) Integrative gene selection for classification of microarray data. Comput Inf Sci 4(2):55Google Scholar
  19. Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, Parkin DM, Forman D, Bray F (2015) Cancer incidence and mortality worldwide: sources, methods and major patterns in globocan 2012. Int J Cancer 136(5):E359–E386CrossRefGoogle Scholar
  20. George G, Raj VC (2011) Review on feature selection techniques and the impact of svm for cancer classification using gene expression profile. arXiv preprint arXiv:1109.1062
  21. Gonzalez-Navarro FF, Belanche-Muñoz LA (2014) Feature selection for microarray gene expression data using simulated annealing guided by the multivariate joint entropy. Comput Sist 18(2):275–293Google Scholar
  22. Gray-McGuire C, Guda K, Adrianto I, Lin CP, Natale L, Potter JD, Newcomb P, Poole EM, Ulrich CM, Lindor N et al (2010) Confirmation of linkage to and localization of familial colon cancer risk haplotype on chromosome 9q22. Cancer Res 70(13):5409–5418CrossRefGoogle Scholar
  23. Gu Q, Li Z, Han J (2012) Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725
  24. Guo S, Yan F, Xu J, Bao Y, Zhu J, Wang X, Wu J, Li Y, Pu W, Liu Y et al (2015) Identification and validation of the methylation biomarkers of non-small cell lung cancer (NSCLC). Clin Epigenetics 7(1):1–10CrossRefGoogle Scholar
  25. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422CrossRefzbMATHGoogle Scholar
  26. Huang DW, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat Protoc 4(1):44–57CrossRefGoogle Scholar
  27. Huerta EB, Duval B, Hao JK (2010) A hybrid lda and genetic algorithm for gene selection and classification of microarray data. Neurocomputing 73(13):2375–2383CrossRefGoogle Scholar
  28. Jing L, Ng MK, Zeng T (2010) Novel hybrid method for gene selection and cancer prediction. World Acad Sci Eng Technol 4(2):258–265Google Scholar
  29. Kalousis A, Prados J, Hilario M (2005) Stability of feature selection algorithms. In: Fifth IEEE international conference on data mining, p. 8Google Scholar
  30. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M (2012) KEGG for integration and interpretation of large-scale molecular data sets. Nucl Acids Res 40:D109–D114. doi: 10.1093/nar/gkr988 CrossRefGoogle Scholar
  31. Kibriya MG, Raza M, Jasmine F, Roy S, Paul-Brutus R, Rahaman R, Dodsworth C, Rakibuz-Zaman M, Kamal M, Ahsan H (2011) A genome-wide dna methylation study in colorectal carcinoma. BMC Med Genomics 4(1):50CrossRefGoogle Scholar
  32. Kou Y, Zhang S, Chen X, Hu S (2015) Gene expression profile analysis of colorectal cancer to investigate potential mechanisms using bioinformatics. Onco Targets Ther 8:745Google Scholar
  33. Kuncheva LI (2007) A stability index for feature selection. In: Devedžic V (ed) Artificial intelligence and applications. ACTA Press, Canada, pp 421–427Google Scholar
  34. Laird PW (2010) Principles and challenges of genome-wide dna methylation analysis. Nat Rev Genet 11(3):191–203CrossRefGoogle Scholar
  35. Lee IH, Lushington GH, Visvanathan M (2011) A filter-based feature selection approach for identifying potential biomarkers for lung cancer. J Clin Bioinforma 1:11CrossRefGoogle Scholar
  36. Lee CP, Leu Y (2011) A novel hybrid feature selection method for microarray data analysis. Appl Soft Comput 11(1):208–213CrossRefGoogle Scholar
  37. Li J, Su H, Chen H, Futscher BW (2007) Optimal search-based gene subset selection for gene array cancer classification. IEEE Trans Inf Technol Biomed 11(4):398–405CrossRefGoogle Scholar
  38. Li BQ, Cai YD, Feng KY, Zhao GJ (2012a) Prediction of protein cleavage site with feature selection by random forest. PLoS ONE 7(9):e45,854CrossRefGoogle Scholar
  39. Li BQ, Feng KY, Chen L, Huang T, Cai YD (2012b) Prediction of protein-protein interaction sites by random forest algorithm with mrmr and ifs. PLoS ONE 7(8):e43,927CrossRefGoogle Scholar
  40. Liu Y, Lan Q, Siegfried JM, Luketich JD, Keohavong P (2006) Aberrant promoter methylation of p16 and MGMT genes in lung tumors from smoking and never-smoking lung cancer patients. Neoplasia 8(1):46–51CrossRefGoogle Scholar
  41. Luque-Baena R, Urda D, Subirats J, Franco L, Jerez J (2013) Analysis of cancer microarray data using constructive neural networks and genetic algorithms. In: Proceedings of the IWBBIO, international work-conference on bioinformatics and biomedical engineering, pp 55–63Google Scholar
  42. Malhotra R, Singh N, Singh Y (2011) Genetic algorithms: concepts, design for optimization of process controllers. Comput Inf Sci 4(2):39Google Scholar
  43. Ma Z, Teschendorff AE (2013) A variational bayes beta mixture model for feature selection in dna methylation studies. J Bioinform Comput Biol 11(04):1350,005CrossRefGoogle Scholar
  44. McCall J (2005) Genetic algorithms for modelling and optimisation. J Comput Appl Math 184(1):205–222MathSciNetCrossRefzbMATHGoogle Scholar
  45. Meng H, Murrelle EL, Li G (2008) Identification of a small optimal subset of cpg sites as bio-markers from high-throughput dna methylation profiles. BMC Bioinf 9(1):457CrossRefGoogle Scholar
  46. Misman MF, Chan WH, Mohamad MS, Deris S (2013) A hybrid of svm and scad with group-specific tuning parameters in identification of informative genes and biological pathways. In: Li J, Cao L, Wang C, Tan KC, Liu B, Pei J, Tseng VS (eds) Trends and applications in knowledge discovery and data mining. Springer, pp 258–269Google Scholar
  47. Morimoto A, Serada S, Enomoto T, Kim A, Matsuzaki S, Takahashi T, Ueda Y, Yoshino K, Fujita M, Fujimoto M et al (2014) Annexin a4 induces platinum resistance in a chloride-and calcium-dependent manner. Oncotarget 5(17):7776CrossRefGoogle Scholar
  48. Mosca E, Bertoli G, Piscitelli E, Vilardo L, Reinbold RA, Zucchi I, Milanesi L (2009) Identification of functionally related genes using data mining and data integration: a breast cancer case study. BMC Bioinformatics 10(12):1Google Scholar
  49. Müller-Tidow C, Diederichs S, Bulk E, Pohle T, Steffen B, Schwäble J, Plewka S, Thomas M, Metzger R, Schneider PM et al (2005) Identification of metastasis-associated receptor tyrosine kinases in non-small cell lung cancer. Cancer Res 65(5):1778–1782CrossRefGoogle Scholar
  50. Nexø BA, Vogel U, Olsen A, Nyegaard M, Bukowy Z, Rockenbauer E, Zhang X, Koca C, Mains M, Hansen B et al (2008) Linkage disequilibrium mapping of a breast cancer susceptibility locus near rai/ppp1r13l/iaspp. BMC Med Genet 9(1):1CrossRefGoogle Scholar
  51. O’Byrne KJ, Baird AM, Kilmartin L, Leonard J, Sacevich C, Gray SG (2011) Epigenetic regulation of glucose transporters in non-small cell lung cancer. Cancers 3(2):1550–1565CrossRefGoogle Scholar
  52. Phipson B, Maksimovic J, Oshlack A (2015) missMethyl: an R package for analysing methylation data from illumina’s HumanMethylation450 platform. Bioinformatics 32. doi: 10.1093/bioinformatics/btv560
  53. Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517CrossRefGoogle Scholar
  54. Saeys Y, Abeel T, Van de Peer Y (2008) Robust feature selection using ensemble feature selection techniques. In: Daelemans W, Goethals B, Morik K (eds) Machine learning and knowledge discovery in databases. Springer, pp 313–325Google Scholar
  55. Sahu B, Mishra D (2012) A novel feature selection algorithm using particle swarm optimization for cancer microarray data. Proc Eng 38:27–31CrossRefGoogle Scholar
  56. Sastry K, Goldberg D, Kendall G (2005) Genetic algorithms. Springer, BostonCrossRefGoogle Scholar
  57. Spinola M, Meyer P, Kammerer S, Falvella FS, Boettger MB, Hoyal CR, Pignatiello C, Fischer R, Roth RB, Pastorino U et al (2006) Association of the pdcd5 locus with lung cancer risk and prognosis in smokers. J Clin Oncol 24(11):1672–1678CrossRefGoogle Scholar
  58. Stevenson L, Allen WL, Turkington R, Jithesh PV, Proutski I, Stewart G, Lenz HJ, Van Schaeybroeck S, Longley DB, Johnston PG (2012) Identification of galanin and its receptor galr1 as novel determinants of resistance to chemotherapy and potential biomarkers in colorectal cancer. Clin Cancer Res 18(19):5412–5426CrossRefGoogle Scholar
  59. Stylianou S, Clarke RB, Brennan K (2006) Aberrant activation of notch signaling in human breast cancer. Cancer Res 66(3):1517–1525CrossRefGoogle Scholar
  60. Uribarri M, Hormaeche I, Zalacain R, Lopez-Vivanco G, Martinez A, Nagore D, Ruiz-Argüello MB (2014) A new biomarker panel in bronchoalveolar lavage for an improved lung cancer diagnosis. J Thorac Oncol 9(10):1504–1512CrossRefGoogle Scholar
  61. Valavanis I, Pilalis E, Georgiadis P, Kyrtopoulos S, Chatziioannou A (2015) Cancer biomarkers from genome-scale dna methylation: comparison of evolutionary and semantic analysis methods. Microarrays 4(4):647–670CrossRefGoogle Scholar
  62. Wei R, Zhang Y, Shen L, Jiang W, Li C, Zhong M, Xie Y, Yang D, He L, Zhou Q (2012) Comparative proteomic and radiobiological analyses in human lung adenocarcinoma cells. Mol Cell Biochem 359(1–2):151–159CrossRefGoogle Scholar
  63. Ya Chen, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, Gallinger S, Hudson TJ, Weksberg R (2013) Discovery of cross-reactive probes and polymorphic CpGs in the illumina infinium humanmethylation450 microarray. Epigenetics 8(2):203–209CrossRefGoogle Scholar
  64. Yao L, Pan TY (2010) Feature selection and classification of seldi-tof mass spectra of hepatoma using gene-weighted genetic algorithm. In: Proceedings of international conference on biomedical fuzzy systems associationGoogle Scholar
  65. Yu L, Ding C, Loscalzo S (2008) Stable feature selection via dense feature groups. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’08, pp 803–811Google Scholar
  66. Zhuang J, Widschwendter M, Teschendorff AE (2012) A comparison of feature selection and classification methods in dna methylation studies using the illumina infinium platform. BMC Bioinformatics 13(1):59CrossRefGoogle Scholar
  67. Zou KH, OMalley AJ, Mauri L (2007) Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation 115(5):654–657CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Alhasan Alkuhlani
    • 1
  • Mohammad Nassef
    • 1
  • Ibrahim Farag
    • 1
  1. 1.Department of Computer Science, Faculty of Computers and InformationCairo UniversityGizaEgypt

Personalised recommendations