Abstract
Cancer is a serious disease that causes death worldwide. DNA methylation (DNAm) is an epigenetic mechanism, which controls the regulation of gene expression and is useful in early detection of cancer. The challenge with DNA methylation microarray datasets is the huge number of CpG sites compared to the number of samples. Recent research efforts attempted to reduce this high dimensionality by different feature selection techniques. This article proposes a multistage feature selection approach to select the optimal CpG sites from three different DNAm cancer datasets (breast, colon and lung). The proposed approach combines three different filter feature selection methods including Fisher Criterion, t-test and Area Under ROC Curve. In addition, as a wrapper feature selection, we apply genetic algorithms with Support Vector Machine Recursive Feature Elimination (SVM-RFE) as its fitness function, and SVM as its evaluator. Using the Incremental Feature Selection (IFS) strategy, subsets of 24, 13 and 27 optimal CpG sites are selected for the breast, colon and lung cancer datasets, respectively. By applying fivefold cross-validation on the training datasets, these subsets of optimal CpG sites showed perfect classification accuracies of 100, 100 and 97.67%, respectively. Moreover, the testing of the three independent cancer datasets by these final subsets resulted in accuracies 96.02, 98.81 and 94.51%, respectively. The experimental results demonstrated high classification performance and small optimal feature subsets. Consequently, the biological significance of the genes corresponding to these feature subsets is validated using enrichment analysis.
Similar content being viewed by others
References
Al-Hussaini H, Subramanyam D, Reedijk M, Sridhar SS (2011) Notch signaling pathway as a therapeutic target in breast cancer. Mol Cancer Ther 10(1):9–15
Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, Irizarry RA (2014) Minfi: a flexible and comprehensive bioconductor package for the analysis of infinium DNA methylation microarrays. Bioinformatics 30(10):1363–1369
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29
Barat A, Ruskin HJ (2015) Comparative correlation structure of colon cancer locus specific methylation: characterisation of patient profiles and potential markers across 3 array-based datasets. J Cancer 6(8):795
Bibikova M, Le J, Barnes B, Saedinia-Melnyk S, Zhou L, Shen R, Gunderson KL (2009) Genome-wide dna methylation profiling using infinium assay. Epigenomics 1(1):177–200
Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL et al (2011) High density dna methylation array with single cpg site resolution. Genomics 98(4):288–295
Birts CN, Harding R, Soosaipillai G, Halder T, Azim-Araghi A, Darley M, Cutress RI, Bateman AC, Blaydes JP (2011) Expression of ctbp family protein isoforms in breast cancer and their role in chemoresistance. Biol Cell 103(1):1–19
Blackmore JK, Karmakar S, Gu G, Chaubal V, Wang L, Li W, Smith CL (2014) The smrt coregulator enhances growth of estrogen receptor-\(\alpha \)-positive breast cancer cells by promotion of cell cycle progression and inhibition of apoptosis. Endocrinology 155(9):3251–3261
Butterworth R, Piatetsky-Shapiro G, Simovici D (2005) On feature selection through clustering. In: Fifth IEEE international conference on data mining, p. 4
Cai Z, Xu D, Zhang Q, Zhang J, Ngai SM, Shao J (2015) Classification of lung cancer using ensemble-based feature selection and machine learning methods. Mol BioSyst 11(3):791–800
Chen Z, Fillmore CM, Hammerman PS, Kim CF, Wong KK (2014) Non-small-cell lung cancers: a heterogeneous set of diseases. Nat Rev Cancer 14(8):535–546
Das PM, Singal R (2004) Dna methylation and cancer. J Clin Oncol 22(22):4632–4642
Deng Y, Deng H, Liu J, Han G, Malkoski S, Liu B, Zhao R, Wang XJ, Zhang Q (2012) Transcriptional down-regulation of brca1 and e-cadherin by ctbp1 in breast cancer. Mol Carcinog 51(6):500–507
Do H, Wong NC, Murone C, John T, Solomon B, Mitchell PL, Dobrovic A (2014) A critical re-assessment of DNA repair gene promoter methylation in non-small cell lung carcinoma. Sci Rep 4:4186
Egger G, Liang G, Aparicio A, Jones PA (2004) Epigenetics in human disease and prospects for epigenetic therapy. Nature 429(6990):457–463
Ein-Dor L, Kela I, Getz G, Givol D, Domany E (2005) Outcome signature genes in breast cancer: Is there a unique set? Bioinformatics 21(2):171–178
Fan TW, Lane AN, Higashi RM, Farag MA, Gao H, Bousamra M, Miller DM (2009) Altered regulation of metabolic pathways in human lung cancer discerned by 13 C stable isotope-resolved metabolomics (sirm). Mol Cancer 8(1):1
Fang OH, Mustapha N, Sulaiman MN (2011) Integrative gene selection for classification of microarray data. Comput Inf Sci 4(2):55
Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, Parkin DM, Forman D, Bray F (2015) Cancer incidence and mortality worldwide: sources, methods and major patterns in globocan 2012. Int J Cancer 136(5):E359–E386
George G, Raj VC (2011) Review on feature selection techniques and the impact of svm for cancer classification using gene expression profile. arXiv preprint arXiv:1109.1062
Gonzalez-Navarro FF, Belanche-Muñoz LA (2014) Feature selection for microarray gene expression data using simulated annealing guided by the multivariate joint entropy. Comput Sist 18(2):275–293
Gray-McGuire C, Guda K, Adrianto I, Lin CP, Natale L, Potter JD, Newcomb P, Poole EM, Ulrich CM, Lindor N et al (2010) Confirmation of linkage to and localization of familial colon cancer risk haplotype on chromosome 9q22. Cancer Res 70(13):5409–5418
Gu Q, Li Z, Han J (2012) Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725
Guo S, Yan F, Xu J, Bao Y, Zhu J, Wang X, Wu J, Li Y, Pu W, Liu Y et al (2015) Identification and validation of the methylation biomarkers of non-small cell lung cancer (NSCLC). Clin Epigenetics 7(1):1–10
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Huang DW, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat Protoc 4(1):44–57
Huerta EB, Duval B, Hao JK (2010) A hybrid lda and genetic algorithm for gene selection and classification of microarray data. Neurocomputing 73(13):2375–2383
Jing L, Ng MK, Zeng T (2010) Novel hybrid method for gene selection and cancer prediction. World Acad Sci Eng Technol 4(2):258–265
Kalousis A, Prados J, Hilario M (2005) Stability of feature selection algorithms. In: Fifth IEEE international conference on data mining, p. 8
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M (2012) KEGG for integration and interpretation of large-scale molecular data sets. Nucl Acids Res 40:D109–D114. doi:10.1093/nar/gkr988
Kibriya MG, Raza M, Jasmine F, Roy S, Paul-Brutus R, Rahaman R, Dodsworth C, Rakibuz-Zaman M, Kamal M, Ahsan H (2011) A genome-wide dna methylation study in colorectal carcinoma. BMC Med Genomics 4(1):50
Kou Y, Zhang S, Chen X, Hu S (2015) Gene expression profile analysis of colorectal cancer to investigate potential mechanisms using bioinformatics. Onco Targets Ther 8:745
Kuncheva LI (2007) A stability index for feature selection. In: Devedžic V (ed) Artificial intelligence and applications. ACTA Press, Canada, pp 421–427
Laird PW (2010) Principles and challenges of genome-wide dna methylation analysis. Nat Rev Genet 11(3):191–203
Lee IH, Lushington GH, Visvanathan M (2011) A filter-based feature selection approach for identifying potential biomarkers for lung cancer. J Clin Bioinforma 1:11
Lee CP, Leu Y (2011) A novel hybrid feature selection method for microarray data analysis. Appl Soft Comput 11(1):208–213
Li J, Su H, Chen H, Futscher BW (2007) Optimal search-based gene subset selection for gene array cancer classification. IEEE Trans Inf Technol Biomed 11(4):398–405
Li BQ, Cai YD, Feng KY, Zhao GJ (2012a) Prediction of protein cleavage site with feature selection by random forest. PLoS ONE 7(9):e45,854
Li BQ, Feng KY, Chen L, Huang T, Cai YD (2012b) Prediction of protein-protein interaction sites by random forest algorithm with mrmr and ifs. PLoS ONE 7(8):e43,927
Liu Y, Lan Q, Siegfried JM, Luketich JD, Keohavong P (2006) Aberrant promoter methylation of p16 and MGMT genes in lung tumors from smoking and never-smoking lung cancer patients. Neoplasia 8(1):46–51
Luque-Baena R, Urda D, Subirats J, Franco L, Jerez J (2013) Analysis of cancer microarray data using constructive neural networks and genetic algorithms. In: Proceedings of the IWBBIO, international work-conference on bioinformatics and biomedical engineering, pp 55–63
Malhotra R, Singh N, Singh Y (2011) Genetic algorithms: concepts, design for optimization of process controllers. Comput Inf Sci 4(2):39
Ma Z, Teschendorff AE (2013) A variational bayes beta mixture model for feature selection in dna methylation studies. J Bioinform Comput Biol 11(04):1350,005
McCall J (2005) Genetic algorithms for modelling and optimisation. J Comput Appl Math 184(1):205–222
Meng H, Murrelle EL, Li G (2008) Identification of a small optimal subset of cpg sites as bio-markers from high-throughput dna methylation profiles. BMC Bioinf 9(1):457
Misman MF, Chan WH, Mohamad MS, Deris S (2013) A hybrid of svm and scad with group-specific tuning parameters in identification of informative genes and biological pathways. In: Li J, Cao L, Wang C, Tan KC, Liu B, Pei J, Tseng VS (eds) Trends and applications in knowledge discovery and data mining. Springer, pp 258–269
Morimoto A, Serada S, Enomoto T, Kim A, Matsuzaki S, Takahashi T, Ueda Y, Yoshino K, Fujita M, Fujimoto M et al (2014) Annexin a4 induces platinum resistance in a chloride-and calcium-dependent manner. Oncotarget 5(17):7776
Mosca E, Bertoli G, Piscitelli E, Vilardo L, Reinbold RA, Zucchi I, Milanesi L (2009) Identification of functionally related genes using data mining and data integration: a breast cancer case study. BMC Bioinformatics 10(12):1
Müller-Tidow C, Diederichs S, Bulk E, Pohle T, Steffen B, Schwäble J, Plewka S, Thomas M, Metzger R, Schneider PM et al (2005) Identification of metastasis-associated receptor tyrosine kinases in non-small cell lung cancer. Cancer Res 65(5):1778–1782
Nexø BA, Vogel U, Olsen A, Nyegaard M, Bukowy Z, Rockenbauer E, Zhang X, Koca C, Mains M, Hansen B et al (2008) Linkage disequilibrium mapping of a breast cancer susceptibility locus near rai/ppp1r13l/iaspp. BMC Med Genet 9(1):1
O’Byrne KJ, Baird AM, Kilmartin L, Leonard J, Sacevich C, Gray SG (2011) Epigenetic regulation of glucose transporters in non-small cell lung cancer. Cancers 3(2):1550–1565
Phipson B, Maksimovic J, Oshlack A (2015) missMethyl: an R package for analysing methylation data from illumina’s HumanMethylation450 platform. Bioinformatics 32. doi:10.1093/bioinformatics/btv560
Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Saeys Y, Abeel T, Van de Peer Y (2008) Robust feature selection using ensemble feature selection techniques. In: Daelemans W, Goethals B, Morik K (eds) Machine learning and knowledge discovery in databases. Springer, pp 313–325
Sahu B, Mishra D (2012) A novel feature selection algorithm using particle swarm optimization for cancer microarray data. Proc Eng 38:27–31
Sastry K, Goldberg D, Kendall G (2005) Genetic algorithms. Springer, Boston
Spinola M, Meyer P, Kammerer S, Falvella FS, Boettger MB, Hoyal CR, Pignatiello C, Fischer R, Roth RB, Pastorino U et al (2006) Association of the pdcd5 locus with lung cancer risk and prognosis in smokers. J Clin Oncol 24(11):1672–1678
Stevenson L, Allen WL, Turkington R, Jithesh PV, Proutski I, Stewart G, Lenz HJ, Van Schaeybroeck S, Longley DB, Johnston PG (2012) Identification of galanin and its receptor galr1 as novel determinants of resistance to chemotherapy and potential biomarkers in colorectal cancer. Clin Cancer Res 18(19):5412–5426
Stylianou S, Clarke RB, Brennan K (2006) Aberrant activation of notch signaling in human breast cancer. Cancer Res 66(3):1517–1525
Uribarri M, Hormaeche I, Zalacain R, Lopez-Vivanco G, Martinez A, Nagore D, Ruiz-Argüello MB (2014) A new biomarker panel in bronchoalveolar lavage for an improved lung cancer diagnosis. J Thorac Oncol 9(10):1504–1512
Valavanis I, Pilalis E, Georgiadis P, Kyrtopoulos S, Chatziioannou A (2015) Cancer biomarkers from genome-scale dna methylation: comparison of evolutionary and semantic analysis methods. Microarrays 4(4):647–670
Wei R, Zhang Y, Shen L, Jiang W, Li C, Zhong M, Xie Y, Yang D, He L, Zhou Q (2012) Comparative proteomic and radiobiological analyses in human lung adenocarcinoma cells. Mol Cell Biochem 359(1–2):151–159
Ya Chen, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, Gallinger S, Hudson TJ, Weksberg R (2013) Discovery of cross-reactive probes and polymorphic CpGs in the illumina infinium humanmethylation450 microarray. Epigenetics 8(2):203–209
Yao L, Pan TY (2010) Feature selection and classification of seldi-tof mass spectra of hepatoma using gene-weighted genetic algorithm. In: Proceedings of international conference on biomedical fuzzy systems association
Yu L, Ding C, Loscalzo S (2008) Stable feature selection via dense feature groups. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’08, pp 803–811
Zhuang J, Widschwendter M, Teschendorff AE (2012) A comparison of feature selection and classification methods in dna methylation studies using the illumina infinium platform. BMC Bioinformatics 13(1):59
Zou KH, OMalley AJ, Mauri L (2007) Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation 115(5):654–657
Acknowledgements
We are so grateful to Prof. Amr Badr for the significant consultations and advice we got while working on this research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors confirm that this article content has no conflict of interest.
Additional information
Communicated by V. Loia.
Electronic supplementary material
Below is the link to the electronic supplementary material.
500_2016_2439_MOESM1_ESM.pdf
Supplementary 1: The accuracy (Ac), sensitivity (Sn), specificity (Sp) of each run of IFS for each of the three training sets using MSFS (NoGA-Mode) (PDF 221 kb).
500_2016_2439_MOESM2_ESM.pdf
Supplementary 2: The accuracy (Ac), sensitivity (Sn), specificity (Sp) of each run of IFS for each of the three training sets using MSFS (GA-Mode) (PDF 206 kb).
500_2016_2439_MOESM3_ESM.pdf
Supplementary 3: The accuracy (Ac), sensitivity (Sn), specificity (Sp) of each run of IFS for each of the three independent sets using MSFS (NoGA-Mode) (PDF 260 kb).
500_2016_2439_MOESM4_ESM.pdf
Supplementary 4: The accuracy (Ac), sensitivity (Sn), specificity (Sp) of each run of IFS for each of the three independent sets using MSFS (GA-Mode) (PDF 261 kb).
500_2016_2439_MOESM5_ESM.pdf
Supplementary 5: GO terms and KEGG pathways of genes corresponding to the selected CpG sites for the three attempted cancer datasets (PDF 241 kb).
Rights and permissions
About this article
Cite this article
Alkuhlani, A., Nassef, M. & Farag, I. Multistage feature selection approach for high-dimensional cancer data. Soft Comput 21, 6895–6906 (2017). https://doi.org/10.1007/s00500-016-2439-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-016-2439-9