Skip to main content

Advertisement

Log in

Multistage feature selection approach for high-dimensional cancer data

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Cancer is a serious disease that causes death worldwide. DNA methylation (DNAm) is an epigenetic mechanism, which controls the regulation of gene expression and is useful in early detection of cancer. The challenge with DNA methylation microarray datasets is the huge number of CpG sites compared to the number of samples. Recent research efforts attempted to reduce this high dimensionality by different feature selection techniques. This article proposes a multistage feature selection approach to select the optimal CpG sites from three different DNAm cancer datasets (breast, colon and lung). The proposed approach combines three different filter feature selection methods including Fisher Criterion, t-test and Area Under ROC Curve. In addition, as a wrapper feature selection, we apply genetic algorithms with Support Vector Machine Recursive Feature Elimination (SVM-RFE) as its fitness function, and SVM as its evaluator. Using the Incremental Feature Selection (IFS) strategy, subsets of 24, 13 and 27 optimal CpG sites are selected for the breast, colon and lung cancer datasets, respectively. By applying fivefold cross-validation on the training datasets, these subsets of optimal CpG sites showed perfect classification accuracies of 100, 100 and 97.67%, respectively. Moreover, the testing of the three independent cancer datasets by these final subsets resulted in accuracies 96.02, 98.81 and 94.51%, respectively. The experimental results demonstrated high classification performance and small optimal feature subsets. Consequently, the biological significance of the genes corresponding to these feature subsets is validated using enrichment analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Al-Hussaini H, Subramanyam D, Reedijk M, Sridhar SS (2011) Notch signaling pathway as a therapeutic target in breast cancer. Mol Cancer Ther 10(1):9–15

    Article  Google Scholar 

  • Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, Irizarry RA (2014) Minfi: a flexible and comprehensive bioconductor package for the analysis of infinium DNA methylation microarrays. Bioinformatics 30(10):1363–1369

    Article  Google Scholar 

  • Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29

    Article  Google Scholar 

  • Barat A, Ruskin HJ (2015) Comparative correlation structure of colon cancer locus specific methylation: characterisation of patient profiles and potential markers across 3 array-based datasets. J Cancer 6(8):795

    Article  Google Scholar 

  • Bibikova M, Le J, Barnes B, Saedinia-Melnyk S, Zhou L, Shen R, Gunderson KL (2009) Genome-wide dna methylation profiling using infinium assay. Epigenomics 1(1):177–200

    Article  Google Scholar 

  • Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL et al (2011) High density dna methylation array with single cpg site resolution. Genomics 98(4):288–295

    Article  Google Scholar 

  • Birts CN, Harding R, Soosaipillai G, Halder T, Azim-Araghi A, Darley M, Cutress RI, Bateman AC, Blaydes JP (2011) Expression of ctbp family protein isoforms in breast cancer and their role in chemoresistance. Biol Cell 103(1):1–19

    Article  Google Scholar 

  • Blackmore JK, Karmakar S, Gu G, Chaubal V, Wang L, Li W, Smith CL (2014) The smrt coregulator enhances growth of estrogen receptor-\(\alpha \)-positive breast cancer cells by promotion of cell cycle progression and inhibition of apoptosis. Endocrinology 155(9):3251–3261

    Article  Google Scholar 

  • Butterworth R, Piatetsky-Shapiro G, Simovici D (2005) On feature selection through clustering. In: Fifth IEEE international conference on data mining, p. 4

  • Cai Z, Xu D, Zhang Q, Zhang J, Ngai SM, Shao J (2015) Classification of lung cancer using ensemble-based feature selection and machine learning methods. Mol BioSyst 11(3):791–800

    Article  Google Scholar 

  • Chen Z, Fillmore CM, Hammerman PS, Kim CF, Wong KK (2014) Non-small-cell lung cancers: a heterogeneous set of diseases. Nat Rev Cancer 14(8):535–546

    Article  Google Scholar 

  • Das PM, Singal R (2004) Dna methylation and cancer. J Clin Oncol 22(22):4632–4642

    Article  Google Scholar 

  • Deng Y, Deng H, Liu J, Han G, Malkoski S, Liu B, Zhao R, Wang XJ, Zhang Q (2012) Transcriptional down-regulation of brca1 and e-cadherin by ctbp1 in breast cancer. Mol Carcinog 51(6):500–507

    Article  Google Scholar 

  • Do H, Wong NC, Murone C, John T, Solomon B, Mitchell PL, Dobrovic A (2014) A critical re-assessment of DNA repair gene promoter methylation in non-small cell lung carcinoma. Sci Rep 4:4186

  • Egger G, Liang G, Aparicio A, Jones PA (2004) Epigenetics in human disease and prospects for epigenetic therapy. Nature 429(6990):457–463

    Article  Google Scholar 

  • Ein-Dor L, Kela I, Getz G, Givol D, Domany E (2005) Outcome signature genes in breast cancer: Is there a unique set? Bioinformatics 21(2):171–178

    Article  Google Scholar 

  • Fan TW, Lane AN, Higashi RM, Farag MA, Gao H, Bousamra M, Miller DM (2009) Altered regulation of metabolic pathways in human lung cancer discerned by 13 C stable isotope-resolved metabolomics (sirm). Mol Cancer 8(1):1

    Google Scholar 

  • Fang OH, Mustapha N, Sulaiman MN (2011) Integrative gene selection for classification of microarray data. Comput Inf Sci 4(2):55

    Google Scholar 

  • Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, Parkin DM, Forman D, Bray F (2015) Cancer incidence and mortality worldwide: sources, methods and major patterns in globocan 2012. Int J Cancer 136(5):E359–E386

    Article  Google Scholar 

  • George G, Raj VC (2011) Review on feature selection techniques and the impact of svm for cancer classification using gene expression profile. arXiv preprint arXiv:1109.1062

  • Gonzalez-Navarro FF, Belanche-Muñoz LA (2014) Feature selection for microarray gene expression data using simulated annealing guided by the multivariate joint entropy. Comput Sist 18(2):275–293

    Google Scholar 

  • Gray-McGuire C, Guda K, Adrianto I, Lin CP, Natale L, Potter JD, Newcomb P, Poole EM, Ulrich CM, Lindor N et al (2010) Confirmation of linkage to and localization of familial colon cancer risk haplotype on chromosome 9q22. Cancer Res 70(13):5409–5418

    Article  Google Scholar 

  • Gu Q, Li Z, Han J (2012) Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725

  • Guo S, Yan F, Xu J, Bao Y, Zhu J, Wang X, Wu J, Li Y, Pu W, Liu Y et al (2015) Identification and validation of the methylation biomarkers of non-small cell lung cancer (NSCLC). Clin Epigenetics 7(1):1–10

    Article  Google Scholar 

  • Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422

    Article  MATH  Google Scholar 

  • Huang DW, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat Protoc 4(1):44–57

    Article  Google Scholar 

  • Huerta EB, Duval B, Hao JK (2010) A hybrid lda and genetic algorithm for gene selection and classification of microarray data. Neurocomputing 73(13):2375–2383

    Article  Google Scholar 

  • Jing L, Ng MK, Zeng T (2010) Novel hybrid method for gene selection and cancer prediction. World Acad Sci Eng Technol 4(2):258–265

    Google Scholar 

  • Kalousis A, Prados J, Hilario M (2005) Stability of feature selection algorithms. In: Fifth IEEE international conference on data mining, p. 8

  • Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M (2012) KEGG for integration and interpretation of large-scale molecular data sets. Nucl Acids Res 40:D109–D114. doi:10.1093/nar/gkr988

    Article  Google Scholar 

  • Kibriya MG, Raza M, Jasmine F, Roy S, Paul-Brutus R, Rahaman R, Dodsworth C, Rakibuz-Zaman M, Kamal M, Ahsan H (2011) A genome-wide dna methylation study in colorectal carcinoma. BMC Med Genomics 4(1):50

    Article  Google Scholar 

  • Kou Y, Zhang S, Chen X, Hu S (2015) Gene expression profile analysis of colorectal cancer to investigate potential mechanisms using bioinformatics. Onco Targets Ther 8:745

    Google Scholar 

  • Kuncheva LI (2007) A stability index for feature selection. In: Devedžic V (ed) Artificial intelligence and applications. ACTA Press, Canada, pp 421–427

  • Laird PW (2010) Principles and challenges of genome-wide dna methylation analysis. Nat Rev Genet 11(3):191–203

    Article  Google Scholar 

  • Lee IH, Lushington GH, Visvanathan M (2011) A filter-based feature selection approach for identifying potential biomarkers for lung cancer. J Clin Bioinforma 1:11

    Article  Google Scholar 

  • Lee CP, Leu Y (2011) A novel hybrid feature selection method for microarray data analysis. Appl Soft Comput 11(1):208–213

    Article  Google Scholar 

  • Li J, Su H, Chen H, Futscher BW (2007) Optimal search-based gene subset selection for gene array cancer classification. IEEE Trans Inf Technol Biomed 11(4):398–405

    Article  Google Scholar 

  • Li BQ, Cai YD, Feng KY, Zhao GJ (2012a) Prediction of protein cleavage site with feature selection by random forest. PLoS ONE 7(9):e45,854

    Article  Google Scholar 

  • Li BQ, Feng KY, Chen L, Huang T, Cai YD (2012b) Prediction of protein-protein interaction sites by random forest algorithm with mrmr and ifs. PLoS ONE 7(8):e43,927

    Article  Google Scholar 

  • Liu Y, Lan Q, Siegfried JM, Luketich JD, Keohavong P (2006) Aberrant promoter methylation of p16 and MGMT genes in lung tumors from smoking and never-smoking lung cancer patients. Neoplasia 8(1):46–51

    Article  Google Scholar 

  • Luque-Baena R, Urda D, Subirats J, Franco L, Jerez J (2013) Analysis of cancer microarray data using constructive neural networks and genetic algorithms. In: Proceedings of the IWBBIO, international work-conference on bioinformatics and biomedical engineering, pp 55–63

  • Malhotra R, Singh N, Singh Y (2011) Genetic algorithms: concepts, design for optimization of process controllers. Comput Inf Sci 4(2):39

    Google Scholar 

  • Ma Z, Teschendorff AE (2013) A variational bayes beta mixture model for feature selection in dna methylation studies. J Bioinform Comput Biol 11(04):1350,005

    Article  Google Scholar 

  • McCall J (2005) Genetic algorithms for modelling and optimisation. J Comput Appl Math 184(1):205–222

    Article  MathSciNet  MATH  Google Scholar 

  • Meng H, Murrelle EL, Li G (2008) Identification of a small optimal subset of cpg sites as bio-markers from high-throughput dna methylation profiles. BMC Bioinf 9(1):457

    Article  Google Scholar 

  • Misman MF, Chan WH, Mohamad MS, Deris S (2013) A hybrid of svm and scad with group-specific tuning parameters in identification of informative genes and biological pathways. In: Li J, Cao L, Wang C, Tan KC, Liu B, Pei J, Tseng VS (eds) Trends and applications in knowledge discovery and data mining. Springer, pp 258–269

  • Morimoto A, Serada S, Enomoto T, Kim A, Matsuzaki S, Takahashi T, Ueda Y, Yoshino K, Fujita M, Fujimoto M et al (2014) Annexin a4 induces platinum resistance in a chloride-and calcium-dependent manner. Oncotarget 5(17):7776

    Article  Google Scholar 

  • Mosca E, Bertoli G, Piscitelli E, Vilardo L, Reinbold RA, Zucchi I, Milanesi L (2009) Identification of functionally related genes using data mining and data integration: a breast cancer case study. BMC Bioinformatics 10(12):1

    Google Scholar 

  • Müller-Tidow C, Diederichs S, Bulk E, Pohle T, Steffen B, Schwäble J, Plewka S, Thomas M, Metzger R, Schneider PM et al (2005) Identification of metastasis-associated receptor tyrosine kinases in non-small cell lung cancer. Cancer Res 65(5):1778–1782

    Article  Google Scholar 

  • Nexø BA, Vogel U, Olsen A, Nyegaard M, Bukowy Z, Rockenbauer E, Zhang X, Koca C, Mains M, Hansen B et al (2008) Linkage disequilibrium mapping of a breast cancer susceptibility locus near rai/ppp1r13l/iaspp. BMC Med Genet 9(1):1

    Article  Google Scholar 

  • O’Byrne KJ, Baird AM, Kilmartin L, Leonard J, Sacevich C, Gray SG (2011) Epigenetic regulation of glucose transporters in non-small cell lung cancer. Cancers 3(2):1550–1565

    Article  Google Scholar 

  • Phipson B, Maksimovic J, Oshlack A (2015) missMethyl: an R package for analysing methylation data from illumina’s HumanMethylation450 platform. Bioinformatics 32. doi:10.1093/bioinformatics/btv560

  • Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517

    Article  Google Scholar 

  • Saeys Y, Abeel T, Van de Peer Y (2008) Robust feature selection using ensemble feature selection techniques. In: Daelemans W, Goethals B, Morik K (eds) Machine learning and knowledge discovery in databases. Springer, pp 313–325

  • Sahu B, Mishra D (2012) A novel feature selection algorithm using particle swarm optimization for cancer microarray data. Proc Eng 38:27–31

    Article  Google Scholar 

  • Sastry K, Goldberg D, Kendall G (2005) Genetic algorithms. Springer, Boston

    Book  Google Scholar 

  • Spinola M, Meyer P, Kammerer S, Falvella FS, Boettger MB, Hoyal CR, Pignatiello C, Fischer R, Roth RB, Pastorino U et al (2006) Association of the pdcd5 locus with lung cancer risk and prognosis in smokers. J Clin Oncol 24(11):1672–1678

    Article  Google Scholar 

  • Stevenson L, Allen WL, Turkington R, Jithesh PV, Proutski I, Stewart G, Lenz HJ, Van Schaeybroeck S, Longley DB, Johnston PG (2012) Identification of galanin and its receptor galr1 as novel determinants of resistance to chemotherapy and potential biomarkers in colorectal cancer. Clin Cancer Res 18(19):5412–5426

    Article  Google Scholar 

  • Stylianou S, Clarke RB, Brennan K (2006) Aberrant activation of notch signaling in human breast cancer. Cancer Res 66(3):1517–1525

    Article  Google Scholar 

  • Uribarri M, Hormaeche I, Zalacain R, Lopez-Vivanco G, Martinez A, Nagore D, Ruiz-Argüello MB (2014) A new biomarker panel in bronchoalveolar lavage for an improved lung cancer diagnosis. J Thorac Oncol 9(10):1504–1512

    Article  Google Scholar 

  • Valavanis I, Pilalis E, Georgiadis P, Kyrtopoulos S, Chatziioannou A (2015) Cancer biomarkers from genome-scale dna methylation: comparison of evolutionary and semantic analysis methods. Microarrays 4(4):647–670

    Article  Google Scholar 

  • Wei R, Zhang Y, Shen L, Jiang W, Li C, Zhong M, Xie Y, Yang D, He L, Zhou Q (2012) Comparative proteomic and radiobiological analyses in human lung adenocarcinoma cells. Mol Cell Biochem 359(1–2):151–159

    Article  Google Scholar 

  • Ya Chen, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, Gallinger S, Hudson TJ, Weksberg R (2013) Discovery of cross-reactive probes and polymorphic CpGs in the illumina infinium humanmethylation450 microarray. Epigenetics 8(2):203–209

    Article  Google Scholar 

  • Yao L, Pan TY (2010) Feature selection and classification of seldi-tof mass spectra of hepatoma using gene-weighted genetic algorithm. In: Proceedings of international conference on biomedical fuzzy systems association

  • Yu L, Ding C, Loscalzo S (2008) Stable feature selection via dense feature groups. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’08, pp 803–811

  • Zhuang J, Widschwendter M, Teschendorff AE (2012) A comparison of feature selection and classification methods in dna methylation studies using the illumina infinium platform. BMC Bioinformatics 13(1):59

    Article  Google Scholar 

  • Zou KH, OMalley AJ, Mauri L (2007) Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation 115(5):654–657

    Article  Google Scholar 

Download references

Acknowledgements

We are so grateful to Prof. Amr Badr for the significant consultations and advice we got while working on this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alhasan Alkuhlani.

Ethics declarations

Conflict of interest

The authors confirm that this article content has no conflict of interest.

Additional information

Communicated by V. Loia.

Electronic supplementary material

Below is the link to the electronic supplementary material.

500_2016_2439_MOESM1_ESM.pdf

Supplementary 1: The accuracy (Ac), sensitivity (Sn), specificity (Sp) of each run of IFS for each of the three training sets using MSFS (NoGA-Mode) (PDF 221 kb).

500_2016_2439_MOESM2_ESM.pdf

Supplementary 2: The accuracy (Ac), sensitivity (Sn), specificity (Sp) of each run of IFS for each of the three training sets using MSFS (GA-Mode) (PDF 206 kb).

500_2016_2439_MOESM3_ESM.pdf

Supplementary 3: The accuracy (Ac), sensitivity (Sn), specificity (Sp) of each run of IFS for each of the three independent sets using MSFS (NoGA-Mode) (PDF 260 kb).

500_2016_2439_MOESM4_ESM.pdf

Supplementary 4: The accuracy (Ac), sensitivity (Sn), specificity (Sp) of each run of IFS for each of the three independent sets using MSFS (GA-Mode) (PDF 261 kb).

500_2016_2439_MOESM5_ESM.pdf

Supplementary 5: GO terms and KEGG pathways of genes corresponding to the selected CpG sites for the three attempted cancer datasets (PDF 241 kb).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alkuhlani, A., Nassef, M. & Farag, I. Multistage feature selection approach for high-dimensional cancer data. Soft Comput 21, 6895–6906 (2017). https://doi.org/10.1007/s00500-016-2439-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-016-2439-9

Keywords

Navigation