Skip to main content

Advertisement

Log in

Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification

  • Original Article
  • Published:
Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Abstract

Microarray gene expression data are often accompanied by a large number of genes and a small number of samples. However, only a few of these genes are relevant to cancer, resulting in significant gene selection challenges. Hence, we propose a two-stage gene selection approach by combining extreme gradient boosting (XGBoost) and a multi-objective optimization genetic algorithm (XGBoost-MOGA) for cancer classification in microarray datasets. In the first stage, the genes are ranked using an ensemble-based feature selection using XGBoost. This stage can effectively remove irrelevant genes and yield a group comprising the most relevant genes related to the class. In the second stage, XGBoost-MOGA searches for an optimal gene subset based on the most relevant genes’ group using a multi-objective optimization genetic algorithm. We performed comprehensive experiments to compare XGBoost-MOGA with other state-of-the-art feature selection methods using two well-known learning classifiers on 14 publicly available microarray expression datasets. The experimental results show that XGBoost–MOGA yields significantly better results than previous state-of-the-art algorithms in terms of various evaluation criteria, such as accuracy, F-score, precision, and recall.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Güçkıran K, Cantürk İ, Özyılmaz L (2019) LASSO ve Relief Özellik Seçimi Yöntemleri ile DVM, ÇKA ve RO Ağ Yapıları Kullanılarak DNA Mikroçip Gen İfadesi Verisetlerinin Sınıflandırılması. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 23:115–121. https://doi.org/10.19113/sdufenbed.453462

  2. Lazar C, Taminau J, Meganck S et al (2012) A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol and Bioinf 9:1106–1119. https://doi.org/10.1109/TCBB.2012.33

    Article  Google Scholar 

  3. Lee C-P, Leu Y (2011) A novel hybrid feature selection method for microarray data analysis. Appl Soft Comput 11:208–213. https://doi.org/10.1016/j.asoc.2009.11.010

    Article  Google Scholar 

  4. Hira ZM, Gillies DF (2015) A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinform 2015:1–13. https://doi.org/10.1155/2015/198363

    Article  Google Scholar 

  5. Bhalla A, Agrawal RK (2013) Microarray gene-expression data classification using less gene expressions by combining feature selection methods and classifiers. IJIEEB 5:42–48. https://doi.org/10.5815/ijieeb.2013.05.06

    Article  Google Scholar 

  6. Bindu NH, Chakravarthi T (2018) Booster of an FS algorithm on high dimensional data. IJSRSET 4:496–500

    Google Scholar 

  7. Yu H, Ni J (2014) An Improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data. IEEE/ACM Trans Comput Biol Bioinf 11:657–666

    Article  Google Scholar 

  8. Li M, Xiong A, Wang L et al (2020) ACO resampling: enhancing the performance of oversampling methods for class imbalance classification. Knowledge-Based Systems 196:105818

    Article  Google Scholar 

  9. Li W, Yin Y, Quan X, Zhang H (2019) Gene expression value prediction based on XGBoost algorithm. Front Genet 10:1077. https://doi.org/10.3389/fgene.2019.01077

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Islam A, Rahman MM, Ahmed E, et al (2020) Adaptive feature selection and classification of colon cancer from gene expression data: an ensemble learning approach. In: Proceedings of the International Conference on Computing Advancements. ACM, Dhaka Bangladesh 1–7

  11. Kavitha KR, Gopinath A, Gopi M (2017) Applying improved svm classifier for leukemia cancer classification using FCBF. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 61–66

  12. Ben Brahim A, Limam M (2013) Robust ensemble feature selection for high dimensional data sets. In: 2013 International Conference on High Performance Computing & Simulation (HPCS). IEEE, Helsinki, Finland 151–157

  13. Hall MA, Smith LA (1999) Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, May 1–5, 1999, Orlando, Florida, USA

  14. Zeng X-Q, Li G-Z, Chen S-F (2010) Gene selection by using an improved fast correlation-based filter. In: 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW). IEEE, HongKong, China 625–630

  15. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40:16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024

    Article  Google Scholar 

  16. Li J, Cheng K, Wang S et al (2018) Feature selection: a data perspective. ACM Comput Surv 50:1–45. https://doi.org/10.1145/3136625

    Article  Google Scholar 

  17. Elyasigomari V, Lee DA, Screen HRC, Shaheed MH (2017) Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification. J Biomed Inform 67:11–20. https://doi.org/10.1016/j.jbi.2017.01.016

    Article  CAS  PubMed  Google Scholar 

  18. Huang X, Zhang L, Wang B et al (2018) Feature clustering based support vector machine recursive feature elimination for gene selection. Appl Intell 48:594–607. https://doi.org/10.1007/s10489-017-0992-2

    Article  Google Scholar 

  19. Shukla AK, Singh P, Vardhan M (2019) A new hybrid feature subset selection framework based on binary genetic algorithm and information theory. Int J Comp Intel Appl 18:1950020. https://doi.org/10.1142/S1469026819500202

    Article  Google Scholar 

  20. Huan Liu, Setiono R (1995) Chi2: feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence. IEEE Comput. Soc. Press, Herndon, VA, USA 388–391

  21. Liu Y (2004) A comparative study on feature selection methods for drug discovery. J Chem Inf Comput Sci 44:1823–1828. https://doi.org/10.1021/ci049875d

    Article  CAS  PubMed  Google Scholar 

  22. Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53:23–69. https://doi.org/10.1023/A:1025667309714

    Article  Google Scholar 

  23. Ghosh M, Adhikary S, Ghosh KK et al (2019) Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods. Med Biol Eng Comput 57:159–176. https://doi.org/10.1007/s11517-018-1874-4

    Article  PubMed  Google Scholar 

  24. Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555

    Google Scholar 

  25. Lu H, Chen J, Yan K et al (2017) A hybrid feature selection algorithm for gene expression data classification. Neurocomputing 256:56–62. https://doi.org/10.1016/j.neucom.2016.07.080

    Article  Google Scholar 

  26. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Francisco California USA 785–794

  27. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232

    Article  Google Scholar 

  28. Chen S, Zhou W, Tu J et al (2021) A novel XGBoost method to infer the primary lesion of 20 solid tumor types from gene expression data. Front Genet 12:632761. https://doi.org/10.3389/fgene.2021.632761

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Islam A, Rahman MM, Ahmed E, et al (2020) Adaptive feature selection and classification of colon cancer from gene expression data: an ensemble learning approach. In: Proceedings of the International Conference on Computing Advancements. Association for Computing Machinery, New York, NY, USA 1–7

  30. Dimitrakopoulos GN, Vrahatis AG, Plagianakos V, Sgarbas K (2018) Pathway analysis using XGBoost classification in biomedical data. In: Proceedings of the 10th Hellenic Conference on Artificial Intelligence. ACM, Patras Greece 1–6

  31. Sujamol S, Vimina ER, Krishnakumar U (2020) Improving recurrence prediction accuracy of ovarian cancer using multi-phase feature selection methodology. Appl Artif Intell 35:1–21. https://doi.org/10.1080/08839514.2020.1854988

    Article  Google Scholar 

  32. Abdu-Aljabar RD, Awad OA (2021) A Comparative analysis study of lung cancer detection and relapse prediction using XGBoost classifier. IOP Conf Ser: Mater Sci Eng 1076:012048. https://doi.org/10.1088/1757-899X/1076/1/012048

    Article  CAS  Google Scholar 

  33. Haidar A, Verma B, Haidar R (2019) A swarm based optimization of the XGBoost parameters. Aust J Intell Inf Process Syst 16:74–81

    Google Scholar 

  34. Djellali H, Guessoum S, Ghoualmi-Zine N, Layachi S (2017) Fast correlation based filter combined with genetic algorithm and particle swarm on feature selection. In: 2017 5th International Conference on Electrical Engineering - Boumerdes (ICEE-B). IEEE, Boumerdes 1–6

  35. Pragadeesh C, Jeyaraj R, Siranjeevi K et al (2019) Hybrid feature selection using micro genetic algorithm on microarray gene expression data. IFS 36:2241–2246. https://doi.org/10.3233/JIFS-169935

    Article  Google Scholar 

  36. Babatunde OH, Armstrong L, Leng J, Diepeveen D (2014) A genetic algorithm-based feature selection. British J Math Comput Sci 5:889–905

    Google Scholar 

  37. Sayed S, Nassef M, Badr A, Farag I (2019) A nested genetic algorithm for feature selection in high-dimensional cancer microarray datasets. Expert Syst Appl 121:233–243. https://doi.org/10.1016/j.eswa.2018.12.022

    Article  Google Scholar 

  38. Song K, Yan F, Ding T et al (2020) A steel property optimization model based on the XGBoost algorithm and improved PSO. Comput Mater Sci 174:109472. https://doi.org/10.1016/j.commatsci.2019.109472

    Article  CAS  Google Scholar 

  39. Alizadeh AA, Eisen MB, Davis RE et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511. https://doi.org/10.1038/35000501

    Article  CAS  Google Scholar 

  40. Zhu Z, Ong Y-S, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recogn 40:3236–3248. https://doi.org/10.1016/j.patcog.2007.02.007

    Article  Google Scholar 

  41. Alon U, Barkai N, Notterman DA et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96:6745–6750. https://doi.org/10.1073/pnas.96.12.6745

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Subramanian AA, Tamayo PP, Mootha VKV et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102:15545–15550

    Article  CAS  Google Scholar 

  43. Singh D, Febbo PG, Ross K et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209

    Article  CAS  Google Scholar 

  44. Borovecki F, Lovrecic L, Zhou J et al (2005) Genome-wide expression profiling of human blood reveals biomarkers for Huntington’s disease. Proc Natl Acad Sci USA 102:11023–11028

    Article  CAS  Google Scholar 

  45. Tian E, Zhan F, Walker R et al (2003) The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. N Engl J Med 349:2483–2494

    Article  CAS  Google Scholar 

  46. Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classfication methods for tissue classification based on gene expression. Bioinformatics (Oxford, England) 20:2429–2437. https://doi.org/10.1093/bioinformatics/bth267

    Article  CAS  Google Scholar 

  47. The Cancer Genome Atlas Program - National Cancer Institute. https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga. Accessed 10 Oct 2021

  48. Pedregosa F, Varoquaux G, Gramfort A, et al (2012) Scikit-learn: machine learning in python

  49. Calzolari M (2019) manuel-calzolari/sklearn-genetic: sklearn-genetic 0.2. Zenodo

  50. Soufan O, Kleftogiannis D, Kalnis P, Bajic VB (2015) DWFS: a wrapper feature selection tool based on a parallel genetic algorithm. PLoS ONE 10:e0117988. https://doi.org/10.1371/journal.pone.0117988

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Syafrudin M, Alfian G, Fitriyani NL et al (2020) A self-care prediction model for children with disability based on genetic algorithm and extreme gradient boosting. Mathematics 8:1590. https://doi.org/10.3390/math8091590

    Article  Google Scholar 

  52. Hall MA (1999) Correlation-based feature selection for machine learning. 198

  53. Urbanowicz RJ, Olson RS, Schmitt P, et al (2017) Benchmarking relief-based feature selection methods

  54. Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1:80–83. https://doi.org/10.2307/3001968

    Article  Google Scholar 

  55. Pratt JW (1959) Remarks on zeros and ties in the Wilcoxon signed rank procedures. J Am Stat Assoc 54:655–667. https://doi.org/10.1080/01621459.1959.10501526

    Article  Google Scholar 

  56. Barot RK, Shitole SC, Bhagat N, et al (2016) Therapeutic effect of 0.1% Tacrolimus Eye Ointment in Allergic Ocular Diseases. J Clin Diagn Res 10:NC05–NC09. https://doi.org/10.7860/JCDR/2016/17847.7978

  57. Maino P, Presilla S, ColliFranzone PA et al (2018) Radiation dose exposure for lumbar transforaminal epidural steroid injections and facet joint blocks under CT vs. fluoroscopic guidance. Pain Pract 18:798–804. https://doi.org/10.1111/papr.12677

    Article  PubMed  Google Scholar 

  58. Wang A, Liu X, Wu J et al (2014) Combined FV and FVIII deficiency (F5F8D) in a Chinese family with a novel missense mutation in MCFD2 gene. Haemophilia 20:e436-438. https://doi.org/10.1111/hae.12549

    Article  CAS  PubMed  Google Scholar 

  59. Ye H, Zhang X, Chen Z et al (2018) Association between the polymorphism (rs17222919, -1316T/G) of 5-lipoxygenase-activating protein gene (ALOX5AP) and the risk of stroke: A meta analysis. Medicine (Baltimore) 97:e12682. https://doi.org/10.1097/MD.0000000000012682

    Article  CAS  Google Scholar 

  60. Zhou Y, Chu L, Wang Q et al (2018) CD59 is a potential biomarker of esophageal squamous cell carcinoma radioresistance by affecting DNA repair. Cell Death Dis 9:887. https://doi.org/10.1038/s41419-018-0895-0

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Qin Y, Du J, Fan C (2020) Ube2S regulates Wnt/β-catenin signaling and promotes the progression of non-small cell lung cancer. Int J Med Sci 17:274–279. https://doi.org/10.7150/ijms.40243

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Ostuni A, Carmosino M, Miglionico R et al (2020) Inhibition of ABCC6 transporter modifies cytoskeleton and reduces motility of HepG2 cells via purinergic pathway. Cells 9:E1410. https://doi.org/10.3390/cells9061410

    Article  CAS  PubMed  Google Scholar 

  63. Miao T, Peng C, Tang Z et al (2021) Implication of ataxia-telangiectasia-mutated kinase in epithelium-mesenchyme transition. Carcinogenesis 42:640–649. https://doi.org/10.1093/carcin/bgab002

    Article  CAS  PubMed  Google Scholar 

  64. Grun LK, da Teixeira N, R, Mengden L von, et al (2018) TRF1 as a major contributor for telomeres’ shortening in the context of obesity. Free Radic Biol Med 129:286–295. https://doi.org/10.1016/j.freeradbiomed.2018.09.039

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous referees their comments. Research on this work was partially supported by the grants from the National Nature Science Foundation of China (No. 621660028).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Min.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deng, X., Li, M., Deng, S. et al. Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Med Biol Eng Comput 60, 663–681 (2022). https://doi.org/10.1007/s11517-021-02476-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11517-021-02476-x

Keywords

Navigation