Skip to main content

Advertisement

Log in

HBS–STACK: hierarchical biomarker selection and stacked ensemble model for biomarker identification and cancer prediction on multi-omics

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Genomic and transcriptomic data development has provided new prospects for biomarker identification and cancer prediction. However, it is challenging to capture the biological dataset with complex and nonlinear associations using existing biomarkers and cancer diagnosis techniques. Machine learning offers enormous potential for creating feature selection techniques and models to identify cancer biomarkers. In this article, we propose a Hierarchical Biomarker Selection and Stacked Ensemble model for Biomarker Identification and Cancer Prediction (HBS–STACK) on miRNA, gene expression, and DNA Methylation (DM) datasets. Three-stage biomarker selection is developed comprising an aggregation of information between CpG sites and genes by considering the biological relations at stage 1, Fold Change and False Discovery Rate selection at stage 2, and Light Gradient Boosting Machine with Recursive Feature Elimination (LBGMRFE) selection at stage 3. The selected features and markers are integrated and passed to stacked ML models comprising Gradient Boosting Machine (GBM), Naïve Bayes (NB), Random Forest (RF) at level 1 learning, and DNN at level 2 learning. HBS–STACK is evaluated on breast cancer (BRCA) and is validated on kidney renal clear cell carcinoma (KIRC) from TCGA (The Cancer Genome Atlas) Portal and on Alzheimer Disease. We found several genomic and transcriptomic biomarkers comprising IQSEC1 for BRCA, ZFHX3, CTBP2, and SLC9AR2 for KIRC and TMEM61 for Alzheimer disease, respectively. The experimental results show that the HBS–STACK outperformed GBM, NB, and RF with 99.60, 99.03, and 92.05% accuracy and shows an improvement of 2.27, 26.03, 10.05% in performance compared with existing techniques on BRCA, KIRC, and Alzheimer, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The datasets will be made available on suitable request.

References

  1. Vargas AJ, Harris CC (2019) Cancer as a case study. Biomakers 16:525–537. https://doi.org/10.1038/nrc.2016.56.Biomarker

    Article  Google Scholar 

  2. One in every 15 Indians will die of cancer, says WHO report. https://theprint.in/health/one-in-every-15-indians-will-die-of-cancer-says-who-report/359394/. Accessed 14 Feb 2022

  3. Smith TR, Miller MS, Lohman KK et al (2003) DNA damage and breast cancer risk. Carcinogenesis 24:883–889. https://doi.org/10.1093/carcin/bgg037

    Article  CAS  PubMed  Google Scholar 

  4. Raweh AA, Nassef M, Badr A et al (2020) Identifying a miRNA signature for predicting the stage of breast cancer. Cancers (Basel) 12:1–14. https://doi.org/10.18632/oncotarget.2915

    Article  Google Scholar 

  5. Das T, Andrieux G, Ahmed M, Chakraborty S (2020) Integration of online omics-data resources for cancer research. Front Genet 11:1–24. https://doi.org/10.3389/fgene.2020.578345

    Article  CAS  Google Scholar 

  6. Reel PS, Reel S, Pearson E et al (2021) Using machine learning approaches for multi-omics data analysis: a review. Biotechnol Adv 49:107739. https://doi.org/10.1016/j.biotechadv.2021.107739

    Article  CAS  PubMed  Google Scholar 

  7. Lazar C, Taminau J, Meganck S et al (2012) Survey of filter techniques for feature selection in MicroArrays. IEEE Trans Comput Biol Bioinform 9:1106–1119

    Article  Google Scholar 

  8. Raweh AA, Nassef M, Badr A (2018) A hybridized feature selection and extraction approach for enhancing cancer prediction based on DNA methylation. IEEE Access 6:15212–15223. https://doi.org/10.1109/ACCESS.2018.2812734

    Article  Google Scholar 

  9. Yasuda T, Bateni M, Chen L, et al (2022) Sequential attention for feature selection, pp 1–21

  10. Zhao, Z., Zhang, Y., Harinen, T., Yung M (2022) Feature selection methods for uplift modeling and heterogeneous treatment effect. In: IFIP international conference on artificial intelligence applications and innovations. Springer: Cham, pp 217–230

  11. Tang XF, Shi Z, Jin M (2021) Multi-category multi-state information ensemble-based classification method for precise diagnosis of three cancers. Neural Comput Appl 33:15901–15917. https://doi.org/10.1007/s00521-021-06211-3

    Article  Google Scholar 

  12. Huang MW, Chen CW, Lin WC et al (2017) SVM and SVM ensembles in breast cancer prediction. PLoS ONE 12:1–14. https://doi.org/10.1371/journal.pone.0161501

    Article  CAS  Google Scholar 

  13. Cho S-B, Won H-H (2003) Machine learning in DNA microarray analysis for cancer classification. Proc First Asia-Pacific Bioinform Conf Bioinform 19:189–198

    Google Scholar 

  14. Sun L, Zhang X, Qian Y et al (2019) Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification. Inf Sci (N Y) 502:18–41. https://doi.org/10.1016/j.ins.2019.05.072

    Article  MathSciNet  Google Scholar 

  15. Li L, Ching WK, Liu ZP (2022) Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods. Comput Biol Chem 100:107747. https://doi.org/10.1016/j.compbiolchem.2022.107747

    Article  CAS  PubMed  Google Scholar 

  16. Liaw A, Wiener M (2002) The R Journal: classification and regression by randomForest. R Journal 2:18–22

    Google Scholar 

  17. Genomic Data Commons Data Portal. https://portal.gdc.cancer.gov/. Accessed 10 Jan 2022

  18. Rehman O, Zhuang H, Ali AM, Ibrahim A (2019) Validation of miRNAs as breast cancer biomarkers with a machine learning approach. Cancers (Basel) 11:431. https://doi.org/10.3390/cancers11030431

    Article  CAS  PubMed  Google Scholar 

  19. Danaee P, Ghaeini R, Hendrix DA (2017) A deep learning approach for cancer detection and relevant gene identification. Pac Symp Biocomput. https://doi.org/10.1142/9789813207813_0022

    Article  PubMed  Google Scholar 

  20. Alghunaim S, Al-Baity HH (2019) On the scalability of machine-learning algorithms for breast cancer prediction in big data context. IEEE Access 7:91535–91546. https://doi.org/10.1109/ACCESS.2019.2927080

    Article  Google Scholar 

  21. Jeon H, Oh S (2020) Hybrid-recursive feature elimination for efficient feature selection. Appl Sci 10(9):1–8

    Article  Google Scholar 

  22. Zhang G, Xue Z, Yan C et al (2021) A novel biomarker identification approach for gastric cancer using gene expression and DNA methylation dataset. Front Genet. https://doi.org/10.3389/fgene.2021.644378

    Article  PubMed  PubMed Central  Google Scholar 

  23. Wang T, Shao W, Huang Z et al (2021) MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun 12:1–13. https://doi.org/10.1038/s41467-021-23774-w

    Article  CAS  Google Scholar 

  24. Choi JM, Chae H (2023) moBRCA-net: a breast cancer subtype classification framework based on multi-omics attention neural networks. BMC Bioinform 24:1–15. https://doi.org/10.1186/s12859-023-05273-5

    Article  Google Scholar 

  25. Garzon R, Fabbri M, Cimmino A et al (2006) MicroRNA expression and function in cancer. Trends Mol Med 12:580–587. https://doi.org/10.1016/j.molmed.2006.10.006

    Article  CAS  PubMed  Google Scholar 

  26. Wessely F, Emes RD (2012) Identication of DNA methylation biomarkers from Innium arrays. Front Genet 3:1–8. https://doi.org/10.3389/fgene.2012.00161

    Article  Google Scholar 

  27. Shobha G, Rangaswamy S (2018) Machine learning, 1st edn. Amsterdam, Elsevier

    Google Scholar 

  28. Yiu T (2019) Understanding Random Forest. https://towardsdatascience.com/understanding-random-forest-58381e0602d2. Accessed 2 Mar 2022

  29. Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Front Neurorobot. https://doi.org/10.3389/fnbot.2013.00021

    Article  PubMed  PubMed Central  Google Scholar 

  30. Montavon G, Samek W, Müller KR (2018) Methods for interpreting and understanding deep neural networks. Digit Signal Proc Rev J 73:1–15. https://doi.org/10.1016/j.dsp.2017.10.011

    Article  MathSciNet  Google Scholar 

  31. Pavlyshenko B (2018) Using stacking approaches for machine learning models. In: Proceedings of the 2018 IEEE 2nd international conference on data stream mining and processing, DSMP 2018 pp. 255–258. https://doi.org/10.1109/DSMP.2018.8478522

  32. Stacked Models, Hands-On Machine Learning with R (2020). https://bradleyboehmke.github.io/HOML/stacking.html. Accessed 12 Jan 2022

  33. impute.knn: A function to impute missing expression data. https://www.rdocumentation.org/packages/impute/versions/1.46.0/topics/impute.knn. Accessed 12 Jan 2022

  34. Pavya K, Srinivasan DB (2017) Feature selection techniques in data mining: a study. Int J Sci Dev Res 2:594–598

    Google Scholar 

  35. Witten D (2007) A comparison of fold-change and the t-statistic for microarray data analysis. Analysis 1776:58–85

    Google Scholar 

  36. Norris AW, Kahn CR (2006) Analysis of gene expression in pathophysiological states: Balancing false discovery and false negative rates. Proc Natl Acad Sci U S A 103:649–653. https://doi.org/10.1073/pnas.0510115103

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  37. Shen Z (2020) A Novel Hybrid Classification Model - LightGBM With Neural Net. https://zitaoshen.rbind.io/project/machine_learning/a-novel-hybrid-classification-model-lightgbm-with-neural-net/. Accessed 23 Jan 2022

  38. Wang D, Li JR, Zhang YH et al (2018) Identification of differentially expressed genes between original breast cancer and xenograft using machine learning algorithms. Genes (Basel) 9:1–15. https://doi.org/10.3390/genes9030155

    Article  CAS  Google Scholar 

  39. Ma B, Meng F, Yan G et al (2020) Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data. Comput Biol Med 121:103761. https://doi.org/10.1016/j.compbiomed.2020.103761

    Article  CAS  PubMed  Google Scholar 

  40. Li MW, Xu DY, Geng J, Hong WC (2022) A hybrid approach for forecasting ship motion using CNN–GRU–AM and GCWOA. Appl Soft Comput 114:108084. https://doi.org/10.1016/j.asoc.2021.108084

    Article  Google Scholar 

  41. Sultan G (2019) Towards the early detection of ductal carcinoma (a common type of breast cancer) using biomarkers linked to the PPAR(γ) signaling pathway. Bioinformation 15:799–805. https://doi.org/10.6026/97320630015799

    Article  PubMed  PubMed Central  Google Scholar 

  42. Hunter S, Nault B, Ugwuagbo KC et al (2019) Mir526b and mir655 promote tumour associated angiogenesis and lymphangiogenesis in breast cancer. Cancers (Basel). https://doi.org/10.3390/cancers11070938

    Article  PubMed  PubMed Central  Google Scholar 

  43. Martinez-Ledesma E, Verhaak RGW, Treviño V (2015) Identification of a multi-cancer gene expression biomarker for cancer clinical outcomes using a network-based algorithm. Sci Rep 5:1–14. https://doi.org/10.1038/srep11966

    Article  CAS  Google Scholar 

  44. Salas LA, Johnson KC, Koestler DC et al (2017) Integrative epigenetic and genetic pan-cancer somatic alteration portraits. Epigenetics 12:561–574. https://doi.org/10.1080/15592294.2017.1319043

    Article  PubMed  PubMed Central  Google Scholar 

  45. Zhu H, Lu J, Zhao H et al (2018) Functional long noncoding RNAs (IncRNAs) in clear cell kidney carcinoma revealed by reconstruction and comprehensive analysis of the lncRNA–miRNA–mRNA regulatory network. Med Sci Monit 24:8250–8263. https://doi.org/10.12659/MSM.910773

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Zong X, Fu J, Wang Z, Wang Q (2022) The diagnostic and prognostic values of HOXA gene family in kidney clear cell renal cell carcinoma. J Oncol 2022:1–14. https://doi.org/10.1155/2022/1762637

    Article  CAS  Google Scholar 

  47. Han G, Zhao W, Song X et al (2017) Unique protein expression signatures of survival time in kidney renal clear cell carcinoma through a pan-cancer screening. BMC Genom. https://doi.org/10.1186/s12864-017-4026-6

    Article  Google Scholar 

  48. Zheng X, Song T, Dou C et al (2015) CtBP2 is an independent prognostic marker that promotes GLI1 induced epithelial-mesenchymal transition in hepatocellular carcinoma. Oncotarget 6:3752–3769. https://doi.org/10.18632/oncotarget.2915

    Article  PubMed  PubMed Central  Google Scholar 

  49. Aboulouard S, Wisztorski M, Duhamel M et al (2021) In-depth proteomics analysis of sentinel lymph nodes from individuals with endometrial cancer. Cell Rep Med 2:100318. https://doi.org/10.1016/j.xcrm.2021.100318

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Ali M, Archer DB, Gorijala P et al (2023) Large multi-ethnic genetic analyses of amyloid imaging identify new genes for Alzheimer disease. Acta Neuropathol Commun 11:1–20. https://doi.org/10.1186/s40478-023-01563-4

    Article  CAS  Google Scholar 

  51. Vasanthakumar A, Davis JW, Idler K et al (2020) Harnessing peripheral DNA methylation differences in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) to reveal novel biomarkers of disease. Clin Epigenet 12:1–11. https://doi.org/10.1186/s13148-020-00864-y

    Article  CAS  Google Scholar 

  52. Silva GJJ, Bye A, el Azzouzi H, Wisløff U (2017) MicroRNAs as important regulators of exercise adaptation. Prog Cardiovasc Dis 60:130–151. https://doi.org/10.1016/j.pcad.2017.06.003

    Article  PubMed  Google Scholar 

  53. Brownlee J (2016) Naive Bayes for machine learning. https://machinelearningmastery.com/naive-bayes-for-machine-learning/. Accessed 28 Feb 2022

Download references

Funding

The authors have no funding to report.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arwinder Dhillon.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Ethical standards

The author declares that this article complies the ethical standard.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix I: Algorithm for proposed HBS–STACK

Appendix I: Algorithm for proposed HBS–STACK

Algorithm 1
figure a

Pseudocode of HBS–STACK

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dhillon, A., Singh, A. & Bhalla, V.K. HBS–STACK: hierarchical biomarker selection and stacked ensemble model for biomarker identification and cancer prediction on multi-omics. Neural Comput & Applic 36, 5413–5431 (2024). https://doi.org/10.1007/s00521-023-09359-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-09359-2

Keywords

Navigation