Abstract
A more efficient feature selection method was developed to screen genes corresponding to specific cancers to further investigate their pathogenesis. The LASSO-RFE model, a last absolute shrinkage and selection operator (LASSO) classifier based on the idea of recursive feature elimination (RFE), was constructed. To verify the efficiency of the proposed algorithm, performance tests were conducted by using four kinds of gene expression RNA sequences publicly available in The Cancer Genome Atlas (TCGA). The numerical experiments were used to illustrate that the proposed LASSO-RFE enables a higher accuracy of the classification prediction model and a clearer biological interpretability of the selected gene features compared with three typical feature selection algorithms. The experimental results showed that LASSO-RFE effectively reduced tens of thousands of features in the original data to three dimensions and provided better performance for the classification model than mutual information, L1-SVM and tree-based selection method. This model retains the ability of the common LASSO algorithm to filter and remove redundant and irrelevant features, and enhances the biological interpretability according to RFE, which was compared with the traditional feature reduction methods. In this paper, only a limited number of data cases have been validated, and the application of LASSO-RFE with more recent data remains to be further investigated.
Similar content being viewed by others
Data Availability
All data are available from the corresponding author.
References
Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics 4(37):373–384
Chapman KB, Prendes MJ, Sternberg H et al (2012) COL10A1 expression is elevated in diverse solid tumor types and is associated with tumor vasculature. Future Oncol 8(8):1031–1040
Chen J, Zou Q, Li J (2021) DeepM6ASeq-EL: prediction of human N6-methyladenosine (m6A) sites with LSTM and ensemble learning. Front Comput Sci. https://doi.org/10.1007/s11704-020-0180-0
Chen K, Liu Y, Wang Z, et al (2019) Expression of COL10A1 in patients with pancreatic cancer and its prognostic value. Acad J Chin PLA Med School
Duan L, Ge H, Ma W et al (2015) EEG feature selection method based on decision tree. Bio-Med Mater Eng 26(s1):S1019–S1025
Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene monitoring. Science 286(5439):531–537
Guyon I, Weston J, Barnhill S et al (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Guyon I, Nikravesh M, Gunn S, et al (2006) [Studies in fuzziness and soft computing] feature extraction Volume 207|| Combining SVMs with various feature selection strategies, 315-324. https://doi.org/10.1007/978-3-540-35488-8
Huang H, Li T, Ye G et al (2018) High expression of COL10A1 is associated with poor prognosis in colorectal cancer. Onco Targets Ther 11:1571–1581
Li J, Qin Y, Yi D et al (2015) Feature selection for support vector machine in the study of financial early warning system. Qual Reliab Eng 30(6):867–877
Li Y, Wang X, Shi L et al (2020) Predictions for high COL1A1 and COL10A1 expression resulting in a poor prognosis in esophageal squamous cell carcinoma by bioinformatics analyses. Translat Cancer Res 9(1):85–94
Li T, Huang H, Shi G, et al (2018) TGF-β1-SOX9 axis-inducible COL10A1 promotes invasion and metastasis in gastric cancer via epithelial-to-mesenchymal transition. Cell Death and Disease
Maes F, Collignon A (1997) Multimodality image registration by maximization of mutual information. IEEE Trans Med Imaging 16(2):187–198
Molina LC, Belanche L, Nebot N (2002) Feature selection algorithms: a survey and experimental evaluation. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM 2002), 9–12 Dec 2002, Maebashi City, Japan. IEEE.
Necula L, Matei L, Dragu D et al (2020) High plasma levels of COL10A1 are associated with advanced tumor stage in gastric cancer patients. World J Gastroenterol 26(22):3024–3033
Peng Y, Wu Z, Jiang J (2010) A novel feature selection approach for biomedical data classification. J Biomed Inform 43(1):15–23
Ramaswamy S, Golub TR (2002) DNA microarrays in clinical oncology. J Clin Oncol 20(7):1932–1941
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J r Stat Soc Ser B (methodol) 58:267–288
Tinker AV, Boussioutas A, Bowtell DDL (2006) The challenges of gene expression microarrays for the study of human cancer. Cancer Cell 9:333–339
Topouzelis K, Psyllos A (2012) Oil spill feature selection and classification using decision tree forest on SAR image data. Isprs J Photogramm Remote Sens 68:135–143
Yang Y, Sun F, Chen H, Tan H, Yang L, Zhang L, Huang Y (2021) Postnatal exposure to DINP was associated with greater alterations of lipidomic markers for hepatic steatosis than DEHP in postweaning mice. Sci Total Environ 758:143631. https://doi.org/10.1016/j.scitotenv.2020.143631
Zhang M, Chen H, Wang M, Bai F, Wu K (2020) Bioinformatics analysis of prognostic significance of COL10A1 in breast cancer. Biosci Rep 40(2)
Zou Q, Xing P, Wei L, Liu B (2019) Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA. RNA 25(2):205–218. https://doi.org/10.1261/rna.069112.118
Funding
None.
Author information
Authors and Affiliations
Contributions
CA contributed to conceptualization, data curation, writing—original draft, and writing—review and editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Rights and permissions
About this article
Cite this article
Ai, C. A Method for Cancer Genomics Feature Selection Based on LASSO-RFE. Iran J Sci Technol Trans Sci 46, 731–738 (2022). https://doi.org/10.1007/s40995-022-01292-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40995-022-01292-8