Abstract
In cancer progression, the expression level of relevant genes will change significantly in tumors comparing to their healthy counterparts. Therefore, the discovery of specific genes serving as biomarkers is of practical significance for diagnosis and prognosis. The available high-throughput ‘-omic’ datasets provide unprecedented resources and opportunities of deriving cancer biomarkers, such as the public RNA-sequencing data generated by the Cancer Genome Atlas (TCGA) consortium. Here, we explore the identification of biomarker genes in 12 types of cancers from the classification effects in control and disease samples by machine learning. We firstly identify differentially expressed genes individually. Then, we implement feature selection by integrating recursive feature reduction and random forest classification with feature ranking. The final feature number will be determined via a parsimony principle that the features will be as few as possible, while they are still with the highest classification accuracy. In each cancer, the biomarker genes are then evaluated by tenfold cross-validations via several classification algorithms. We find extreme learning machine achieves the best classification performance when compared to the other methods. The further gene enrichment analyses indicate the dysfunctional and pathogenic mechanism in these identified biomarkers.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Rodriguez, H., Pennington, S.R.: Revolutionizing precision oncology through collaborative proteogenomics and data sharing. Cell 173, 535–539 (2018)
Zhu, C., Ren, C., Han, J., et al.: A five-microRNA panel in plasma was identified as potential biomarker for early detection of gastric cancer. Br. J. Cancer 110, 2291–2299 (2014)
Li, M., Hong, G., Cheng, J., et al.: Identifying reproducible molecular biomarkers for gastric cancer metastasis with the aid of recurrence information. Sci. Rep. 6, 24869 (2016)
Vargas, A.J., Harris, C.C.: Biomarker development in the precision medicine era: lung cancer as a case study. Nat. Rev. Cancer 16, 525–537 (2016)
Bhalla, S., Chaudhary, K., Kumar, R., et al.: Gene expression-based biomarkers for discriminating early and late stage of clear cell renal cancer. Sci. Rep. 7, 44997 (2017)
Chang, K., Creighton, C.J., Davis, C., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013)
Wei, L., Lian, B., Zhang, Y., et al.: Application of microRNA and mRNA expression profiling on prognostic biomarker discovery for hepatocellular carcinoma. BMC Genom. 15, S13 (2014)
Tsai, C.-A., Chen, J.J., Baek, S.: Development of biomarker classifiers from high-dimensional data. Brief. Bioinform. 10, 537–546 (2009)
Dupont, P., Helleputte, T., Abeel, T., et al.: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26, 392–398 (2009)
Swan, A.L., Mobasheri, A., Allaway, D., et al.: Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. OMICS J. Integr. Biol. 17, 595–610 (2013)
Wenric, S., Shemirani, R.: Using supervised learning methods for gene selection in RNA-Seq case-control studies. Front. Genet. 9, 297 (2018)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Wong, T.-T.: Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 48, 2839–2846 (2015)
Goldman, M., Craft, B., Swatloski, T., et al.: The UCSC cancer genomics browser: update 2015. Nucleic Acids Res. 43, D812–D817 (2014)
Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Kuhn, M.: Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008)
Guyon, I., Weston, J., Barnhill, S., et al.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)
Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2, 42–47 (2012)
Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: ICML, pp. 148–156. Citeseer (1996)
Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: theory and applications. Neurocomputing 70, 489–501 (2006)
Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn. 29, 103–130 (1997)
Hecht-Nielsen, R.: Theory of the backpropagation neural network. In: Neural Networks for Perception, pp. 65–93. Elsevier (1992)
Chen, H.-L., Yang, B., Liu, J., et al.: A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis. Expert Syst. Appl. 38, 9014–9022 (2011)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27 (2011)
Demircan, K., Cömertoğlu, İ., Akyol, S., et al.: A new biological marker candidate in female reproductive system diseases: Matrix metalloproteinase with thrombospondin motifs (ADAMTS). J. Turk. Ger. Gynecol. Assoc. 15, 250–255 (2014)
Russell, D.L., Brown, H.M., Dunning, K.R.: ADAMTS proteases in fertility. Matrix Biol. 44–46, 54–63 (2015)
Lindgren, D., Eriksson, P., Krawczyk, K., et al.: Cell-type-specific gene programs of the normal human nephron define kidney cancer subtypes. Cell Rep. 20, 1476–1489 (2017)
Acknowledgement
This work was partially supported by the National Natural Science Foundation of China (Nos. 61572287 and 61533011), the Shandong Provincial Key Research and Development Program, China (No. 2018GSF118043), the Innovation Method Fund of China (Ministry of Science and Technology of China, No. 2018IM020200), and the Program of Qilu Young Scholars of Shandong University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Z., Liu, ZP. (2019). Identifying Cancer Biomarkers from High-Throughput RNA Sequencing Data by Machine Learning. In: Huang, DS., Jo, KH., Huang, ZK. (eds) Intelligent Computing Theories and Application. ICIC 2019. Lecture Notes in Computer Science(), vol 11644. Springer, Cham. https://doi.org/10.1007/978-3-030-26969-2_49
Download citation
DOI: https://doi.org/10.1007/978-3-030-26969-2_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26968-5
Online ISBN: 978-3-030-26969-2
eBook Packages: Computer ScienceComputer Science (R0)