Skip to main content

Identifying Cancer Biomarkers from High-Throughput RNA Sequencing Data by Machine Learning

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11644))

Abstract

In cancer progression, the expression level of relevant genes will change significantly in tumors comparing to their healthy counterparts. Therefore, the discovery of specific genes serving as biomarkers is of practical significance for diagnosis and prognosis. The available high-throughput ‘-omic’ datasets provide unprecedented resources and opportunities of deriving cancer biomarkers, such as the public RNA-sequencing data generated by the Cancer Genome Atlas (TCGA) consortium. Here, we explore the identification of biomarker genes in 12 types of cancers from the classification effects in control and disease samples by machine learning. We firstly identify differentially expressed genes individually. Then, we implement feature selection by integrating recursive feature reduction and random forest classification with feature ranking. The final feature number will be determined via a parsimony principle that the features will be as few as possible, while they are still with the highest classification accuracy. In each cancer, the biomarker genes are then evaluated by tenfold cross-validations via several classification algorithms. We find extreme learning machine achieves the best classification performance when compared to the other methods. The further gene enrichment analyses indicate the dysfunctional and pathogenic mechanism in these identified biomarkers.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Rodriguez, H., Pennington, S.R.: Revolutionizing precision oncology through collaborative proteogenomics and data sharing. Cell 173, 535–539 (2018)

    Article  Google Scholar 

  2. Zhu, C., Ren, C., Han, J., et al.: A five-microRNA panel in plasma was identified as potential biomarker for early detection of gastric cancer. Br. J. Cancer 110, 2291–2299 (2014)

    Article  Google Scholar 

  3. Li, M., Hong, G., Cheng, J., et al.: Identifying reproducible molecular biomarkers for gastric cancer metastasis with the aid of recurrence information. Sci. Rep. 6, 24869 (2016)

    Article  Google Scholar 

  4. Vargas, A.J., Harris, C.C.: Biomarker development in the precision medicine era: lung cancer as a case study. Nat. Rev. Cancer 16, 525–537 (2016)

    Article  Google Scholar 

  5. Bhalla, S., Chaudhary, K., Kumar, R., et al.: Gene expression-based biomarkers for discriminating early and late stage of clear cell renal cancer. Sci. Rep. 7, 44997 (2017)

    Article  Google Scholar 

  6. Chang, K., Creighton, C.J., Davis, C., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013)

    Article  Google Scholar 

  7. Wei, L., Lian, B., Zhang, Y., et al.: Application of microRNA and mRNA expression profiling on prognostic biomarker discovery for hepatocellular carcinoma. BMC Genom. 15, S13 (2014)

    Article  Google Scholar 

  8. Tsai, C.-A., Chen, J.J., Baek, S.: Development of biomarker classifiers from high-dimensional data. Brief. Bioinform. 10, 537–546 (2009)

    Article  Google Scholar 

  9. Dupont, P., Helleputte, T., Abeel, T., et al.: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26, 392–398 (2009)

    Google Scholar 

  10. Swan, A.L., Mobasheri, A., Allaway, D., et al.: Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. OMICS J. Integr. Biol. 17, 595–610 (2013)

    Article  Google Scholar 

  11. Wenric, S., Shemirani, R.: Using supervised learning methods for gene selection in RNA-Seq case-control studies. Front. Genet. 9, 297 (2018)

    Article  Google Scholar 

  12. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  13. Wong, T.-T.: Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 48, 2839–2846 (2015)

    Article  Google Scholar 

  14. Goldman, M., Craft, B., Swatloski, T., et al.: The UCSC cancer genomics browser: update 2015. Nucleic Acids Res. 43, D812–D817 (2014)

    Article  Google Scholar 

  15. Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010)

    Article  Google Scholar 

  16. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)

    Article  Google Scholar 

  17. Kuhn, M.: Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008)

    Article  Google Scholar 

  18. Guyon, I., Weston, J., Barnhill, S., et al.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)

    Article  Google Scholar 

  19. Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2, 42–47 (2012)

    Google Scholar 

  20. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: ICML, pp. 148–156. Citeseer (1996)

    Google Scholar 

  21. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: theory and applications. Neurocomputing 70, 489–501 (2006)

    Article  Google Scholar 

  22. Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn. 29, 103–130 (1997)

    Article  Google Scholar 

  23. Hecht-Nielsen, R.: Theory of the backpropagation neural network. In: Neural Networks for Perception, pp. 65–93. Elsevier (1992)

    Google Scholar 

  24. Chen, H.-L., Yang, B., Liu, J., et al.: A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis. Expert Syst. Appl. 38, 9014–9022 (2011)

    Article  Google Scholar 

  25. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27 (2011)

    Article  Google Scholar 

  26. Demircan, K., Cömertoğlu, İ., Akyol, S., et al.: A new biological marker candidate in female reproductive system diseases: Matrix metalloproteinase with thrombospondin motifs (ADAMTS). J. Turk. Ger. Gynecol. Assoc. 15, 250–255 (2014)

    Article  Google Scholar 

  27. Russell, D.L., Brown, H.M., Dunning, K.R.: ADAMTS proteases in fertility. Matrix Biol. 44–46, 54–63 (2015)

    Article  Google Scholar 

  28. Lindgren, D., Eriksson, P., Krawczyk, K., et al.: Cell-type-specific gene programs of the normal human nephron define kidney cancer subtypes. Cell Rep. 20, 1476–1489 (2017)

    Article  Google Scholar 

Download references

Acknowledgement

This work was partially supported by the National Natural Science Foundation of China (Nos. 61572287 and 61533011), the Shandong Provincial Key Research and Development Program, China (No. 2018GSF118043), the Innovation Method Fund of China (Ministry of Science and Technology of China, No. 2018IM020200), and the Program of Qilu Young Scholars of Shandong University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhi-Ping Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Z., Liu, ZP. (2019). Identifying Cancer Biomarkers from High-Throughput RNA Sequencing Data by Machine Learning. In: Huang, DS., Jo, KH., Huang, ZK. (eds) Intelligent Computing Theories and Application. ICIC 2019. Lecture Notes in Computer Science(), vol 11644. Springer, Cham. https://doi.org/10.1007/978-3-030-26969-2_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-26969-2_49

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26968-5

  • Online ISBN: 978-3-030-26969-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics