Skip to main content

The Efficacy of Various Machine Learning Models for Multi-class Classification of RNA-Seq Expression Data

  • Conference paper
  • First Online:
Intelligent Computing (CompCom 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 997))

Included in the following conference series:

Abstract

Late diagnosis and high costs are key factors that negatively impact the care of cancer patients worldwide. Although the availability of biological markers for the diagnosis of cancer type is increasing, costs and reliability of tests currently present a barrier to the adoption of their routine use. There is a pressing need for accurate methods that enable early diagnosis and cover a broad range of cancers. The use of machine learning and RNA-seq expression analysis has shown promise in the classification of cancer type. However, research is inconclusive about which type of machine learning models are optimal. The suitability of five algorithms were assessed for the classification of 17 different cancer types. Each algorithm was fine-tuned and trained on the full array of 18,015 genes per sample, for 4,221 samples (75% of the dataset). They were then tested with 1,408 samples (25% of the dataset) for which cancer types were withheld to determine the accuracy of prediction. The results show that ensemble algorithms achieve 100% accuracy in the classification of 14 out of 17 types of cancer. The clustering and classification models, while faster than the ensembles, performed poorly due to the high level of noise in the dataset. When the features were reduced to a list of 20 genes, the ensemble algorithms maintained an accuracy above 95% as opposed to the clustering and classification models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Al-Rajab, M., Lu, J., Xu, Q.: Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. Comput. Methods Programs Biomed. 146, 11–24 (2017)

    Google Scholar 

  2. Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21, 631–643 (2005)

    Google Scholar 

  3. Azar, A.T., Elshazly, H.I., Hassanien, A.E., Elkorany, A.M.: A random forest classifier for lymph diseases. Comput. Methods Programs Biomed. 113, 465–473 (2014)

    Google Scholar 

  4. Bartsch, G., Mitra, A.P., Mitra, S.A., Almal, A.A., Steven, K.E., Skinner, D.G., Fry, D.W., Lenehan, P.F., Worzel, W.P., Cote, R.J.: Use of artificial intelligence and machine learning algorithms with gene expression profiling to predict recurrent nonmuscle invasive urothelial carcinoma of the bladder. J. Urol. 195, 493–498 (2016)

    Google Scholar 

  5. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)

    Google Scholar 

  6. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. Wadsworth & Brooks. Cole Statistics/Probability Series (1984)

    Google Scholar 

  7. Ezkurdia, I., Juan, D., Rodriguez, J.M., Frankish, A., Diekhans, M., Harrow, J., Vazquez, J., Valencia, A., Tress, M.L.: Multiple evidence strands suggest that there may be as few as 19000 human protein-coding genes. Hum. Mol. Genet. 23, 5866–5878 (2014)

    Google Scholar 

  8. Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R.M., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M., Network, C.G.A.R.: The cancer genome atlas pan-cancer analysis project. Nature Genet. 45, 1113 (2013)

    Google Scholar 

  9. Podolsky, M.D., Barchuk, A.A., Kuznetcov, V.I., Gusarova, N.F., Gaidukov, V.S., Tarakanov, S.A.: Evaluation of machine learning algorithm utilization for lung cancer classification based on gene expression levels. Asian Pac. J. Cancer Prev. 17, 835–838 (2016)

    Google Scholar 

  10. Tarek, S., Elwahab, R.A., Shoman, M.: Gene expression based cancer classification. Egypt. Inf. J. 18, 151–159 (2017)

    Google Scholar 

  11. Díaz-Uriarte, R., De Andres, S.A.: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 3 (2006)

    Google Scholar 

  12. Tan, Y., Shi, L., Tong, W., Hwang, G.G., Wang, C.: Multi-class tumor classification by discriminant partial least squares using microarray gene expression data and assessment of classification models. Comput. Biol. Chem. 28, 235–243 (2004)

    Google Scholar 

  13. Team, R.C.: R Development Core Team R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2014)

    Google Scholar 

  14. Khalilabad, N.D., Hassanpour, H.: Employing image processing techniques for cancer detection using microarray images. Comput. Biol. Med. 81, 139–147 (2017)

    Google Scholar 

  15. Kursa, M.B.: rFerns: an implementation of the random ferns method for general-purpose machine learning. arXiv preprint (2012). arXiv:1202.1121

  16. Meng, J., Zhang, J., Luan, Y.-S., He, X.-Y., Li, L.-S., Zhu, Y.-F.: Parallel gene selection and dynamic ensemble pruning based on affinity propagation. Comput. Biol. Med. 87, 8–21 (2017)

    Google Scholar 

  17. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63, 3–42 (2006)

    Google Scholar 

  18. Villamizar, M., Andrade-Cetto, J., Sanfeliu, A., Moreno-Noguer, F.: Bootstrapping boosted random ferns for discriminative and efficient object classification. Pattern Recogn. 45, 3141–3153 (2012)

    Google Scholar 

  19. Zhi, J., Sun, J., Wang, Z., Ding, W.: Support vector machine classifier for prediction of the metastasis of colorectal cancer. Int. J. Mol. Med. 41, 1419–1426 (2018)

    Google Scholar 

  20. Perez-Riverol, Y., Kuhn, M., Vizcaíno, J.A., Hitz, M.-P., Audain, E.: Accurate and fast feature selection workflow for high-dimensional omics data. PLoS One 12, e0189875 (2017)

    Google Scholar 

  21. Li, X., Yang, S., Fan, R., Yu, X., Chen, D.: Discrimination of soft tissues using laser-induced breakdown spectroscopy in combination with k nearest neighbors (kNN) and support vector machine (SVM) classifiers. Opt. Laser Technol. 102, 233–239 (2018)

    Google Scholar 

  22. Shang, Y., Bouffanais, R.: Influence of the number of topologically interacting neighbors on swarm dynamics. Sci. Rep. 4, 4184 (2014)

    Google Scholar 

  23. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, Berlin (2001)

    Google Scholar 

  24. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21–27 (1967)

    Google Scholar 

  25. Ye, Z., Dong, H., Li, Y., Ma, T., Huang, H., Leong, H.S., Eckel-Passow, J., Kocher, J.-P.A., Liang, H., Wang, L.: Prevalent Homozygous deletions of type I interferon and defensin genes in Human Cancers Associate with Immunotherapy Resistance. Clin. Cancer Res. 24(14), 3299–3308 (2018)

    Google Scholar 

  26. Rhee, H., Kim, H.-Y., Choi, J.-H., Woo, H.G., Yoo, J.E., Nahm, J.H., Choi, J.S., Park, Y.N.: Keratin 19 expression in hepatocellular carcinoma is regulated by fibroblast-derived HGF via a MET-ERK1/2-AP1 and SP1 axis. Cancer Res. 78(7), 1619–1631 (2018)

    Google Scholar 

  27. Bram Ednersson, S., Stenson, M., Stern, M., Enblad, G., Fagman, H., Nilsson-Ehle, H., Hasselblom, S., Andersson, P.O.: Expression of ribosomal and actin network proteins and immunochemotherapy resistance in diffuse large B cell lymphoma patients. Br. J. haematol. 181(6), 770–781 (2018)

    Google Scholar 

  28. Sanz, G., Leray, I., Dewaele, A., Sobilo, J., Lerondel, S., Bouet, S., Grébert, D., Monnerie, R., Pajot-Augy, E., Mir, L.M.: Promotion of cancer cell invasiveness and metastasis emergence caused by olfactory receptor stimulation. PLoS One 9, e85110 (2014)

    Google Scholar 

  29. Lawrence, M.S., Stojanov, P., Polak, P., Kryukov, G.V., Cibulskis, K., Sivachenko, A., Carter, S.L., Stewart, C., Mermel, C.H., Roberts, S.A.: Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Sterling Ramroach , Melford John or Ajay Joshi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ramroach, S., John, M., Joshi, A. (2019). The Efficacy of Various Machine Learning Models for Multi-class Classification of RNA-Seq Expression Data. In: Arai, K., Bhatia, R., Kapoor, S. (eds) Intelligent Computing. CompCom 2019. Advances in Intelligent Systems and Computing, vol 997. Springer, Cham. https://doi.org/10.1007/978-3-030-22871-2_65

Download citation

Publish with us

Policies and ethics