Machine Learning-Based State-of-the-Art Methods for the Classification of RNA-Seq Data

  • Almas Jabeen
  • Nadeem Ahmad
  • Khalid Raza
Part of the Lecture Notes in Computational Vision and Biomechanics book series (LNCVB, volume 26)


Ribonucleic acid sequencing (RNA-Seq) measures the expression levels of several transcripts simultaneously. The readings can be gene, exon, or other regions of interest. Various computational tools have been developed for studying pathogens or viruses from RNA-Seq data by classifying them according to the attributes in several pre-defined classes. However, computational tools and approaches to analyzing complex datasets are still lacking. The development of classification models is highly recommended for the diagnosis and classification of diseases, disease monitoring at the molecular level and research into potential disease biomarkers. In this chapter, we discuss various machine learning approaches for RNA-Seq data classification and their implementation. These advancements in bioinformatics, along with developments in machine learning-based classification, would provide powerful toolboxes for the classification of transcriptome information available through RNA-Seq data.


RNA-Seq data Deep learning Deep neural networks Supervised Unsupervised Classification Clustering Support vector machine (SVM) BagSVM Classification and regression trees (CART) Random forest Feature selection 


  1. 1.
    Aggarwal CC, Reddy CK (2013) Data clustering: algorithms and applications. Chapman and Hall/CRCGoogle Scholar
  2. 2.
    Ahmed HA, Mahanta P, Bhattacharyya DK, Kalita JK, Ghosh A (2011) Intersected coexpressed subcube miner: an effective triclustering algorithm. In: Information and Communication Technologies (WICT), 2011 World Congress, pp 846–851. doi: 10.1109/WICT.2011.6141358
  3. 3.
    Ahmed SS, Dey N, Ashour AS, Sifaki-Pistolla D, Bălas-Timar D, Balas VE, Tavares JMR (2017) Effect of fuzzy partitioning in Crohn’s disease classification: a neuro-fuzzy-based approach. Med Biol Eng Compu 55(1):101–115. doi: 10.1007/s11517-016-1508-7 CrossRefGoogle Scholar
  4. 4.
    Angermueller C, Pärnamaa T, Parts L, Stegle O (2016) Deep learning for computational biology. Mol Syst Biol 12(7):878. doi: 10.15252/msb.20156651 CrossRefGoogle Scholar
  5. 5.
    Ballouz S, Verleyen W, Gillis J (2015) Guidance for RNA-Seq co-expression network construction and analysis: safety in numbers. Bioinformatics 31(13):2123–2130. doi: 10.1093/bioinformatics/btv118 CrossRefGoogle Scholar
  6. 6.
    Berkhin P (2006) A survey of clustering data mining techniques. Grouping Multidimension Data, Springer, Berlin, pp 25–71. doi: 10.1007/3-540-28349-8_2
  7. 7.
    Bhatia S, Prakash P, Pillai GN (2008) SVM based decision support system for heart disease classification with integer-coded genetic algorithm to select critical features. In: Proceedings of the world congress on engineering and computer science, pp 34–38 Google Scholar
  8. 8.
    Bhattacharyya DK, Kalita JK (2013) Network anomaly detection: a machine learning perspective. CRC Press, Boca RatonGoogle Scholar
  9. 9.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32. doi: 10.1023/A:1010933404324 MATHCrossRefGoogle Scholar
  10. 10.
    Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth & Brooks Monterey, CAMATHGoogle Scholar
  11. 11.
    Calaway R, Edlefsen L, Gong L, Fast S (2016) Big data decision trees with R. Revolution Press, SingaporeGoogle Scholar
  12. 12.
    Cestarelli V, Fiscon G, Felici G, Bertolazzi P, Weitschek E (2015) CAMUR: knowledge extraction from RNA-seq cancer data through equivalent classification rules. Bioinformatics 32(5):697–704. doi: 10.1093/bioinformatics/btv635 CrossRefGoogle Scholar
  13. 13.
    Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27. doi: 10.1145/1961189.1961199 Google Scholar
  14. 14.
    Chaulk SG, Ebhardt HA, Fahlman RP (2016) Correlations of microRNA: microRNA expression patterns reveal insights into microRNA clusters and global microRNA expression patterns. Mol BioSyst 12(1):110–119. doi: 10.1039/C5MB00415B CrossRefGoogle Scholar
  15. 15.
    Cheng Y, Church GM (2000) Biclustering of expression data. ISMB 8:93–103Google Scholar
  16. 16.
    Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on multiple classifier systems, Springer, Berlin, pp 1–15. doi: 10.1007/3-540-45014-9_1
  17. 17.
    Dietterich TG (2002) Ensemble learning. Handb Brain Theor Neural Netw 2:110–125 (MIT Press)Google Scholar
  18. 18.
    Dong K, Zhao H, Tong T, Wan X (2016) NBLDA: negative binomial linear discriminant analysis for RNA-Seq data. BMC Bioinform 17(1):369. doi: 10.1186/s12859-016-1208-1 CrossRefGoogle Scholar
  19. 19.
    Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96(34):226–231Google Scholar
  20. 20.
    Fan XN, Zhang SW (2015) lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning. Mol BioSyst 11(3):892–897. doi: 10.1039/C4MB00650J CrossRefMathSciNetGoogle Scholar
  21. 21.
    Ghaffari N, Yousefi MR, Johnson CD, Ivanov I, Dougherty ER (2013) Modeling the next generation sequencing sample processing pipeline for the purposes of classification. BMC Bioinform 14(1):307. doi: 10.1186/1471-2105-14-307
  22. 22.
    Ghosh AK, Chaudhuri P, Sengupta D (2006) Classification using kernel density estimates: multiscale analysis and visualization. Technometrics 48(1):120–132. doi: 10.1198/004017005000000391 CrossRefMathSciNetGoogle Scholar
  23. 23.
    Giveki D, Salimi H, Bahmanyar G, Khademian Y (2012) Automatic detection of diabetes diagnosis using feature weighted support vector machines based on mutual information and modified cuckoo search. arXiv preprint arXiv:12012173
  24. 24.
    Gregorutti B, Michel B, Saint-Pierre P (2013) Correlation and variable importance in random forests. Stat Comput, pp 1–20. doi: 10.1007/s11222-016-9646-1
  25. 25.
    Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. ACM Sigmod Rec 27(2):73–84. doi: 10.1145/276305.276312 MATHCrossRefGoogle Scholar
  26. 26.
    Hackenberg M, Sturm M, Langenberger D, Falcon-Perez JM, Aransay AM (2009) miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments. Nucleic Acids Res 37(suppl 2):W68–W76. doi: 10.1093/nar/gkp347 CrossRefGoogle Scholar
  27. 27.
    Hansen LK, Salamon P (1990) Neural network ensembles. IEEE Trans Pattern Anal Mach Intell 12(10):993–1001. doi: 10.1109/34.58871 CrossRefGoogle Scholar
  28. 28.
    Hinneburg A, Gabriel HH (2007) Denclue 20: fast clustering based on kernel density estimation. In: International symposium on intelligent data analysis, Springer, Berlin, pp 70–80. doi: 10.1007/978-3-540-74825-0
  29. 29.
    Hoi SC, Wang J, Zhao P, Jin, R (2012) Online feature selection for mining big data. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: Algorithms, systems, programming models and applications, pp 93–100. doi: 10.1145/2351316.2351329
  30. 30.
    Höppner F (1999) Fuzzy cluster analysis: methods for classification, data analysis and image recognition. Wiley, New JerseyGoogle Scholar
  31. 31.
    Ibrahim R, Yousri NA, Ismail MA, El-Makky NM (2014) Multi-level gene/MiRNA feature selection using deep belief nets and active learning. In: Engineering in Medicine and Biology Society (EMBC), 2014 36th annual international conference of the IEEE, pp 3957–3960. doi: 10.1109/EMBC.2014.6944490
  32. 32.
    Jayawardana K, Schramm SJ, Haydu L, Thompson JF, Scolyer RA, Mann GJ, Müller S, Yang JYH (2015) Determination of prognosis in metastatic melanoma through integration of clinico-pathologic, mutation, mRNA, microRNA, and protein information. Int J Cancer 136(4):863–874. doi: 10.1002/ijc.29047 CrossRefGoogle Scholar
  33. 33.
    Jiang P, Wu H, Wang W, Ma W, Sun X, Lu Z (2007) MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res 35(suppl 2):W339–W344. doi: 10.1093/nar/gkm368 CrossRefGoogle Scholar
  34. 34.
    Kamal MS, Chowdhury L, Khan MI, Ashour AS, Tavares JMR, Dey N (2017) Hidden Markov model and Chapman Kolmogrov for protein structures prediction from images. Comput Biol Chem 68:231–244. doi: 10.1016/j.compbiolchem.2017.04.003 CrossRefGoogle Scholar
  35. 35.
    Kamal S, Dey N, Nimmy SF, Ripon SH, Ali NY, Ashour AS, Karaa WBA, Nguyen GN, Shi F (2016) Evolutionary framework for coding area selection from cancer data. Neural Comput Appl, pp 1–23. doi: 10.1007/s00521-016-2513-3
  36. 36.
    Kamal S, Ripon SH, Dey N, Ashour AS, Santhi V (2016) A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Comput Methods Programs Biomed 131:191–206. doi: 10.1016/j.cmpb.2016.04.005 CrossRefGoogle Scholar
  37. 37.
    Karabatak M, Ince MC (2009) An expert system for detection of breast cancer based on association rules and neural network. Expert Syst Appl 36(2):3465–3469. doi: 10.1016/j.eswa.2008.02.064 CrossRefGoogle Scholar
  38. 38.
    Karypis G, Han EH, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75. doi: 10.1109/2.781637 CrossRefGoogle Scholar
  39. 39.
    Kashyap H, Ahmed HA, Hoque N, Roy S, Bhattacharyya DK (2015) Big data analytics in bioinformatics: A machine learning perspective. arXiv preprint arXiv:1506.05101
  40. 40.
    Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis. Wiley, New JerseyGoogle Scholar
  41. 41.
    Kausar N, Abdullah A, Samir BB, Palaniappan S, AlG-hamdi BS, Dey N (2016) Ensemble clustering algorithm with supervised classification of clinical data for early diagnosis of coronary artery disease. J Med Imaging Health Inform 6(1):78–87. doi: 10.1166/jmihi.2016.1593 CrossRefGoogle Scholar
  42. 42.
    Kriegel HP, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Wiley Interdisc Rev Data Min Knowl Discov 1(3):231–240. doi: 10.1002/widm.30 CrossRefGoogle Scholar
  43. 43.
    Kursa MB (2014) Robustness of random forest-based gene selection methods. BMC Bioinform 15(1):8. doi: 10.1186/1471-2105-15-8 CrossRefGoogle Scholar
  44. 44.
    LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444. doi: 10.1038/nature14539 CrossRefGoogle Scholar
  45. 45.
    Leung MK, Xiong HY, Lee LJ, Frey BJ (2014) Deep learning of the tissue-regulated splicing code. Bioinformatics 30(12):i121–i129. doi: 10.1093/bioinformatics/btu277 CrossRefGoogle Scholar
  46. 46.
    Li G, Ma Q, Tang H, Paterson AH, Xu Y (2009) QUBIC: a qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Res 37(15):e101. doi: 10.1093/nar/gkp491 CrossRefGoogle Scholar
  47. 47.
    Liu B, Fang L, Liu F, Wang X, Chen J, Chou KC (2015) Identification of real microRNA precursors with a pseudo structure status composition approach. PLoS One 10(3):e0121501. doi: 10.1371/journal.pone.0121501 CrossRefGoogle Scholar
  48. 48.
    Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12):550. doi: 10.1186/s13059-014-0550-8 CrossRefGoogle Scholar
  49. 49.
    Mamoshina P, Vieira A, Putin E, Zhavoronkov A (2016) Applications of deep learning in biomedicine. Mol Pharm 13(5):1445–1454. doi: 10.1021/acs.molpharmaceut.5b00982 CrossRefGoogle Scholar
  50. 50.
    Marisa L, de Reyniès A, Duval A, Selves J, Gaub MP, Vescovo L, Etienne-Grimaldi MC, Schiappa R, Guenot D, Ayadi M, Kirzin S (2013) Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value. PLoS Med 10(5):e1001453. doi: 10.1371/journal.pmed.1001453 CrossRefGoogle Scholar
  51. 51.
    Marx V (2013) Biology: the big challenges of big data. Nature 498(7453):255–260. doi: 10.1038/498255a CrossRefGoogle Scholar
  52. 52.
    Maticzka D, Lange SJ, Costa F, Backofen R (2014) GraphProt: modeling binding preferences of RNA-binding proteins. Genome Biol 15(1):R17. doi: 10.1186/gb-2014-15-1-r17 CrossRefGoogle Scholar
  53. 53.
    Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016. doi: 10.1109/TKDE.2002.1033770 CrossRefGoogle Scholar
  54. 54.
    Pan X, Xiong K (2015) PredcircRNA: computational classification of circular RNA from other long non-coding RNA using hybrid features. Mol BioSyst 11(8):2219–2226. doi: 10.1039/C5MB00214A CrossRefGoogle Scholar
  55. 55.
    Park Y, Kellis M (2015) Deep learning for regulatory genomics. Nat Biotechnol 33(8):825–826CrossRefGoogle Scholar
  56. 56.
    Phipson B, Oshlack A (2014) DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging. Genome Biol 15(9):465. doi: 10.1186/s13059-014-0465-4
  57. 57.
    Raza K, Ahmad S (2016) Principle, analysis, application and challenges of next-generation sequencing: a review. arXiv preprint ar-Xiv:160605254Google Scholar
  58. 58.
    Ripon SH, Kamal S, Hossain S, Dey N (2016) Theoretical analysis of different classifiers under reduction rough data set: a brief proposal. Int J Rough Sets Data Anal (IJRSDA) 3(3):1–20. doi: 10.4018/IJRSDA.2016070101 CrossRefGoogle Scholar
  59. 59.
    Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140. doi: 10.1093/bioinformatics/btp616 CrossRefGoogle Scholar
  60. 60.
    Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227. doi: 10.1007/BF00116037 Google Scholar
  61. 61.
    Son YJ, Kim HG, Kim EH, Choi S, Lee SK (2010) Application of support vector machine for prediction of medication adherence in heart failure patients. Healthc Inform Res 16(4):253–259. doi: 10.4258/hir.2010.16.4.253 CrossRefGoogle Scholar
  62. 62.
    Strbenac D, Mann GJ, Yang JY, Ormerod JT (2016) Differential distribution improves gene selection stability and has competitive classification performance for patient survival. Nucleic Acids Res 44(13):e119–e119. doi: 10.1093/nar/gkw444 CrossRefGoogle Scholar
  63. 63.
    Sun K, Chen X, Jiang P, Song X, Wang H, Sun H (2013) iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data. BMC Genom 14(2):S7. doi: 10.1186/1471-2164-14-S2-S7 CrossRefGoogle Scholar
  64. 64.
    Takahashi M, Hayashi H, Watanabe Y, Sawamura K, Fukui N, Watanabe J Kitajima T, Yamanouchi Y, Iwata N, Mizukami K, Hori T (2010) Diagnostic classification of schizophrenia by neural network analysis of blood-based gene expression signatures. Schizophr Res 119(1):210–218. doi: 10.1016/j.schres.2009.12.024
  65. 65.
    Tan M, Tsang IW, Wang L (2014) Towards ultrahigh dimensional feature selection for big data. J Mach Learn Res 15(1):1371–1429MATHMathSciNetGoogle Scholar
  66. 66.
    Teschendorff AE, Widschwendter M (2012) Differential variability improves the identification of cancer risk markers in DNA methylation studies profiling precursor cancer lesions. Bioinformatics 28(11):1487–1494. doi: 10.1093/bioinformatics/bts170 CrossRefGoogle Scholar
  67. 67.
    Tian L, Tibshirani R (2011) Adaptive index models for marker-based risk stratification. Biostatistics 12(1):68–86. doi: 10.1093/biostatistics/kxq047 CrossRefGoogle Scholar
  68. 68.
    Tripathy A, Rath SK (2017) Classification of sentiment of reviews using supervised machine learning techniques. Int J Rough Sets Data Anal (IJRSDA) 4(1):56–74. doi: 10.4018/IJRSDA.2017010104 CrossRefGoogle Scholar
  69. 69.
    Vapnik VN (2000) The nature of statistical learning theory, ser. Stat Eng Inform Sci 21:1003–1008 (Springer, New York)Google Scholar
  70. 70.
    Wang CY, Hu L, Guo MZ, Liu XY, Zou Q (2015) imDC: an ensemble learning method for imbalanced classification with miRNA data. Genet Mol Res 14(1):123–133. doi: 10.4238/2015 CrossRefGoogle Scholar
  71. 71.
    Westholm JO, Miura P, Olson S, Shenker S, Joseph B, Sanfilippo P, Celniker SE, Graveley BR, Lai EC (2014) Genome-wide analysis of drosophila circular RNAs reveals their structural and sequence properties and age-dependent neural accumulation. Cell Rep 9(5):1966–1980. doi: 10.1016/j.celrep.2014.10.062 CrossRefGoogle Scholar
  72. 72.
    Witten DM (2011) Classification and clustering of sequencing data using a poisson model. Ann Appl Stat, pp 2493–2518. doi: 10.1214/11-AOAS493
  73. 73.
    Yang IS, Kim S (2015) Analysis of whole transcriptome sequencing data: workflow and software. Genom Inform 13(4):119–125. doi: 10.5808/GI.2015.13.4.119 CrossRefGoogle Scholar
  74. 74.
    Zararsiz G, Goksuluk D, Korkmaz S, Eldem V, Duru IP, Ozturk A, Unver T (2014) Classification of RNA-Seq data via bagging support vector machines. bioRxiv 007526. doi: 10.1101/007526
  75. 75.
    Zararsiz G, Göksülük D, Korkmaz S, Eldem V, Zararsız GE, Duru İP, Unver T, Öztürk A (2017) A comprehensive simulation study on classification of RNA-Seq data. PeerJ Preprints, 5:e2761v1. doi: 10.7287/peerj.preprints.2761v1
  76. 76.
    Zhang J, Hadj-Moussa H, Storey KB (2016) Current progress of high-throughput microRNA differential expression analysis and random forest gene selection for model and non-model systems: an R implementation. J Integr Bioinform 13(5):306. doi: 10.2390/biecoll-jib-2016-306 Google Scholar
  77. 77.
    Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Min Knowl Disc 1(2):141–182. doi: 10.1023/A:1009783824328 CrossRefGoogle Scholar
  78. 78.
    Zomaya AY (2013) Stability of feature selection algorithms and ensemble feature selection methods in bioinformatics. Biol Knowl Discov Handb Preprocess Min Postprocess Biol Data 23:333 (Wiley)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Department of BiosciencesJamia Millia IslamiaNew DelhiIndia
  2. 2.Department of Computer ScienceJamia Millia IslamiaNew DelhiIndia

Personalised recommendations