Classification of RNA-seq Data

  • Kean Ming TanEmail author
  • Ashley PetersenEmail author
  • Daniela Witten
Part of the Frontiers in Probability and the Statistical Sciences book series (FROPROSTAS)


Next-generation sequencing technologies have made it possible to obtain, at a relatively low cost, a detailed snapshot of the RNA transcripts present in a tissue sample. The resulting reads are usually binned by gene, exon, or other region of interest; thus the data typically amount to read counts for tens of thousands of features, on no more than dozens or hundreds of observations. It is often of interest to use these data to develop a classifier in order to assign an observation to one of several pre-defined classes. However, the high dimensionality of the data poses statistical challenges: because there are far more features than observations, many existing classification techniques cannot be directly applied. In recent years, a number of proposals have been made to extend existing classification approaches to the high-dimensional setting. In this chapter, we discuss the use of, and modifications to, logistic regression, linear discriminant analysis, principal components analysis, partial least squares, and the support vector machine in the high-dimensional setting. We illustrate these methods on two RNA-sequencing data sets.


Support Vector Machine Partial Little Square Linear Discriminant Analysis Support Vector Classifier Sparse Principal Component Analysis 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



D.W. received support for this work from NIH Grant DP5OD009145, NSF CAREER Award DMS-1252624, and a Sloan Foundation Research Fellowship.


  1. [1]
    Agresti, A.: Categorical Data Analysis. Wiley, New York (2002)CrossRefzbMATHGoogle Scholar
  2. [2]
    Aguilera, A.M., Escabias, M., Valderrama, M.J.: Using principal components for estimating logistic regression with high-dimensional multicollinear data. Comput. Stat. Data Anal. 50(8), 1905–1924 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  3. [3]
    Allen, D.M.: The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16(1), 125–127 (1974)CrossRefzbMATHMathSciNetGoogle Scholar
  4. [4]
    Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010)CrossRefGoogle Scholar
  5. [5]
    Bair, E., Hastie, T., Paul, D., Tibshirani, R.: Prediction by supervised principal components. J. Am. Stat. Assoc. 101(473), 119–137 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  6. [6]
    Bair, E., Tibshirani, R.: Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2(4), e108 (2004)CrossRefGoogle Scholar
  7. [7]
    Barshan, E., Ghodsi, A., Azimifar, Z., Zolghadri Jahromi, M.: Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recogn. 44(7), 1357–1371 (2011)CrossRefzbMATHGoogle Scholar
  8. [8]
    Bickel, P.J., Levina, E.: Some theory for Fisher’s linear discriminant function, naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10(6), 989–1010 (2004)CrossRefzbMATHMathSciNetGoogle Scholar
  9. [9]
    Boulesteix, A.L.: PLS dimension reduction for classification with microarray data. Stat. Appl. Genet. Mol. Biol. 3(1), 1–33 (2004)MathSciNetGoogle Scholar
  10. [10]
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  11. [11]
    Brown, M.P., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares, M., Haussler, D.: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. 97(1), 262–267 (2000)CrossRefGoogle Scholar
  12. [12]
    Bullard, J., Purdom, E., Hansen, K., Dudoit, S.: Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinform. 11, 94 (2010)CrossRefGoogle Scholar
  13. [13]
    Chun, H., Keleş, S.: Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. Roy. Stat. Soc. Ser. B (Stat. Meth.) 72(1), 3–25 (2010)Google Scholar
  14. [14]
    Chung, D., Keles, S.: Sparse partial least squares classification for high dimensional data. Stat. Appl. Genet. Mol. Biol. 9(1), Article 17 (2010)Google Scholar
  15. [15]
    Clemmensen, L., Hastie, T., Witten, D., Ersbøll, B.: Sparse discriminant analysis. Technometrics 53(4), 406–413 (2011)CrossRefMathSciNetGoogle Scholar
  16. [16]
    Collins, M., Dasgupta, S., Schapire, R.E.: A generalization of principal components analysis to the exponential family. In Advances in Neural Information Processing Systems, pp. 617–624 (2001)Google Scholar
  17. [17]
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATHGoogle Scholar
  18. [18]
    d’Aspremont, A., Bach, F., Ghaoui, L.E.: Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res. 9, 1269–1294 (2008)Google Scholar
  19. [19]
    d’Aspremont, A., El Ghaoui, L., Jordan, M.I., Lanckriet, G.R.: A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49(3), 434–448 (2007)Google Scholar
  20. [20]
    Datta, S., Pihur, V., Datta, S.: An adaptive optimal ensemble classifier via bagging and rank aggregation with applications to high dimensional data. BMC Bioinformatics 11(1), 427 (2010)CrossRefGoogle Scholar
  21. [21]
    De Leeuw, J.: Principal component analysis of binary data by iterated singular value decomposition. Comput. Stat. Data Anal. 50(1), 21–39 (2006)CrossRefzbMATHGoogle Scholar
  22. [22]
    Dietterich, T.G.: Ensemble methods in machine learning. In: Multiple Classifier Systems, pp. 1–15. Springer, Berlin (2000)Google Scholar
  23. [23]
    Dillies, M.A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., et al.: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14(6), 671–683 (2013)CrossRefGoogle Scholar
  24. [24]
    Ding, B., Gentleman, R.: Classification using generalized partial least squares. J. Comput. Graph. Stat. 14(2), 280–298 (2005)CrossRefMathSciNetGoogle Scholar
  25. [25]
    Donoho, D.L., Johnstone, I.M.: Adapting to unknown smoothness via wavelet shrinkage. J. Am. Stat. Assoc. 90(432), 1200–1224 (1995)CrossRefzbMATHMathSciNetGoogle Scholar
  26. [26]
    Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97(457), 77–87 (2002)CrossRefzbMATHMathSciNetGoogle Scholar
  27. [27]
    Efron, B.: Estimating the error rate of a prediction rule: improvement on cross-validation. J. Am. Stat. Assoc. 78(382), 316–331 (1983)CrossRefzbMATHMathSciNetGoogle Scholar
  28. [28]
    Fort, G., Lambert-Lacroix, S.: Classification using partial least squares with penalized logistic regression. Bioinformatics 21(7), 1104–1111 (2005)CrossRefGoogle Scholar
  29. [29]
    Frank, L.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics 35(2), 109–135 (1993)CrossRefzbMATHGoogle Scholar
  30. [30]
    Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Software 33(1), 1–22 (2010)Google Scholar
  31. [31]
    Friedman, J.H.: Regularized discriminant analysis. J. Am. Stat. Assoc. 84(405), 165–175 (1989)CrossRefGoogle Scholar
  32. [32]
    Fu, X., Fu, N., Guo, S., Yan, Z., Xu, Y., Hu, H., Menzel, C., Chen, W., Li, Y., Zeng, R., et al.: Estimating accuracy of RNA-seq and microarrays with proteomics. BMC Genom. 10, 161 (2009)CrossRefGoogle Scholar
  33. [33]
    Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10), 906–914 (2000)CrossRefGoogle Scholar
  34. [34]
    Geisser, S.: The predictive sample reuse method with applications. J. Am. Stat. Assoc. 70(350), 320–328 (1975)CrossRefzbMATHGoogle Scholar
  35. [35]
    Grosenick, L., Greer, S., Knutson, B.: Interpretable classifiers for FMRI improve prediction of purchases. IEEE Trans. Neural Syst. Rehabil. Eng. 16(6), 539–548 (2008)CrossRefGoogle Scholar
  36. [36]
    Guo, Y., Hastie, T., Tibshirani, R.: Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8(1), 86–100 (2007)CrossRefzbMATHGoogle Scholar
  37. [37]
    Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)CrossRefzbMATHGoogle Scholar
  38. [38]
    Haas, B.J., Zody, M.C., et al.: Advancing RNA-seq analysis. Nat. Biotech. 28(5), 421–423 (2010)CrossRefGoogle Scholar
  39. [39]
    Hastie, T., Buja, A., Tibshirani, R.: Penalized discriminant analysis. Ann. Stat. 23(1), 73–102 (1995)CrossRefzbMATHMathSciNetGoogle Scholar
  40. [40]
    Hastie, T., Tibshirani, R.: Discriminant analysis by Gaussian mixtures. J. Roy. Stat. Soc. Ser. B (Methodological) 58(1), 155–176 (1996)Google Scholar
  41. [41]
    Hastie, T., Tibshirani, R., Buja, A.: Flexible discriminant analysis by optimal scoring. J. Am. Stat. Assoc. 89, 1255–1270 (1994)CrossRefzbMATHMathSciNetGoogle Scholar
  42. [42]
    Hastie, T., Tibshirani, R., Friedman, J.J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)CrossRefGoogle Scholar
  43. [43]
    Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970)CrossRefzbMATHMathSciNetGoogle Scholar
  44. [44]
    Hsu, C.W., Lin, C.J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)CrossRefGoogle Scholar
  45. [45]
    James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning. Springer, New York (2013)CrossRefzbMATHGoogle Scholar
  46. [46]
    Jolliffe, I.: Principal Component Analysis. Wiley, New York (2005)Google Scholar
  47. [47]
    Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12(3), 531–547 (2003)CrossRefMathSciNetGoogle Scholar
  48. [48]
    Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11, 517–553 (2010)zbMATHMathSciNetGoogle Scholar
  49. [49]
    Kannan, K., Wang, L., Wang, J., Ittmann, M.M., Li, W., Yen, L.: Recurrent chimeric RNAs enriched in human prostate cancer identified by deep sequencing. Proc. Natl. Acad. Sci. 108(22), 9172–9177 (2011)CrossRefGoogle Scholar
  50. [50]
    Lee, S., Huang, J.Z., Hu, J.: Sparse logistic principal components analysis for binary data. Ann. Appl. Stat. 4(3), 1579–1601 (2010)CrossRefzbMATHMathSciNetGoogle Scholar
  51. [51]
    Lee, S.I., Lee, H., Abbeel, P., Ng, A.Y.: Efficient L1 regularized logistic regression. In: Proceedings of the National Conference on Artificial Intelligence, vol. 21, pp. 401–408. AAAI Press, Menlo Park (1999); MIT Press, Cambridge, London (2006)Google Scholar
  52. [52]
    Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. J. Am. Stat. Assoc. 99(465), 67–81 (2004)CrossRefzbMATHMathSciNetGoogle Scholar
  53. [53]
    Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., Irizarry, R.A.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11(10), 733–739 (2010)CrossRefGoogle Scholar
  54. [54]
    Leng, C.: Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using microarray data. Comput. Biol. Chem. 32(6), 417–425 (2008)CrossRefzbMATHMathSciNetGoogle Scholar
  55. [55]
    Li, J., Witten, D.M., Johnstone, I.M., Tibshirani, R.: Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics 13(3), 523–538 (2012)CrossRefGoogle Scholar
  56. [56]
    Ma, Z.: Sparse principal component analysis and iterative thresholding. Ann. Stat. 41(2), 772–801 (2013)CrossRefzbMATHGoogle Scholar
  57. [57]
    Mai, Q., Zou, H., Yuan, M.: A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 99(1), 29–42 (2012)CrossRefzbMATHMathSciNetGoogle Scholar
  58. [58]
    Malone, J.H., Oliver, B.: Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol. 9, 34 (2011)CrossRefGoogle Scholar
  59. [59]
    Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic, New York (1980)Google Scholar
  60. [60]
    Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., Gilad, Y.: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9), 1509–1517 (2008)CrossRefGoogle Scholar
  61. [61]
    Marx, B.D.: Iteratively reweighted partial least squares estimation for generalized linear regression. Technometrics 38(4), 374–381 (1996)CrossRefzbMATHMathSciNetGoogle Scholar
  62. [62]
    Marx, B.D., Smith, E.P.: Principal component estimation for generalized linear regression. Biometrika 77(1), 23–31 (1990)CrossRefzbMATHMathSciNetGoogle Scholar
  63. [63]
    McCarthy, D.J., Chen, Y., Smyth, G.K.: Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Res. 40(10), 4288–4297 (2012)CrossRefGoogle Scholar
  64. [64]
    McCullagh, P., Nelder, J.A.: Generalized Linear Models. Chapman and Hall, Boca Raton (1989)CrossRefzbMATHGoogle Scholar
  65. [65]
    Meier, L., Van De Geer, S., Bühlmann, P.: The group lasso for logistic regression. J. Roy. Stat. Soc. Ser. B (Stat. Meth.) 70(1), 53–71 (2008)Google Scholar
  66. [66]
    Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Meth. 5(7), 621–628 (2008)CrossRefGoogle Scholar
  67. [67]
    Nguyen, D.V., Rocke, D.M.: Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics 18(9), 1216–1226 (2002)CrossRefGoogle Scholar
  68. [68]
    Nguyen, D.V., Rocke, D.M.: Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18(1), 39–50 (2002)CrossRefGoogle Scholar
  69. [69]
    Opitz, D., Maclin, R.: Popular ensemble methods: An empirical study. J. Artif. Intell. Res. 11, 169–198 (1999)zbMATHGoogle Scholar
  70. [70]
    Oshlack, A., Wakefield, M.J.: Transcript length bias in RNA-seq data confounds systems biology. Biol. Direct 4(14) (2009)Google Scholar
  71. [71]
    Ozsolak, F., Milos, P.M.: RNA sequencing: advances, challenges and opportunities. Nat. Rev. Genet. 12(2), 87–98 (2010)CrossRefGoogle Scholar
  72. [72]
    Park, M.Y., Hastie, T.: L1-regularization path algorithm for generalized linear models. J. Roy. Stat. Soc. Ser. B (Stat. Meth.) 69(4), 659–677 (2007)Google Scholar
  73. [73]
    Park, P.J.: ChIP–seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680 (2009)CrossRefGoogle Scholar
  74. [74]
    Quackenbush, J.: Microarray data normalization and transformation. Nat. Genet. 32, 496–501 (2002)CrossRefGoogle Scholar
  75. [75]
    R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2013).
  76. [76]
    Robinson, M.D., McCarthy, D.J., Smyth, G.K.: edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1), 139–140 (2010)Google Scholar
  77. [77]
    Robinson, M.D., Oshlack, A.: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010)CrossRefGoogle Scholar
  78. [78]
    Schein, A.I., Saul, L.K., Ungar, L.H.: A generalized linear model for principal component analysis of binary data. In: Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, pp. 14–21 (2003)Google Scholar
  79. [79]
    Shao, J.: Linear model selection by cross-validation. J. Am. Stat. Assoc. 88(422), 486–494 (1993)CrossRefzbMATHGoogle Scholar
  80. [80]
    Shen, H., Huang, J.Z.: Sparse principal component analysis via regularized low rank matrix approximation. J. Multivariate Anal. 99(6), 1015–1034 (2008)CrossRefzbMATHMathSciNetGoogle Scholar
  81. [81]
    Shendure, J.: The beginning of the end for microarrays? Nat. Meth. 5(7), 585–587 (2008)CrossRefGoogle Scholar
  82. [82]
    Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. Roy. Stat. Soc. Ser. B (Methodological) 36, 111–147 (1974)zbMATHGoogle Scholar
  83. [83]
    Tarazona, S., García-Alcalde, F., Dopazo, J., Ferrer, A., Conesa, A.: Differential expression in RNA-seq: a matter of depth. Genome Res. 21(12), 2213–2223 (2011)CrossRefGoogle Scholar
  84. [84]
    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. Ser. B (Methodological) 58, 267–288 (1996)zbMATHMathSciNetGoogle Scholar
  85. [85]
    Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 99(10), 6567–6572 (2002)CrossRefGoogle Scholar
  86. [86]
    Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat. Sci. 18(1), 104–117 (2003)CrossRefzbMATHMathSciNetGoogle Scholar
  87. [87]
    Trendafilov, N.T., Jolliffe, I.T.: Projected gradient approach to the numerical solution of the SCoTLASS. Comput. Stat. Data Anal. 50(1), 242–253 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  88. [88]
    Trendafilov, N.T., Jolliffe, I.T.: DALASS: variable selection in discriminant analysis via the LASSO. Comput. Stat. Data Anal. 51(8), 3718–3736 (2007)CrossRefzbMATHMathSciNetGoogle Scholar
  89. [89]
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (2000)CrossRefzbMATHGoogle Scholar
  90. [90]
    Wang, Z., Gerstein, M., Snyder, M.: RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10(1), 57–63 (2009)CrossRefGoogle Scholar
  91. [91]
    Weston, J., Watkins, C.: Multi-class support vector machines. Technical report, Citeseer (1998)Google Scholar
  92. [92]
    Witten, D., Tibshirani, R., Gu, S.G., Fire, A., Lui, W.O.: Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biol. 8(58) (2010)Google Scholar
  93. [93]
    Witten, D.M.: Classification and clustering of sequencing data using a Poisson model. Ann. Appl. Stat. 5(4), 2493–2518 (2011)CrossRefzbMATHMathSciNetGoogle Scholar
  94. [94]
    Witten, D.M., Tibshirani, R.: Penalized classification using Fisher’s linear discriminant. J. Roy. Stat. Soc. Ser. B (Stat. Meth.) 73(5), 753–772 (2011)Google Scholar
  95. [95]
    Witten, D.M., Tibshirani, R., Hastie, T.: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515–534 (2009)CrossRefGoogle Scholar
  96. [96]
    Wold, H., et al.: Estimation of principal components and related models by iterative least squares. Multivariate Anal. 1, 391–420 (1966)MathSciNetGoogle Scholar
  97. [97]
    Wold, S.: Cross-validatory estimation of the number of components in factor and principal components models. Technometrics 20(4), 397–405 (1978)CrossRefzbMATHGoogle Scholar
  98. [98]
    Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. Ser. B (Stat. Meth.) 68(1), 49–67 (2006)Google Scholar
  99. [99]
    Zhu, J., Hastie, T.: Classification of gene microarrays by penalized logistic regression. Biostatistics 5(3), 427–443 (2004)CrossRefzbMATHGoogle Scholar
  100. [100]
    Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1-norm support vector machines. Adv. Neural Inform. Process. Syst. 16(1), 49–56 (2004)Google Scholar
  101. [101]
    Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B (Stat. Meth.) 67(2), 301–320 (2005)Google Scholar
  102. [102]
    Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Department of BiostatisticsUniversity of WashingtonSeattleUSA

Personalised recommendations