Discriminant Analysis and Normalization Methods for Next-Generation Sequencing Data

  • Yan Zhou
  • Junhui Wang
  • Yichuan Zhao
  • Tiejun TongEmail author
Part of the ICSA Book Series in Statistics book series (ICSABSS)


Next-generation sequencing has become a powerful tool for gene expression analysis with the development of high-throughput techniques. Discriminating which type of diseases a new sample belongs to is a fundamental issue in medical and biological studies. Different from continuous microarray data, next-generation sequencing reads are mapped onto the reference genome and are discrete data. Consequently, existing discriminant analysis methods for microarray data may not be readily applicable for next-generation sequencing data. In recent years, a number of new discriminant analysis methods have been proposed to discriminate next-generation sequencing data. In this chapter, we introduce three such methods including the Poisson linear discriminant analysis, the zero-inflated Poisson logistic discriminant analysis, and the negative binomial linear discriminant analysis. In view of the importance, we further introduce several normalization methods for processing next-generation sequencing data. Simulation studies and two real datasets are also carried out to demonstrate the usefulness of the newly developed methods.



The authors thank the editor and two referees for their helpful comments that have led to some significant improvements of this chapter. Yan Zhou’s research was supported by the National Natural Science Foundation of China (Grant No. 11701385), National Statistical Research Project (Grant No. 2017LY56), the Doctor Start Fund of Guangdong Province (Grant No. 2016A030310062), and the National Social Science Foundation of China (Grant No. 15CTJ008). Junhui Wang’s research was supported by HK RGC grants GRF-11302615 and GRF-11331016. Yichuan Zhao’s research was partially supported by the NSF Grant DMS-1406163 and NSA Grant H98230-12-1-0209. Tiejun Tong’s research was supported by the Health and Medical Research Fund (Grant No. 04150476) and the National Natural Science Foundation of China (Grant No. 11671338).


  1. Anders, S., & Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11, R106.CrossRefGoogle Scholar
  2. Birchler, J. A., & Kavi, H. H. (2008). Slicing and dicing for small RNAs. Science, 320, 1023–1024.CrossRefGoogle Scholar
  3. Bolstad, B. M., Irizarry, R. A., Astrand M., & Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185–193.CrossRefGoogle Scholar
  4. Brawand, D., Soumillon, M., Necsulea, A., Julien, P., Csardi, G., Harrigan, P., et al. (2011). The evolution of gene expression levels in mammalian organs. Nature, 478, 343–348.CrossRefGoogle Scholar
  5. Bullard, J. H., Purdom, E., Hansen, K. D., & Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11, 94.CrossRefGoogle Scholar
  6. Casella, G., & Berger, R. L. (2002). Statistical inference. Pacific Grove, CA: Duxbury.zbMATHGoogle Scholar
  7. Chen, C. M., Lu, Y. L., Sio, C. P., Wu, G. C., Tzou, W. S., & Pai, T. W. (2014). Gene ontology based housekeeping gene selection for RNA-seq normalization. Methods, 67, 354–363.CrossRefGoogle Scholar
  8. Clemmensen, L., Hastie, T., Witten, D., & Ersbøll, B. (2011). Sparse discriminant analysis. Technometrics, 53, 406–413.MathSciNetCrossRefGoogle Scholar
  9. Cloonan N., Forrest A. R., Kolle G., Gardiner B. B., Faulkner G. J., Brown M. K., et al. (2008). Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods, 5, 613–619.CrossRefGoogle Scholar
  10. Dillies, M. A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14, 671–683.CrossRefGoogle Scholar
  11. Dong, K., Zhao, H., Tong, T., & Wan, X. (2016). NBLDA: Negative binomial linear discriminant analysis for RNA-Seq data. BMC Bioinformatics, 17, 369.CrossRefGoogle Scholar
  12. Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97, 77–87.MathSciNetCrossRefGoogle Scholar
  13. Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84, 165–175.MathSciNetCrossRefGoogle Scholar
  14. Grosenick, L., Greer, S., & Knutson, B. (2008). Interpretable classifiers for FMRI improve prediction of purchases. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 16, 539–548.CrossRefGoogle Scholar
  15. Hastie, T., Buja, A., & Tibshirani, R. (1995). Penalized discriminant analysis. The Annals of Statistics, 23, 73–102.MathSciNetCrossRefGoogle Scholar
  16. Hastie, T., Tibshirani, R., & Buja, A. (1994). Flexible discriminant analysis by optimal scoring. Journal of the American Statistical Association, 89, 1255–1270.MathSciNetCrossRefGoogle Scholar
  17. Hastie, T., & Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society. Series B (Methodological), 58, 155–176.MathSciNetzbMATHGoogle Scholar
  18. Huang, H. H. (2016). Ensemble method of k-mer and natural vector for the phylogenetic analysis of multiple-segmented viruses. Journal of Theoretical Biology, 398, 136–144.CrossRefGoogle Scholar
  19. Huang, H. H., & Girimurugan, S. B. (2018). A novel real-time genome comparison method using discrete wavelet transform. Journal of Computational Biology, 25(4), 406–416.MathSciNetCrossRefGoogle Scholar
  20. Huang, H. H., & Yu, C. (2016). Clustering DNA sequences using the out-place measure with reduced n-gram. Journal of Theoretical Biology, 406, 61–72.CrossRefGoogle Scholar
  21. Huang, H. H., Yu, C., Hernandez, T., Zheng, H., Yau, S. C., He, R.L., et al. (2014). Global comparison of multiple-segmented viruses in 12-dimensional genome space. Molecular Phylogenetics and Evolution, 81, 29–36.CrossRefGoogle Scholar
  22. Huang, S., Tong, T., & Zhao, H. (2010). Bias-corrected diagonal discriminant rules for high-dimensional classification. Biometrics, 66, 1096–1106.MathSciNetCrossRefGoogle Scholar
  23. Leng, C. (2008). Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using microarray data. Computational Biology and Chemistry, 32, 417–425.MathSciNetCrossRefGoogle Scholar
  24. Lin, B., Zhang, L., & Chen, X. (2014). LFCseq: A nonparametric approach for differential expression analysis of RNA-seq data. BMC Genomics, 15, S7.CrossRefGoogle Scholar
  25. Lorenz, D. J., Gill, R. S., Mitra, R., & Datta, S. (2014). Using RNA-seq data to detect differentially expressed genes. In S. Datta & D. Nettleton (Eds.), Statistical analysis of next generation sequencing data (pp. 25–49). New York: Springer.Google Scholar
  26. Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.CrossRefGoogle Scholar
  27. Mai, Q., Zou, H., & Yuan, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99, 29–42.MathSciNetCrossRefGoogle Scholar
  28. Mardis, E. R. (2008). Next-generation DNA sequencing methods. Annual Review of Genomics and Human Genetics, 9, 387–402.CrossRefGoogle Scholar
  29. Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research, 18, 1509–1517.CrossRefGoogle Scholar
  30. Meyer, O., Bischl, B., & Weihs, C. (2014). Support vector machines on large data sets: simple parallel approaches. In M. Spiliopoulou, L. Schmidt-Thieme, & R. Janning (Eds.), Data analysis, machine learning and knowledge discovery. Studies in Classification, Data Analysis, and Knowledge Organization (pp. 87–95). Cham: Springer.Google Scholar
  31. Morin, R. D., O’Connor, M. D., Griffith, M., Kuchenbauer, F., Delaney A., Prabhu A. L., et al. (2008). Application of massively parallel sequencing to micro RNA profiling and discovery in human embryonic stem cells. Genome Research, 18, 610–621.CrossRefGoogle Scholar
  32. Morozova, O., Hirst, M., & Marra, M. A. (2009). Applications of new sequencing technologies for transcriptome analysis. Annual Review of Genomics and Human Genetics, 10, 135–151.CrossRefGoogle Scholar
  33. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods, 5, 621–628.CrossRefGoogle Scholar
  34. Mouatassim, Y., & Ezzahid, E. H. (2012). Poisson regression and Zero-inflated Poisson regression: Application to private health insurance data. European Actuarial Journal, 2, 187–204.MathSciNetCrossRefGoogle Scholar
  35. Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., et al. (2008). The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 320, 1344–1349.CrossRefGoogle Scholar
  36. Ridout, M., Demetrio, C. G. B., & Hinde, J. (1998). Models for count data with many zeros. In International biometric conference, Cape Town.Google Scholar
  37. Ripley, B. D. (1996). Pattern recognition and neural networks. New York: Cambridge.CrossRefGoogle Scholar
  38. Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139–140.CrossRefGoogle Scholar
  39. Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11, R25.CrossRefGoogle Scholar
  40. Robinson, M. D., & Smyth, G. K. (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321–332.CrossRefGoogle Scholar
  41. Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology, 26, 1135–1145.CrossRefGoogle Scholar
  42. Stefani, G., & Slack, F. J. (2008). Small non-coding RNAs in animal development. Nature Reviews Molecular Cell Biology, 9, 219–230.CrossRefGoogle Scholar
  43. Tan, K. M., Petersen, A., & Witten, D. M. (2014). Classification of RNA-seq data. In Statistical analysis of next generation sequencing data (pp. 219–246). New York: Springer.Google Scholar
  44. The Cancer Genome Atlas Research Network (2014). Comprehensive molecular characterization of gastric adenocarcinoma. Nature, 513, 202–209.CrossRefGoogle Scholar
  45. Wald, P. W., & Kronmal, R. A. (1977). Discriminant functions when covariances are unequal and sample sizes are moderate. Biometrics, 33, 479–484.CrossRefGoogle Scholar
  46. Wang, E. T., Sandberg, R., Luo, S. J., Khrebtukova, I., Zhang, L., Mayr, C., et al. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature, 456, 470–476.CrossRefGoogle Scholar
  47. Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics, 10, 57–63.CrossRefGoogle Scholar
  48. Witten, D. M. (2011). Classification and clustering of sequencing data using a Poisson model. The Annals of Applied Statistics, 5, 2493–2518.MathSciNetCrossRefGoogle Scholar
  49. Witten, D. M., Tibshirani, R., Gu, S. G., Fire, A., & Lui, W. (2010). Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biology, 8, 58.CrossRefGoogle Scholar
  50. Wolenski, F. S., Shah, P., Sano, T., Shinozawa, T., Bernard, H., Gallacher, M. J., et al. (2017). Identification of microRNA biomarker candidates in urine and plasma from rats with kidney or liver damage. Journal of Applied Toxicology, 37, 278–286.CrossRefGoogle Scholar
  51. Zhou, Y., Wan, X., Zhang, B. X., & Tong, T. (2018). Classifying next-generation sequencing data using a zero-inated Poisson model. Bioinformatics, 34(8), 1329–1335.CrossRefGoogle Scholar
  52. Zhou, Y., Wang, G., Zhang, J., & Li, H. (2017). A hypothesis testing based method for normalization and differential expression analysis of RNA-Seq data. PLoS One, 12, e0169594.CrossRefGoogle Scholar
  53. Zhou, Y., Zhang, B., Li, G., Tong, T., & Wan, X. (2017). GD-RDA: A new regularized discriminant analysis for high dimensional data. Journal of Computational Biology, 24, 1099–1111.MathSciNetCrossRefGoogle Scholar
  54. Zhou, Y., Zhu, J. D., Tong, T., Wang, J. H., Lin, B. Q., & Zhang, J. (submitted). A statistical normalization method and differential expression analysis for RNA-seq data between different species.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Yan Zhou
    • 1
  • Junhui Wang
    • 2
  • Yichuan Zhao
    • 3
  • Tiejun Tong
    • 4
    Email author
  1. 1.College of Mathematics and Statistics, Institute of Statistical SciencesShenzhen UniversityShenzhenChina
  2. 2.School of Data ScienceCity University of Hong KongKowloonHong Kong
  3. 3.Department of Mathematics and StatisticsGeorgia State UniversityAtlantaUSA
  4. 4.Department of MathematicsHong Kong Baptist UniversityKowloonHong Kong

Personalised recommendations