Skip to main content

Discriminant Analysis and Normalization Methods for Next-Generation Sequencing Data

  • Chapter
  • First Online:
New Frontiers of Biostatistics and Bioinformatics

Part of the book series: ICSA Book Series in Statistics ((ICSABSS))

Abstract

Next-generation sequencing has become a powerful tool for gene expression analysis with the development of high-throughput techniques. Discriminating which type of diseases a new sample belongs to is a fundamental issue in medical and biological studies. Different from continuous microarray data, next-generation sequencing reads are mapped onto the reference genome and are discrete data. Consequently, existing discriminant analysis methods for microarray data may not be readily applicable for next-generation sequencing data. In recent years, a number of new discriminant analysis methods have been proposed to discriminate next-generation sequencing data. In this chapter, we introduce three such methods including the Poisson linear discriminant analysis, the zero-inflated Poisson logistic discriminant analysis, and the negative binomial linear discriminant analysis. In view of the importance, we further introduce several normalization methods for processing next-generation sequencing data. Simulation studies and two real datasets are also carried out to demonstrate the usefulness of the newly developed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Anders, S., & Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11, R106.

    Article  Google Scholar 

  • Birchler, J. A., & Kavi, H. H. (2008). Slicing and dicing for small RNAs. Science, 320, 1023–1024.

    Article  Google Scholar 

  • Bolstad, B. M., Irizarry, R. A., Astrand M., & Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185–193.

    Article  Google Scholar 

  • Brawand, D., Soumillon, M., Necsulea, A., Julien, P., Csardi, G., Harrigan, P., et al. (2011). The evolution of gene expression levels in mammalian organs. Nature, 478, 343–348.

    Article  Google Scholar 

  • Bullard, J. H., Purdom, E., Hansen, K. D., & Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11, 94.

    Article  Google Scholar 

  • Casella, G., & Berger, R. L. (2002). Statistical inference. Pacific Grove, CA: Duxbury.

    MATH  Google Scholar 

  • Chen, C. M., Lu, Y. L., Sio, C. P., Wu, G. C., Tzou, W. S., & Pai, T. W. (2014). Gene ontology based housekeeping gene selection for RNA-seq normalization. Methods, 67, 354–363.

    Article  Google Scholar 

  • Clemmensen, L., Hastie, T., Witten, D., & Ersbøll, B. (2011). Sparse discriminant analysis. Technometrics, 53, 406–413.

    Article  MathSciNet  Google Scholar 

  • Cloonan N., Forrest A. R., Kolle G., Gardiner B. B., Faulkner G. J., Brown M. K., et al. (2008). Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods, 5, 613–619.

    Article  Google Scholar 

  • Dillies, M. A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14, 671–683.

    Article  Google Scholar 

  • Dong, K., Zhao, H., Tong, T., & Wan, X. (2016). NBLDA: Negative binomial linear discriminant analysis for RNA-Seq data. BMC Bioinformatics, 17, 369.

    Article  Google Scholar 

  • Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97, 77–87.

    Article  MathSciNet  Google Scholar 

  • Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84, 165–175.

    Article  MathSciNet  Google Scholar 

  • Grosenick, L., Greer, S., & Knutson, B. (2008). Interpretable classifiers for FMRI improve prediction of purchases. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 16, 539–548.

    Article  Google Scholar 

  • Hastie, T., Buja, A., & Tibshirani, R. (1995). Penalized discriminant analysis. The Annals of Statistics, 23, 73–102.

    Article  MathSciNet  Google Scholar 

  • Hastie, T., Tibshirani, R., & Buja, A. (1994). Flexible discriminant analysis by optimal scoring. Journal of the American Statistical Association, 89, 1255–1270.

    Article  MathSciNet  Google Scholar 

  • Hastie, T., & Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society. Series B (Methodological), 58, 155–176.

    MathSciNet  MATH  Google Scholar 

  • Huang, H. H. (2016). Ensemble method of k-mer and natural vector for the phylogenetic analysis of multiple-segmented viruses. Journal of Theoretical Biology, 398, 136–144.

    Article  Google Scholar 

  • Huang, H. H., & Girimurugan, S. B. (2018). A novel real-time genome comparison method using discrete wavelet transform. Journal of Computational Biology, 25(4), 406–416.

    Article  MathSciNet  Google Scholar 

  • Huang, H. H., & Yu, C. (2016). Clustering DNA sequences using the out-place measure with reduced n-gram. Journal of Theoretical Biology, 406, 61–72.

    Article  Google Scholar 

  • Huang, H. H., Yu, C., Hernandez, T., Zheng, H., Yau, S. C., He, R.L., et al. (2014). Global comparison of multiple-segmented viruses in 12-dimensional genome space. Molecular Phylogenetics and Evolution, 81, 29–36.

    Article  Google Scholar 

  • Huang, S., Tong, T., & Zhao, H. (2010). Bias-corrected diagonal discriminant rules for high-dimensional classification. Biometrics, 66, 1096–1106.

    Article  MathSciNet  Google Scholar 

  • Leng, C. (2008). Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using microarray data. Computational Biology and Chemistry, 32, 417–425.

    Article  MathSciNet  Google Scholar 

  • Lin, B., Zhang, L., & Chen, X. (2014). LFCseq: A nonparametric approach for differential expression analysis of RNA-seq data. BMC Genomics, 15, S7.

    Article  Google Scholar 

  • Lorenz, D. J., Gill, R. S., Mitra, R., & Datta, S. (2014). Using RNA-seq data to detect differentially expressed genes. In S. Datta & D. Nettleton (Eds.), Statistical analysis of next generation sequencing data (pp. 25–49). New York: Springer.

    Google Scholar 

  • Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.

    Article  Google Scholar 

  • Mai, Q., Zou, H., & Yuan, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99, 29–42.

    Article  MathSciNet  Google Scholar 

  • Mardis, E. R. (2008). Next-generation DNA sequencing methods. Annual Review of Genomics and Human Genetics, 9, 387–402.

    Article  Google Scholar 

  • Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research, 18, 1509–1517.

    Article  Google Scholar 

  • Meyer, O., Bischl, B., & Weihs, C. (2014). Support vector machines on large data sets: simple parallel approaches. In M. Spiliopoulou, L. Schmidt-Thieme, & R. Janning (Eds.), Data analysis, machine learning and knowledge discovery. Studies in Classification, Data Analysis, and Knowledge Organization (pp. 87–95). Cham: Springer.

    Google Scholar 

  • Morin, R. D., O’Connor, M. D., Griffith, M., Kuchenbauer, F., Delaney A., Prabhu A. L., et al. (2008). Application of massively parallel sequencing to micro RNA profiling and discovery in human embryonic stem cells. Genome Research, 18, 610–621.

    Article  Google Scholar 

  • Morozova, O., Hirst, M., & Marra, M. A. (2009). Applications of new sequencing technologies for transcriptome analysis. Annual Review of Genomics and Human Genetics, 10, 135–151.

    Article  Google Scholar 

  • Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods, 5, 621–628.

    Article  Google Scholar 

  • Mouatassim, Y., & Ezzahid, E. H. (2012). Poisson regression and Zero-inflated Poisson regression: Application to private health insurance data. European Actuarial Journal, 2, 187–204.

    Article  MathSciNet  Google Scholar 

  • Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., et al. (2008). The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 320, 1344–1349.

    Article  Google Scholar 

  • Ridout, M., Demetrio, C. G. B., & Hinde, J. (1998). Models for count data with many zeros. In International biometric conference, Cape Town.

    Google Scholar 

  • Ripley, B. D. (1996). Pattern recognition and neural networks. New York: Cambridge.

    Book  Google Scholar 

  • Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139–140.

    Article  Google Scholar 

  • Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11, R25.

    Article  Google Scholar 

  • Robinson, M. D., & Smyth, G. K. (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321–332.

    Article  Google Scholar 

  • Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology, 26, 1135–1145.

    Article  Google Scholar 

  • Stefani, G., & Slack, F. J. (2008). Small non-coding RNAs in animal development. Nature Reviews Molecular Cell Biology, 9, 219–230.

    Article  Google Scholar 

  • Tan, K. M., Petersen, A., & Witten, D. M. (2014). Classification of RNA-seq data. In Statistical analysis of next generation sequencing data (pp. 219–246). New York: Springer.

    Google Scholar 

  • The Cancer Genome Atlas Research Network (2014). Comprehensive molecular characterization of gastric adenocarcinoma. Nature, 513, 202–209.

    Article  Google Scholar 

  • Wald, P. W., & Kronmal, R. A. (1977). Discriminant functions when covariances are unequal and sample sizes are moderate. Biometrics, 33, 479–484.

    Article  Google Scholar 

  • Wang, E. T., Sandberg, R., Luo, S. J., Khrebtukova, I., Zhang, L., Mayr, C., et al. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature, 456, 470–476.

    Article  Google Scholar 

  • Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics, 10, 57–63.

    Article  Google Scholar 

  • Witten, D. M. (2011). Classification and clustering of sequencing data using a Poisson model. The Annals of Applied Statistics, 5, 2493–2518.

    Article  MathSciNet  Google Scholar 

  • Witten, D. M., Tibshirani, R., Gu, S. G., Fire, A., & Lui, W. (2010). Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biology, 8, 58.

    Article  Google Scholar 

  • Wolenski, F. S., Shah, P., Sano, T., Shinozawa, T., Bernard, H., Gallacher, M. J., et al. (2017). Identification of microRNA biomarker candidates in urine and plasma from rats with kidney or liver damage. Journal of Applied Toxicology, 37, 278–286.

    Article  Google Scholar 

  • Zhou, Y., Wan, X., Zhang, B. X., & Tong, T. (2018). Classifying next-generation sequencing data using a zero-inated Poisson model. Bioinformatics, 34(8), 1329–1335.

    Article  Google Scholar 

  • Zhou, Y., Wang, G., Zhang, J., & Li, H. (2017). A hypothesis testing based method for normalization and differential expression analysis of RNA-Seq data. PLoS One, 12, e0169594.

    Article  Google Scholar 

  • Zhou, Y., Zhang, B., Li, G., Tong, T., & Wan, X. (2017). GD-RDA: A new regularized discriminant analysis for high dimensional data. Journal of Computational Biology, 24, 1099–1111.

    Article  MathSciNet  Google Scholar 

  • Zhou, Y., Zhu, J. D., Tong, T., Wang, J. H., Lin, B. Q., & Zhang, J. (submitted). A statistical normalization method and differential expression analysis for RNA-seq data between different species.

    Google Scholar 

Download references

Acknowledgements

The authors thank the editor and two referees for their helpful comments that have led to some significant improvements of this chapter. Yan Zhou’s research was supported by the National Natural Science Foundation of China (Grant No. 11701385), National Statistical Research Project (Grant No. 2017LY56), the Doctor Start Fund of Guangdong Province (Grant No. 2016A030310062), and the National Social Science Foundation of China (Grant No. 15CTJ008). Junhui Wang’s research was supported by HK RGC grants GRF-11302615 and GRF-11331016. Yichuan Zhao’s research was partially supported by the NSF Grant DMS-1406163 and NSA Grant H98230-12-1-0209. Tiejun Tong’s research was supported by the Health and Medical Research Fund (Grant No. 04150476) and the National Natural Science Foundation of China (Grant No. 11671338).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tiejun Tong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zhou, Y., Wang, J., Zhao, Y., Tong, T. (2018). Discriminant Analysis and Normalization Methods for Next-Generation Sequencing Data. In: Zhao, Y., Chen, DG. (eds) New Frontiers of Biostatistics and Bioinformatics. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-99389-8_18

Download citation

Publish with us

Policies and ethics