Discriminant Analysis and Normalization Methods for Next-Generation Sequencing Data

Zhou, Yan; Wang, Junhui; Zhao, Yichuan; Tong, Tiejun

doi:10.1007/978-3-319-99389-8_18

Yan Zhou⁵,
Junhui Wang⁶,
Yichuan Zhao⁷ &
…
Tiejun Tong⁸

Part of the book series: ICSA Book Series in Statistics ((ICSABSS))

1146 Accesses
1 Citations

Abstract

Next-generation sequencing has become a powerful tool for gene expression analysis with the development of high-throughput techniques. Discriminating which type of diseases a new sample belongs to is a fundamental issue in medical and biological studies. Different from continuous microarray data, next-generation sequencing reads are mapped onto the reference genome and are discrete data. Consequently, existing discriminant analysis methods for microarray data may not be readily applicable for next-generation sequencing data. In recent years, a number of new discriminant analysis methods have been proposed to discriminate next-generation sequencing data. In this chapter, we introduce three such methods including the Poisson linear discriminant analysis, the zero-inflated Poisson logistic discriminant analysis, and the negative binomial linear discriminant analysis. In view of the importance, we further introduce several normalization methods for processing next-generation sequencing data. Simulation studies and two real datasets are also carried out to demonstrate the usefulness of the newly developed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anders, S., & Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11, R106.
Article Google Scholar
Birchler, J. A., & Kavi, H. H. (2008). Slicing and dicing for small RNAs. Science, 320, 1023–1024.
Article Google Scholar
Bolstad, B. M., Irizarry, R. A., Astrand M., & Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185–193.
Article Google Scholar
Brawand, D., Soumillon, M., Necsulea, A., Julien, P., Csardi, G., Harrigan, P., et al. (2011). The evolution of gene expression levels in mammalian organs. Nature, 478, 343–348.
Article Google Scholar
Bullard, J. H., Purdom, E., Hansen, K. D., & Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11, 94.
Article Google Scholar
Casella, G., & Berger, R. L. (2002). Statistical inference. Pacific Grove, CA: Duxbury.
MATH Google Scholar
Chen, C. M., Lu, Y. L., Sio, C. P., Wu, G. C., Tzou, W. S., & Pai, T. W. (2014). Gene ontology based housekeeping gene selection for RNA-seq normalization. Methods, 67, 354–363.
Article Google Scholar
Clemmensen, L., Hastie, T., Witten, D., & Ersbøll, B. (2011). Sparse discriminant analysis. Technometrics, 53, 406–413.
Article MathSciNet Google Scholar
Cloonan N., Forrest A. R., Kolle G., Gardiner B. B., Faulkner G. J., Brown M. K., et al. (2008). Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods, 5, 613–619.
Article Google Scholar
Dillies, M. A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14, 671–683.
Article Google Scholar
Dong, K., Zhao, H., Tong, T., & Wan, X. (2016). NBLDA: Negative binomial linear discriminant analysis for RNA-Seq data. BMC Bioinformatics, 17, 369.
Article Google Scholar
Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97, 77–87.
Article MathSciNet Google Scholar
Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84, 165–175.
Article MathSciNet Google Scholar
Grosenick, L., Greer, S., & Knutson, B. (2008). Interpretable classifiers for FMRI improve prediction of purchases. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 16, 539–548.
Article Google Scholar
Hastie, T., Buja, A., & Tibshirani, R. (1995). Penalized discriminant analysis. The Annals of Statistics, 23, 73–102.
Article MathSciNet Google Scholar
Hastie, T., Tibshirani, R., & Buja, A. (1994). Flexible discriminant analysis by optimal scoring. Journal of the American Statistical Association, 89, 1255–1270.
Article MathSciNet Google Scholar
Hastie, T., & Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society. Series B (Methodological), 58, 155–176.
MathSciNet MATH Google Scholar
Huang, H. H. (2016). Ensemble method of k-mer and natural vector for the phylogenetic analysis of multiple-segmented viruses. Journal of Theoretical Biology, 398, 136–144.
Article Google Scholar
Huang, H. H., & Girimurugan, S. B. (2018). A novel real-time genome comparison method using discrete wavelet transform. Journal of Computational Biology, 25(4), 406–416.
Article MathSciNet Google Scholar
Huang, H. H., & Yu, C. (2016). Clustering DNA sequences using the out-place measure with reduced n-gram. Journal of Theoretical Biology, 406, 61–72.
Article Google Scholar
Huang, H. H., Yu, C., Hernandez, T., Zheng, H., Yau, S. C., He, R.L., et al. (2014). Global comparison of multiple-segmented viruses in 12-dimensional genome space. Molecular Phylogenetics and Evolution, 81, 29–36.
Article Google Scholar
Huang, S., Tong, T., & Zhao, H. (2010). Bias-corrected diagonal discriminant rules for high-dimensional classification. Biometrics, 66, 1096–1106.
Article MathSciNet Google Scholar
Leng, C. (2008). Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using microarray data. Computational Biology and Chemistry, 32, 417–425.
Article MathSciNet Google Scholar
Lin, B., Zhang, L., & Chen, X. (2014). LFCseq: A nonparametric approach for differential expression analysis of RNA-seq data. BMC Genomics, 15, S7.
Article Google Scholar
Lorenz, D. J., Gill, R. S., Mitra, R., & Datta, S. (2014). Using RNA-seq data to detect differentially expressed genes. In S. Datta & D. Nettleton (Eds.), Statistical analysis of next generation sequencing data (pp. 25–49). New York: Springer.
Google Scholar
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.
Article Google Scholar
Mai, Q., Zou, H., & Yuan, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99, 29–42.
Article MathSciNet Google Scholar
Mardis, E. R. (2008). Next-generation DNA sequencing methods. Annual Review of Genomics and Human Genetics, 9, 387–402.
Article Google Scholar
Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research, 18, 1509–1517.
Article Google Scholar
Meyer, O., Bischl, B., & Weihs, C. (2014). Support vector machines on large data sets: simple parallel approaches. In M. Spiliopoulou, L. Schmidt-Thieme, & R. Janning (Eds.), Data analysis, machine learning and knowledge discovery. Studies in Classification, Data Analysis, and Knowledge Organization (pp. 87–95). Cham: Springer.
Google Scholar
Morin, R. D., O’Connor, M. D., Griffith, M., Kuchenbauer, F., Delaney A., Prabhu A. L., et al. (2008). Application of massively parallel sequencing to micro RNA profiling and discovery in human embryonic stem cells. Genome Research, 18, 610–621.
Article Google Scholar
Morozova, O., Hirst, M., & Marra, M. A. (2009). Applications of new sequencing technologies for transcriptome analysis. Annual Review of Genomics and Human Genetics, 10, 135–151.
Article Google Scholar
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods, 5, 621–628.
Article Google Scholar
Mouatassim, Y., & Ezzahid, E. H. (2012). Poisson regression and Zero-inflated Poisson regression: Application to private health insurance data. European Actuarial Journal, 2, 187–204.
Article MathSciNet Google Scholar
Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., et al. (2008). The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 320, 1344–1349.
Article Google Scholar
Ridout, M., Demetrio, C. G. B., & Hinde, J. (1998). Models for count data with many zeros. In International biometric conference, Cape Town.
Google Scholar
Ripley, B. D. (1996). Pattern recognition and neural networks. New York: Cambridge.
Book Google Scholar
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139–140.
Article Google Scholar
Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11, R25.
Article Google Scholar
Robinson, M. D., & Smyth, G. K. (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321–332.
Article Google Scholar
Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology, 26, 1135–1145.
Article Google Scholar
Stefani, G., & Slack, F. J. (2008). Small non-coding RNAs in animal development. Nature Reviews Molecular Cell Biology, 9, 219–230.
Article Google Scholar
Tan, K. M., Petersen, A., & Witten, D. M. (2014). Classification of RNA-seq data. In Statistical analysis of next generation sequencing data (pp. 219–246). New York: Springer.
Google Scholar
The Cancer Genome Atlas Research Network (2014). Comprehensive molecular characterization of gastric adenocarcinoma. Nature, 513, 202–209.
Article Google Scholar
Wald, P. W., & Kronmal, R. A. (1977). Discriminant functions when covariances are unequal and sample sizes are moderate. Biometrics, 33, 479–484.
Article Google Scholar
Wang, E. T., Sandberg, R., Luo, S. J., Khrebtukova, I., Zhang, L., Mayr, C., et al. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature, 456, 470–476.
Article Google Scholar
Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics, 10, 57–63.
Article Google Scholar
Witten, D. M. (2011). Classification and clustering of sequencing data using a Poisson model. The Annals of Applied Statistics, 5, 2493–2518.
Article MathSciNet Google Scholar
Witten, D. M., Tibshirani, R., Gu, S. G., Fire, A., & Lui, W. (2010). Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biology, 8, 58.
Article Google Scholar
Wolenski, F. S., Shah, P., Sano, T., Shinozawa, T., Bernard, H., Gallacher, M. J., et al. (2017). Identification of microRNA biomarker candidates in urine and plasma from rats with kidney or liver damage. Journal of Applied Toxicology, 37, 278–286.
Article Google Scholar
Zhou, Y., Wan, X., Zhang, B. X., & Tong, T. (2018). Classifying next-generation sequencing data using a zero-inated Poisson model. Bioinformatics, 34(8), 1329–1335.
Article Google Scholar
Zhou, Y., Wang, G., Zhang, J., & Li, H. (2017). A hypothesis testing based method for normalization and differential expression analysis of RNA-Seq data. PLoS One, 12, e0169594.
Article Google Scholar
Zhou, Y., Zhang, B., Li, G., Tong, T., & Wan, X. (2017). GD-RDA: A new regularized discriminant analysis for high dimensional data. Journal of Computational Biology, 24, 1099–1111.
Article MathSciNet Google Scholar
Zhou, Y., Zhu, J. D., Tong, T., Wang, J. H., Lin, B. Q., & Zhang, J. (submitted). A statistical normalization method and differential expression analysis for RNA-seq data between different species.
Google Scholar

Download references

Acknowledgements

The authors thank the editor and two referees for their helpful comments that have led to some significant improvements of this chapter. Yan Zhou’s research was supported by the National Natural Science Foundation of China (Grant No. 11701385), National Statistical Research Project (Grant No. 2017LY56), the Doctor Start Fund of Guangdong Province (Grant No. 2016A030310062), and the National Social Science Foundation of China (Grant No. 15CTJ008). Junhui Wang’s research was supported by HK RGC grants GRF-11302615 and GRF-11331016. Yichuan Zhao’s research was partially supported by the NSF Grant DMS-1406163 and NSA Grant H98230-12-1-0209. Tiejun Tong’s research was supported by the Health and Medical Research Fund (Grant No. 04150476) and the National Natural Science Foundation of China (Grant No. 11671338).

Author information

Authors and Affiliations

College of Mathematics and Statistics, Institute of Statistical Sciences, Shenzhen University, Shenzhen, China
Yan Zhou
School of Data Science, City University of Hong Kong, Kowloon, Hong Kong
Junhui Wang
Department of Mathematics and Statistics, Georgia State University, Atlanta, GA, USA
Yichuan Zhao
Department of Mathematics, Hong Kong Baptist University, Kowloon, Hong Kong
Tiejun Tong

Authors

Yan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Junhui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yichuan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Tiejun Tong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tiejun Tong .

Editor information

Editors and Affiliations

Department of Mathematics and Statistics, Georgia State University, Atlanta, GA, USA
Yichuan Zhao
Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Ding-Geng Chen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhou, Y., Wang, J., Zhao, Y., Tong, T. (2018). Discriminant Analysis and Normalization Methods for Next-Generation Sequencing Data. In: Zhao, Y., Chen, DG. (eds) New Frontiers of Biostatistics and Bioinformatics. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-99389-8_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-99389-8_18
Published: 06 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99388-1
Online ISBN: 978-3-319-99389-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics