Discriminant Analysis and Normalization Methods for Next-Generation Sequencing Data
Next-generation sequencing has become a powerful tool for gene expression analysis with the development of high-throughput techniques. Discriminating which type of diseases a new sample belongs to is a fundamental issue in medical and biological studies. Different from continuous microarray data, next-generation sequencing reads are mapped onto the reference genome and are discrete data. Consequently, existing discriminant analysis methods for microarray data may not be readily applicable for next-generation sequencing data. In recent years, a number of new discriminant analysis methods have been proposed to discriminate next-generation sequencing data. In this chapter, we introduce three such methods including the Poisson linear discriminant analysis, the zero-inflated Poisson logistic discriminant analysis, and the negative binomial linear discriminant analysis. In view of the importance, we further introduce several normalization methods for processing next-generation sequencing data. Simulation studies and two real datasets are also carried out to demonstrate the usefulness of the newly developed methods.
The authors thank the editor and two referees for their helpful comments that have led to some significant improvements of this chapter. Yan Zhou’s research was supported by the National Natural Science Foundation of China (Grant No. 11701385), National Statistical Research Project (Grant No. 2017LY56), the Doctor Start Fund of Guangdong Province (Grant No. 2016A030310062), and the National Social Science Foundation of China (Grant No. 15CTJ008). Junhui Wang’s research was supported by HK RGC grants GRF-11302615 and GRF-11331016. Yichuan Zhao’s research was partially supported by the NSF Grant DMS-1406163 and NSA Grant H98230-12-1-0209. Tiejun Tong’s research was supported by the Health and Medical Research Fund (Grant No. 04150476) and the National Natural Science Foundation of China (Grant No. 11671338).
- Lorenz, D. J., Gill, R. S., Mitra, R., & Datta, S. (2014). Using RNA-seq data to detect differentially expressed genes. In S. Datta & D. Nettleton (Eds.), Statistical analysis of next generation sequencing data (pp. 25–49). New York: Springer.Google Scholar
- Meyer, O., Bischl, B., & Weihs, C. (2014). Support vector machines on large data sets: simple parallel approaches. In M. Spiliopoulou, L. Schmidt-Thieme, & R. Janning (Eds.), Data analysis, machine learning and knowledge discovery. Studies in Classification, Data Analysis, and Knowledge Organization (pp. 87–95). Cham: Springer.Google Scholar
- Ridout, M., Demetrio, C. G. B., & Hinde, J. (1998). Models for count data with many zeros. In International biometric conference, Cape Town.Google Scholar
- Tan, K. M., Petersen, A., & Witten, D. M. (2014). Classification of RNA-seq data. In Statistical analysis of next generation sequencing data (pp. 219–246). New York: Springer.Google Scholar
- Zhou, Y., Zhu, J. D., Tong, T., Wang, J. H., Lin, B. Q., & Zhang, J. (submitted). A statistical normalization method and differential expression analysis for RNA-seq data between different species.Google Scholar