Abstract
Statistical modeling of count data from RNA sequencing (RNA-seq) experiments is important for proper interpretation of results. Here I will describe how count data can be modeled using count distributions, or alternatively analyzed using nonparametric methods. I will focus on basic routines for performing data input, scaling/normalization, visualization, and statistical testing to determine sets of features where the counts reflect differences in gene expression across samples. Finally, I discuss limitations and possible extensions to the models presented here.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Love M, Anders S, Kim V, Huber W (2015) RNA-seq workflow: gene-level exploratory analysis and differential expression. F1000research 4:1070
Love M, Soneson C, Patro R (2018) Swimming downstream: statistical analysis of differential transcript usage following salmon quantification. F1000research 7:952
Van den Berge K, Hembach KM, Soneson C, Tiberi S, Clement L et al (2019) RNA sequencing data: Hitchhiker’s guide to expression analysis. Ann Rev Biomed Data Sci 2(1):139–173
Ewels P, Magnusson M, Lundin S, Käller M (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32(19):3047–3048
King HW, Klose RJ (2017) The pioneer factor oct4 requires the chromatin remodeller brg1 to support gene regulatory element function in mouse embryonic stem cells. Elife 6:e22631
Patro R, Duggal G, Love M, Irizarry R, Kingsford C (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417–419
Köster J, Rahmann S (2012) Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28(19):2520–2522
Huber W, Carey VJ, Gentleman R, Anders S, Carlson M et al (2015) Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12(2):115–121
Love MI, Soneson C, Hickey PF, Johnson LK, Pierce NT et al (2020) Tximeta: reference sequence checksums for provenance identification in RNA-seq. PLoS Comput Biol 16(2):e1007664
Srivastava A, Malik L, Smith TS, Sudbery I, Patro R (2019) Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol 20:65
Frankish A, GENCODE-consoritum, Flicek P. (2018) GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 47(D1):D766–D773
Soneson C, Love MI, Robinson M (2015) Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000research 4:1521
Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M et al (2013) Software for computing and annotating genomic ranges. PLoS Comput Biol 9(8):e1003118
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12):550
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139
McCarthy DJ, Chen Y, Smyth GK (2012) Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Res 40:4288–4297
Law CW, Chen Y, Shi W, Smyth GK (2014) voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15(2):29
Wu H, Wang C, Wu Z (2012) A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics 14(2):232–243
Ignatiadis N, Klaus B, Zaugg J, Huber W (2016) Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat Methods 13(7):577–580
Dudoit S, Yang YH, Callow MJ, Speed TP (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12(1):111–139
Roberts CJ, Nelson B, Marton MJ, Stoughton R, Meyer MR et al (2000) Signaling and circuitry of multiple mapk pathways revealed by a matrix of global gene expression profiles. Science 287(5454):873–880
Cox DR, Reid N (1987) Parameter orthogonality and approximate conditional inference. J R Stat Soc B 49(1):1–39
Tibshirani R (1988) Estimating transformations for regression via additivity and variance stabilization. J Am Stat Assoc 83:394–405
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11:R106
Witten DM (2011) Classification and clustering of sequencing data using a Poisson model. Annal Appl Stat 5(4):2493–2518
Townes FW, Hicks SC, Aryee MJ, Irizarry RA (2019) Feature selection and dimension reduction for single cell RNA-seq based on a multinomial model. Genome Biol 20:295
Zhu A, Ibrahim JG, Love MI (2018) Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics 35(12):2084–2092
Stephens M (2016) False discovery rates: a new deal. Biostatistics 18(2):41
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300
Soneson C, Matthes KL, Nowicka M, Law CW, Robinson MD (2016) Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage. Genome Biol 17(1):12
Anders S, Reyes A, Huber W (2012) Detecting differential usage of exons from RNA-seq data. Genome Res 22(10):2008–2017
Nowicka M, Robinson M (2016) DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000research 5:1356
Van den Berge K, Soneson C, Robinson MD, Clement L (2017) stageR: a general stage-wise method for controlling the gene-level false discovery rate in differential expression and differential transcript usage. Genome Biol 18(1):151
Alasoo K, Rodrigues J, Mukhopadhyay S, Knights A, Mann A et al (2018) Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response. Nat Genet 50:424–431
Love MI, Hogenesch JB, Irizarry RA (2016) Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat Biotechnol 34(12):1287–1291
Glaus P, Honkela A, Rattray M (2012) Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics 28(13):1721–1728
Turro E, Astle WJ, Tavaré S (2013) Flexible analysis of RNA-seq data using mixed effects models. Bioinformatics 30(2):180–188
Al Seesi S, Temate-Tiagueu Y, Zelikovsky A, Măndoiu II (2014) Bootstrap-based differential gene expression analysis for RNA-seq data with and without replicates. BMC Genomics 15(Suppl 8):S2
Pimentel H, Bray NL, Puente S, Melsted P, Pachter L (2017) Differential analysis of RNA-seq incorporating quantification uncertainty. Nat Methods 14(7):687–690
Bray NL, Pimentel H, Melsted P, Pachter L (2016) Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34(5):525
Zhu A, Srivastava A, Ibrahim J, Patro R, Love M (2019) Nonparametric expression analysis using inferential replicate counts. Nucleic Acids Res 47(18):e105
Li J, Tibshirani R (2011) Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Stat Methods Med Res 22(5):519–536
Turro E, Su S-Y, Gonçalves Â, Coin LJ, Richardson S, Lewin A (2011) Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biol 12(2):R13
Storey J, Tibshirani R (2003) Statistical significance for genome-wide experiments. Proc Natl Acad Sci 100(16):9440–9445
Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN et al (2020) Orchestrating single-cell analysis with bioconductor. Nat Methods 17(2):137–145
Soneson C, Robinson MD (2018) Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods 15(4):255–261
Sun S, Zhu J, Ma Y, Zhou X (2019) Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biology 20(1):269
Duo A, Robinson M, Soneson C (2018) A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000research 7:1141
Van den Berge K, Perraudeau F, Soneson C, Love MI, Risso D et al (2018) Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol 19:24
Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18):3094–3100
Soneson C, Yao Y, Bratus-Neuenschwander A, Patrignani A, Robinson MD, Hussain S (2019) A comprehensive examination of nanopore native RNA sequencing for characterization of complex transcriptomes. Nat Commun 10(1):3359
Cruz-Garcia L, O’Brien G, Sipos B, Mayes S, Love M et al (2019) Generation of a transcriptional radiation exposure signature in human blood using long-read nanopore sequencing. Radiat Res 193(2):143–154
Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q (2020) Opportunities and challenges in long-read sequencing data analysis. Genome Biol 21(1):30
Castel SE, Levy-Moonshine A, Mohammadi P, Banks E, Lappalainen T (2015) Tools and best practices for data processing in allelic expression analysis. Genome Biol 16(1):195
Raghupathy N, Choi K, Vincent MJ, Beane GL, Sheppard KS et al (2018) Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression. Bioinformatics 34(13):2177–2184
Srivastava A, Malik L, Sarkar H, Zakeri M, Almodaresi F et al (2019) Alignment and mapping methodology influence transcript abundance estimation. Genome Biol 21:239
Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB (2014) Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2(1):15
Calgaro M, Romualdi C, Waldron L, Risso D, Vitulo N (2020) Assessment of single cell RNA-seq statistical methods on microbiome data. Genome Biol 21:191
Callahan B, Sankaran K, Fukuyama J, McMurdie P, Holmes S (2016) Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. F1000research 5:1492
Sankaran K, Holmes SP (2018) Latent variable modeling for the microbiome. Biostatistics 20(4):599–614
Willis AD (2019) Rarefaction, alpha diversity, and statistics. Front Microbiol 10:2407
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Love, M.I. (2021). Statistical Modeling of High Dimensional Counts. In: Picardi, E. (eds) RNA Bioinformatics. Methods in Molecular Biology, vol 2284. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1307-8_7
Download citation
DOI: https://doi.org/10.1007/978-1-0716-1307-8_7
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-1306-1
Online ISBN: 978-1-0716-1307-8
eBook Packages: Springer Protocols