Statistical Modeling of High Dimensional Counts

Love, Michael I.

doi:10.1007/978-1-0716-1307-8_7

Michael I. Love^3,4

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2284))

8162 Accesses
1 Citations

Abstract

Statistical modeling of count data from RNA sequencing (RNA-seq) experiments is important for proper interpretation of results. Here I will describe how count data can be modeled using count distributions, or alternatively analyzed using nonparametric methods. I will focus on basic routines for performing data input, scaling/normalization, visualization, and statistical testing to determine sets of features where the counts reflect differences in gene expression across samples. Finally, I discuss limitations and possible extensions to the models presented here.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Love M, Anders S, Kim V, Huber W (2015) RNA-seq workflow: gene-level exploratory analysis and differential expression. F1000research 4:1070
Article PubMed PubMed Central Google Scholar
Love M, Soneson C, Patro R (2018) Swimming downstream: statistical analysis of differential transcript usage following salmon quantification. F1000research 7:952
Article CAS PubMed PubMed Central Google Scholar
Van den Berge K, Hembach KM, Soneson C, Tiberi S, Clement L et al (2019) RNA sequencing data: Hitchhiker’s guide to expression analysis. Ann Rev Biomed Data Sci 2(1):139–173
Article Google Scholar
Ewels P, Magnusson M, Lundin S, Käller M (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32(19):3047–3048
Article CAS PubMed PubMed Central Google Scholar
King HW, Klose RJ (2017) The pioneer factor oct4 requires the chromatin remodeller brg1 to support gene regulatory element function in mouse embryonic stem cells. Elife 6:e22631
Article PubMed PubMed Central Google Scholar
Patro R, Duggal G, Love M, Irizarry R, Kingsford C (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417–419
Article CAS PubMed PubMed Central Google Scholar
Köster J, Rahmann S (2012) Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28(19):2520–2522
Article PubMed Google Scholar
Huber W, Carey VJ, Gentleman R, Anders S, Carlson M et al (2015) Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12(2):115–121
Article CAS PubMed PubMed Central Google Scholar
Love MI, Soneson C, Hickey PF, Johnson LK, Pierce NT et al (2020) Tximeta: reference sequence checksums for provenance identification in RNA-seq. PLoS Comput Biol 16(2):e1007664
Article CAS PubMed PubMed Central Google Scholar
Srivastava A, Malik L, Smith TS, Sudbery I, Patro R (2019) Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol 20:65
Article PubMed PubMed Central Google Scholar
Frankish A, GENCODE-consoritum, Flicek P. (2018) GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 47(D1):D766–D773
Article PubMed Central Google Scholar
Soneson C, Love MI, Robinson M (2015) Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000research 4:1521
Article PubMed Google Scholar
Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M et al (2013) Software for computing and annotating genomic ranges. PLoS Comput Biol 9(8):e1003118
Article CAS PubMed PubMed Central Google Scholar
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12):550
Article PubMed PubMed Central Google Scholar
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139
Article CAS PubMed Google Scholar
McCarthy DJ, Chen Y, Smyth GK (2012) Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Res 40:4288–4297
Article CAS PubMed PubMed Central Google Scholar
Law CW, Chen Y, Shi W, Smyth GK (2014) voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15(2):29
Article Google Scholar
Wu H, Wang C, Wu Z (2012) A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics 14(2):232–243
Article PubMed PubMed Central Google Scholar
Ignatiadis N, Klaus B, Zaugg J, Huber W (2016) Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat Methods 13(7):577–580
Article CAS PubMed PubMed Central Google Scholar
Dudoit S, Yang YH, Callow MJ, Speed TP (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12(1):111–139
Google Scholar
Roberts CJ, Nelson B, Marton MJ, Stoughton R, Meyer MR et al (2000) Signaling and circuitry of multiple mapk pathways revealed by a matrix of global gene expression profiles. Science 287(5454):873–880
Article CAS PubMed Google Scholar
Cox DR, Reid N (1987) Parameter orthogonality and approximate conditional inference. J R Stat Soc B 49(1):1–39
Google Scholar
Tibshirani R (1988) Estimating transformations for regression via additivity and variance stabilization. J Am Stat Assoc 83:394–405
Article Google Scholar
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11:R106
Article CAS PubMed PubMed Central Google Scholar
Witten DM (2011) Classification and clustering of sequencing data using a Poisson model. Annal Appl Stat 5(4):2493–2518
Google Scholar
Townes FW, Hicks SC, Aryee MJ, Irizarry RA (2019) Feature selection and dimension reduction for single cell RNA-seq based on a multinomial model. Genome Biol 20:295
Article CAS PubMed PubMed Central Google Scholar
Zhu A, Ibrahim JG, Love MI (2018) Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics 35(12):2084–2092
Article PubMed Central Google Scholar
Stephens M (2016) False discovery rates: a new deal. Biostatistics 18(2):41
Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300
Google Scholar
Soneson C, Matthes KL, Nowicka M, Law CW, Robinson MD (2016) Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage. Genome Biol 17(1):12
Article PubMed PubMed Central Google Scholar
Anders S, Reyes A, Huber W (2012) Detecting differential usage of exons from RNA-seq data. Genome Res 22(10):2008–2017
Article CAS PubMed PubMed Central Google Scholar
Nowicka M, Robinson M (2016) DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000research 5:1356
Article PubMed PubMed Central Google Scholar
Van den Berge K, Soneson C, Robinson MD, Clement L (2017) stageR: a general stage-wise method for controlling the gene-level false discovery rate in differential expression and differential transcript usage. Genome Biol 18(1):151
Article PubMed PubMed Central Google Scholar
Alasoo K, Rodrigues J, Mukhopadhyay S, Knights A, Mann A et al (2018) Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response. Nat Genet 50:424–431
Article CAS PubMed PubMed Central Google Scholar
Love MI, Hogenesch JB, Irizarry RA (2016) Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat Biotechnol 34(12):1287–1291
Article CAS PubMed PubMed Central Google Scholar
Glaus P, Honkela A, Rattray M (2012) Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics 28(13):1721–1728
Article CAS PubMed PubMed Central Google Scholar
Turro E, Astle WJ, Tavaré S (2013) Flexible analysis of RNA-seq data using mixed effects models. Bioinformatics 30(2):180–188
Article PubMed Google Scholar
Al Seesi S, Temate-Tiagueu Y, Zelikovsky A, Măndoiu II (2014) Bootstrap-based differential gene expression analysis for RNA-seq data with and without replicates. BMC Genomics 15(Suppl 8):S2
Article PubMed PubMed Central Google Scholar
Pimentel H, Bray NL, Puente S, Melsted P, Pachter L (2017) Differential analysis of RNA-seq incorporating quantification uncertainty. Nat Methods 14(7):687–690
Article CAS PubMed Google Scholar
Bray NL, Pimentel H, Melsted P, Pachter L (2016) Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34(5):525
Article CAS PubMed Google Scholar
Zhu A, Srivastava A, Ibrahim J, Patro R, Love M (2019) Nonparametric expression analysis using inferential replicate counts. Nucleic Acids Res 47(18):e105
Article CAS PubMed PubMed Central Google Scholar
Li J, Tibshirani R (2011) Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Stat Methods Med Res 22(5):519–536
Article PubMed PubMed Central Google Scholar
Turro E, Su S-Y, Gonçalves Â, Coin LJ, Richardson S, Lewin A (2011) Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biol 12(2):R13
Article CAS PubMed PubMed Central Google Scholar
Storey J, Tibshirani R (2003) Statistical significance for genome-wide experiments. Proc Natl Acad Sci 100(16):9440–9445
Article CAS PubMed PubMed Central Google Scholar
Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN et al (2020) Orchestrating single-cell analysis with bioconductor. Nat Methods 17(2):137–145
Article CAS PubMed Google Scholar
Soneson C, Robinson MD (2018) Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods 15(4):255–261
Article CAS PubMed Google Scholar
Sun S, Zhu J, Ma Y, Zhou X (2019) Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biology 20(1):269
Google Scholar
Duo A, Robinson M, Soneson C (2018) A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000research 7:1141
Article PubMed Google Scholar
Van den Berge K, Perraudeau F, Soneson C, Love MI, Risso D et al (2018) Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol 19:24
Article PubMed PubMed Central Google Scholar
Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18):3094–3100
Article CAS PubMed PubMed Central Google Scholar
Soneson C, Yao Y, Bratus-Neuenschwander A, Patrignani A, Robinson MD, Hussain S (2019) A comprehensive examination of nanopore native RNA sequencing for characterization of complex transcriptomes. Nat Commun 10(1):3359
Article PubMed PubMed Central Google Scholar
Cruz-Garcia L, O’Brien G, Sipos B, Mayes S, Love M et al (2019) Generation of a transcriptional radiation exposure signature in human blood using long-read nanopore sequencing. Radiat Res 193(2):143–154
Article PubMed PubMed Central Google Scholar
Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q (2020) Opportunities and challenges in long-read sequencing data analysis. Genome Biol 21(1):30
Article PubMed PubMed Central Google Scholar
Castel SE, Levy-Moonshine A, Mohammadi P, Banks E, Lappalainen T (2015) Tools and best practices for data processing in allelic expression analysis. Genome Biol 16(1):195
Article PubMed PubMed Central Google Scholar
Raghupathy N, Choi K, Vincent MJ, Beane GL, Sheppard KS et al (2018) Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression. Bioinformatics 34(13):2177–2184
Article CAS PubMed PubMed Central Google Scholar
Srivastava A, Malik L, Sarkar H, Zakeri M, Almodaresi F et al (2019) Alignment and mapping methodology influence transcript abundance estimation. Genome Biol 21:239
Article Google Scholar
Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB (2014) Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2(1):15
Article PubMed PubMed Central Google Scholar
Calgaro M, Romualdi C, Waldron L, Risso D, Vitulo N (2020) Assessment of single cell RNA-seq statistical methods on microbiome data. Genome Biol 21:191
Article CAS PubMed PubMed Central Google Scholar
Callahan B, Sankaran K, Fukuyama J, McMurdie P, Holmes S (2016) Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. F1000research 5:1492
Article PubMed PubMed Central Google Scholar
Sankaran K, Holmes SP (2018) Latent variable modeling for the microbiome. Biostatistics 20(4):599–614
Article PubMed Central Google Scholar
Willis AD (2019) Rarefaction, alpha diversity, and statistics. Front Microbiol 10:2407
Article PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biostatistics, University of North Carolina-Chapel Hill, Chapel Hill, NC, USA
Michael I. Love
Department of Genetics, University of North Carolina-Chapel Hill, Chapel Hill, NC, USA
Michael I. Love

Authors

Michael I. Love
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento Di Bioscienze Biotecnologie E Biofarmaceutica, Università degli Studi di Bari Aldo Moro, Bari, Italy
Ernesto Picardi

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Love, M.I. (2021). Statistical Modeling of High Dimensional Counts. In: Picardi, E. (eds) RNA Bioinformatics. Methods in Molecular Biology, vol 2284. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1307-8_7

Download citation

DOI: https://doi.org/10.1007/978-1-0716-1307-8_7
Published: 10 April 2021
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-1306-1
Online ISBN: 978-1-0716-1307-8
eBook Packages: Springer Protocols

Publish with us

Policies and ethics