Skip to main content

Statistical Modeling of High Dimensional Counts

  • Protocol
  • First Online:
RNA Bioinformatics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2284))

Abstract

Statistical modeling of count data from RNA sequencing (RNA-seq) experiments is important for proper interpretation of results. Here I will describe how count data can be modeled using count distributions, or alternatively analyzed using nonparametric methods. I will focus on basic routines for performing data input, scaling/normalization, visualization, and statistical testing to determine sets of features where the counts reflect differences in gene expression across samples. Finally, I discuss limitations and possible extensions to the models presented here.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Love M, Anders S, Kim V, Huber W (2015) RNA-seq workflow: gene-level exploratory analysis and differential expression. F1000research 4:1070

    Article  PubMed  PubMed Central  Google Scholar 

  2. Love M, Soneson C, Patro R (2018) Swimming downstream: statistical analysis of differential transcript usage following salmon quantification. F1000research 7:952

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Van den Berge K, Hembach KM, Soneson C, Tiberi S, Clement L et al (2019) RNA sequencing data: Hitchhiker’s guide to expression analysis. Ann Rev Biomed Data Sci 2(1):139–173

    Article  Google Scholar 

  4. Ewels P, Magnusson M, Lundin S, Käller M (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32(19):3047–3048

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. King HW, Klose RJ (2017) The pioneer factor oct4 requires the chromatin remodeller brg1 to support gene regulatory element function in mouse embryonic stem cells. Elife 6:e22631

    Article  PubMed  PubMed Central  Google Scholar 

  6. Patro R, Duggal G, Love M, Irizarry R, Kingsford C (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417–419

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Köster J, Rahmann S (2012) Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28(19):2520–2522

    Article  PubMed  Google Scholar 

  8. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M et al (2015) Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12(2):115–121

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Love MI, Soneson C, Hickey PF, Johnson LK, Pierce NT et al (2020) Tximeta: reference sequence checksums for provenance identification in RNA-seq. PLoS Comput Biol 16(2):e1007664

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Srivastava A, Malik L, Smith TS, Sudbery I, Patro R (2019) Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol 20:65

    Article  PubMed  PubMed Central  Google Scholar 

  11. Frankish A, GENCODE-consoritum, Flicek P. (2018) GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 47(D1):D766–D773

    Article  PubMed Central  Google Scholar 

  12. Soneson C, Love MI, Robinson M (2015) Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000research 4:1521

    Article  PubMed  Google Scholar 

  13. Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M et al (2013) Software for computing and annotating genomic ranges. PLoS Comput Biol 9(8):e1003118

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12):550

    Article  PubMed  PubMed Central  Google Scholar 

  15. Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139

    Article  CAS  PubMed  Google Scholar 

  16. McCarthy DJ, Chen Y, Smyth GK (2012) Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Res 40:4288–4297

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Law CW, Chen Y, Shi W, Smyth GK (2014) voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15(2):29

    Article  Google Scholar 

  18. Wu H, Wang C, Wu Z (2012) A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics 14(2):232–243

    Article  PubMed  PubMed Central  Google Scholar 

  19. Ignatiadis N, Klaus B, Zaugg J, Huber W (2016) Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat Methods 13(7):577–580

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Dudoit S, Yang YH, Callow MJ, Speed TP (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12(1):111–139

    Google Scholar 

  21. Roberts CJ, Nelson B, Marton MJ, Stoughton R, Meyer MR et al (2000) Signaling and circuitry of multiple mapk pathways revealed by a matrix of global gene expression profiles. Science 287(5454):873–880

    Article  CAS  PubMed  Google Scholar 

  22. Cox DR, Reid N (1987) Parameter orthogonality and approximate conditional inference. J R Stat Soc B 49(1):1–39

    Google Scholar 

  23. Tibshirani R (1988) Estimating transformations for regression via additivity and variance stabilization. J Am Stat Assoc 83:394–405

    Article  Google Scholar 

  24. Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11:R106

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Witten DM (2011) Classification and clustering of sequencing data using a Poisson model. Annal Appl Stat 5(4):2493–2518

    Google Scholar 

  26. Townes FW, Hicks SC, Aryee MJ, Irizarry RA (2019) Feature selection and dimension reduction for single cell RNA-seq based on a multinomial model. Genome Biol 20:295

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Zhu A, Ibrahim JG, Love MI (2018) Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics 35(12):2084–2092

    Article  PubMed Central  Google Scholar 

  28. Stephens M (2016) False discovery rates: a new deal. Biostatistics 18(2):41

    Google Scholar 

  29. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300

    Google Scholar 

  30. Soneson C, Matthes KL, Nowicka M, Law CW, Robinson MD (2016) Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage. Genome Biol 17(1):12

    Article  PubMed  PubMed Central  Google Scholar 

  31. Anders S, Reyes A, Huber W (2012) Detecting differential usage of exons from RNA-seq data. Genome Res 22(10):2008–2017

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Nowicka M, Robinson M (2016) DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000research 5:1356

    Article  PubMed  PubMed Central  Google Scholar 

  33. Van den Berge K, Soneson C, Robinson MD, Clement L (2017) stageR: a general stage-wise method for controlling the gene-level false discovery rate in differential expression and differential transcript usage. Genome Biol 18(1):151

    Article  PubMed  PubMed Central  Google Scholar 

  34. Alasoo K, Rodrigues J, Mukhopadhyay S, Knights A, Mann A et al (2018) Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response. Nat Genet 50:424–431

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Love MI, Hogenesch JB, Irizarry RA (2016) Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat Biotechnol 34(12):1287–1291

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Glaus P, Honkela A, Rattray M (2012) Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics 28(13):1721–1728

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Turro E, Astle WJ, Tavaré S (2013) Flexible analysis of RNA-seq data using mixed effects models. Bioinformatics 30(2):180–188

    Article  PubMed  Google Scholar 

  38. Al Seesi S, Temate-Tiagueu Y, Zelikovsky A, Măndoiu II (2014) Bootstrap-based differential gene expression analysis for RNA-seq data with and without replicates. BMC Genomics 15(Suppl 8):S2

    Article  PubMed  PubMed Central  Google Scholar 

  39. Pimentel H, Bray NL, Puente S, Melsted P, Pachter L (2017) Differential analysis of RNA-seq incorporating quantification uncertainty. Nat Methods 14(7):687–690

    Article  CAS  PubMed  Google Scholar 

  40. Bray NL, Pimentel H, Melsted P, Pachter L (2016) Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34(5):525

    Article  CAS  PubMed  Google Scholar 

  41. Zhu A, Srivastava A, Ibrahim J, Patro R, Love M (2019) Nonparametric expression analysis using inferential replicate counts. Nucleic Acids Res 47(18):e105

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Li J, Tibshirani R (2011) Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Stat Methods Med Res 22(5):519–536

    Article  PubMed  PubMed Central  Google Scholar 

  43. Turro E, Su S-Y, Gonçalves Â, Coin LJ, Richardson S, Lewin A (2011) Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biol 12(2):R13

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Storey J, Tibshirani R (2003) Statistical significance for genome-wide experiments. Proc Natl Acad Sci 100(16):9440–9445

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN et al (2020) Orchestrating single-cell analysis with bioconductor. Nat Methods 17(2):137–145

    Article  CAS  PubMed  Google Scholar 

  46. Soneson C, Robinson MD (2018) Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods 15(4):255–261

    Article  CAS  PubMed  Google Scholar 

  47. Sun S, Zhu J, Ma Y, Zhou X (2019) Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biology 20(1):269

    Google Scholar 

  48. Duo A, Robinson M, Soneson C (2018) A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000research 7:1141

    Article  PubMed  Google Scholar 

  49. Van den Berge K, Perraudeau F, Soneson C, Love MI, Risso D et al (2018) Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol 19:24

    Article  PubMed  PubMed Central  Google Scholar 

  50. Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18):3094–3100

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Soneson C, Yao Y, Bratus-Neuenschwander A, Patrignani A, Robinson MD, Hussain S (2019) A comprehensive examination of nanopore native RNA sequencing for characterization of complex transcriptomes. Nat Commun 10(1):3359

    Article  PubMed  PubMed Central  Google Scholar 

  52. Cruz-Garcia L, O’Brien G, Sipos B, Mayes S, Love M et al (2019) Generation of a transcriptional radiation exposure signature in human blood using long-read nanopore sequencing. Radiat Res 193(2):143–154

    Article  PubMed  PubMed Central  Google Scholar 

  53. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q (2020) Opportunities and challenges in long-read sequencing data analysis. Genome Biol 21(1):30

    Article  PubMed  PubMed Central  Google Scholar 

  54. Castel SE, Levy-Moonshine A, Mohammadi P, Banks E, Lappalainen T (2015) Tools and best practices for data processing in allelic expression analysis. Genome Biol 16(1):195

    Article  PubMed  PubMed Central  Google Scholar 

  55. Raghupathy N, Choi K, Vincent MJ, Beane GL, Sheppard KS et al (2018) Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression. Bioinformatics 34(13):2177–2184

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Srivastava A, Malik L, Sarkar H, Zakeri M, Almodaresi F et al (2019) Alignment and mapping methodology influence transcript abundance estimation. Genome Biol 21:239

    Article  Google Scholar 

  57. Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB (2014) Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2(1):15

    Article  PubMed  PubMed Central  Google Scholar 

  58. Calgaro M, Romualdi C, Waldron L, Risso D, Vitulo N (2020) Assessment of single cell RNA-seq statistical methods on microbiome data. Genome Biol 21:191

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Callahan B, Sankaran K, Fukuyama J, McMurdie P, Holmes S (2016) Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. F1000research 5:1492

    Article  PubMed  PubMed Central  Google Scholar 

  60. Sankaran K, Holmes SP (2018) Latent variable modeling for the microbiome. Biostatistics 20(4):599–614

    Article  PubMed Central  Google Scholar 

  61. Willis AD (2019) Rarefaction, alpha diversity, and statistics. Front Microbiol 10:2407

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Love, M.I. (2021). Statistical Modeling of High Dimensional Counts. In: Picardi, E. (eds) RNA Bioinformatics. Methods in Molecular Biology, vol 2284. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1307-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-1307-8_7

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-1306-1

  • Online ISBN: 978-1-0716-1307-8

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics