Measurement, Summary, and Methodological Variation in RNA-sequencing

  • Alyssa C. Frazee
  • Leonardo Collado Torres
  • Andrew E. Jaffe
  • Ben Langmead
  • Jeffrey T. Leek
Part of the Frontiers in Probability and the Statistical Sciences book series (FROPROSTAS)


There has been a major shift from microarrays to RNA-sequencing (RNA-seq) for measuring gene expression as the price per measurement between these technologies has become comparable. The advantages of RNA-seq are increased measurement flexibility to detect alternative transcription, allele specific transcription, or transcription outside of known coding regions. The price of this increased flexibility is: (a) an increase in raw data size and (b) more decisions that must be made by the data analyst. Here we provide a selective review and extension of our previous work in attempting to measure variability in results due to different choices about how to summarize and analyze RNA-sequencing data. We discuss a standard model for gene expression measurements that breaks variability down into variation due to technology, biology, and measurement error. Finally, wee show the importance of gene model selection, normalization, and choice for statistical model on the ultimate results of an RNA-sequencing experiment.


Cytosine Guanine 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. [1]
    A C’t Hoen, P., Friedländer, M.R., Almlöf, J., Sammeth, M., Pulyakhina, I., Anvar, S.Y., Laros, J.F., Buermans, H.P., Karlberg, O., Brännvall, M., et al.: Reproducibility of high-throughput mrna and small rna sequencing across laboratories. Nat. Biotechnol. 31, 1015–1022 (2013)Google Scholar
  2. [2]
    Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010). doi:10.1186/gb-2010-11-10-r106.
  3. [3]
    Auer, P.L., Doerge, R.W.: Statistical design and analysis of RNA sequencing data. Genetics 185(2), 405–416 (2010)CrossRefGoogle Scholar
  4. [4]
    Bullard, J., Purdom, E., Hansen, K.D., Dudoit, S.: Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC Bioinform. 11, 94 (2010). R package version 1.10.0Google Scholar
  5. [5]
    Dohm, J.C., Lottaz, C., Borodina, T., Himmelbauer, H.: Substantial biases in ultra-short read data sets from high-throughput dna sequencing. Nucleic Acids Res. 36(16), e105–e105 (2008)CrossRefGoogle Scholar
  6. [6]
    Elowitz, M., Levine, A., Siggia, E., Swain, P.: Stochastic gene expression in a single cell. Science 297(5584), 1183 (2002)CrossRefGoogle Scholar
  7. [7]
    Frazee, A., Sabunciyan, S., Hansen, K., Irizarry, R., Leek, J.: Differential expression analysis 362 of RNA-seq data at single-base resolution. Biostatistics doi:  10.1093/biostatistics/kxt053 (2014)
  8. [8]
    Friguet, C., Kloareg, M., Causer, D.: A factor model approach to multiple testing under dependence. J. Am. Stat. Assoc., 104:488, 1406–1415 (2009)Google Scholar
  9. [9]
    Garber, M., Grabherr, M., Guttman, M., Trapnell, C.: Computational methods for transcriptome annotation and quantification using rna-seq. Nat. Meth. 8(6), 469–477 (2011)CrossRefGoogle Scholar
  10. [10]
    Glenn, T.C.: Field guide to next-generation dna sequencers. Mol. Ecol. Resour. 11(5), 759–769 (2011)CrossRefGoogle Scholar
  11. [11]
    Hansen, K.D., Brenner, S.E., Dudoit, S.: Biases in illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38(12), e131 (2010)CrossRefGoogle Scholar
  12. [12]
    Hansen, K.D., Wu, Z., Irizarry, R.A., Leek, J.T.: Sequencing technology does not eliminate biological variability. Nat. Biotechnol. 29(7), 572–573 (2011)CrossRefGoogle Scholar
  13. [13]
    Hansen, K.D., Irizarry, R.A., Wu, Z.: Removing technical variability in rna-seq data using conditional quantile normalization. Biostatistics 13(2), 204–216 (2012)CrossRefGoogle Scholar
  14. [14]
    Ioannidis, J.P.: Why most published research findings are false. PLoS Med. 2(8), e124 (2005)CrossRefGoogle Scholar
  15. [15]
    Jiang, H., Wong, W.: Statistical inferences for isoform expression in rna-seq. Bioinformatics 25(8), 1026–1032 (2009)CrossRefGoogle Scholar
  16. [16]
    Kleinman, C.L., Majewski, J.: Comment on “widespread RNA and DNA sequence differences in the human transcriptome”. Science 335(6074), 1302; author reply 1302 (2012)Google Scholar
  17. [17]
    Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Meth. 9(4), 357–359 (2012)CrossRefGoogle Scholar
  18. [18]
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)CrossRefGoogle Scholar
  19. [19]
    Langmead, B., Hansen, K.D., Leek, J.T.: Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11(8), R83 (2010)CrossRefGoogle Scholar
  20. [20]
    Ledford, H.: The death of microarrays? Nature 455(7215), 847 (2008)CrossRefGoogle Scholar
  21. [21]
    Leek, J., Storey, J.: Capturing heterogeneity in gene expression studies by ‘surrogate variable analysis’. PLoS Genet. 3, e161 (2007)CrossRefGoogle Scholar
  22. [22]
    Leek, J., Storey, J.: A general framework for multiple testing dependence. PNAS 105, 18,718–18,723 (2008)Google Scholar
  23. [23]
    Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., Irizarry, R.A.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010)CrossRefGoogle Scholar
  24. [24]
    Li, B., Dewey, C.: Rsem: accurate transcript quantification from rna-seq data with or without a reference genome. BMC Bioinform. 12(1), 323 (2011)CrossRefGoogle Scholar
  25. [25]
    Li, H., Durbin, R.: Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)CrossRefGoogle Scholar
  26. [26]
    Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows–wheeler transform. Bioinformatics 26(5), 589–595 (2010)CrossRefGoogle Scholar
  27. [27]
    Li, J., Jiang, H., Wong, W.: Modeling non-uniformity in short-read rates in rna-seq data. Genome Biol. 11(5), R25 (2010)CrossRefGoogle Scholar
  28. [28]
    Li, M., Wang, I.X., Li, Y., Bruzel, A., Richards, A.L., Toung, J.M., Cheung, V.G.: Widespread rna and dna sequence differences in the human transcriptome. Science 333(6038), 53–58 (2011)CrossRefGoogle Scholar
  29. [29]
    Lin, W., Piskol, R., Tan, M.H., Li, J.B.: Comment on “widespread RNA and DNA sequence differences in the human transcriptome”. Science 335(6074), 1302; author reply 1302 (2012)Google Scholar
  30. [30]
    MacArthur, D.: Methods: face up to false positives. Nature 487(7408), 427–428 (2012)CrossRefGoogle Scholar
  31. [31]
    McCall, M.N., Bolstad, B.M., Irizarry, R.A.: Frozen robust multiarray analysis (frma). Biostatistics 11(2), 242–253 (2010)CrossRefGoogle Scholar
  32. [32]
    McCall, M.N., Uppal, K., Jaffee, H.A., Zilliox, M.J., Irizarry, R.A.: The gene expression barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Res. 39(Suppl 1), D1011–D1015 (2011)CrossRefGoogle Scholar
  33. [33]
    NHGRI: DNA sequencing costs.
  34. [34]
    Oshlack, A., Robinson, M.D., Young, M.D., et al.: From rna-seq reads to differential expression results. Genome Biol. 11(12), 220 (2010)CrossRefGoogle Scholar
  35. [35]
    Piccolo, S.R., Withers, M.R., Francis, O.E., Bild, A.H., Johnson, W.E.: Multiplatform single-sample estimates of transcriptional activation. Proc. Natl. Acad. Sci. 110(44), 17,778–17,783 (2013)CrossRefGoogle Scholar
  36. [36]
    Pickrell, J., Marioni, J., Pai, A., Degner, J., Engelhardt, B., Nkadori, E., Veyrieras, J., Stephens, M., Gilad, Y., Pritchard, J.: Understanding mechanisms underlying human gene expression variation with rna sequencing. Nature 464(7289), 768–772 (2010)CrossRefGoogle Scholar
  37. [37]
    Pickrell, J.K., Gilad, Y., Pritchard, J.K.: Comment on “widespread RNA and DNA sequence differences in the human transcriptome”. Science 335(6074), 1302; author reply 1302 (2012)Google Scholar
  38. [38]
    Risso, D., Schwartz, K., Sherlock, G., Dudoit, S.: Gc-content normalization for rna-seq data. BMC Bioinform. 12(1), 480 (2011)CrossRefGoogle Scholar
  39. [39]
    Roberts, A., Trapnell, C., Donaghey, J., Rinn, J., Pachter, L., et al.: Improving rna-seq expression estimates by correcting for fragment bias. Genome Biol. 12(3), R22 (2011)CrossRefGoogle Scholar
  40. [40]
    Robinson, M., McCarthy, D., Smyth, G.: edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1), 139–140 (2010)Google Scholar
  41. [41]
    Shendure, J., Ji, H.: Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–1145 (2008)Google Scholar
  42. [42]
    Stein, L.D.: The case for cloud computing in genome informatics. Genome Biol. 11(5), 207 (2010)CrossRefGoogle Scholar
  43. [43]
    Teschendorff, A.E., Zhuang, J., Widschwendter, M.: Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies. Bioinformatics 27, 1496–1505 (2011)CrossRefGoogle Scholar
  44. [44]
    Trapnell, C., Pachter, L., Salzberg, S.L.: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9), 1105–1111 (2009)CrossRefGoogle Scholar
  45. [45]
    Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J., Pachter, L.: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28(5), 511–515 (2010)CrossRefGoogle Scholar
  46. [46]
    Wang, K., Singh, D., Zeng, Z., Coleman, S.J., Huang, Y., Savich, G.L., He, X., Mieczkowski, P., Grimm, S.A., Perou, C.M., et al.: Mapsplice: accurate mapping of rna-seq reads for splice junction discovery. Nucleic Acids Res. 38(18), e178 (2010)CrossRefGoogle Scholar
  47. [47]
    Wu, T.D., Nacu, S.: Fast and snp-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26(7), 873–881 (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Department of BiostatisticsJohns Hopkins Bloomberg School of Public HealthBaltimoreUSA
  2. 2.Lieber Institute for Brain DevelopmentBaltioreUSA
  3. 3.Department of Computer Science, Whiting School of EngineeringJohns Hopkins UniversityBaltimoreUSA

Personalised recommendations