The Role of Spike-In Standards in the Normalization of RNA-seq

  • Davide Risso
  • John Ngai
  • Terence P. Speed
  • Sandrine Dudoit
Part of the Frontiers in Probability and the Statistical Sciences book series (FROPROSTAS)

Abstract

Normalization of RNA-seq data is essential to ensure accurate inference of expression levels, by adjusting for sequencing depth and other more complex nuisance effects, both within and between samples. Recently, the External RNA Control Consortium (ERCC) developed a set of 92 synthetic spike-in standards that are commercially available and relatively easy to add to a typical library preparation. In this chapter, we compare the performance of several state-of-the-art normalization methods, including adaptations that directly use spike-in sequences as controls. We show that although the ERCC spike-ins could in principle be valuable for assessing accuracy in RNA-seq experiments, their read counts are not stable enough to be used for normalization purposes. We propose a novel approach to normalization that can successfully make use of control sequences to remove unwanted effects and lead to accurate estimation of expression fold-changes and tests of differential expression.

References

  1. [1]
    Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11(10), R106 (2010)CrossRefGoogle Scholar
  2. [2]
    Anders, S., Pyl, P.T., Huber, W.: HTSeq: a Python framework to work with high-throughput sequencing data. Technical Report, bioRxiv preprint (2014). doi:10.1101/002824Google Scholar
  3. [3]
    Baker, S.C., Bauer, S.R., Beyer, R.P., Brenton, J.D., Bromley, B., Burrill, J., Causton, H., Conley, M.P., Elespuru, R., Fero, M., et al.: The external RNA controls consortium: a progress report. Nat. Meth. 2(10), 731–734 (2005)CrossRefGoogle Scholar
  4. [4]
    Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B 57, 289–300 (1995)MATHMathSciNetGoogle Scholar
  5. [5]
    Bolstad, B.M., Irizarry, R.A., Åstrand, M., Speed, T.P.: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185–193 (2003)CrossRefGoogle Scholar
  6. [6]
    Brennecke, P., Anders, S., Kim, J.K., Kołodziejczyk, A.A., Zhang, X., Proserpio, V., Baying, B., Benes, V., Teichmann, S.A., Marioni, J.C., Heisler, M.G.: Accounting for technical noise in single-cell RNA-seq experiments. Nat. Meth. 10, 1093–1095 (2013)CrossRefGoogle Scholar
  7. [7]
    Bullard, J., Purdom, E., Hansen, K., Dudoit, S.: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinform. 11(1), 94 (2010)CrossRefGoogle Scholar
  8. [8]
    Canales, R.D., Luo, Y., Willey, J.C., Austermiller, B., Barbacioru, C.C., Boysen, C., Hunkapiller, K., Jensen, R.V., Knight, C.R., Lee, K.Y., et al.: Evaluation of DNA microarray results with quantitative gene expression platforms. Nat. Biotechnol. 24(9), 1115–1122 (2006)CrossRefGoogle Scholar
  9. [9]
    Cleveland, W.S.: Robust locally weighted regression and smoothing scatterplots. J. Am. Stat. Assoc. 74(368), 829–836 (1979)CrossRefMATHMathSciNetGoogle Scholar
  10. [10]
    Dillies, M.A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., et al.: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14(6), 671–683 (2013)CrossRefGoogle Scholar
  11. [11]
    Ferreira, T., Wilson, S.R., Choi, Y.G., Risso, D., Dudoit, S., Speed, T.P., Ngai, J.: Silencing of odorant receptor genes by G Protein β γ signaling ensures the expression of one odorant receptor per olfactory sensory neuron. Neuron 81, 847–859 (2014)CrossRefGoogle Scholar
  12. [12]
    Flicek, P., Amode, M.R., Barrell, D., Beal, K., Brent, S., Carvalho-Silva, D., Clapham, P., Coates, G., Fairley, S., Fitzgerald, S., et al.: Ensembl 2012. Nucleic Acids Res. 40(D1), D84–D90 (2012)CrossRefGoogle Scholar
  13. [13]
    Gagnon-Bartsch, J., Speed, T.: Using control genes to correct for unwanted variation in microarray data. Biostatistics 13(3), 539–552 (2012)CrossRefGoogle Scholar
  14. [14]
    Gagnon-Bartsch, J., Jacob, L., Speed, T.P.: Removing unwanted variation from high dimensional data with negative controls. Technical Report 820, Department of Statistics, University of California, Berkeley (2013)Google Scholar
  15. [15]
    Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R.A., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G.K., Tierney, L., Yang, Y.H., Zhang, J.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5(10), R80 (2004)CrossRefGoogle Scholar
  16. [16]
    Hansen, K.D., Brenner, S.E., Dudoit, S.: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38(12), e131 (2010)CrossRefGoogle Scholar
  17. [17]
    Hansen, K.D., Irizarry, R.A., Zhijin, W.: Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 13(2), 204–216 (2012)CrossRefGoogle Scholar
  18. [18]
    Jiang, L., Schlesinger, F., Davis, C.A., Zhang, Y., Li, R., Salit, M., Gingeras, T.R., Oliver, B.: Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21(9), 1543–1551 (2011)CrossRefGoogle Scholar
  19. [19]
    Lovén, J., Orlando, D., Sigova, A., Lin, C., Rahl, P., Burge, C., Levens, D., Lee, T., Young, R.: Revisiting global gene expression analysis. Cell 151(3), 476–482 (2012)CrossRefGoogle Scholar
  20. [20]
    Marioni, J., Mason, C., Mane, S., Stephens, M., Gilad, Y.: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9), 1509 (2008)CrossRefGoogle Scholar
  21. [21]
    McCullagh, P., Nelder, J.: Generalized Linear Models. Chapman and Hall, New York (1989)CrossRefMATHGoogle Scholar
  22. [22]
    Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Meth. 5(7), 621–628 (2008)CrossRefGoogle Scholar
  23. [23]
    Oshlack, A., Wakefield, M.: Transcript length bias in RNA-seq data confounds systems biology. Biol. Direct 4(1), 14 (2009)CrossRefGoogle Scholar
  24. [24]
    Oshlack, A., Emslie, D., Corcoran, L.M., Smyth, G.K.: Normalization of boutique two-color microarrays with a high proportion of differentially expressed probes. Genome Biol. 8(1), R2 (2007)CrossRefGoogle Scholar
  25. [25]
    Qing, T., Yu, Y., Du, T., Shi, L.: mRNA enrichment protocols determine the quantification characteristics of external RNA spike-in controls in RNA-Seq studies. Sci. China Life Sci. 56(2), 134–142 (2013)CrossRefGoogle Scholar
  26. [26]
    R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2013). http://www.R-project.org
  27. [27]
    Risso, D., Massa, M.S., Chiogna, M., Romualdi, C.: A modified LOESS normalization applied to microRNA arrays: a comparative evaluation. Bioinformatics 25(20), 2685–2691 (2009)CrossRefGoogle Scholar
  28. [28]
    Risso, D., Schwartz, K., Sherlock, G., Dudoit, S.: GC-content normalization for RNA-Seq data. BMC Bioinform. 12(1), 480 (2011)CrossRefGoogle Scholar
  29. [29]
    Risso, D., Ngai, J., Speed, T., Dudoit, S.: Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. (2014, in press).Google Scholar
  30. [30]
    Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L., Pachter, L.: Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 12(3), R22 (2011)CrossRefGoogle Scholar
  31. [31]
    Robinson, M.D., McCarthy, D.J., Smyth, G.K.: edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1), 139–140 (2010)Google Scholar
  32. [32]
    Robinson, M.D., Oshlack, A.: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11(3), R25 (2010)CrossRefGoogle Scholar
  33. [33]
    Su, Z., Labaj, P., Li, S., Thierry-Mieg, J., Thierry-Mieg, D., Shi, W., et al.: Power and limitations of RNA-Seq. Nat. Biotechnol. (2014, in press)Google Scholar
  34. [34]
    Sun, Z., Zhu, Y.: Systematic comparison of RNA-Seq normalization methods using measurement error models. Bioinformatics 28(20), 2584–2591 (2012)CrossRefMathSciNetGoogle Scholar
  35. [35]
    Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., Wang, X., Bodeau, J., Tuch, B.B., Siddiqui, A., et al.: mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Meth. 6(5), 377–382 (2009)CrossRefGoogle Scholar
  36. [36]
    Trapnell, C., Pachter, L., Salzberg, S.L.: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9), 1105–1111 (2009)CrossRefGoogle Scholar
  37. [37]
    Wang, Z., Gerstein, M., Snyder, M.: RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10(1), 57–63 (2009)CrossRefGoogle Scholar
  38. [38]
    Wu, D., Hu, Y., Tong, S., Williams, B.R., Smyth, G.K., Gantier, M.P.: The use of miRNA microarrays for the analysis of cancer samples with global miRNA decrease. RNA 19(7), 876–888 (2013)CrossRefGoogle Scholar
  39. [39]
    Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., Speed, T.P.: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30(4), e15 (2002)CrossRefGoogle Scholar
  40. [40]
    Zheng, W., Chung, L.M., Zhao, H.: Bias detection and correction in RNA-sequencing data. BMC Bioinform. 12(1), 290 (2011)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Davide Risso
    • 1
  • John Ngai
    • 2
  • Terence P. Speed
    • 1
    • 3
    • 4
  • Sandrine Dudoit
    • 5
  1. 1.Department of StatisticsUniversity of CaliforniaBerkeleyUSA
  2. 2.Department of Molecular and Cell Biology, Helen Wills Neuroscience Institute, and Functional Genomics LaboratoryUniversity of CaliforniaBerkeleyUSA
  3. 3.Bioinformatics DivisionWalter and Eliza Hall InstituteMelbourneAustralia
  4. 4.Department of Mathematics and StatisticsThe University of MelbourneVictoriaAustralia
  5. 5.Division of Biostatistics and Department of StatisticsUniversity of CaliforniaBerkeleyUSA

Personalised recommendations