Abstract
The power of deep sequencing technology to reliably detect single RNA reads leads to a paradoxical problem of high sensitivity. In hybridization or PCR based methods for RNA quantification, the concern is low sensitivity, i.e., the problem that the signal from truly expressed genes might not be distinguishable from noise. In contrast, the problem with RNA-seq is that it is not clear whether genes with very low read counts are from low expressed genes or merely transcriptional noise. The frequency distribution for read counts does not show a clear separation in two classes of genes, which makes the decision whether a gene is to be considered expressed or not seemingly arbitrary. Here we address this problem by suggesting a statistical model that considers the number of transcripts detected in a RNA-seq study as a mixture of two distributions: one is a exponential distribution for transcripts from inactive genes, and a negative binomial distribution for actively transcribed genes. We apply this model to a number of RNA-seq data sets and find that the model fits the data very well. The calculated criteria for distinguishing between expressed and non-expressed gene is remarkably consistent among data sets, suggesting genes with more than two transcripts per million transcripts (TPM) are highly likely from actively transcribed genes. This criterion is consistent with the criterion of 1 RPKM proposed by Hebenstreit et al. Mol Sys Biol 7:497 (2011), based on chromatin modification and per cell RNA expression data. Hence, the regression model correctly identifies the not actively expressed class of genes and thus, provides an operational criterion for classifying genes in expressed and non-expressed sets, facilitating the interpretation of RNA-seq data.
References
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11:R106
Bishop OJ, Morton JG, Rosbash M, Richardson M (1974) Three abundance classes in HeLa cell messenger RNA. Nature 250:199–204
Hebenstreit D, Fang M, Gu M, Charoensawan V, Oudenaarden Av, Teichmann SA (2011) RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol Sys Biol 7:497
Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26:493–500
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628
Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M (2008) The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320:1344–1349
R-Development-Core-Team (2009) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11:R25
Struhl K (2007) Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nat Struct Mol Biol 14:103–105
Visel A et al (2009) ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 457:854–858
Wagner GP, Kin K, Lynch VJ (2012) Measurement of mRNA abundance using RNA-seq data: RPKM is inconsistent among samples. Theory Biosci. doi:10.1007/s12064-012-0162-3
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63
Acknowledgments
The authors thank Nicholas Carriero for help in mapping the RNA-seq data to reference genomes and for providing a list of feature lengths for the human genome. The research for this paper was supported by the John Templeton Foundation grant #12793 (The opinions expressed in this paper are not those of the JTF) and the Foundational Questions in Evolutionary Biology grant (FQEB # RFP-12-23).
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Wagner, G.P., Kin, K. & Lynch, V.J. A model based criterion for gene expression calls using RNA-seq data. Theory Biosci. 132, 159–164 (2013). https://doi.org/10.1007/s12064-013-0178-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12064-013-0178-3