Skip to main content

Advertisement

Log in

A model based criterion for gene expression calls using RNA-seq data

  • SHORT COMMUNICATION
  • Published:
Theory in Biosciences Aims and scope Submit manuscript

Abstract

The power of deep sequencing technology to reliably detect single RNA reads leads to a paradoxical problem of high sensitivity. In hybridization or PCR based methods for RNA quantification, the concern is low sensitivity, i.e., the problem that the signal from truly expressed genes might not be distinguishable from noise. In contrast, the problem with RNA-seq is that it is not clear whether genes with very low read counts are from low expressed genes or merely transcriptional noise. The frequency distribution for read counts does not show a clear separation in two classes of genes, which makes the decision whether a gene is to be considered expressed or not seemingly arbitrary. Here we address this problem by suggesting a statistical model that considers the number of transcripts detected in a RNA-seq study as a mixture of two distributions: one is a exponential distribution for transcripts from inactive genes, and a negative binomial distribution for actively transcribed genes. We apply this model to a number of RNA-seq data sets and find that the model fits the data very well. The calculated criteria for distinguishing between expressed and non-expressed gene is remarkably consistent among data sets, suggesting genes with more than two transcripts per million transcripts (TPM) are highly likely from actively transcribed genes. This criterion is consistent with the criterion of 1 RPKM proposed by Hebenstreit et al. Mol Sys Biol 7:497 (2011), based on chromatin modification and per cell RNA expression data. Hence, the regression model correctly identifies the not actively expressed class of genes and thus, provides an operational criterion for classifying genes in expressed and non-expressed sets, facilitating the interpretation of RNA-seq data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

References

  • Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11:R106

    Article  PubMed  CAS  Google Scholar 

  • Bishop OJ, Morton JG, Rosbash M, Richardson M (1974) Three abundance classes in HeLa cell messenger RNA. Nature 250:199–204

    Article  PubMed  CAS  Google Scholar 

  • Hebenstreit D, Fang M, Gu M, Charoensawan V, Oudenaarden Av, Teichmann SA (2011) RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol Sys Biol 7:497

    Google Scholar 

  • Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26:493–500

    Google Scholar 

  • Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628

    Article  PubMed  CAS  Google Scholar 

  • Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M (2008) The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320:1344–1349

    Article  PubMed  CAS  Google Scholar 

  • R-Development-Core-Team (2009) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

    Google Scholar 

  • Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11:R25

    Article  PubMed  Google Scholar 

  • Struhl K (2007) Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nat Struct Mol Biol 14:103–105

    Article  PubMed  CAS  Google Scholar 

  • Visel A et al (2009) ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 457:854–858

    Article  PubMed  CAS  Google Scholar 

  • Wagner GP, Kin K, Lynch VJ (2012) Measurement of mRNA abundance using RNA-seq data: RPKM is inconsistent among samples. Theory Biosci. doi:10.1007/s12064-012-0162-3

    PubMed  Google Scholar 

  • Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

The authors thank Nicholas Carriero for help in mapping the RNA-seq data to reference genomes and for providing a list of feature lengths for the human genome. The research for this paper was supported by the John Templeton Foundation grant #12793 (The opinions expressed in this paper are not those of the JTF) and the Foundational Questions in Evolutionary Biology grant (FQEB # RFP-12-23).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Günter P. Wagner.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 50 kb)

Supplementary material 2 (TIFF 59 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wagner, G.P., Kin, K. & Lynch, V.J. A model based criterion for gene expression calls using RNA-seq data. Theory Biosci. 132, 159–164 (2013). https://doi.org/10.1007/s12064-013-0178-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12064-013-0178-3

Keywords

Navigation