PBSeq: Modeling base-level bias to estimate gene and isoform expression for RNA-seq data

Original Article

Abstract

Due to its unprecedented high-throughput and high-resolution, RNA-seq rapidly becomes a revolutionary and powerful technology for transcriptome analysis. However, RNA-seq library preparation results in non-uniformity of read distribution in the represented genes. When estimating gene and isoform expression level, the non-uniformity needs to be accounted and corrected to improve the estimation accuracy. In this paper, we propose PBSeq, a Poisson model utilizing a base-level bias correction strategy to estimate gene and isoform expression. The base-level bias correction strategy simultaneously considers the positional and sequence-specific biases at starting position of reads mapped to the genes of interest. The PBSeq not only provides the expression values but also estimates the uncertainty associated with expression estimation, which represents the variation across replicates and is useful for downstream analysis. We utilize a simulated dataset and three real RNA-seq datasets to validate the PBSeq model. Results show that PBseq can accurately estimate gene and isoform expression levels and is computationally efficient compared with other state-of-art methods.

Keywords

RNA-seq Base-level bias Gene and isoform expression level Expression of uncertainty 

References

  1. 1.
    Bishop CM et al (2006) Pattern recognition and machine learning, vol. 1. Springer, New YorkGoogle Scholar
  2. 2.
    Burgess DJ (2014) Gene expression: a global assessment of rna-seq performance. Nat Rev Genet 15(10):645–645CrossRefGoogle Scholar
  3. 3.
    Canales RD, Luo Y, Willey JC, Austermiller B, Barbacioru CC, Boysen C, Hunkapiller K, Jensen RV, Knight CR, Lee KY et al (2006) Evaluation of DNA microarray results with quantitative gene expression platforms. Nat Biotechnol 24(9):1115–1122CrossRefGoogle Scholar
  4. 4.
    Glaus P, Honkela A, Rattray M (2012) Identifying differentially expressed transcripts from RNA-Seq data with biological variation. Bioinformatics 28(13):1721–1728CrossRefGoogle Scholar
  5. 5.
    Hansen KD, Brenner SE, Dudoit S (2010) Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res 38(12):e131–e131CrossRefGoogle Scholar
  6. 6.
    Hardcastle TJ, Kelly KA (2010) bayseq: empirical bayesian methods for identifying differential expression in sequence count data. BMC Bioinform 11(1):422CrossRefGoogle Scholar
  7. 7.
    Hu M, Zhu Y, Taylor JM, Liu JS, Qin ZS (2012) Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq. Bioinformatics 28(1):63–68CrossRefGoogle Scholar
  8. 8.
    Huang Y, Hu Y, Jones CD, MacLeod JN, Chiang DY, Liu Y, Prins JF, Liu J (2013) A robust method for transcript quantification with RNA-Seq data. J Comput Biol 20(3):167–187MathSciNetCrossRefGoogle Scholar
  9. 9.
    Jiang H, Wong WH (2009) Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25(8):1026–1032CrossRefGoogle Scholar
  10. 10.
    Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C (2013) EBSeq: an empirical bayes hierarchical model for inference in RNA-Seq experiments. Bioinformatics 29(8):1035–1043CrossRefGoogle Scholar
  11. 11.
    Li B, Dewey C (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform 12(1):323CrossRefGoogle Scholar
  12. 12.
    Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26(4):493–500CrossRefGoogle Scholar
  13. 13.
    Li J, Jiang H, Wong W (2010) Method modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol 11(5):R25CrossRefGoogle Scholar
  14. 14.
    Marguerat S, Bähler J (2010) RNA-seq: from technology to biology. Cell Mol Life Sci 67(4):569–579CrossRefGoogle Scholar
  15. 15.
    Marguerat S, Wilhelm BT, Bähler J (2008) British Yeast Group Meeting 2008: next-generation sequencing: applications beyond genomes. Biochem Soc Trans 36(Pt 5):1091–1096CrossRefGoogle Scholar
  16. 16.
    Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18(9):1509–1517CrossRefGoogle Scholar
  17. 17.
    Metzker ML (2010) Sequencing technologies-the next generation. Nat Rev Genet 11(1):31–46CrossRefGoogle Scholar
  18. 18.
    Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7):621–628CrossRefGoogle Scholar
  19. 19.
    Nariai N, Hirose O, Kojima K, Nagasaki M (2013) Tigar: transcript isoform abundance estimation method with gapped alignment of rna-seq data by variational bayesian inference. Bioinformatics 29:2292–2299Google Scholar
  20. 20.
    Nariai N, Kojima K, Mimori T, Sato Y, Kawai Y, Yamaguchi-Kabata Y, Nagasaki M (2014) Tigar2: sensitive and accurate estimation of transcript isoform expression with longer rna-seq reads. BMC Genom 15(Suppl 10):S5CrossRefGoogle Scholar
  21. 21.
    Ozsolak F, Milos PM (2011) RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12(2):87–98CrossRefGoogle Scholar
  22. 22.
    Pachter L (2011) Models for transcript quantification from RNA-Seq. arXiv:1104.3889
  23. 23.
    Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L et al (2011) Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol 12(3):R22CrossRefGoogle Scholar
  24. 24.
    Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, De Longueville F, Kawasaki ES, Lee KY et al (2006) The microarray quality control (MAQC) project shows inter-and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24(9):1151–1161CrossRefGoogle Scholar
  25. 25.
    Spellucci P (1998) An SQP method for general nonlinear programs using only equality constrained subproblems. Math Program 82(3):413–448MathSciNetCrossRefMATHGoogle Scholar
  26. 26.
    Suo C, Calza S, Salim A, Pawitan Y (2014) Joint estimation of isoform expression and isoform-specific read distribution using multi-sample RNA-Seq data. Bioinformatics 30:506–13CrossRefGoogle Scholar
  27. 27.
    Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L (2012) Differential gene and transcript expression analysis of RNA-Seq experiments with TopHat and Cufflinks. Nat Protoc 7(3):562–578CrossRefGoogle Scholar
  28. 28.
    Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28(5):511–515CrossRefGoogle Scholar
  29. 29.
    Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456(7221):470–476CrossRefGoogle Scholar
  30. 30.
    Wang X, Wu Z, Zhang X (2010) Isoform abundance inference provides a more accurate estimation of gene expression levels in RNA-Seq. J Bioinform Comput Biol 8(supp01):177–192CrossRefGoogle Scholar
  31. 31.
    Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63CrossRefGoogle Scholar
  32. 32.
    Wu Z, Wang X, Zhang X (2011) Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq. Bioinformatics 27(4):502–508CrossRefGoogle Scholar
  33. 33.
    Zhang L, Chen S, Liu X (2014) Detecting differential expression from rna-seq data with expression measurement uncertainty. Front Comput Sci:1–12Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.College of Computer Science and TechnologyNanjing University of Aeronautics and AstronauticsNanjingChina

Personalised recommendations