Simultaneous Isoform Discovery and Quantification from RNA-Seq
- 650 Downloads
RNA sequencing is a recent technology which has seen an explosion of methods addressing all levels of analysis, from read mapping to transcript assembly to differential expression modeling. In particular the discovery of isoforms at the transcript assembly stage is a complex problem and current approaches suffer from various limitations. For instance, many approaches use graphs to construct a minimal set of isoforms which covers the observed reads, then perform a separate algorithm to quantify the isoforms, which can result in a loss of power. Current methods also use ad-hoc solutions to deal with the vast number of possible isoforms which can be constructed from a given set of reads. Finally, while the need of taking into account features such as read pairing and sampling rate of reads has been acknowledged, most existing methods do not seamlessly integrate these features as part of the model. We present Montebello, an integrated statistical approach which performs simultaneous isoform discovery and quantification by using a Monte Carlo simulation to find the most likely isoform composition leading to a set of observed reads. We compare Montebello to Cufflinks, a popular isoform discovery approach, on a simulated data set and on 46.3 million brain reads from an Illumina tissue panel. On this data set Montebello appears to offer a modest improvement over Cufflinks when considering discovery and parsimony metrics. In addition Montebello mitigates specific difficulties inherent in the Cufflinks approach. Finally, Montebello can be fine-tuned depending on the type of solution desired.
KeywordsAlternative splicing RNA-seq Isoform discovery Algorithms Monte Carlo
We thank Hui Jiang and Nicholas Johnson for useful discussions. D.H. developed and tested the model. W.H.W. initiated and supervised the project. D.H. drafted and W.H.W. revised the paper. D.H. was funded a Ric Weiland Graduate Fellowship (Stanford University) and by NIH grants R01 HG004634 and R01 HG005220. W.H.W. was supported by NIH grants R01 HG004634 and R01 HG005717.
- 3.Geyer C (1991) Markov chain Monte Carlo maximum likelihood. In: Keramidas EM (ed) Computing science and statistics: Proc 23rd symposium on the interface. Interface Foundation, Fairfax Station, pp 156–163 Google Scholar
- 4.Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A (2011) Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat Biotechnol 29:644–652 CrossRefGoogle Scholar
- 5.Grant GR, Farkas MH, Pizarro AD, Lahens NF, Schug J, Brunk BP, Stoeckert CJ, Hogenesch JB, Pierce EA (2011) Comparative analysis of RNA-seq alignment algorithms and the RNA-seq unified mapper (rum). Bioinformatics 27(18):2518–2528 Google Scholar
- 6.Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C, Rinn JL, Lander ES, Regev A (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincrnas. Nat Biotechnol 28:503–510 CrossRefGoogle Scholar
- 9.Hiller D (2010) Alternative splicing analysis using RNA-seq data. PhD thesis, Stanford University Google Scholar
- 13.Jiang H (2009) Computational and statistical approaches in RNA sequencing analysis. PhD thesis, Stanford University Google Scholar
- 32.Stegle O, Drewe P, Bohnert R, Borgwardt K, Rätsch G (2010) Statistical tests for detecting differential RNA-transcript expression from read counts. Available on nature precedings. http://precedings.nature.com/documents/4437/version/1
- 33.Tarazona S, García-Alcalde F, Dopazo J, Ferrer A, Conesa A (2011) Differential expression in RNA-seq: a matter of depth. Genome Res. doi: 10.1101/gr.124321.111. URL http://genome.cshlp.org/content/early/2011/10/28/gr.124321.111.abstract Google Scholar