Advertisement

Inference of Isoforms from Short Sequence Reads

(Extended Abstract)
  • Jianxing Feng
  • Wei Li
  • Tao Jiang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6044)

Abstract

Due to alternative splicing events in eukaryotic species, the identification of mRNA isoforms (or splicing variants) is a difficult problem. Traditional experimental methods for this purpose are time consuming and cost ineffective. The emerging RNA-Seq technology provides a possible effective method to address this problem. Although the advantages of RNA-Seq over traditional methods in transcriptome analysis have been confirmed by many studies, the inference of isoforms from millions of short sequence reads (e.g., Illumina/Solexa reads) has remained computationally challenging. In this work, we propose a method to calculate the expression levels of isoforms and infer isoforms from short RNA-Seq reads using exon-intron boundary, transcription start site (TSS) and poly-A site (PAS) information. We first formulate the relationship among exons, isoforms, and single-end reads as a convex quadratic program, and then use an efficient algorithm (called IsoInfer) to search for isoforms. IsoInfer can calculate the expression levels of isoforms accurately if all the isoforms are known and infer novel isoforms from scratch. Our experimental tests on known mouse isoforms with both simulated expression levels and reads demonstrate that IsoInfer is able to calculate the expression levels of isoforms with an accuracy comparable to the state-of-the-art statistical method and a 60 times faster speed. Moreover, our tests on both simulated and real reads show that it achieves a good precision and sensitivity in inferring isoforms when given accurate exon-intron boundary, TSS and PAS information, especially for isoforms whose expression levels are significantly high.

Keywords

Transcription Start Site Alternative Splice Event Massively Parallel Signature Sequencing Exon Junction Splice Graph 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Boguski, M.S., et al.: Gene discovery in dbEST. Science 265(5181), 1993–(1994)CrossRefGoogle Scholar
  2. 2.
    Boguski, M.S.: The turning point in genome research. Trends in Biochemical Sciences 20(8), 295–296 (1995)CrossRefGoogle Scholar
  3. 3.
    The FANTOM Consortium: The transcriptional landscape of the mammalian genome. Science 309(5740), 1559–1563 (2005)Google Scholar
  4. 4.
    The ENCODE Project Consortium: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146), 799–816 (2007)Google Scholar
  5. 5.
    Weinstock, G.M.: ENCODE: more genomic empowerment. Genome Res. 17(6), 667–668 (2007)CrossRefGoogle Scholar
  6. 6.
    Bertone, P., et al.: Global identification of human transcribed sequences with genome tiling arrays. Science 306(5705), 2242–2246 (2004)CrossRefGoogle Scholar
  7. 7.
    Kwan, T., et al.: Genome-wide analysis of transcript isoform variation in humans. Nat. Genetics (2008)Google Scholar
  8. 8.
    Johnson, J.M., et al.: Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302(5653), 2141–2144 (2003)CrossRefGoogle Scholar
  9. 9.
    Kapranov, P., et al.: RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316(5830), 1484–1488 (2007)CrossRefGoogle Scholar
  10. 10.
    Brenner, S., et al.: Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18(6), 630–634 (2000)CrossRefGoogle Scholar
  11. 11.
    Reinartz, J., et al.: Massively parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression profiling in all organisms. Brief Funct. Genomic Proteomic 1(1), 95–104 (2002)CrossRefGoogle Scholar
  12. 12.
    Velculescu, V.E., et al.: Serial analysis of gene expression. Science 270(5235), 484–487 (1995)CrossRefGoogle Scholar
  13. 13.
    Harbers, M., Carninci, P.: Tag-based approaches for transcriptome research and genome annotation. Nat. Meth. 2(7), 495–502 (2005)CrossRefGoogle Scholar
  14. 14.
    Shiraki, T., et al.: Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proceedings of the National Academy of Sciences of the United States of America 100(26), 15776–15781 (2003)CrossRefGoogle Scholar
  15. 15.
    Kodzius, R., et al.: CAGE: cap analysis of gene expression. Nat. Meth. 3(3), 211–222 (2005)CrossRefGoogle Scholar
  16. 16.
    Kim, J.B., et al.: Polony multiplex analysis of gene expression (PMAGE) in mouse hypertrophic cardiomyopathy. Science 316(5830), 1481–1484 (2007)CrossRefGoogle Scholar
  17. 17.
    Ng, P., et al.: Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat. Methods 2, 105–111 (2005)CrossRefGoogle Scholar
  18. 18.
    Nagalakshmi, U., et al.: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320(5881), 1344–1349 (2008)CrossRefGoogle Scholar
  19. 19.
    Trapnell, C., et al.: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9), 1105–1111 (2009)CrossRefGoogle Scholar
  20. 20.
    Graveley, B.R.: Molecular biology: power sequencing. Nature 453(7199), 1197–1198 (2008)CrossRefGoogle Scholar
  21. 21.
    Yassour, M., et al.: Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing. Proceedings of the National Academy of Sciences 106(9), 3264–3269 (2009)CrossRefGoogle Scholar
  22. 22.
    Wilhelm, B.T., et al.: Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 453(7199), 1239–1243 (2008)CrossRefGoogle Scholar
  23. 23.
    Cloonan, N., et al.: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods (2008)Google Scholar
  24. 24.
    Mortazavi, A., et al.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5(7), 621–628 (2008)CrossRefGoogle Scholar
  25. 25.
    Marioni, J., et al.: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9), 1509–1517 (2008)CrossRefGoogle Scholar
  26. 26.
    Sultan, M., et al.: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321(5891), 956–960 (2008)CrossRefGoogle Scholar
  27. 27.
    Wang, Z., et al.: RNA-Seq: a revolutionary tool for transcriptomics. Genetics Nature reviews (2008)Google Scholar
  28. 28.
    Lacroix, V., et al.: Exact transcriptome reconstruction from short sequence reads. In: Crandall, K.A., Lagergren, J. (eds.) WABI 2008. LNCS (LNBI), vol. 5251, pp. 50–63. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  29. 29.
    Jiang, H., Wong, W.H.: Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25(8), 1026–1032 (2009)CrossRefGoogle Scholar
  30. 30.
    Pagani, F., Baralle, F.E.: Genomic variants in exons and introns: identifying the splicing spoilers. Nat. Rev. Genet. 5(5), 389–396 (2004)CrossRefGoogle Scholar
  31. 31.
    Srebrow, A., Kornblihtt, A.R.: The connection between splicing and cancer. J. Cell Sci. 119(13), 2635–2641 (2006)CrossRefGoogle Scholar
  32. 32.
    Williams, W.V.: Editorial hot topic: Transcriptome analysis in drug development (executive editor: williams, W.v.). Current Molecular Medicine 5(2), 1–2 (2005)CrossRefGoogle Scholar
  33. 33.
    Heber, S., et al.: Splicing graphs and EST assembly problem. Bioinformatics 18(suppl.1), S181–S188 (2002)Google Scholar
  34. 34.
    Sammeth, M., Valiente, G., Guigó, R.: Bubbles: Alternative splicing events of arbitrary dimension in splicing graphs. In: Vingron, M., Wong, L. (eds.) RECOMB 2008. LNCS (LNBI), vol. 4955, pp. 372–395. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  35. 35.
    Xing, Y., et al.: The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Res. 14(3), 426–441 (2004)CrossRefGoogle Scholar
  36. 36.
    Bonizzoni, P., et al.: Detecting alternative gene structures from spliced ESTs: a computational approach. Journal of Computational Biology 16(1), 43–66 (2009)CrossRefMathSciNetGoogle Scholar
  37. 37.
    Djebali, S., et al.: Efficient targeted transcript discovery via array-based normalization of RACE libraries. Nat. Meth. 5(7), 629–635 (2008)CrossRefGoogle Scholar
  38. 38.
    Salehi-Ashtiani, K., Yang, X., Derti, A., Tian, W., Hao, T., Lin, C., Makowski, K., Shen, L., Murray, R.R., Szeto, D., Tusneem, N., Smith, D.R., Cusick, M.E., Hill, D.E., Roth, F.P., Vidal, M.: Isoform discovery by targeted cloning, ’deep-well’ pooling and parallel sequencing. Nat. Meth. 5(7), 597–600 (2008)CrossRefGoogle Scholar
  39. 39.
    Fullwood, M.J., et al.: Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 19(4), 521–532 (2009)CrossRefGoogle Scholar
  40. 40.
    Pan, Q., et al.: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40(12), 1413–1415 (2008)CrossRefGoogle Scholar
  41. 41.
    Wang, E.T., et al.: Alternative isoform regulation in human tissue transcriptomes. Nature 456(7221), 470–476 (2008)CrossRefGoogle Scholar
  42. 42.
    Feng, J., et al.: Inference of isoforms from short sequence reads. Manuscript (Janaury 2010), http://www.cs.ucr.edu/~jianxing/IsoInfer-recomb10-full.pdf
  43. 43.
    Breitbart, R.E., et al.: Alternative splicing: a ubiquitous mechanism for the generation of multiple protein isoforms from single genes. Annual Review of Biochemistry 56(1), 467–495 (1987)CrossRefMathSciNetGoogle Scholar
  44. 44.
    Sammeth, M., et al.: A general definition and nomenclature for alternative splicing events. PLoS Comput. Biol. 4(8), e1000147 (2008)Google Scholar
  45. 45.
    Langmead, B., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25 (2009)Google Scholar
  46. 46.
    Li, H., et al.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11), 1851–1858 (2008)CrossRefGoogle Scholar
  47. 47.
    Li, R., et al.: SOAP: short oligonucleotide alignment program. Bioinformatics 24(5), 713–714 (2008)CrossRefGoogle Scholar
  48. 48.
    Cloonan, N., et al.: RNA-MATE: a recursive mapping strategy for high-throughput RNA-sequencing data. Bioinformatics, btp459 (2009)Google Scholar
  49. 49.
    Alkan, C., Kidd, J.M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman, J.O., Baker, C., Malig, M., Mutlu, O., Sahinalp, S.C., Gibbs, R.A., Eichler, E.E.: Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41(10), 1061–1067 (2009)CrossRefGoogle Scholar
  50. 50.
    Hashimoto, T., et al.: Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite. Bioinformatics, btp438 (2009)Google Scholar
  51. 51.
    Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2007)Google Scholar
  52. 52.
    Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Math. Program 27, 1–33 (1983)zbMATHCrossRefMathSciNetGoogle Scholar
  53. 53.
    Korbel, J., et al.: PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biology 10(2), R23 (2009)Google Scholar
  54. 54.
    Karolchik, D., et al.: The UCSC genome browser database: 2008 update. Nucl. Acids Res. 36(Database issue), D773–D779 (2008)Google Scholar
  55. 55.
    Alter, M.D., et al.: Variation in the large-scale organization of gene expression levels in the hippocampus relates to stable epigenetic variability in behavior. PLoS ONE 3(10), e3344 (2008)Google Scholar
  56. 56.
    Konishi, T.: Three-parameter lognormal distribution ubiquitously found in cdna microarray data and its application to parametric data treatment. BMC Bioinformatics 5(1), 5 (2004)CrossRefGoogle Scholar
  57. 57.
    Wijaya, E., et al.: Modeling the marginal distribution of gene expression with mixture models. In: FGCN 2008: Proceedings of the 2008 Second International Conference on Future Generation Communication and Networking, pp. 84–89. IEEE Computer Society, Washington (2008)CrossRefGoogle Scholar
  58. 58.
    Richter, D.C., et al.: MetaSima sequencing simulator for genomics and metagenomics. PLoS ONE 3(10), e3373 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Jianxing Feng
    • 1
  • Wei Li
    • 2
  • Tao Jiang
    • 2
    • 3
  1. 1.State Key Laboratory on Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer ScienceTsinghua Univ.BeijingChina
  2. 2.Department of Computer ScienceUniv. of CaliforniaRiverside
  3. 3.Tsinghua Univ.BeijingChina

Personalised recommendations