IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly

(Extended Abstract)
  • Wei Li
  • Jianxing Feng
  • Tao Jiang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6577)

Abstract

The new second generation sequencing technology revolutionizes many biology related research fields, and posts various computational biology challenges. One of them is transcriptome assembly based on RNA-Seq data, which aims at reconstructing all full-length mRNA transcripts simultaneously from millions of short reads. In this paper, we consider three objectives in transcriptome assembly: the maximization of prediction accuracy, minimization of interpretation, and maximization of completeness. The first objective, the maximization of prediction accuracy, requires that the estimated expression levels based on assembled transcripts should be as close as possible to the observed ones for every expressed region of the genome. The minimization of interpretation follows the parsimony principle to seek as few transcripts in the prediction as possible. The third objective, the maximization of completeness, requires that the maximum number of mapped reads (or “expressed segments” in gene models) be explained by (i.e., contained in) the predicted transcripts in the solution. Based on the above three objectives, we present IsoLasso, a new RNA-Seq based transcriptome assembly tool. IsoLasso is based on the well-known LASSO algorithm, a multivariate regression method designated to seek a balance between the maximization of prediction accuracy and the minimization of interpretation. By including some additional constraints in the quadratic program involved in LASSO, IsoLasso is able to make the set of assembled transcripts as complete as possible. Experiments on simulated and real RNA-Seq datasets show that IsoLasso achieves higher sensitivity and precision simultaneously than the state-of-art transcript assembly tools.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Wheeler, D.A., et al.: The complete genome of an individual by massively parallel dna sequencing. Nature 452, 872–876 (2008)CrossRefGoogle Scholar
  2. 2.
    Mortazavi, A., et al.: Mapping and quantifying mammalian transcriptomes by rna-seq. Nature Methods 5, 621–628 (2008)CrossRefGoogle Scholar
  3. 3.
    Holt, K.E., et al.: High-throughput sequencing provides insights into genome variation and evolution in salmonella typhi. Nature Genetics 40, 987–993 (2008)CrossRefGoogle Scholar
  4. 4.
    Wilhelm, B.T., et al.: Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 453, 1239–1243 (2008)CrossRefGoogle Scholar
  5. 5.
    Lister, R., et al.: Highly integrated Single-Base resolution maps of the epigenome in arabidopsis. Cell 133(3), 523–536 (2008)CrossRefGoogle Scholar
  6. 6.
    Morin, R., et al.: Profiling the HeLa s3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. BioTechniques 45, 81–94 (2008), PMID: 18611170CrossRefGoogle Scholar
  7. 7.
    Marioni, J.C., et al.: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 18(9), 1509–1517 (2008)CrossRefGoogle Scholar
  8. 8.
    Cloonan, N., et al.: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Meth. 5, 613–619 (2008)CrossRefGoogle Scholar
  9. 9.
    Nagalakshmi, U., et al.: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349 (2008)CrossRefGoogle Scholar
  10. 10.
    Haas, B.J., Zody, M.C.: Advancing RNA-Seq analysis. Nat. Biotech. 28, 421–423 (2010)CrossRefGoogle Scholar
  11. 11.
    Morozova, O., et al.: Applications of new sequencing technologies for transcriptome analysis. Annual Review of Genomics and Human Genetics 10(1), 135–151 (2009), PMID: 19715439MathSciNetCrossRefGoogle Scholar
  12. 12.
    Wall, P.K., et al.: Comparison of next generation sequencing technologies for transcriptome characterization. BMC Genomics 10(1), 347 (2009)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Wang, Z., et al.: RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009)CrossRefGoogle Scholar
  14. 14.
    Birol, I., et al.: De novo transcriptome assembly with abyss. Bioinformatics 25, 2872–2877 (2009)CrossRefGoogle Scholar
  15. 15.
    Yassour, M., et al.: Ab initio construction of a eukaryotic transcriptome by massively parallel mrna sequencing. Proceedings of the National Academy of Sciences of the United States of America 106, 3264–3269 (2009)CrossRefGoogle Scholar
  16. 16.
    Trapnell, C., et al.: Transcript assembly and quantification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology 28, 511–515 (2010)CrossRefGoogle Scholar
  17. 17.
    Guttman, M., et al.: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincrnas. Nature Biotechnology 28, 503–510 (2010)CrossRefGoogle Scholar
  18. 18.
    Feng, J., et al.: Inference of isoforms from short sequence reads. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 138–157. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  19. 19.
    Trapnell, C., et al.: Tophat: discovering splice junctions with rna-seq. Bioinformatics 25, 1105–1111 (2009)CrossRefGoogle Scholar
  20. 20.
    Au, K.F., et al.: Detection of splice junctions from paired-end rna-seq data by splicemap. Nucl. Acids Res., gkq211+ (April 2010)Google Scholar
  21. 21.
    Jiang, H., Wong, W.H.: Statistical inferences for isoform expression in rna-seq. Bioinformatics 25, 1026–1032 (2009)CrossRefGoogle Scholar
  22. 22.
    Hastie, T., et al.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, ch. 3, p. 57. Springer, Heidelberg (2009)CrossRefMATHGoogle Scholar
  23. 23.
    Hocking, R.R., Leslie, R.N.: Selection of the best subset in regression analysis. Technometrics 9(4), 531–540 (1967)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1), 267–288 (1996)MathSciNetMATHGoogle Scholar
  25. 25.
    Wu, T.T., et al.: Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25, 714–721 (2009)CrossRefGoogle Scholar
  26. 26.
    Kim, S., et al.: A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics 25, i204–i212 (2009)CrossRefGoogle Scholar
  27. 27.
    Gustafsson, M., et al.: Constructing and analyzing a large-scale gene-to-gene regulatory network-lasso-constrained inference and biological validation. IEEE/ACM Trans. Comput. Biol. Bioinformatics 2(3), 254–261 (2005)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Ma, S., et al.: Supervised group lasso with applications to microarray data analysis. BMC Bioinformatics 8, 60+ (2007)CrossRefGoogle Scholar
  29. 29.
    Paaniuc, B., et al.: Accurate estimation of expression levels of homologous genes in RNA-seq experiments. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 397–409. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  30. 30.
    Li, J., et al.: Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biology 11(5), R50+ (2010)CrossRefGoogle Scholar
  31. 31.
    Richard, H., et al.: Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments. Nucleic Acids Research 38, e112 (2010)CrossRefGoogle Scholar
  32. 32.
    Srivastava, S., Chen, L.: A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Research 38, e170 (2010)CrossRefGoogle Scholar
  33. 33.
    Lee, S., et al.: Accurate quantification of transcriptome from RNA-Seq data by effective length normalization. Nucleic Acids Research (November 2010)Google Scholar
  34. 34.
    Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970)CrossRefMATHGoogle Scholar
  35. 35.
    Efron, B., et al.: Least angle regression. Annals of Statistics 32, 407–499 (2004)MathSciNetCrossRefMATHGoogle Scholar
  36. 36.
    Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B 67, 301–320 (2005)MathSciNetCrossRefMATHGoogle Scholar
  37. 37.
    Park, M.Y., Hastie, T.: L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69, 659–677 (2007)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Optimization Toolbox User’s Guide. The Mathworks, Inc., Natrik (2004)Google Scholar
  39. 39.
    Sammeth, M., et al.: The flux simulator (2010), http://flux.sammeth.net
  40. 40.
    The ENCODE Project Consortium: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Wei Li
    • 1
  • Jianxing Feng
    • 2
  • Tao Jiang
    • 1
    • 3
  1. 1.Department of Computer Science and EngineeringUniversity of CaliforniaRiversideUSA
  2. 2.College of Life Science and BiotechnologyTongji UniversityShanghaiChina
  3. 3.School of Information Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations