Advertisement

Comparing Segmentation Methods for Genome Annotation Based on RNA-Seq Data

  • Alice Cleynen
  • Sandrine Dudoit
  • Stéphane Robin
Article

Abstract

Transcriptome sequencing (RNA-Seq) yields massive data sets, containing a wealth of information on the expression of a genome. While numerous methods have been developed for the analysis of differential gene expression, little has been attempted for the localization of transcribed regions, that is, segments of DNA that are transcribed and processed to result in a mature messenger RNA. Our understanding of genomes, mostly annotated from biological experiments or computational gene prediction methods, could benefit greatly from re-annotation using the high precision of RNA-Seq.

We consider five classes of genome segmentation methods to delineate transcribed regions, including intron/exon boundaries, based on RNA-Seq data. The methods provide different functionality and include both exact and heuristic approaches, using diverse models, such as hidden Markov or Bayesian models, and diverse algorithms, such as dynamic programming or the forward-backward algorithm. We evaluate the methods in a simulation study where RNA-Seq read counts are generated from parametric models as well as by resampling of actual yeast RNA-Seq data. The methods are compared in terms of criteria that include global and local fit to a reference segmentation, Receiver Operator Characteristic (ROC) curves, and coverage of credibility intervals based on posterior change-point distributions. All compared algorithms are implemented in packages available on the Comprehensive R Archive Network (CRAN, http://cran.r-project.org). The data set used in the simulation study is publicly available from the Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/sra).

While the different methods each have pros and cons, our results suggest that the EBS Bayesian approach of Rigaill, Lebarbier, and Robin (2012) performs well in a re-annotation context, as illustrated in the simulation study and in the application to actual yeast RNA-Seq data.

This article has supplementary material online.

Key Words

Change-point detection Confidence intervals Count data Genome annotation Negative binomial distribution RNA-Seq Segmentation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

13253_2013_159_MOESM1_ESM.pdf (874 kb)
(PDF 874 kB)

References

  1. Arlot, S., and Celisse, A. (2010), “Segmentation of the Mean of Heteroscedastic Data via Cross-Validation,” Statistics and Computing, 1–20. Google Scholar
  2. Bai, J., and Perron, P. (2003), “Computation and Analysis of Multiple Structural Change Models,” Journal of Applied Econometrics, 18, 1–22. CrossRefGoogle Scholar
  3. Barry, D., and Hartigan, J. (1993), “A Bayesian Analysis for Change Point Problems,” Journal of the American Statistical Association, 88 (421), 309–319. zbMATHMathSciNetGoogle Scholar
  4. Boeva, V., Zinovyev, A., Bleakley, K., Vert, J.-P., Janoueix-Lerosey, I., Delattre, O., and Barillot, E. (2011), “Control-Free Calling of Copy Number Alterations in Deep-Sequencing Data Using GC-Content Normalization,” Bioinformatics (Oxford, England), 27, 268–269. CrossRefGoogle Scholar
  5. Breiman, Friedman, Olshen, and Stone (1984), Classification and Regression Trees, Belmont: Wadsworth and Brooks. zbMATHGoogle Scholar
  6. Cleynen, A., Koskas, M., and Rigaill, G. (under review), “A Generic Implementation of the Pruned Dynamic Programing Algorithm,” arXiv:1204.5564.
  7. Cleynen, A., and Lebarbier, E. (under review), “Segmentation of the Poisson and Negative Binomial Rate Models: A Penalized Estimator,” arXiv:1301.2534.
  8. Guthery, S. B. (1974), “Partition Regression,” Journal of the American Statistical Association, 69 (348), 945–947. CrossRefzbMATHGoogle Scholar
  9. Hsu, L., Self, S., Grove, D., Randolph, T., Wang, K., Delrow, J., Loo, L., and Porter, P. (2005), “Denoising Array-Based Comparative Genomic Hybridization Data Using Wavelets,” Biostatistics, 6, 211–226. CrossRefzbMATHGoogle Scholar
  10. Hupé, P., Stransky, N., Thiery, J., Radvanyi, F., and Barillot, E. (2004), “Analysis of Array CGH Data: From Signal Ratio to Gain and Loss of DNA Regions,” Bioinformatics, 20(18), 3413–3422. CrossRefGoogle Scholar
  11. Johnson, N., Kemp, A., and Kotz, S. (2005), Univariate Discrete Distributions, New York: Wiley. CrossRefzbMATHGoogle Scholar
  12. Killick, R., and Eckley, I. (2011), changepoint: An R Package for Changepoint Analysis. Google Scholar
  13. Lai, W. R., Johnson, M. D., Kucherlapati, R., and Park, P. J. (2005), “Comparative Analysis of Algorithms for Identifying Amplifications and Deletions in Array CGH Data,” Bioinformatics (Oxford, England), 21 (19), 3763–3770. CrossRefGoogle Scholar
  14. Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. (2008), “Ultrafast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome,” Genome Biology, 10. Google Scholar
  15. Luong, T. M., Rozenholc, Y., and Nuel, G. (2013), “Fast Estimation of Posterior Probabilities in Change-Point Models Through a Constrained Hidden Markov Model,” Computational Statistics & Data Analysis. arXiv:1203.4394.
  16. Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., and Snyder, M. (2008), “The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing,” Science, 320 (5881), 1344–1349. CrossRefGoogle Scholar
  17. Rigaill, G., Lebarbier, E., and Robin, S. (2012), “Exact Posterior Distributions and Model Selection Criteria for Multiple Change-Point Detection Problems,” Statistics and Computing, 22, 917–929. CrossRefzbMATHMathSciNetGoogle Scholar
  18. Risso, D., Schwartz, K., Sherlock, G., and Dudoit, S. (2011), “GC-Content Normalization for RNA-Seq Data,” BMC Bioinformatics, 12 (1), 480. CrossRefGoogle Scholar
  19. Robinson, M. D., McCarthy, D. J., and Smyth, G. K. (2010), “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data,” Bioinformatics, 26 (1), 139–140. CrossRefGoogle Scholar
  20. Scott, A., and Knott, M. (1974), “A Cluster Analysis Method for Grouping Means in the Analysis of Variance,” Biometrics, 30, 507–512. CrossRefzbMATHGoogle Scholar

Copyright information

© International Biometric Society 2013

Authors and Affiliations

  • Alice Cleynen
    • 1
    • 2
    • 3
  • Sandrine Dudoit
    • 3
  • Stéphane Robin
    • 1
    • 2
  1. 1.AgroParisTechUMR 518Paris Cedex 05France
  2. 2.INRAUMR 518Paris Cedex 05France
  3. 3.Division of Biostatistics and Department of StatisticsUniversity of California, BerkeleyBerkeleyUSA

Personalised recommendations