Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction
We present a computational method for analyzing deep sequencing data obtained from a genetically diverse sample. The set of reads obtained from a deep sequencing experiment represents a statistical sample of the underlying population. We develop a generative probabilistic model for assigning observed reads to unobserved haplotypes in the presence of sequencing errors. This clustering problem is solved in a Bayesian fashion using the Dirichlet process mixture to define a prior distribution on the unknown number of haplotypes in the mixture. We devise a Gibbs sampler for sampling from the joint posterior distribution of haplotype sequences, assignment of reads to haplotypes, and error rate of the sequencing process to obtain estimates of the local haplotype structure of the population. The method is evaluated on simulated data and on experimental deep sequencing data obtained from HIV samples.
Unable to display preview. Download preview PDF.
- 13.Saeed, F., Khokhar, A., Zagordi, O., Beerenwinkel, N.: Multiple sequence alignment system for pyrosequencing reads. In: Bioinformatics and Computational Biology (BICoB) conference 2009, LNCS (in press, 2009)Google Scholar
- 14.Neal, R.: Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics 9(2), 249–265 (2000)Google Scholar
- 15.Schmid, R., Schuster, S., Steel, M., Huson, D.: Readsim- a simulator for sanger and 454 sequencing (unpublished) (2006)Google Scholar
- 17.Campbell, P.J., Pleasance, E.D., Stephens, P.J., Dicks, E., Rance, R., Goodhead, I., Follows, G.A., Green, A.R., Futreal, P.A., Stratton, M.R.: Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing. Proc. Natl. Acad. Sci. USA 105(35), 13081–13086 (2008)CrossRefPubMedPubMedCentralGoogle Scholar