Skip to main content

Accuracy Assessment of Consensus Sequence from Shotgun Sequencing

  • Chapter
  • First Online:
Handbook of Statistical Bioinformatics

Part of the book series: Springer Handbooks of Computational Statistics ((SHCS))

  • 4135 Accesses

Abstract

The significance of any genetic or biological implication based on DNA sequencing depends on its accuracy. The statistical evaluation of accuracy requires a probabilistic model of measurement error. In this chapter, we describe two statistical models of sequence assembly from shotgun sequencing respectively for the cases of haploid and diploid target genome. The first model allows us to convert quality scores into probabilities. It combines quality scores of base-calling and the power of alignment to improve sequencing accuracy. Specifically, we start with assembled contigs and represent probabilistic errors by logistic models that takes quality scores and other genomic features as covariates. Since the true sequence is unknown, an EM algorithm is used to deal with missing data. The second model describes the case in which DNA reads are from one of diploid genome, and our aim is to reconstruct the two haplotypes including phase information. The statistical model consists of sequencing errors, compositional information and haplotype memberships of each DNA fragment. Consequently, optimal haplotype sequences can be inferred by maximizing the probability among all configurations conditional on the given assembly. In the meantime, this probability together with the coverage information provides an assessment of the confidence for the reconstruction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adams, M. D., Fields, C., & Ventor, J. C. (Eds.). (1994). Automated DNA sequencing and analysis. London, San Diego: Academic.

    Google Scholar 

  2. An, H., & Gu, L. (1985). On the selection of regression variables. Acta Mathematicae Applicatae Sinica, 2, 27–36.

    Article  MATH  Google Scholar 

  3. Churchill, G. A., & Waterman, M. S. (1992). The accuracy of DNA sequences: Estimating sequence quality. Genomics, 14, 89–98.

    Article  Google Scholar 

  4. Dehal, P., et al. (2002). The draft genome of ciona intestinalis: Insights into chordate and vertebrate origins. Science, 298, 2157–2167.

    Article  Google Scholar 

  5. Ewing, B., & Green, P. (1998). Base-calling of automated sequencer traces using phred. 2. error probabilities. Genome Research, 8, 186–194.

    Google Scholar 

  6. Ewing, B., et al. (1998). Base-calling of automated sequencer traces using phred. 1. accuracy assessment. Genome Research, 8, 175–185.

    Google Scholar 

  7. Felsenfeld, A., Peterson, J., Schloss, J., & Guyer, M. (1999). Assessing the quality of the DNA sequence from the human genome project. Genome Research, 9, 1–4.

    Google Scholar 

  8. Kim, J. H., Waterman, M. S., & Li, L. M. (2006). Accuracy assessment of diploid consensus sequences. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4, 88–97.

    Article  Google Scholar 

  9. Kim, J. H., Waterman, M. S., & Li, L. M. (2007). Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. Genome Research, 17, 1101–1110.

    Article  Google Scholar 

  10. Lancia, G., Bafna, V., Istrail, S., Lippert, R., & Schwartz, R. (2001). SNPs problems, complexity, and algorithms. In European symposium on algorithms (pp. 182–193). Lecture Notes in Computer Science. Springer-Verlag GmbH.

    Google Scholar 

  11. Lander, E. S., & Waterman, M. S. (1988). Genomic mapping by fingerprinting random clones. Genomics, 2, 231–239.

    Article  Google Scholar 

  12. Levy, S., et al. (2007). The diploid genome sequence of an individual human. PLoS Biology, 5, e254. dOi:10.1371/journal.pbio.0050254.

    Article  Google Scholar 

  13. Li, L. M. (2002). DNA sequencing and parametric deconvolution. Statistica Sinica, 12, 179–202.

    MathSciNet  MATH  Google Scholar 

  14. Li, L. M., Kim, J. H., & Waterman, M. S. (2004). Haplotype reconstruction from SNP alignment. Journal of Computational Biology, 11, 505–516.

    Article  Google Scholar 

  15. Li, L. M., & Speed, T. P. (1999). An estimate of the color separation matrix in four-dye fluorescence-based DNA sequencing. Electrophoresis, 20, 1433–1442.

    Article  Google Scholar 

  16. Li, L. M., & Speed, T. P. (2002). Parametric deconvolution of positive spike trains. Annals of Statistics, 28, 1279–1301.

    MathSciNet  Google Scholar 

  17. Lippert, R., Schwartz, R., Lancia, G., & Istrail, S. (2002). Algorithmic strategies for the SNP haplotype assembly problem. Briefings in Bioinformatics, 3, 1–9.

    Article  Google Scholar 

  18. McCullagh, P., & Nelder, J. A. (1989). Generalized linear model (2nd ed.). London: Chapman and Hall.

    Google Scholar 

  19. Nelson, D. O., & Fridlyand, J. (2003). Designing meaningful measures of real length for data produced by DNA sequencers. In Science and statistics: A festschrift for Terry Speed (pp. 295–306). Lecture Notes-Monograph Series. Institute of Mathematical Statistics.

    Google Scholar 

  20. Parkhill, J., et al. (2000). The genome sequence of the food-borne pathogen campylobacter jejuni reveals hypervariable sequences. Nature, 403, 665–668.

    Article  Google Scholar 

  21. Ross, S. M. (1989). Introduction to probability models (4th ed.). Academic.

    Google Scholar 

  22. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.

    Article  MathSciNet  MATH  Google Scholar 

  23. Venables, W. N., & Ripley, B. D. (1994). Modern applied statistics with S-plus. Springer.

    Google Scholar 

  24. Winer, R., Yen, G., & Huang, J. (2002). Call scores and quality values: Two measures of quality produced by the CEQ { $Ⓡ$} genetic analysis systems. Beckman Coulter, Inc.

    Google Scholar 

Download references

Acknowledgements

We thank Prof. Michael Waterman for initiating the works reported in this chapter. This work is supported by the NIH CEGS grant to University of Southern California and the NIH grant R01 GM75308;.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei M. Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Li, L.M. (2011). Accuracy Assessment of Consensus Sequence from Shotgun Sequencing. In: Lu, HS., Schölkopf, B., Zhao, H. (eds) Handbook of Statistical Bioinformatics. Springer Handbooks of Computational Statistics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16345-6_1

Download citation

Publish with us

Policies and ethics