Skip to main content

Advertisement

Log in

Occupancy Modeling of Coverage Distribution for Whole Genome Shotgun Dna Sequencing

  • Original Paper
  • Published:
Bulletin of Mathematical Biology Aims and scope Submit manuscript

Abstract

Expected-value models have long provided a rudimentary theoretical foundation for random DNA sequencing. Here, we are interested in improving characterization of genome coverage in terms of its underlying probability distributions. We find that the mathematical notion of occupancy serves as a good model for evolution of the coverage distribution function and reveals new insights related to sequence redundancy. Established concepts, such as “full shotgun depth,” have been assumed invariant, but actually depend on project size and decrease over time. For most microbial projects, the full shotgun milestone should be revised downward by about 30%. Accordingly, many already-completed genomes appear to have been over-sequenced. Results also suggest that read lengths for emerging high-throughput sequencing methods must be increased substantially before they can be considered as possible successors to the standard Sanger method. In particular, gains in throughput and sequence depth cannot be made to compensate for diminished read length. Limits are well approximated by a simple logarithmic equation, which should be useful in estimating maximum coverage-based redundancy for future projects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abrahamsen, M.S., Templeton, T.J., Enomoto, S., Abrahante, J.E., Zhu, G., Lancto, C.A., et al., 2004. Complete genome sequence of the apicomplexan Cryptosporidium parvum. Science 304, 441–445.

    Google Scholar 

  • Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., et al., 2000. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195.

    Google Scholar 

  • Anderson, S., 1981. Shotgun DNA sequencing using cloned DNase I-generated fragments. Nucleic Acids Res. 9, 3015–3027.

    Google Scholar 

  • Armbrust, E.V., Berges, J.A., Bowler, C., Green, B.R., Martinez, D., Putnam, N.H., et al., 2004. The genome of the diatom Thalassiosira pseudonana: Ecology, evolution, and metabolism. Science 306, 79–86.

    Google Scholar 

  • Bao, Q.Y., Tian, Y.Q., Li, W., Xu, Z.Y., Xuan, Z.Y., Hu, S.N., et al., 2002. A complete sequence of the T. tengcongensis genome. Genome Res. 12, 689–700.

    Google Scholar 

  • Blakesley, R.W., Hansen, N.F., Mullikin, J.C., Thomas, P.J., McDowell, J.C., Maskeri, B., et al., 2004. An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 14, 2235–2244.

    Google Scholar 

  • Bouck, J., Miller, W., Gorrell, J.H., Muzny, D., Gibbs, R.A., 1998. Analysis of the quality and utility of random shotgun sequencing at low redundancies. Genome Res. 8, 1074–1084.

    Google Scholar 

  • Braslavsky, I., Hebert, B., Kartalov, E., Quake, S.R., 2003. Sequence information can be obtained from single DNA molecules. Proc. Natl. Acad. Sci. U.S.A. 100, 3960–3964.

    Google Scholar 

  • Carlton, J.M., Angiuoli, S.V., Suh, B.B., Kooij, T.W., Pertea, M., Silva, J.C., et al., 2002. Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii. Nature 419, 512–519.

    Google Scholar 

  • Cerdeño-Tárraga, A.M., Patrick, S., Crossman, L.C., Blakely, G., Abratt, V., Lennard, N., et al., 2005. Extensive DNA inversions in the B. fragilis genome control variable gene expression. Science 307, 1463–1465.

    Google Scholar 

  • Chaisson, M., Pevzner, P., Tang, H., 2004. Fragment assembly with short reads. Bioinformatics 20, 2067–2074.

    Google Scholar 

  • Chien, M., Morozova, I., Shi, S., Sheng, H., Chen, J., Gomez, S.M., et al., 2004. The genomic sequence of the accidental pathogen Legionella pneumophila. Science 305, 1966–1968.

    Google Scholar 

  • Chimpanzee Sequencing Consortium, 2005. Initial sequence of the chimpanzee genome and comparison wih the human genome. Nature, 437, 69–87.

    Google Scholar 

  • Clarke, L., Carbon, J., 1976. A colony bank containing synthetic Col El hybrid plasmids representative of the entire E. coli genome. Cell 9, 91–99.

    Google Scholar 

  • Comtet, L., 1974. Advanced Combinatorics. Reidel Publishing, Dordrecht, Holland.

  • Deininger, P.L., 1983. Random subcloning of sonicated DNA: Application to shotgun DNA sequence analysis. Anal. Biochem. 129, 216–223.

    Google Scholar 

  • DelVecchio, V.G., Kapatral, V., Redkar, R.J., Patra, G., Mujer, C., Los, T., et al., 2002. The genome sequence of the facultative intracellular pathogen Brucella melitensis. Proc. Natl. Acad. Sci. U.S.A. 99, 443–448.

    Google Scholar 

  • Elkin, C., Kapur, H., Smith, T., Humphries, D., Pollard, M., Hammon, N., Hawkins, T., 2002. Magnetic bead purification of labeled DNA fragments for high-throughput capillary electrophoresis sequencing. Biotechniques 32, 1296–1302.

    Google Scholar 

  • Feller, W., 1968. An Introduction to Probability Theory and Its Applications, 3rd edn. Wiley, New York, NY.

  • Fisher, R.A., 1929. Tests of significance in harmonic analysis. Proc. R. Soc. Lond. Ser. A 125, 54–59.

    Google Scholar 

  • Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., et al., 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512.

  • Fraser, C.M., Norris, S.J., Weinstock, C.M., White, O., Sutton, G.G., Dodson, R., et al., 1998. Complete genome sequence of Treponema pallidum, the syphilis spirochete. Science 281, 375–388.

    Google Scholar 

  • Galagan, J.E., Calvo, S.E., Borkovich, K.A., Selker, E.U., Read, N.D., Jaffe, D., et al., 2003. The genome sequence of the filamentous fungus Neurospora crassa. Nature 422, 859–868.

    Google Scholar 

  • Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J., Scherer, S., et al., 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493–521.

    Google Scholar 

  • Goff, S.A., Ricke, D., Lan, T.H., Presting, G., Wang, R., Dunn, M., et al., 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296, 92–100.

    Google Scholar 

  • Green, E.D., 2001. Strategies for the systematic sequencing of complex genomes. Nat. Rev. Genet. 2, 573–583.

    Google Scholar 

  • Johnson, N.L., Kotz, S., 1977. Urn Models and Their Application. John Wiley & Sons, New York, NY.

  • Jones, T., Federspiel, N.A., Chibana, H., Dungan, J., Kalman, S., Magee, B.B., et al., 2004. The diploid genome sequence of Candida albicans. Proc. Natl. Acad. Sci. U.S.A. 101, 7329–7334.

    Google Scholar 

  • Kim, U.-J., Shizuya, H., deJong, P.J., Birren, B., Simon, M.I., 1992. Stable propagation of cosmid sized human DNA inserts in an F-factor based vector. Nucleic Acids Res. 20, 1083–1085.

    Google Scholar 

  • Kirkness, E.F., Bafna, V., Halpern, A.L., Levy, S., Remington, K., Rusch, D.B., et al., 2003. The dog genome: Survey sequencing and comparative analysis. Science 301, 1898–1903.

    Google Scholar 

  • Kolchin, V.F., Sevastyanov, B.A., Christyakov, V.P., 1978. Random Allocations. John Wiley & Sons, New York, NY.

  • Lander, E.S., Waterman, M.S., 1988. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics 2, 231–239.

    Google Scholar 

  • Leroy, S., Duperray, C., Morand, S., 2003. Flow cytometry for parasite nematode genome size measurement. Mol. Biochem. Parasitol. 128, 91–93.

    Google Scholar 

  • Loftus, B., Anderson, I., Davies, R., Alsmark, U.C.M., Samuelson, J., Amedeo, P., et al., 2005a. The genome of the protist parasite Entamoeba histolytica. Nature 433, 865–868.

  • Loftus, B.J., Fung, E., Roncaglia, P., Rowley, D., Amedeo, P., Bruno, D., et al., 2005b. The genome of the basidiomycetous yeast and human pathogen Cryptococcus neoformans. Science 307, 1321–1324.

  • Matsuzaki, M., Misumi, O., Shin-I, T., Maruyama, S., Takahara, M., Miyagishima, S.Y., et al., 2004. Genome sequence of the ultrasmall unicellular red alga Cyanidioschyzon merolae 10D. Nature 428, 653–657.

  • Mitra, R.D., Shendure, J., Olejnik, J., Krzymanska-Olejnik, E., Church, G.M., 2003. Fluorescent insitu sequencing on polymerase colonies. Anal. Biochem. 320, 55–65.

    Google Scholar 

  • Myers, G., 1999. Whole-genome DNA sequencing. Comput. Sci. Eng. 1, 33–43.

    Google Scholar 

  • Roach, J.C., Boysen, C., Wang, K., Hood, L., 1995. Pairwise end sequencing: A unified approach to genomic mapping and sequencing. Genomics 26, 345–353.

    Google Scholar 

  • Robbins, H.E., 1944. On the measure of a random set. Ann. Math. Stat. 15, 70–74.

    Google Scholar 

  • Sanger, F., Coulson, A.R., Barrell, B.G., Smith, A.J., Roe, B.A., 1980. Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing. J. Mol. Biol. 143, 161–178.

    Google Scholar 

  • Sanger, F., Nicklen, S., Coulson, A.R., 1977. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U.S.A. 74, 5463–5467.

    Google Scholar 

  • Shendure, J., Mitra, R.D., Varma, C., Church, G.M., 2004. Advanced sequencing technologies: Methods and goals. Nat. Rev. Genet. 5, 335–344.

    Google Scholar 

  • Shizuya, H., Birren, B., Kim, U.J., Mancino, V., Slepak, T., Tachiiri, Y., Simon, M., 1992. Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc. Natl. Acad. Sci. U.S.A. 89, 8794–8797.

    Google Scholar 

  • Siegel, A.F., 1978. Random arcs on the circle. J. Appl. Probabil. 15, 774–789.

    Google Scholar 

  • Smith, G.D., Bernstein, K.E., 1995. BULLET: A computer simulation of shotgun DNA sequencing. Comput. Appl. Biosci. 11, 155–157.

    Google Scholar 

  • Stevens, W.L., 1939. Solution to a geometrical problem in probability. Ann. Eugenics 9, 315–320.

    Google Scholar 

  • Tettelin, H., Nelson, K.E., Paulsen, I.T., Eisen, J.A., Read, T.D., Peterson, S., et al., 2001. Complete genome sequence of a virulent isolate of Streptococcus pneumoniae. Science 293, 498–506.

    Google Scholar 

  • Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., et al., 2001. The sequence of the human genome. Science 291, 1304–1351.

    Google Scholar 

  • Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., et al., 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562.

    Google Scholar 

  • Wendl, M.C., Waterston, R.H., 2002. Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. Genome Res. 12, 1943–1949.

    Google Scholar 

  • Wendl, M.C., Yang, S.P., 2004. Gap statistics for whole genome shotgun DNA sequencing projects. Bioinformatics 20, 1527–1534.

    Google Scholar 

  • Xu, P., Widmer, G., Wang, Y.P., Ozaki, L.S., Alves, J.M., Serrano, M.G., et al., 2004. The genome of Cryptosporidium hominis. Nature 431, 1107–1112.

    Google Scholar 

  • Yakushevich, L.V., 1998. Nonlinear Physics of DNA. Johns Wiley & Sons, Chichester, UK.

  • Yu, J., Hu, S., Wang, J., Wong, G.K.S., Li, S., Liu, B., et al., 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael C. Wendl.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wendl, M.C. Occupancy Modeling of Coverage Distribution for Whole Genome Shotgun Dna Sequencing. Bltn. Mathcal. Biology 68, 179–196 (2006). https://doi.org/10.1007/s11538-005-9021-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11538-005-9021-4

Keywords

Navigation