Skip to main content
Book cover

Sequences II pp 166–188Cite as

Reconstructing sequences from shotgun data

  • Conference paper

Abstract

One method of sequencing DNA produces a linear list of bases, {A, C, G, T}, for many short overlapping fragments of the original DNA. To find the sequence of the original piece of DNA, the many fragments must be reassembled. While this problem of reassembly is similar to the NP-complete shortest common superstring problem [GMS80], we believe that biologists are actually trying to solve a simpler problem. Biologists assume that short overlaps between substrings are insignificant. Further, they assume that there is a unique string from which substrings could have been produced. We consider a reconstruction problem with these restrictions. We devise algorithms for this problem both when the overlaps must be exact and when there may be errors in the overlaps. Our algorithms are based on Rabin-Karp string matching, and on suffix arrays. We investigate the running times of our algorithms, and show that in expected case they have running times proportional to the length of the reconstructed sequence. We give the timings of some test runs and note that the suffix array algorithms seem to be faster.

Keywords

  • Shotgun Sequencing
  • String Match
  • Suffix Array
  • Naive Algorithm
  • String Match Algorithm

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-1-4613-9323-8_13
  • Chapter length: 23 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-1-4613-9323-8
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   79.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. V. Aho and M. J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18:333–340, 1975.

    MathSciNet  MATH  CrossRef  Google Scholar 

  2. H. S. Bilofsky and C. Burks. The genbank genetic sequence data bank. Nucleic Acids Research, 16:1861–1863, 1988.

    CrossRef  Google Scholar 

  3. A. Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis. Linear approximation of shortest superstrings. In Proceedings of the ACM Symposium on Theory of Computing, pages 328–336, Baltimore, MD, 1991. ACM press.

    Google Scholar 

  4. R. S. Boyer and J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20:762–772, 1977.

    MATH  CrossRef  Google Scholar 

  5. P. Cull and J. L. Holloway. Algorithms for constructing a consensus sequence. Technical Report TR-91–20–1, Oregon State University, Department of Computer Science, 1991.

    Google Scholar 

  6. G. H. Gonnet and R. A. Baeza-Yates. An analysis of the Karp-Rabin string matching algorithm. Information Processing Letters, 34:271–274, 1990.

    MathSciNet  MATH  CrossRef  Google Scholar 

  7. J. Gallant, D. Maier, and J. Storer. On finding minimal length super-strings. Journal of Computer and System Science, 20:50–58, 1980.

    MathSciNet  MATH  CrossRef  Google Scholar 

  8. S. Hahn, S. Buratowski, P. A. Sharp, and L. Guarente. Isolation of the gene encoding the yeast TATA binding protein TFIID: A gene identical to the SPT15 suppressor of ty element insertions. Cell, 58:1173–1181, 1989.

    CrossRef  Google Scholar 

  9. M. Horikoshi, C. K. Wang, H. Fujii, J. A. Cromlish, P. A. Weil, and R. G. Roeder. Cloning and structure of a yeast gene encoding a general transcription initiation factor TFIID that binds to the TATA box. Nature, 341:299–303, 1989.

    CrossRef  Google Scholar 

  10. D. E. Knuth, J. H. Morris, and V. R. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6:323–350, 1977.

    MathSciNet  MATH  CrossRef  Google Scholar 

  11. D. E. Knuth. The art of computer programming: searching and sorting, volume 3. Addison-Wesley, 1973.

    Google Scholar 

  12. R. M. Karp and M. O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 32:249–260, 1987.

    MathSciNet  CrossRef  Google Scholar 

  13. M. Li. Towards a DNA sequencing theory. In IEEE Symposium on the Foundations of Computer Science, pages 125–134, 1990.

    Google Scholar 

  14. U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. In Proceedings of the First Annual A CM-SIAM Symposium on Discrete Algorithms, pages 319–327. SIAM, 1990.

    Google Scholar 

  15. H. Peltola, H. Soderlund, J. Tarhio, and E. Ukkonen. Algorithms for some string matching problems arising in molecular genetics. In Information Processing 83, pages 53–64, 1983.

    Google Scholar 

  16. M. C. Schmidt, C. Kao, R. Pei, and A. J. Berk. Yeast TATA-box transcription factor gene. Proceedings of the National Academy of Science, 86:7785–7789, 1989.

    CrossRef  Google Scholar 

  17. J. Tarhio and E. Ukkonen. A greedy approximation algorithm for constructing shortest common superstrings. Theoretical Computer Science, 57:131–145, 1988.

    MathSciNet  MATH  CrossRef  Google Scholar 

  18. J. Turner. Approximation algorithms for the shortest common superstring problem. Information and Computation, 83:1–20, 1989.

    MathSciNet  MATH  CrossRef  Google Scholar 

  19. E. Ukkonen. A linear-time algorithm for finding approximate shortest common superstrings. Algorithmica, 5:313–323, 1990.

    MathSciNet  MATH  CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 1993 Springer-Verlag New York, Inc.

About this paper

Cite this paper

Cull, P., Holloway, J. (1993). Reconstructing sequences from shotgun data. In: Capocelli, R., De Santis, A., Vaccaro, U. (eds) Sequences II. Springer, New York, NY. https://doi.org/10.1007/978-1-4613-9323-8_13

Download citation

  • DOI: https://doi.org/10.1007/978-1-4613-9323-8_13

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4613-9325-2

  • Online ISBN: 978-1-4613-9323-8

  • eBook Packages: Springer Book Archive