Skip to main content

Poisson process approximation for repeats in one sequence and its application to sequencing by hybridization

  • Conference paper
  • First Online:
  • 147 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1075))

Abstract

Sequencing by hybridization here refers to the attempt to determine a target DNA sequence of moderate length m from the set of all l-tuples contained in the sequence. Realistic values are currently in the range l=8 to 12 and m=100 to 5000. As a mathematical idealization, it is best to begin with the unrealistic assumption that the multiplicity of occurrences of l-tuples is also known; this multiset is called the spectrum or l-spectrum of the sequence. For random sequences, the more realistic case, where only the set underlying the spectrum is known, can be closely approximated by the situation involving the spectrum.

We model DNA as an i.i.d. sequence. Results are proved both in terms of the simpler infinite limit, where m and l go to infinity together so that the probability of unique recoverability is bounded away from zero and one, and also in the much more difficult finite case, where the reuslts are given with concrete error bounds. For example: for the uniform distribution on four letters, for l=12, m=2814, the probability of unique recoverability is in the interval. 9349±.0347, while for l=10, m=568 it is in .9676±.0671. These are rigorous results, not involving simulation. The error bounds get smaller as l, m increase together, and for the small values of l in biological practice, our intervals with error bounds are just starting to be informative.

We consider notions of partial recovery. First, with N defined as the number of m-sequences having the same spectrum as the target sequence, what is the distribution of N? The question of unique recoverability is one of approximating P(N=1). Another notion of partial recovery involves the length L of the longest subsequence of the target which is determined by the spectrum; here unique recoverability corresponds to the event {L=m}.

For all of these questions, the deterministic key involves the Ukkonen-Pevzner transformations and the de Bruijn graph of the target; and the probabilistic key is a Poisson process approximation for where there are repeats of t-tuples, with t=l−1, in the target sequence. This Poisson approximation is easily proved using the Chen-Stein method, which with more work also yields concrete error bounds. This talk is based in part on joint work with Daniela Martin and Michael Waterman.

Supported by grants from the National Institute of Health (GM 36230) and the National Science Foundation (DMS 90-05833).

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D. Aldous, Probability approximations via the Poisson clumping heuristic. Springer, New York.

    Google Scholar 

  2. R. Arratia, Sequencing by hybridization: set versus multiset. (1996) In preparation.

    Google Scholar 

  3. R. Arratia, L. Goldstein and L. Gordon, Two moments suffice for Poisson approximations: The Chen-Stein method. Ann. Probab. 17 (1989), 9–25.

    Google Scholar 

  4. R. Arratia, L. Goldstein and L. Gordon, Poisson Approximation and the Chen-Stein Method. Statistical Science 5 (1990), 403–434.

    Google Scholar 

  5. R. Arratia, D. Martin, G. Reinert, and M. S. Waterman, Poisson process approximation for sequence repeats, and sequencing by hybridization. (1996) Submitted to J. Comp. Biol., 68 pp.

    Google Scholar 

  6. R. Arratia and G. Reinert (1996) Partial recovery in sequencing by hybridization. In preparation.

    Google Scholar 

  7. R. Arratia and S. Tavaré, Reviews of Probability approximations via the Poisson clumping heuristic by D. Aldous and Poisson approximation by A.D. Barbour, L. Holst, and S. Janson. Ann. Probab. 21 (1993), 2269–2279.

    Google Scholar 

  8. A.D. Barbour, L. Holst, S. Janson, Poisson approximation. Clarendon, Oxford 1992.

    Google Scholar 

  9. L.H.Y. Chen, Poisson approximation for dependent trials. Ann. Probab. 3 (1975), 534–545.

    Google Scholar 

  10. M. Dyer, A. Frieze and S. Suen, The Probability of Unique Solutions of Sequencing by Hybridization. J. Comp. Biol. 1 (1994), 105–110.

    Google Scholar 

  11. S. Karlin and F. Ost, Counts on long aligned word matches along random letter sequences. Adv. Appl. Prob. 19 (1987), 293–351.

    Google Scholar 

  12. P.A. Pevzner, DNA physical mapping and alternating Eulerian cycles in colored graphs. Algorithmica 13 (1995), 77–105.

    Google Scholar 

  13. E. Ukkonen, Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92 (1992), 191–211.

    Google Scholar 

  14. M.S. Waterman, Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman Hall 1995.

    Google Scholar 

  15. A.M. Zubkov and V.G. Mikhailov, Limit distributions of random variables associated with long duplications in a sequence of independent trials. Theory Prob. Applications 19 (1974), 172–179.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Dan Hirschberg Gene Myers

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Arratia, R., Reinert, G. (1996). Poisson process approximation for repeats in one sequence and its application to sequencing by hybridization. In: Hirschberg, D., Myers, G. (eds) Combinatorial Pattern Matching. CPM 1996. Lecture Notes in Computer Science, vol 1075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61258-0_16

Download citation

  • DOI: https://doi.org/10.1007/3-540-61258-0_16

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-61258-2

  • Online ISBN: 978-3-540-68390-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics