Poisson process approximation for repeats in one sequence and its application to sequencing by hybridization

Arratia, Richard; Reinert, Gesine

doi:10.1007/3-540-61258-0_16

Poisson process approximation for repeats in one sequence and its application to sequencing by hybridization

Richard Arratia¹ &
Gesine Reinert¹

Conference paper
First Online: 01 January 2005

147 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1075))

Abstract

Sequencing by hybridization here refers to the attempt to determine a target DNA sequence of moderate length m from the set of all l-tuples contained in the sequence. Realistic values are currently in the range l=8 to 12 and m=100 to 5000. As a mathematical idealization, it is best to begin with the unrealistic assumption that the multiplicity of occurrences of l-tuples is also known; this multiset is called the spectrum or l-spectrum of the sequence. For random sequences, the more realistic case, where only the set underlying the spectrum is known, can be closely approximated by the situation involving the spectrum.

We model DNA as an i.i.d. sequence. Results are proved both in terms of the simpler infinite limit, where m and l go to infinity together so that the probability of unique recoverability is bounded away from zero and one, and also in the much more difficult finite case, where the reuslts are given with concrete error bounds. For example: for the uniform distribution on four letters, for l=12, m=2814, the probability of unique recoverability is in the interval. 9349±.0347, while for l=10, m=568 it is in .9676±.0671. These are rigorous results, not involving simulation. The error bounds get smaller as l, m increase together, and for the small values of l in biological practice, our intervals with error bounds are just starting to be informative.

We consider notions of partial recovery. First, with N defined as the number of m-sequences having the same spectrum as the target sequence, what is the distribution of N? The question of unique recoverability is one of approximating P(N=1). Another notion of partial recovery involves the length L of the longest subsequence of the target which is determined by the spectrum; here unique recoverability corresponds to the event {L=m}.

For all of these questions, the deterministic key involves the Ukkonen-Pevzner transformations and the de Bruijn graph of the target; and the probabilistic key is a Poisson process approximation for where there are repeats of t-tuples, with t=l−1, in the target sequence. This Poisson approximation is easily proved using the Chen-Stein method, which with more work also yields concrete error bounds. This talk is based in part on joint work with Daniela Martin and Michael Waterman.

Supported by grants from the National Institute of Health (GM 36230) and the National Science Foundation (DMS 90-05833).

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

References

D. Aldous, Probability approximations via the Poisson clumping heuristic. Springer, New York.
Google Scholar
R. Arratia, Sequencing by hybridization: set versus multiset. (1996) In preparation.
Google Scholar
R. Arratia, L. Goldstein and L. Gordon, Two moments suffice for Poisson approximations: The Chen-Stein method. Ann. Probab. 17 (1989), 9–25.
Google Scholar
R. Arratia, L. Goldstein and L. Gordon, Poisson Approximation and the Chen-Stein Method. Statistical Science 5 (1990), 403–434.
Google Scholar
R. Arratia, D. Martin, G. Reinert, and M. S. Waterman, Poisson process approximation for sequence repeats, and sequencing by hybridization. (1996) Submitted to J. Comp. Biol., 68 pp.
Google Scholar
R. Arratia and G. Reinert (1996) Partial recovery in sequencing by hybridization. In preparation.
Google Scholar
R. Arratia and S. Tavaré, Reviews of Probability approximations via the Poisson clumping heuristic by D. Aldous and Poisson approximation by A.D. Barbour, L. Holst, and S. Janson. Ann. Probab. 21 (1993), 2269–2279.
Google Scholar
A.D. Barbour, L. Holst, S. Janson, Poisson approximation. Clarendon, Oxford 1992.
Google Scholar
L.H.Y. Chen, Poisson approximation for dependent trials. Ann. Probab. 3 (1975), 534–545.
Google Scholar
M. Dyer, A. Frieze and S. Suen, The Probability of Unique Solutions of Sequencing by Hybridization. J. Comp. Biol. 1 (1994), 105–110.
Google Scholar
S. Karlin and F. Ost, Counts on long aligned word matches along random letter sequences. Adv. Appl. Prob. 19 (1987), 293–351.
Google Scholar
P.A. Pevzner, DNA physical mapping and alternating Eulerian cycles in colored graphs. Algorithmica 13 (1995), 77–105.
Google Scholar
E. Ukkonen, Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92 (1992), 191–211.
Google Scholar
M.S. Waterman, Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman Hall 1995.
Google Scholar
A.M. Zubkov and V.G. Mikhailov, Limit distributions of random variables associated with long duplications in a sequence of independent trials. Theory Prob. Applications 19 (1974), 172–179.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, University of Southern California, 90089-1113, Los Angeles, CA
Richard Arratia & Gesine Reinert

Authors

Richard Arratia
View author publications
You can also search for this author in PubMed Google Scholar
Gesine Reinert
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Dan Hirschberg Gene Myers

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arratia, R., Reinert, G. (1996). Poisson process approximation for repeats in one sequence and its application to sequencing by hybridization. In: Hirschberg, D., Myers, G. (eds) Combinatorial Pattern Matching. CPM 1996. Lecture Notes in Computer Science, vol 1075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61258-0_16

Download citation

DOI: https://doi.org/10.1007/3-540-61258-0_16
Published: 01 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61258-2
Online ISBN: 978-3-540-68390-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics