CPM 1996: Combinatorial Pattern Matching pp 209-219

# Poisson process approximation for repeats in one sequence and its application to sequencing by hybridization

• Richard Arratia
• Gesine Reinert
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1075)

## Abstract

Sequencing by hybridization here refers to the attempt to determine a target DNA sequence of moderate length m from the set of all l-tuples contained in the sequence. Realistic values are currently in the range l=8 to 12 and m=100 to 5000. As a mathematical idealization, it is best to begin with the unrealistic assumption that the multiplicity of occurrences of l-tuples is also known; this multiset is called the spectrum or l-spectrum of the sequence. For random sequences, the more realistic case, where only the set underlying the spectrum is known, can be closely approximated by the situation involving the spectrum.

We model DNA as an i.i.d. sequence. Results are proved both in terms of the simpler infinite limit, where m and l go to infinity together so that the probability of unique recoverability is bounded away from zero and one, and also in the much more difficult finite case, where the reuslts are given with concrete error bounds. For example: for the uniform distribution on four letters, for l=12, m=2814, the probability of unique recoverability is in the interval. 9349±.0347, while for l=10, m=568 it is in .9676±.0671. These are rigorous results, not involving simulation. The error bounds get smaller as l, m increase together, and for the small values of l in biological practice, our intervals with error bounds are just starting to be informative.

We consider notions of partial recovery. First, with N defined as the number of m-sequences having the same spectrum as the target sequence, what is the distribution of N? The question of unique recoverability is one of approximating P(N=1). Another notion of partial recovery involves the length L of the longest subsequence of the target which is determined by the spectrum; here unique recoverability corresponds to the event {L=m}.

For all of these questions, the deterministic key involves the Ukkonen-Pevzner transformations and the de Bruijn graph of the target; and the probabilistic key is a Poisson process approximation for where there are repeats of t-tuples, with t=l−1, in the target sequence. This Poisson approximation is easily proved using the Chen-Stein method, which with more work also yields concrete error bounds. This talk is based in part on joint work with Daniela Martin and Michael Waterman.

## Keywords

Target Word Error Bound Partial Recovery Poisson Point Process Poisson Approximation
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

## References

1. 1.
D. Aldous, Probability approximations via the Poisson clumping heuristic. Springer, New York.Google Scholar
2. 2.
R. Arratia, Sequencing by hybridization: set versus multiset. (1996) In preparation.Google Scholar
3. 3.
R. Arratia, L. Goldstein and L. Gordon, Two moments suffice for Poisson approximations: The Chen-Stein method. Ann. Probab. 17 (1989), 9–25.Google Scholar
4. 4.
R. Arratia, L. Goldstein and L. Gordon, Poisson Approximation and the Chen-Stein Method. Statistical Science 5 (1990), 403–434.Google Scholar
5. 5.
R. Arratia, D. Martin, G. Reinert, and M. S. Waterman, Poisson process approximation for sequence repeats, and sequencing by hybridization. (1996) Submitted to J. Comp. Biol., 68 pp.Google Scholar
6. 6.
R. Arratia and G. Reinert (1996) Partial recovery in sequencing by hybridization. In preparation.Google Scholar
7. 7.
R. Arratia and S. Tavaré, Reviews of Probability approximations via the Poisson clumping heuristic by D. Aldous and Poisson approximation by A.D. Barbour, L. Holst, and S. Janson. Ann. Probab. 21 (1993), 2269–2279.Google Scholar
8. 8.
A.D. Barbour, L. Holst, S. Janson, Poisson approximation. Clarendon, Oxford 1992.Google Scholar
9. 9.
L.H.Y. Chen, Poisson approximation for dependent trials. Ann. Probab. 3 (1975), 534–545.Google Scholar
10. 10.
M. Dyer, A. Frieze and S. Suen, The Probability of Unique Solutions of Sequencing by Hybridization. J. Comp. Biol. 1 (1994), 105–110.Google Scholar
11. 11.
S. Karlin and F. Ost, Counts on long aligned word matches along random letter sequences. Adv. Appl. Prob. 19 (1987), 293–351.Google Scholar
12. 12.
P.A. Pevzner, DNA physical mapping and alternating Eulerian cycles in colored graphs. Algorithmica 13 (1995), 77–105.Google Scholar
13. 13.
E. Ukkonen, Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92 (1992), 191–211.Google Scholar
14. 14.
M.S. Waterman, Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman Hall 1995.Google Scholar
15. 15.
A.M. Zubkov and V.G. Mikhailov, Limit distributions of random variables associated with long duplications in a sequence of independent trials. Theory Prob. Applications 19 (1974), 172–179.Google Scholar