# Poisson process approximation for repeats in one sequence and its application to sequencing by hybridization

## Abstract

Sequencing by hybridization here refers to the attempt to determine a target DNA sequence of moderate length *m* from the set of all *l*-tuples contained in the sequence. Realistic values are currently in the range *l*=8 to 12 and *m*=100 to 5000. As a mathematical idealization, it is best to begin with the unrealistic assumption that the multiplicity of occurrences of *l*-tuples is also known; this multiset is called the spectrum or *l*-spectrum of the sequence. For random sequences, the more realistic case, where only the set underlying the spectrum is known, can be closely approximated by the situation involving the spectrum.

We model DNA as an i.i.d. sequence. Results are proved both in terms of the simpler infinite limit, where *m* and *l* go to infinity together so that the probability of unique recoverability is bounded away from zero and one, and also in the much more difficult finite case, where the reuslts are given with concrete error bounds. For example: for the uniform distribution on four letters, for *l*=12, *m*=2814, the probability of unique recoverability is in the interval. 9349±.0347, while for *l*=10, *m*=568 it is in .9676±.0671. These are rigorous results, not involving simulation. The error bounds get smaller as *l, m* increase together, and for the small values of *l* in biological practice, our intervals with error bounds are just starting to be informative.

We consider notions of partial recovery. First, with *N* defined as the number of *m*-sequences having the same spectrum as the target sequence, what is the distribution of *N*? The question of unique recoverability is one of approximating P(*N*=1). Another notion of partial recovery involves the length *L* of the longest subsequence of the target which is determined by the spectrum; here unique recoverability corresponds to the event {*L=m*}.

For all of these questions, the deterministic key involves the Ukkonen-Pevzner transformations and the de Bruijn graph of the target; and the probabilistic key is a Poisson process approximation for where there are repeats of *t*-tuples, with t=l−1, in the target sequence. This Poisson approximation is easily proved using the Chen-Stein method, which with more work also yields concrete error bounds. This talk is based in part on joint work with Daniela Martin and Michael Waterman.

## Keywords

Target Word Error Bound Partial Recovery Poisson Point Process Poisson Approximation## Preview

Unable to display preview. Download preview PDF.

## References

- 1.D. Aldous,
*Probability approximations via the Poisson clumping heuristic*. Springer, New York.Google Scholar - 2.R. Arratia, Sequencing by hybridization: set versus multiset. (1996) In preparation.Google Scholar
- 3.R. Arratia, L. Goldstein and L. Gordon, Two moments suffice for Poisson approximations: The Chen-Stein method.
*Ann. Probab.***17**(1989), 9–25.Google Scholar - 4.R. Arratia, L. Goldstein and L. Gordon, Poisson Approximation and the Chen-Stein Method.
*Statistical Science***5**(1990), 403–434.Google Scholar - 5.R. Arratia, D. Martin, G. Reinert, and M. S. Waterman, Poisson process approximation for sequence repeats, and sequencing by hybridization. (1996) Submitted to J. Comp. Biol., 68 pp.Google Scholar
- 6.R. Arratia and G. Reinert (1996) Partial recovery in sequencing by hybridization. In preparation.Google Scholar
- 7.R. Arratia and S. Tavaré, Reviews of
*Probability approximations via the Poisson clumping heuristic*by D. Aldous and*Poisson approximation*by A.D. Barbour, L. Holst, and S. Janson.*Ann. Probab.***21**(1993), 2269–2279.Google Scholar - 8.A.D. Barbour, L. Holst, S. Janson,
*Poisson approximation*. Clarendon, Oxford 1992.Google Scholar - 9.L.H.Y. Chen, Poisson approximation for dependent trials.
*Ann. Probab.***3**(1975), 534–545.Google Scholar - 10.M. Dyer, A. Frieze and S. Suen, The Probability of Unique Solutions of Sequencing by Hybridization.
*J. Comp. Biol.***1**(1994), 105–110.Google Scholar - 11.S. Karlin and F. Ost, Counts on long aligned word matches along random letter sequences.
*Adv. Appl. Prob.***19**(1987), 293–351.Google Scholar - 12.P.A. Pevzner, DNA physical mapping and alternating Eulerian cycles in colored graphs.
*Algorithmica***13**(1995), 77–105.Google Scholar - 13.E. Ukkonen, Approximate string-matching with q-grams and maximal matches.
*Theoretical Computer Science***92**(1992), 191–211.Google Scholar - 14.M.S. Waterman,
*Introduction to Computational Biology: Maps, Sequences and Genomes*. Chapman Hall 1995.Google Scholar - 15.A.M. Zubkov and V.G. Mikhailov, Limit distributions of random variables associated with long duplications in a sequence of independent trials.
*Theory Prob. Applications***19**(1974), 172–179.Google Scholar