# Approximation of sojourn-times via maximal couplings: motif frequency distributions

## Abstract

Sojourn-times provide a versatile framework to assess the statistical significance of motifs in genome-wide searches even under non-Markovian background models. However, the large state spaces encountered in genomic sequence analyses make the exact calculation of sojourn-time distributions computationally intractable in long sequences. Here, we use coupling and analytic combinatoric techniques to approximate these distributions in the general setting of Polish state spaces, which encompass discrete state spaces. Our approximations are accompanied with explicit, easy to compute, error bounds for total variation distance. Broadly speaking, if \({\mathsf{T}}_n\) is the random number of times a Markov chain visits a certain subset \({\mathsf{T}}\) of states in its first \(n\) transitions, then we can usually approximate the distribution of \({\mathsf{T}}_n\) for \(n\) of order \((1-\alpha )^{-m}\), where \(m\) is the largest integer for which the exact distribution of \({\mathsf{T}}_m\) is accessible and \(0\le \alpha \le 1\) is an ergodicity coefficient associated with the probability transition kernel of the chain. This gives access to approximations of sojourn-times in the intermediate regime where \(n\) is perhaps too large for exact calculations, but too small to rely on Normal approximations or stationarity assumptions underlying Poisson and compound Poisson approximations. As proof of concept, we approximate the distribution of the number of matches with a motif in promoter regions of *C. elegans*. Mathematical properties of the proposed ergodicity coefficients and connections with additive functionals of homogeneous Markov chains as well as ergodicity of non-homogeneous Markov chains are also explored.

## Keywords

Additive functionals of Markov chains Embedding technique Ergodicity coefficients Motifs Non-homogeneous Markov chains Non-asymptotic approximations of distributions Patterns Sojourn-times Wolfgang Doeblin## Mathematics Subject Classification (2000)

Primary 60J22 62E17 62L20 65C40 Secondary 92D20 60J05## Notes

### Acknowledgments

We are very thankful to an anonymous referee who motivated us to seek connections with position-specific scoring matrices that expanded the scope of our methods.

## Supplementary material

## References

- Aldous DJ (1989) Probability approximations via the Poisson clumping heuristic, Applied mathematical sciences, vol 77. Springer, New YorkGoogle Scholar
- Aldous DJ, Diaconis P (1987) Strong uniform times and finite random walks. Adv Appl Math 8:69–97CrossRefMATHMathSciNetGoogle Scholar
- Arratia R, Goldstein L, Gordon L (1990) Poisson approximation and the Chen–Stein method. Stat Sci 5(4):403–424MATHMathSciNetGoogle Scholar
- Aston JAD, Martin DEK (2005) Waiting time distributions of competing patterns in higher-order Markovian sequences. J Appl Probab 42(4):977–988CrossRefMATHMathSciNetGoogle Scholar
- Athreya KB, Ney P (1978) A new approach to the limit theory of recurrent Markov chains. Trans Am Math Soc 245:493–501CrossRefMATHMathSciNetGoogle Scholar
- Barbour AD, Holst L, Janson S (1992) Poisson approximation, 1st edn. Oxford University Press, OxfordMATHGoogle Scholar
- Bender EA, Kochman F (1993) The distribution of subword counts is usually Normal. Eur J Comb 14(4):265–275CrossRefMATHMathSciNetGoogle Scholar
- Biggins JD, Cannings C (1987) Markov renewal processes, counters and repeated sequences in Markov chains. Adv Appl Probab 19:521–545CrossRefMATHMathSciNetGoogle Scholar
- Chestnut S (2010) Approximating Markov chain occupancy distributions, Master’s thesis. University of Colorado, USAGoogle Scholar
- Chestnut S, Lladser ME (2010) Occupancy distributions via Doeblin’s ergodicity coefficient. In: Discrete Mathematics and Theoretical Computer Science Proceedings, vol AM, pp 79–92Google Scholar
- Corcoran JN, Tweedie RL (2001) Perfect sampling of ergodic Harris chains. Ann Appl Probab 11(2):438–451CrossRefMATHMathSciNetGoogle Scholar
- Diaconis P, Fill JA (1990) Strong stationary times via a new form of duality. Ann Probab 18(4):1483–1522CrossRefMATHMathSciNetGoogle Scholar
- Dobrushin RL (1956a) Central limit theorem for nonstationary Markov chains. I. Theory Probab Appl 1(1):65–79CrossRefGoogle Scholar
- Dobrushin RL (1956b) Central limit theorem for nonstationary Markov chains. II. Theory Probab Appl 1(4):329–383CrossRefGoogle Scholar
- Doeblin W (1937) Le cas discontinu des probabilités en chaîne. Publ Fac Sci Univ Masaryk (Brno) 236:1–13Google Scholar
- Durbin R, Eddy SR, Krogh A, Mitchison G (2004) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, CambridgeGoogle Scholar
- Durrett R (1999) Essentials of stochastic processes, 1st edn. Springer, BerlinMATHGoogle Scholar
- Erhardsson T (1999) Compound Poisson approximation for Markov chains using Stein’s method. Ann Probab 27:565–596CrossRefMATHMathSciNetGoogle Scholar
- Flajolet P, Sedgewick R (2009) Analytic combinatorics, 1st edn. Cambridge University Press, CambridgeCrossRefMATHGoogle Scholar
- Flames N, Hobert O (2009) Gene regulatory logic of dopamine neuron differentiation. Nature 16:885–889Google Scholar
- Fu JC, Koutras MV (1994) Distribution theory of runs: a Markov chain approach. J Am Stat Assoc 89(427):1050–1058CrossRefMATHMathSciNetGoogle Scholar
- Fu JC, Lou WYW (2003) Distribution theory of runs and patterns and its applications. A finite Markov chain imbedding approach. World Scientific Publishing Co. Inc., SingaporeCrossRefMATHGoogle Scholar
- Gerber HU, Li S-YR (1981) The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain. Stoch Proc Appl 11(1):101–108CrossRefMATHMathSciNetGoogle Scholar
- Geyer CJ (1992) Practical Markov chain Monte Carlo. Stat Sci 7(4):473–483CrossRefMathSciNetGoogle Scholar
- Hajnal J (1958) Weak ergodicity in nonhomogeneous Markov chains. Proc Camb Philos Soc 54:233–246CrossRefMATHMathSciNetGoogle Scholar
- Huang H, Kao MC, Zhou X, Liu JS, Wong WH (2004) Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification. J Comput Biol 11(1):1–14CrossRefMATHGoogle Scholar
- Kato T (1980) Perturbation theory for linear operators. Classics in Mathematics. Springer, New YorkGoogle Scholar
- Kennedy R, Lladser ME, Yarus M, Knight R (2008) Information, probability, and the abundance of the simplest RNA active sites. Front Biosci 13:6060–6071CrossRefGoogle Scholar
- Lindvall T (2002) Lectures on the coupling method. Dover, New YorkMATHGoogle Scholar
- Lladser ME (2007) Minimal Markov chain embeddings of pattern problems. In: Proceedings of the 2007 Information Theory and Applications Workshop. University of California, San DiegoGoogle Scholar
- Lladser ME (2008) Markovian embeddings of general random strings. In: Proceedings of the Fifth Workshop on Analytic Algorithmics and Combinatorics. SIAM, San Francisco, pp 183–190Google Scholar
- Lladser ME, Betterton MD, Knight R (2008) Multiple pattern matching: a Markov chain approach. J Math Biol 56:51–92CrossRefMATHMathSciNetGoogle Scholar
- Marschall T (2011) Construction of minimal deterministic finite automata from biological motifs. Theor Comput Sci 412(8–10):922–930CrossRefMATHMathSciNetGoogle Scholar
- Martin DEK (2005) Distribution of the number of successes in success runs of length at least k in higher-order Markovian sequences. Methodol Comput Appl 7(4):543–554CrossRefMATHMathSciNetGoogle Scholar
- Maxwell M, Woodroofe M (2000) Central limit theorems for additive functionals of Markov chains. Ann Probab 28(2):713–724CrossRefMATHMathSciNetGoogle Scholar
- Meyn SP, Tweedie RL (1993) Markov chains and stochastic stability. Springer, BerlinCrossRefMATHGoogle Scholar
- Møller J (1999) Perfect simulation of conditionally specified models. J R Stat Soc B 61(1):251–264CrossRefGoogle Scholar
- Murdoch DJ, Green PJ (1998) Exact sampling from a continuous state space. Scand J Stat 25(3):483–502CrossRefMATHMathSciNetGoogle Scholar
- Nicodème P (2003) Regexpcount, a symbolic package for counting problems on regular expressions and words. Fund Inf 56(1–2):71–88MATHGoogle Scholar
- Nicodème P, Salvy B, Flajolet P (2002) Motif statistics. Theor Comput Sci 287(2):593–617CrossRefMATHGoogle Scholar
- Pollard D (2002) A user’s guide to measure theoretic probability. Statistical and Probabilistic Mathematics, CambridgeGoogle Scholar
- Reinert G, Schbath S (1998) Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J Comput Biol 5(2):223–253CrossRefGoogle Scholar
- Roberts GO, Rosenthal JS (2004) General state space Markov chains and MCMC algorithms. Probab Sur 1:20–71CrossRefMATHMathSciNetGoogle Scholar
- Roquain E, Schbath S (2007) Improved compound Poisson approximation for the number of occurrences of any rare word family in a stationary Markov chain. Adv Appl Probab 39(1):128–140CrossRefMATHMathSciNetGoogle Scholar
- Rubino G, Sericola B (1989) Sojourn times in finite Markov processes. J Appl Probab 26(4):744–756CrossRefMATHMathSciNetGoogle Scholar
- Seneta E (1973a) Non-negative matrices, 1st edn. Wiley, New YorkMATHGoogle Scholar
- Seneta E (1973b) On the historical development of the theory of finite inhomogeneous Markov chains. Proc Camb Philos Soc 74:507–513CrossRefMATHMathSciNetGoogle Scholar
- Spitzer NC (2009) Neuroscience: a bar code for differentiation. Nature 458(7240):843–844CrossRefGoogle Scholar
- Thorisson H (2000) Coupling, stationarity, and regeneration. Springer, New YorkCrossRefMATHGoogle Scholar
- Tuerk C, Gold L (1990) Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science 249:505–510CrossRefGoogle Scholar
- Online database WormBase (2010) http://www.wormbase.org (June 2010, Release WS214)