Approximation of sojourn-times via maximal couplings: motif frequency distributions

Lladser, Manuel E.; Chestnut, Stephen R.

doi:10.1007/s00285-013-0690-6

Approximation of sojourn-times via maximal couplings: motif frequency distributions

Published: 06 June 2013

Volume 69, pages 147–182, (2014)
Cite this article

Journal of Mathematical Biology Aims and scope Submit manuscript

Manuel E. Lladser¹ &
Stephen R. Chestnut^nAff2

437 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Sojourn-times provide a versatile framework to assess the statistical significance of motifs in genome-wide searches even under non-Markovian background models. However, the large state spaces encountered in genomic sequence analyses make the exact calculation of sojourn-time distributions computationally intractable in long sequences. Here, we use coupling and analytic combinatoric techniques to approximate these distributions in the general setting of Polish state spaces, which encompass discrete state spaces. Our approximations are accompanied with explicit, easy to compute, error bounds for total variation distance. Broadly speaking, if ${\mathsf{T}}_n$ is the random number of times a Markov chain visits a certain subset ${\mathsf{T}}$ of states in its first $n$ transitions, then we can usually approximate the distribution of ${\mathsf{T}}_n$ for $n$ of order $(1-\alpha )^{-m}$, where $m$ is the largest integer for which the exact distribution of ${\mathsf{T}}_m$ is accessible and $0\le \alpha \le 1$ is an ergodicity coefficient associated with the probability transition kernel of the chain. This gives access to approximations of sojourn-times in the intermediate regime where $n$ is perhaps too large for exact calculations, but too small to rely on Normal approximations or stationarity assumptions underlying Poisson and compound Poisson approximations. As proof of concept, we approximate the distribution of the number of matches with a motif in promoter regions of C. elegans. Mathematical properties of the proposed ergodicity coefficients and connections with additive functionals of homogeneous Markov chains as well as ergodicity of non-homogeneous Markov chains are also explored.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Graph-based algorithms for phase-type distributions

Article 01 November 2022

Comparison Theorems for Stochastic Chemical Reaction Networks

Article Open access 31 March 2023

Large Deviations and Uncertainty Relations in Periodically Driven Markov Chains

References

Aldous DJ (1989) Probability approximations via the Poisson clumping heuristic, Applied mathematical sciences, vol 77. Springer, New York
Aldous DJ, Diaconis P (1987) Strong uniform times and finite random walks. Adv Appl Math 8:69–97
Article MATH MathSciNet Google Scholar
Arratia R, Goldstein L, Gordon L (1990) Poisson approximation and the Chen–Stein method. Stat Sci 5(4):403–424
MATH MathSciNet Google Scholar
Aston JAD, Martin DEK (2005) Waiting time distributions of competing patterns in higher-order Markovian sequences. J Appl Probab 42(4):977–988
Article MATH MathSciNet Google Scholar
Athreya KB, Ney P (1978) A new approach to the limit theory of recurrent Markov chains. Trans Am Math Soc 245:493–501
Article MATH MathSciNet Google Scholar
Barbour AD, Holst L, Janson S (1992) Poisson approximation, 1st edn. Oxford University Press, Oxford
MATH Google Scholar
Bender EA, Kochman F (1993) The distribution of subword counts is usually Normal. Eur J Comb 14(4):265–275
Article MATH MathSciNet Google Scholar
Biggins JD, Cannings C (1987) Markov renewal processes, counters and repeated sequences in Markov chains. Adv Appl Probab 19:521–545
Article MATH MathSciNet Google Scholar
Chestnut S (2010) Approximating Markov chain occupancy distributions, Master’s thesis. University of Colorado, USA
Chestnut S, Lladser ME (2010) Occupancy distributions via Doeblin’s ergodicity coefficient. In: Discrete Mathematics and Theoretical Computer Science Proceedings, vol AM, pp 79–92
Corcoran JN, Tweedie RL (2001) Perfect sampling of ergodic Harris chains. Ann Appl Probab 11(2):438–451
Article MATH MathSciNet Google Scholar
Diaconis P, Fill JA (1990) Strong stationary times via a new form of duality. Ann Probab 18(4):1483–1522
Article MATH MathSciNet Google Scholar
Dobrushin RL (1956a) Central limit theorem for nonstationary Markov chains. I. Theory Probab Appl 1(1):65–79
Article Google Scholar
Dobrushin RL (1956b) Central limit theorem for nonstationary Markov chains. II. Theory Probab Appl 1(4):329–383
Article Google Scholar
Doeblin W (1937) Le cas discontinu des probabilités en chaîne. Publ Fac Sci Univ Masaryk (Brno) 236:1–13
Google Scholar
Durbin R, Eddy SR, Krogh A, Mitchison G (2004) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge
Google Scholar
Durrett R (1999) Essentials of stochastic processes, 1st edn. Springer, Berlin
MATH Google Scholar
Erhardsson T (1999) Compound Poisson approximation for Markov chains using Stein’s method. Ann Probab 27:565–596
Article MATH MathSciNet Google Scholar
Flajolet P, Sedgewick R (2009) Analytic combinatorics, 1st edn. Cambridge University Press, Cambridge
Book MATH Google Scholar
Flames N, Hobert O (2009) Gene regulatory logic of dopamine neuron differentiation. Nature 16:885–889
Google Scholar
Fu JC, Koutras MV (1994) Distribution theory of runs: a Markov chain approach. J Am Stat Assoc 89(427):1050–1058
Article MATH MathSciNet Google Scholar
Fu JC, Lou WYW (2003) Distribution theory of runs and patterns and its applications. A finite Markov chain imbedding approach. World Scientific Publishing Co. Inc., Singapore
Book MATH Google Scholar
Gerber HU, Li S-YR (1981) The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain. Stoch Proc Appl 11(1):101–108
Article MATH MathSciNet Google Scholar
Geyer CJ (1992) Practical Markov chain Monte Carlo. Stat Sci 7(4):473–483
Article MathSciNet Google Scholar
Hajnal J (1958) Weak ergodicity in nonhomogeneous Markov chains. Proc Camb Philos Soc 54:233–246
Article MATH MathSciNet Google Scholar
Huang H, Kao MC, Zhou X, Liu JS, Wong WH (2004) Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification. J Comput Biol 11(1):1–14
Article MATH Google Scholar
Kato T (1980) Perturbation theory for linear operators. Classics in Mathematics. Springer, New York
Kennedy R, Lladser ME, Yarus M, Knight R (2008) Information, probability, and the abundance of the simplest RNA active sites. Front Biosci 13:6060–6071
Article Google Scholar
Lindvall T (2002) Lectures on the coupling method. Dover, New York
MATH Google Scholar
Lladser ME (2007) Minimal Markov chain embeddings of pattern problems. In: Proceedings of the 2007 Information Theory and Applications Workshop. University of California, San Diego
Lladser ME (2008) Markovian embeddings of general random strings. In: Proceedings of the Fifth Workshop on Analytic Algorithmics and Combinatorics. SIAM, San Francisco, pp 183–190
Lladser ME, Betterton MD, Knight R (2008) Multiple pattern matching: a Markov chain approach. J Math Biol 56:51–92
Article MATH MathSciNet Google Scholar
Marschall T (2011) Construction of minimal deterministic finite automata from biological motifs. Theor Comput Sci 412(8–10):922–930
Article MATH MathSciNet Google Scholar
Martin DEK (2005) Distribution of the number of successes in success runs of length at least k in higher-order Markovian sequences. Methodol Comput Appl 7(4):543–554
Article MATH MathSciNet Google Scholar
Maxwell M, Woodroofe M (2000) Central limit theorems for additive functionals of Markov chains. Ann Probab 28(2):713–724
Article MATH MathSciNet Google Scholar
Meyn SP, Tweedie RL (1993) Markov chains and stochastic stability. Springer, Berlin
Book MATH Google Scholar
Møller J (1999) Perfect simulation of conditionally specified models. J R Stat Soc B 61(1):251–264
Article Google Scholar
Murdoch DJ, Green PJ (1998) Exact sampling from a continuous state space. Scand J Stat 25(3):483–502
Article MATH MathSciNet Google Scholar
Nicodème P (2003) Regexpcount, a symbolic package for counting problems on regular expressions and words. Fund Inf 56(1–2):71–88
MATH Google Scholar
Nicodème P, Salvy B, Flajolet P (2002) Motif statistics. Theor Comput Sci 287(2):593–617
Article MATH Google Scholar
Pollard D (2002) A user’s guide to measure theoretic probability. Statistical and Probabilistic Mathematics, Cambridge
Reinert G, Schbath S (1998) Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J Comput Biol 5(2):223–253
Article Google Scholar
Roberts GO, Rosenthal JS (2004) General state space Markov chains and MCMC algorithms. Probab Sur 1:20–71
Article MATH MathSciNet Google Scholar
Roquain E, Schbath S (2007) Improved compound Poisson approximation for the number of occurrences of any rare word family in a stationary Markov chain. Adv Appl Probab 39(1):128–140
Article MATH MathSciNet Google Scholar
Rubino G, Sericola B (1989) Sojourn times in finite Markov processes. J Appl Probab 26(4):744–756
Article MATH MathSciNet Google Scholar
Seneta E (1973a) Non-negative matrices, 1st edn. Wiley, New York
MATH Google Scholar
Seneta E (1973b) On the historical development of the theory of finite inhomogeneous Markov chains. Proc Camb Philos Soc 74:507–513
Article MATH MathSciNet Google Scholar
Spitzer NC (2009) Neuroscience: a bar code for differentiation. Nature 458(7240):843–844
Article Google Scholar
Thorisson H (2000) Coupling, stationarity, and regeneration. Springer, New York
Book MATH Google Scholar
Tuerk C, Gold L (1990) Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science 249:505–510
Article Google Scholar
Online database WormBase (2010) http://www.wormbase.org (June 2010, Release WS214)

Download references

Acknowledgments

We are very thankful to an anonymous referee who motivated us to seek connections with position-specific scoring matrices that expanded the scope of our methods.

Author information

Stephen R. Chestnut
Present address: Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218-2682, USA

Authors and Affiliations

Department of Applied Mathematics, University of Colorado, Boulder, CO, 80309-0526, USA
Manuel E. Lladser

Authors

Manuel E. Lladser
View author publications
You can also search for this author in PubMed Google Scholar
Stephen R. Chestnut
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuel E. Lladser.

Additional information

This research has been partially funded by the NSF DMS Grant #0805950.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (xlsx 55 KB)

Supplementary material 2 (xlsx 54 KB)

Supplementary material 3 (xlsx 54 KB)

Appendices

Appendix A: Parameter estimation associated with the DA-motif

Here we describe how we estimated the parameters of the fifth-order models in Sect. 3.2. Since the training sequences (i.e. promoter regions) are short in comparison to the length of the model (e.g. the dat-1 promoter region contains only 417 base-pairs but a fifth-order Markov model on the DNA-alphabet has 1,024 possible prefixes of length $5$), we applied the following incremental approach over each promoter region. Letting the index $\ell $ denote memory order under consideration, first estimate the four parameters associated with $\ell =0$, i.e. the memoryless model. Subsequently, for each $1\le \ell \le 5$, estimate the model parameters of the order $\ell $ model as follows: For each prefix of length $\ell $, say $w_1\cdots w_\ell $ with $w_i\in \{\text{ A, } \text{ C, } \text{ G, } \text{ T }\}$, compute the relative frequencies of the words $w_1\cdots w_\ell \text{ A }, w_1\cdots w_\ell \text{ C }, w_1\cdots w_\ell \text{ G }$ and $w_1\cdots w_\ell \text{ T }$ in the training sequence. If the prefix $w_1\cdots w_\ell $ does not appear or if any of the relative frequencies is equal to 1 (i.e. we estimate transitions from $w_1\cdots w_\ell $ to be deterministic) then assign to $w_1\cdots w_\ell $ the transition probabilities of $w_2\cdots w_\ell $ from the model of order $(\ell -1)$. Otherwise, take those relative frequencies to be the transition probabilities.

Appendix B: A heuristic approach to choosing $m$ and $k$

It is not obvious for general Markov chains how one should choose the parameters $k$ and $m$ in Theorem 3 in order to minimize computation. The problem is difficult because the amount of work required to compute the approximation depends on the parameter $\alpha _k(p)$, but $p_{u}^k$ is expensive to compute for large $k$. We suggest an approach that chooses $k$ and $m$ to satisfy a given error bound.

First, suppose $k$ is fixed and we would like to choose $m$ to guarantee the total variation distance is below a given bound, $0<\epsilon <1$. One may compute a value of $m$ to satisfy $\mathbb{P }(L_{(n\,\mathrm{div }\,k)}> m)\le \epsilon $ exactly by recursion, or one may use an approximation.

Arratia et al. (1990) give an extreme value approximation to the longest run of heads in a Bernoulli sequence with length $n$. For our application, the total variation distance between our approximation and the actual distribution is

$$\begin{aligned} d_{TV}\le \mathbb{P }({\mathsf{L}}_n\ge m+1) = 1-\mathbb{P }({\mathsf{L}}_n<m+1) \approx 1-e^{-(1-\alpha )^t} \end{aligned}$$

where $t=\log _{1/(1-\alpha )}((n-1)\alpha +1)-m-1$. Consequently, given $\alpha $, we should choose

$$\begin{aligned} m = \lceil \log _{1/(1-\alpha )}((n-1)\alpha +1) - \log \log (1/(1-\epsilon ))/\log (1-\alpha )-1\rceil . \end{aligned}$$

(66)

Now we turn to choosing $k$ and $m$, jointly. The algorithm will attempt to minimize computation based on a few observations. First, the quantity $k\cdot m$ is the largest number of consecutive transitions of the Markov chain considered by the approximating random variable. Our heuristic approach is to search for the first (in $k$) minimum of this quantity. The minimum must exist since for $k>n/2$ we always have $m=1$.

Another observation is that if $\alpha _k(p)$ increases with $k$, the remaining computational effort of applying the approximation with $p_{u}^k$ is lower than with $p_{u}^{k-1}$. Finally, $k\cdot m<n$ (ideally $k\cdot m\ll n$) is required for our approximation to be an improvement over the transfer matrix method, in the case that the algorithm we present below terminates with $k\cdot m\approx n$; the transfer matrix method can be applied with $p_{u}^k$ and the computation is not wasted. The program below increments $k$ until the first minimum is found. The final loop to recalculate $\mathbb{P }(L_{(n \,\mathrm{div }\,k)} > m)$ exactly via recursion is only necessary if one uses an approximation to find $m$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lladser, M.E., Chestnut, S.R. Approximation of sojourn-times via maximal couplings: motif frequency distributions. J. Math. Biol. 69, 147–182 (2014). https://doi.org/10.1007/s00285-013-0690-6

Download citation

Received: 21 February 2013
Revised: 11 May 2013
Published: 06 June 2013
Issue Date: July 2014
DOI: https://doi.org/10.1007/s00285-013-0690-6

Keywords

Mathematics Subject Classification (2000)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximation of sojourn-times via maximal couplings: motif frequency distributions

Abstract

Access this article

Similar content being viewed by others

Graph-based algorithms for phase-type distributions

Comparison Theorems for Stochastic Chemical Reaction Networks

Large Deviations and Uncertainty Relations in Periodically Driven Markov Chains

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (xlsx 55 KB)

Supplementary material 2 (xlsx 54 KB)

Supplementary material 3 (xlsx 54 KB)

Appendices

Appendix A: Parameter estimation associated with the DA-motif

Appendix B: A heuristic approach to choosing \(m\) and \(k\)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2000)

Navigation

Approximation of sojourn-times via maximal couplings: motif frequency distributions

Abstract

Access this article

Similar content being viewed by others

Graph-based algorithms for phase-type distributions

Comparison Theorems for Stochastic Chemical Reaction Networks

Large Deviations and Uncertainty Relations in Periodically Driven Markov Chains

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (xlsx 55 KB)

Supplementary material 2 (xlsx 54 KB)

Supplementary material 3 (xlsx 54 KB)

Appendices

Appendix A: Parameter estimation associated with the DA-motif

Appendix B: A heuristic approach to choosing \(m\) and \(k\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

Search

Navigation