Abstract
The most commonly used models for analysing local dependencies in DNA sequences are (high-order) Markov chains. Incorporating knowledge relative to the possible grouping of the nucleotides enables to define dedicated sub-classes of Markov chains. The problem of formulating lumpability hypotheses for a Markov chain is therefore addressed. In the classical approach to lumpability, this problem can be formulated as the determination of an appropriate state space (smaller than the original state space) such that the lumped chain defined on this state space retains the Markov property. We propose a different perspective on lumpability where the state space is fixed and the partitioning of this state space is represented by a one-to-many probabilistic function within a two-level stochastic process. Three nested classes of lumped processes can be defined in this way as sub-classes of first-order Markov chains. These lumped processes enable parsimonious reparameterizations of Markov chains that help to reveal relevant partitions of the state space. Characterizations of the lumped processes on the original transition probability matrix are derived. Different model selection methods relying either on hypothesis testing or on penalized log-likelihood criteria are presented as well as extensions to lumped processes constructed from high-order Markov chains. The relevance of the proposed approach to lumpability is illustrated by the analysis of DNA sequences. In particular, the use of lumped processes enables to highlight differences between intronic sequences and gene untranslated region sequences.
Similar content being viewed by others
References
Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Automatic Control 19 (6), 716–723 (1974)
Billingsley, P.: Statistical methods in Markov chains. Ann. Math. Stat. 32, 12–40 (1961)
Bühlmann, P., Wyner, A.J.: Variable length Markov chains. The Ann. Stat. 27 (2), 480–513 (1999)
Burke, C.J., Rosenblatt, M.: A Markovian function of a Markov chain. Ann. Math. Stat. 29, 1112–1122 (1958)
Burnham, K.P., Anderson, D.R.: Model Selection and Multimodel Inference. A Practical Information-Theoretic Approach. 2nd edn. New York: Springer, 2002
Csiszár, I., Shields, P.C.: The consistency of the BIC Markov order estimator. The Ann. Stat. 28 (6), 1601–1619 (2000)
Ephraim, Y., Merhav, N.: Hidden Markov processes. IEEE Tran. Information Theory 48 (6), 1518–1569 (2002)
Feller, W.: An Introduction to Probability Theory and Its Applications, Volume 1, 3rd edn. New York: Wiley, 1968
Guttorp, P.: Stochastic Modeling of Scientific Data. London: Chapman & Hall, 1995
Hall, D.L., Kadafar, K., Malkinson, A.M.: Statistical methodology for assessing homology of intronic regions of genes. The Canadian J. Stat. 26 (3), 455–465 (1998)
Jansen, R.P.: mRNA localization: message on the move. Nature Reviews Molecular Cell Biol. 2, 247–256 (2001)
Jeffreys, H.: Theory of Probability, 3rd edn. Oxford: Oxford University Press, 1961
Kass, R. E., Raftery, A.E.: Bayes factors. J. American Stat. Association 90, 773–795 (1995)
Katz, R.W.: On some criteria for estimating the order of a Markov chain. Technometrics 23 (3), 243–249 (1981)
Kemeny, J.G., Snell, J.L.: Finite Markov Chains. New York: Springer, 1976
Kulkarni, V.G.: Modeling and Analysis of Stochastic Systems. London: Chapman & Hall, 1995
Lauritzen, S.L.: Graphical Models. Oxford: Oxford University Press, 1996
Macdonald, P.: Diversity in translational regulation. Current Opinion Cell Biol. 13, 326–331 (2001)
Mächler, M., Bühemann, P.: Variable length Markov chains: Methodology, computing and software. J. Computational and Graphical Stat. 13 (2), 435–455 (2004)
Mitchell, P., Tollervey, D.: mRNA turnover. Current Opinion in Cell Biol. 13, 320–325 (2001)
Pesole, G., Liuni, S., Grillo, G., Licciulli, F., Mignone, F., Gissi, C., Saccone, C.: UTRdb and UTRsite: specialized database of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Nucleic Acids Res. 30, 335–340 (2002)
Prum, B., Rodolphe, F., de Turckheim, E.: Finding words with unexpected frequencies in DNA sequences. J. Royal Stat. Soc. Series B 57, 205–220 (1995)
Raftery, A.E., Tavaré, S.: Estimation and modelling repeated patterns in high order Markov chains with the mixture transition distribution model. Appl. Stat. 43 (1), 179–199 (1994)
Reinert, G., Schbath, S., Waterman, M.S.: Probabilistic and statistical properties of words: An overview. J. Comput. Biol. 7 (1/2), 1–46 (2000)
Robin, S., Daudin, J.J.: Exact distribution of word occurrences in a random sequence of letters. J. Appl. Probability 36, 179–193 (1999)
Rogers, D.F., Plante, R.D.: Estimating equilibrium probabilities for band diagonal Markov chains using aggregation and disaggregation techniques. Computers & Oper. Res. 20, 857–877 (1993)
Ron, D., Singer, Y., Tishby, N.: The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning 25, 117–149 (1996)
Schwarz, G.: Estimating the dimension of a model. The Ann. Stat. 6 (2), 461–464 (1978)
Smyth, P., Heckerman, D., Jordan, M.I.: Probabilistic independence networks for hidden Markov probability models. Neural Computation 9, 227–269 (1997)
Stefanov, V.T.: The intersite distances between pattern occurrences in strings generated by general discrete- and continuous-time models: An algorithmic approach. J. Appl. Probability 40, 881–892 (2003)
Thomas, M.U., Barr, D.R.: An approximate test of Markov chain lumpability. J. American Stat. Association 72, 175–179 (1977)
Weinberger, M.J., Rissanen, J.J., Feder, M.: A universal finite memory source. IEEE Transactions on Information Theory 41 (3), 643–652 (1995)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Guédon, Y., d'Aubenton-Carafa, Y. & Thermes, C. Analysing grouping of nucleotides in DNA sequences using lumped processes constructed from Markov chains. J. Math. Biol. 52, 343–372 (2006). https://doi.org/10.1007/s00285-005-0358-y
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00285-005-0358-y