Skip to main content

Advertisement

Log in

Analysing grouping of nucleotides in DNA sequences using lumped processes constructed from Markov chains

  • Published:
Journal of Mathematical Biology Aims and scope Submit manuscript

Abstract

The most commonly used models for analysing local dependencies in DNA sequences are (high-order) Markov chains. Incorporating knowledge relative to the possible grouping of the nucleotides enables to define dedicated sub-classes of Markov chains. The problem of formulating lumpability hypotheses for a Markov chain is therefore addressed. In the classical approach to lumpability, this problem can be formulated as the determination of an appropriate state space (smaller than the original state space) such that the lumped chain defined on this state space retains the Markov property. We propose a different perspective on lumpability where the state space is fixed and the partitioning of this state space is represented by a one-to-many probabilistic function within a two-level stochastic process. Three nested classes of lumped processes can be defined in this way as sub-classes of first-order Markov chains. These lumped processes enable parsimonious reparameterizations of Markov chains that help to reveal relevant partitions of the state space. Characterizations of the lumped processes on the original transition probability matrix are derived. Different model selection methods relying either on hypothesis testing or on penalized log-likelihood criteria are presented as well as extensions to lumped processes constructed from high-order Markov chains. The relevance of the proposed approach to lumpability is illustrated by the analysis of DNA sequences. In particular, the use of lumped processes enables to highlight differences between intronic sequences and gene untranslated region sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Automatic Control 19 (6), 716–723 (1974)

    Article  MathSciNet  Google Scholar 

  2. Billingsley, P.: Statistical methods in Markov chains. Ann. Math. Stat. 32, 12–40 (1961)

    MATH  MathSciNet  Google Scholar 

  3. Bühlmann, P., Wyner, A.J.: Variable length Markov chains. The Ann. Stat. 27 (2), 480–513 (1999)

    Google Scholar 

  4. Burke, C.J., Rosenblatt, M.: A Markovian function of a Markov chain. Ann. Math. Stat. 29, 1112–1122 (1958)

    MATH  MathSciNet  Google Scholar 

  5. Burnham, K.P., Anderson, D.R.: Model Selection and Multimodel Inference. A Practical Information-Theoretic Approach. 2nd edn. New York: Springer, 2002

  6. Csiszár, I., Shields, P.C.: The consistency of the BIC Markov order estimator. The Ann. Stat. 28 (6), 1601–1619 (2000)

    Google Scholar 

  7. Ephraim, Y., Merhav, N.: Hidden Markov processes. IEEE Tran. Information Theory 48 (6), 1518–1569 (2002)

    Article  MathSciNet  Google Scholar 

  8. Feller, W.: An Introduction to Probability Theory and Its Applications, Volume 1, 3rd edn. New York: Wiley, 1968

  9. Guttorp, P.: Stochastic Modeling of Scientific Data. London: Chapman & Hall, 1995

  10. Hall, D.L., Kadafar, K., Malkinson, A.M.: Statistical methodology for assessing homology of intronic regions of genes. The Canadian J. Stat. 26 (3), 455–465 (1998)

    Google Scholar 

  11. Jansen, R.P.: mRNA localization: message on the move. Nature Reviews Molecular Cell Biol. 2, 247–256 (2001)

    Article  Google Scholar 

  12. Jeffreys, H.: Theory of Probability, 3rd edn. Oxford: Oxford University Press, 1961

  13. Kass, R. E., Raftery, A.E.: Bayes factors. J. American Stat. Association 90, 773–795 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  14. Katz, R.W.: On some criteria for estimating the order of a Markov chain. Technometrics 23 (3), 243–249 (1981)

    Article  Google Scholar 

  15. Kemeny, J.G., Snell, J.L.: Finite Markov Chains. New York: Springer, 1976

  16. Kulkarni, V.G.: Modeling and Analysis of Stochastic Systems. London: Chapman & Hall, 1995

  17. Lauritzen, S.L.: Graphical Models. Oxford: Oxford University Press, 1996

  18. Macdonald, P.: Diversity in translational regulation. Current Opinion Cell Biol. 13, 326–331 (2001)

    Article  Google Scholar 

  19. Mächler, M., Bühemann, P.: Variable length Markov chains: Methodology, computing and software. J. Computational and Graphical Stat. 13 (2), 435–455 (2004)

    Article  Google Scholar 

  20. Mitchell, P., Tollervey, D.: mRNA turnover. Current Opinion in Cell Biol. 13, 320–325 (2001)

    Article  Google Scholar 

  21. Pesole, G., Liuni, S., Grillo, G., Licciulli, F., Mignone, F., Gissi, C., Saccone, C.: UTRdb and UTRsite: specialized database of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Nucleic Acids Res. 30, 335–340 (2002)

    Article  Google Scholar 

  22. Prum, B., Rodolphe, F., de Turckheim, E.: Finding words with unexpected frequencies in DNA sequences. J. Royal Stat. Soc. Series B 57, 205–220 (1995)

    MATH  MathSciNet  Google Scholar 

  23. Raftery, A.E., Tavaré, S.: Estimation and modelling repeated patterns in high order Markov chains with the mixture transition distribution model. Appl. Stat. 43 (1), 179–199 (1994)

    MathSciNet  Google Scholar 

  24. Reinert, G., Schbath, S., Waterman, M.S.: Probabilistic and statistical properties of words: An overview. J. Comput. Biol. 7 (1/2), 1–46 (2000)

    Google Scholar 

  25. Robin, S., Daudin, J.J.: Exact distribution of word occurrences in a random sequence of letters. J. Appl. Probability 36, 179–193 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  26. Rogers, D.F., Plante, R.D.: Estimating equilibrium probabilities for band diagonal Markov chains using aggregation and disaggregation techniques. Computers & Oper. Res. 20, 857–877 (1993)

    Article  MATH  Google Scholar 

  27. Ron, D., Singer, Y., Tishby, N.: The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning 25, 117–149 (1996)

    Article  MATH  Google Scholar 

  28. Schwarz, G.: Estimating the dimension of a model. The Ann. Stat. 6 (2), 461–464 (1978)

    Google Scholar 

  29. Smyth, P., Heckerman, D., Jordan, M.I.: Probabilistic independence networks for hidden Markov probability models. Neural Computation 9, 227–269 (1997)

    MATH  MathSciNet  Google Scholar 

  30. Stefanov, V.T.: The intersite distances between pattern occurrences in strings generated by general discrete- and continuous-time models: An algorithmic approach. J. Appl. Probability 40, 881–892 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  31. Thomas, M.U., Barr, D.R.: An approximate test of Markov chain lumpability. J. American Stat. Association 72, 175–179 (1977)

    Article  MATH  Google Scholar 

  32. Weinberger, M.J., Rissanen, J.J., Feder, M.: A universal finite memory source. IEEE Transactions on Information Theory 41 (3), 643–652 (1995)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yves d'Aubenton-Carafa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guédon, Y., d'Aubenton-Carafa, Y. & Thermes, C. Analysing grouping of nucleotides in DNA sequences using lumped processes constructed from Markov chains. J. Math. Biol. 52, 343–372 (2006). https://doi.org/10.1007/s00285-005-0358-y

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00285-005-0358-y

Keywords or phrases

Navigation