Inhomogeneous Parsimonious Markov Models

  • Ralf Eggeling
  • André Gohr
  • Pierre-Yves Bourguignon
  • Edgar Wingender
  • Ivo Grosse
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8188)

Abstract

We introduce inhomogeneous parsimonious Markov models for modeling statistical patterns in discrete sequences. These models are based on parsimonious context trees, which are a generalization of context trees, and thus generalize variable order Markov models. We follow a Bayesian approach, consisting of structure and parameter learning. Structure learning is a challenging problem due to an overexponential number of possible tree structures, so we describe an exact and efficient dynamic programming algorithm for finding the optimal tree structures.

We apply model and learning algorithm to the problem of modeling binding sites of the human transcription factor C/EBP, and find an increased prediction performance compared to fixed order and variable order Markov models. We investigate the reason for this improvement and find several instances of context-specific dependences that can be captured by parsimonious context trees but not by traditional context trees.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Volf, P., Willems, F.: Context maximizing: Finding MDL decision trees. In: 15th Symp. Inform. Theory Benelux, pp. 192–200 (May 1994)Google Scholar
  2. 2.
    Cover, T., Thomas, J.: Elements of Information Theory, 2nd edn. Wiley Interscience (2006)Google Scholar
  3. 3.
    Ding, Y.: Statistical and Bayesian approaches to RNA secondary structure prediction. RNA 12(3), 323–331 (2006)CrossRefGoogle Scholar
  4. 4.
    Xu, X., Ji, Y., Stormo, G.D.: RNA sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics 23(15), 1883–1891 (2007)CrossRefGoogle Scholar
  5. 5.
    Busch, J.R., Ferrari, P.A., Flesia, A.G., Fraiman, R., Grynberg, S.P., Leonardi, F.: Testing statistical hypothesis on random trees and applications to the protein classification problem. The Annals of Applied Statistics 3(2), 542–563 (2009)MathSciNetMATHCrossRefGoogle Scholar
  6. 6.
    Won, K.-J., Ren, B., Wang, W.: Genome-wide prediction of transcription factor binding sites using an integrated model. Genome Biology 11(1), R7 (2010)Google Scholar
  7. 7.
    Ramus, F., Nespor, M., Mehler, J.: Correlates of linguistic rhythm in the speech signal. Cognition 73, 265–292 (1999)CrossRefGoogle Scholar
  8. 8.
    Kolmogorov, A., Rychkova, N.: Analysis of russian verse rhythm, and probability theory. Theory Probab. Appl. 44, 375–385 (2000)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Rissanen, J., Langdon, G.: Arithmetic coding. IBM Journal of Research and Development 23, 149–162 (1979)MathSciNetMATHCrossRefGoogle Scholar
  10. 10.
    Galves, A., Galves, C., Garcia, J., Garcia, N., Leonardi, F.: Context tree selection and linguistic rhythm retrieval from written texts. Ann. Appl. Stat. 6(1), 186–209 (2012)MathSciNetMATHCrossRefGoogle Scholar
  11. 11.
    Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16(1), 16–23 (2000)CrossRefGoogle Scholar
  12. 12.
    Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)CrossRefGoogle Scholar
  13. 13.
    Rissanen, J.: A universal data compression system. IEEE Trans. Inform. Theory 29(5), 656–664 (1983)MathSciNetMATHCrossRefGoogle Scholar
  14. 14.
    Bourguignon, P., Robelin, D.: Modèles de Markov parcimonieux. In: Proceedings of JOBIM (2004)Google Scholar
  15. 15.
    Ramji, D., Foka, P.: CCAAT/enhancer-binding proteins: structure, function and regulation. Biochem. J. 365, 561–575 (2002)Google Scholar
  16. 16.
    Heckerman, G., Geiger, D., Chickering, D.: Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20, 197–243 (1995)MATHGoogle Scholar
  17. 17.
    Jaynes, E.T.: Probability Theory: The Logic of Science. Cambridge University Press (2003)Google Scholar
  18. 18.
    Bühlmann, P., Wyner, A.: Variable length Markov chains. Annals of Statistics 27, 480–513 (1999)MathSciNetMATHCrossRefGoogle Scholar
  19. 19.
    Grau, J., Keilwagen, J., Gohr, A., Haldemann, B., Posch, S., Grosse, I.: Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences. Journal of Machine Learning Research 13, 1967–1971 (2012)Google Scholar
  20. 20.
    Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A., Kel-Margoulis, O., Kloos, D., Land, S., Lewicki-Potapov, B., Michael, H., Münch, R., Reuter, I., Rotert, S., Saxel, H., Scheer, M., Thiele, S., Wingender, E.: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Research 33, 374–378 (2003)CrossRefGoogle Scholar
  21. 21.
    Stormo, G., Schneider, T., Gold, L.: Characterization of translational initiation sites in E.coli. Nucleic Acids Research 10(2), 2971–2996 (1982)CrossRefGoogle Scholar
  22. 22.
    Staden, R.: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research 12, 505–519 (1984)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Ralf Eggeling
    • 1
  • André Gohr
    • 1
  • Pierre-Yves Bourguignon
    • 2
  • Edgar Wingender
    • 3
  • Ivo Grosse
    • 1
    • 4
  1. 1.Institute of Computer ScienceMartin Luther UniversityHalleGermany
  2. 2.Max Planck Institute for Mathematics in the SciencesLeipzigGermany
  3. 3.Institute of BioinformaticsUniversity Medical Center GöttingenGöttingenGermany
  4. 4.German Center of Integrative Biodiversity Research (iDiv) Halle-Jena-LeipzigLeipzigGermany

Personalised recommendations