Skip to main content

Inhomogeneous Parsimonious Markov Models

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNAI,volume 8188)

Abstract

We introduce inhomogeneous parsimonious Markov models for modeling statistical patterns in discrete sequences. These models are based on parsimonious context trees, which are a generalization of context trees, and thus generalize variable order Markov models. We follow a Bayesian approach, consisting of structure and parameter learning. Structure learning is a challenging problem due to an overexponential number of possible tree structures, so we describe an exact and efficient dynamic programming algorithm for finding the optimal tree structures.

We apply model and learning algorithm to the problem of modeling binding sites of the human transcription factor C/EBP, and find an increased prediction performance compared to fixed order and variable order Markov models. We investigate the reason for this improvement and find several instances of context-specific dependences that can be captured by parsimonious context trees but not by traditional context trees.

Keywords

  • Markov Model
  • Structure Learning
  • Independence Model
  • Context Word
  • Context Tree

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. Volf, P., Willems, F.: Context maximizing: Finding MDL decision trees. In: 15th Symp. Inform. Theory Benelux, pp. 192–200 (May 1994)

    Google Scholar 

  2. Cover, T., Thomas, J.: Elements of Information Theory, 2nd edn. Wiley Interscience (2006)

    Google Scholar 

  3. Ding, Y.: Statistical and Bayesian approaches to RNA secondary structure prediction. RNA 12(3), 323–331 (2006)

    CrossRef  Google Scholar 

  4. Xu, X., Ji, Y., Stormo, G.D.: RNA sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics 23(15), 1883–1891 (2007)

    CrossRef  Google Scholar 

  5. Busch, J.R., Ferrari, P.A., Flesia, A.G., Fraiman, R., Grynberg, S.P., Leonardi, F.: Testing statistical hypothesis on random trees and applications to the protein classification problem. The Annals of Applied Statistics 3(2), 542–563 (2009)

    MathSciNet  MATH  CrossRef  Google Scholar 

  6. Won, K.-J., Ren, B., Wang, W.: Genome-wide prediction of transcription factor binding sites using an integrated model. Genome Biology 11(1), R7 (2010)

    Google Scholar 

  7. Ramus, F., Nespor, M., Mehler, J.: Correlates of linguistic rhythm in the speech signal. Cognition 73, 265–292 (1999)

    CrossRef  Google Scholar 

  8. Kolmogorov, A., Rychkova, N.: Analysis of russian verse rhythm, and probability theory. Theory Probab. Appl. 44, 375–385 (2000)

    MathSciNet  CrossRef  Google Scholar 

  9. Rissanen, J., Langdon, G.: Arithmetic coding. IBM Journal of Research and Development 23, 149–162 (1979)

    MathSciNet  MATH  CrossRef  Google Scholar 

  10. Galves, A., Galves, C., Garcia, J., Garcia, N., Leonardi, F.: Context tree selection and linguistic rhythm retrieval from written texts. Ann. Appl. Stat. 6(1), 186–209 (2012)

    MathSciNet  MATH  CrossRef  Google Scholar 

  11. Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16(1), 16–23 (2000)

    CrossRef  Google Scholar 

  12. Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)

    CrossRef  Google Scholar 

  13. Rissanen, J.: A universal data compression system. IEEE Trans. Inform. Theory 29(5), 656–664 (1983)

    MathSciNet  MATH  CrossRef  Google Scholar 

  14. Bourguignon, P., Robelin, D.: Modèles de Markov parcimonieux. In: Proceedings of JOBIM (2004)

    Google Scholar 

  15. Ramji, D., Foka, P.: CCAAT/enhancer-binding proteins: structure, function and regulation. Biochem. J. 365, 561–575 (2002)

    Google Scholar 

  16. Heckerman, G., Geiger, D., Chickering, D.: Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20, 197–243 (1995)

    MATH  Google Scholar 

  17. Jaynes, E.T.: Probability Theory: The Logic of Science. Cambridge University Press (2003)

    Google Scholar 

  18. Bühlmann, P., Wyner, A.: Variable length Markov chains. Annals of Statistics 27, 480–513 (1999)

    MathSciNet  MATH  CrossRef  Google Scholar 

  19. Grau, J., Keilwagen, J., Gohr, A., Haldemann, B., Posch, S., Grosse, I.: Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences. Journal of Machine Learning Research 13, 1967–1971 (2012)

    Google Scholar 

  20. Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A., Kel-Margoulis, O., Kloos, D., Land, S., Lewicki-Potapov, B., Michael, H., Münch, R., Reuter, I., Rotert, S., Saxel, H., Scheer, M., Thiele, S., Wingender, E.: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Research 33, 374–378 (2003)

    CrossRef  Google Scholar 

  21. Stormo, G., Schneider, T., Gold, L.: Characterization of translational initiation sites in E.coli. Nucleic Acids Research 10(2), 2971–2996 (1982)

    CrossRef  Google Scholar 

  22. Staden, R.: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research 12, 505–519 (1984)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Eggeling, R., Gohr, A., Bourguignon, PY., Wingender, E., Grosse, I. (2013). Inhomogeneous Parsimonious Markov Models. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2013. Lecture Notes in Computer Science(), vol 8188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40988-2_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40988-2_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40987-5

  • Online ISBN: 978-3-642-40988-2

  • eBook Packages: Computer ScienceComputer Science (R0)