Advertisement

Fast and Adaptive Variable Order Markov Chain Construction

  • Marcel H. Schulz
  • David Weese
  • Tobias Rausch
  • Andreas Döring
  • Knut Reinert
  • Martin Vingron
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5251)

Abstract

Variable order Markov chains (VOMCs) are a flexible class of models that extend the well-known Markov chains. They have been applied to a variety of problems in computational biology, e.g. protein family classification. A linear time and space construction algorithm has been published in 2000 by Apostolico and Bejerano. However, neither a report of the actual running time nor an implementation of it have been published since. In this paper we use the lazy suffix tree and the enhanced suffix array to improve upon the algorithm of Apostolico and Bejerano. We introduce a new software which is orders of magnitude faster than current tools for building VOMCs, and is suitable for large scale sequence analysis.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Rissanen, J.: A universal data compression system. IEEE Transactions on Information Theory 29, 656–664 (1983)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Ron, D., Singer, Y., Tishby, N.: The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning 25, 117–149 (1996)zbMATHCrossRefGoogle Scholar
  3. 3.
    Ben-Gal, I., Shani, A., Gohr, A., Grau, J., Arviv, S., Shmilovici, A., Posch, S., Grosse, I.: Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 21(11), 2657–2666 (2005)CrossRefGoogle Scholar
  4. 4.
    Zhao, X., Huang, H., Speed, T.P.: Finding short DNA motifs using permuted Markov models. J. Comput. Biol. 12(6), 894–906 (2005)CrossRefGoogle Scholar
  5. 5.
    Ogul, H., Mumcuoglu, E.U.: SVM-based detection of distant protein structural relationships using pairwise probabilistic suffix trees. Comput. Biol. Chem. 30(4), 292–299 (2006)zbMATHCrossRefGoogle Scholar
  6. 6.
    Dalevi, D., Dubhashi, D., Hermansson, M.: Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures. Bioinformatics 22(5), 517–522 (2006)CrossRefGoogle Scholar
  7. 7.
    Bejerano, G., Seldin, Y., Margalit, H., Tishby, N.: Markovian domain fingerprinting: statistical segmentation of protein sequences. Bioinformatics 17(10), 927–934 (2001)CrossRefGoogle Scholar
  8. 8.
    Slonim, N., Bejerano, G., Fine, S., Tishby, N.: Discriminative feature selection via multiclass variable memory Markov model. EURASIP J. Appl. Signal Process 2003(1), 93–102 (2003)CrossRefGoogle Scholar
  9. 9.
    Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)CrossRefGoogle Scholar
  10. 10.
    Posch, S., Grau, J., Gohr, A., Ben-Gal, I., Kel, A.E., Grosse, I.: Recognition of cis-regulatory elements with vombat. J. Bioinform. Comput. Biol. 5(2B), 561–577 (2007)CrossRefGoogle Scholar
  11. 11.
    Apostolico, A., Bejerano, G.: Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. J. Comput. Biol. 7(3-4), 381–393 (2000)CrossRefGoogle Scholar
  12. 12.
    Bejerano, G.: Algorithms for variable length Markov chain modeling. Bioinformatics 20(5), 788–789 (2004)CrossRefGoogle Scholar
  13. 13.
    Leonardi, F.G.: A generalization of the PST algorithm: modeling the sparse nature of protein sequences. Bioinformatics 22(11), 1302–1307 (2006)CrossRefGoogle Scholar
  14. 14.
    Kurtz, S.: Reducing the space requirement of suffix trees. Software Pract. Exper. 29(13), 1149–1171 (1999)CrossRefGoogle Scholar
  15. 15.
    Giegerich, R., Kurtz, S., Stoye, J.: Efficient implementation of lazy suffix trees. Software Pract. Exper. 33(11), 1035–1049 (2003)CrossRefGoogle Scholar
  16. 16.
    Manber, U., Myers, E.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), 20 (2007)CrossRefMathSciNetGoogle Scholar
  18. 18.
    Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 53–86 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Bühlmann, P., Wyner, A.J.: Variable length Markov chains. Ann. Statist. 27(2), 480–513 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Maaß, M.G.: Computing suffix links for suffix trees and arrays. Inf. Process. Lett. 101(6), 250–254 (2007)CrossRefGoogle Scholar
  21. 21.
    Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40(1), 33–50 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Giegerich, R., Kurtz, S.: A comparison of imperative and purely functional suffix tree constructions. Sci. Comput. Program. 25, 187–218 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  23. 23.
    Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Res. 36(Database issue), D25–D30 (2008)Google Scholar
  24. 24.
    Fitzgerald, P.C., Sturgill, D., Shyakhtenko, A., Oliver, B., Vinson, C.: Comparative genomics of drosophila and human core promoters. Genome Biol. 7, R53 (2006)CrossRefGoogle Scholar
  25. 25.
    The UniProt Consortium: The Universal Protein Resource (UniProt). Nucl. Acids Res. 36(suppl.1), D190–195 (2008)Google Scholar
  26. 26.
    Döring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9, 11 (2008)CrossRefGoogle Scholar
  27. 27.
    Schulz, M.H., Bauer, S., Robinson, P.N.: The generalised k-Truncated Suffix Tree for time- and space- efficient searches in multiple DNA or protein sequences. Int. J. Bioinform. Res. Appl. 4(1), 81–95 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Marcel H. Schulz
    • 1
    • 3
  • David Weese
    • 2
  • Tobias Rausch
    • 2
    • 3
  • Andreas Döring
    • 2
  • Knut Reinert
    • 2
  • Martin Vingron
    • 1
  1. 1.Department of Computational Molecular BiologyMax Planck Institute for Molecular GeneticsBerlinGermany
  2. 2.Department of Computer ScienceFree University of BerlinBerlinGermany
  3. 3.International Max Planck Research School for Computational Biology and Scientific Computing 

Personalised recommendations