Speeding Up Exact Motif Discovery by Bounding the Expected Clump Size

  • Tobias Marschall
  • Sven Rahmann
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6293)

Abstract

The overlapping structure of complex patterns, such as IUPAC motifs, significantly affects their statistical properties and should be taken into account in motif discovery algorithms. The contribution of this paper is twofold. On the one hand, we give surprisingly simple formulas for the expected size and weight of motif clumps (maximal overlapping sets of motif matches in a text). In contrast to previous results, we show that these expected values can be computed without matrix inversions. On the other hand, we show how these results can be algorithmically exploited to improve an exact motif discovery algorithm. First, the algorithm can be efficiently generalized to arbitrary finite-memory text models, whereas it was previously limited to i.i.d. texts. Second, we achieve a speed-up of up to a factor of 135. Our open-source (GPL) implementation is available at http://www.rahmannlab.de/software.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Tompa, M., Li, N., Bailey, T.L., et al.: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 23(1), 137–144 (2005)CrossRefPubMedGoogle Scholar
  2. 2.
    Sandve, G.K., Drabløs, F.: A survey of motif discovery methods in an integrated framework. Biology Direct 1(1), 11 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Das, M., Dai, H.K.: A survey of DNA motif finding algorithms. BMC Bioinformatics 8(suppl. 7), S21 (2007)Google Scholar
  4. 4.
    Narlikar, L., Ovcharenko, I.: Identifying regulatory elements in eukaryotic genomes. Briefings in Functional Genomics and Proteomics 8(4), 215–230 (2009)CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Bailey, T.L., Williams, N., Misleh, C., Li, W.W.: MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Research 34(suppl.2), W369–W373 (2006)Google Scholar
  6. 6.
    Hertz, G.Z., Stormo, G.D.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15(7-8), 563–577 (1999)CrossRefPubMedGoogle Scholar
  7. 7.
    Rahmann, S., Marschall, T., Behler, F., Kramer, O.: Modeling evolutionary fitness for DNA motif discovery. In: Rothlauf, F. (ed.) Genetic and Evolutionary Computation Conference (GECCO), Montreal, Québec, Canada, pp. 225–232. ACM, New York (2009)Google Scholar
  8. 8.
    Sagot, M.F.: Spelling approximate repeated or common motifs using a suffix tree. In: Lucchesi, C.L., Moura, A.V. (eds.) LATIN 1998. LNCS, vol. 1380, pp. 374–390. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  9. 9.
    Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17(suppl. 1), S207–S214 (2001)Google Scholar
  10. 10.
    Sinha, S., Tompa, M.: A statistical method for finding transcription factor binding sites. In: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), pp. 344–354 (2000)Google Scholar
  11. 11.
    Marschall, T., Rahmann, S.: Efficient exact motif discovery. Bioinformatics 25(12), i356–i364 (2009)Google Scholar
  12. 12.
    Sandve, G.K., Abul, O., Walseng, V., Drabløs, F.: Improved benchmarks for computational motif discovery. BMC Bioinformatics 8, 193 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  13. 13.
    Kucherov, G., Noé, L., Roytberg, M.: A unifying framework for seed sensitivity and its application to subset seeds. Journal of Bioinformatics and Computational Biology 4(2), 553–569 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoretical Computer Science 287, 593–617 (2002)CrossRefGoogle Scholar
  15. 15.
    Marschall, T., Rahmann, S.: Probabilistic arithmetic automata and their application to pattern matching statistics. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 95–106. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  16. 16.
    Nuel, G.: Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata. Journal of Applied Probability 45, 226–243 (2008)CrossRefGoogle Scholar
  17. 17.
    Stefanov, V., Robin, S., Schbath, S.: Waiting times for clumps of patterns and for structured motifs in random sequences. Discrete Appl. Math. 155(6-7), 868–880 (2007)CrossRefGoogle Scholar
  18. 18.
    Schbath, S.: Compound Poisson approximation of word counts in DNA sequences. ESAIM: Probability and Statistics 1, 1–16 (1995)Google Scholar
  19. 19.
    Reinert, G., Schbath, S.: Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. Journal of Computational Biology 5(2), 223–253 (1998)CrossRefPubMedGoogle Scholar
  20. 20.
    Pape, U.J., Rahmann, S., Sun, F., Vingron, M.: Compound Poisson approximation of the number of occurrences of a position frequency matrix (PFM) on both strands. Journal of Computational Biology 15(6), 547–564 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  21. 21.
    Bassino, F., Clément, J., Fayolle, J., Nicodème, P.: Constructions for clumps statistics. In: Proceedings of the Fifth Colloquium on Mathematics and Computer Science. Discrete Mathematics and Theoretical Computer Science, pp. 179–194 (2008)Google Scholar
  22. 22.
    Bernstein, D.S.: Matrix mathematics, 2nd edn. Princeton University Press, Princeton (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Tobias Marschall
    • 1
  • Sven Rahmann
    • 1
  1. 1.Bioinformatics for High-Throughput Technologies, at the Chair of Algorithm Engineering, Computer Science DepartmentTU DortmundDortmundGermany

Personalised recommendations