Skip to main content

An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

  • Conference paper
Combinatorial Pattern Matching (CPM 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3537))

Included in the following conference series:

Abstract

Motif discovery is the problem of finding local patterns or motifs from a set of unlabeled sequences. One common representation of a motif is a Markov model known as a score matrix. Matrix based motif discovery has been extensively studied but no positive results have been known regarding its theoretical hardness. We present the first non-trivial upper bound on the complexity (worst-case computation time) of this problem. Other than linear terms, our bound depends only on the motif width w (which is typically 5-20) and is a dramatic improvement relative to previously known bounds.

We prove this bound by relating the motif discovery problem to a search problem over permutations of strings of length w, in which the permutations have a particular property. We give a constructive proof of an upper bound on the number of such permutations. For an alphabet size of σ (typically 4) the trivial bound is \(n! \approx ({\frac{n}{e}})^n, n={\sigma}^w\). Our bound is roughly n(σlog σ n)n.

We relate this theoretical result to the exact motif discovery program, TsukubaBB, whose algorithm contains ideas which inspired the result. We describe a recent improvement to the TsukubaBB program which can give a speed up of nine or more and use a dataset of REB1 transcription factor binding sites to illustrate that exact methods can indeed be used in some practical situations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Akutsu, T., Arimura, H., Shimozono, S.: On approximation algorithms for local multiple alignment. In: Proceedings of the fourth annual international conference on computational molecular biology (RECOMB 2000), pp. 1–7. ACM Press, New York (2000)

    Chapter  Google Scholar 

  2. Bailey, T., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers. Machine Learning 21, 51–80 (1995)

    Google Scholar 

  3. Blekas, K., Fotiados, D., Likas, A.: Greedy mixture learning for multiple motif discovery in biological sequences. Bioinformatics 19(5), 607–617 (2003)

    Article  Google Scholar 

  4. Frith, M., Hansen, U., Spouge, J.L., Weng, Z.: Finding functional sequence elements by multiple local alignment. Nucleic Acids Research (2004)

    Google Scholar 

  5. Hertz, G.Z., Hartzell III, G.W., Stormo, G.D.: Identification of consensus patterns in unaligned DNA sequences known to be functionally related. CABIOS 6(2), 81–92 (1990)

    Google Scholar 

  6. Hertz, G.Z., Stormo, G.D.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999)

    Article  Google Scholar 

  7. Horton, P.: A branch and bound algorithm for local multiple alignment. In: Pacific Symposium on Biocomputing 1996, pp. 368–383 (1996)

    Google Scholar 

  8. Horton, P.: Tsukuba BB: A branch and bound algorithm for local multiple sequence alignment. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 84–98. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  9. Horton, P.: Tsukuba BB: A branch and bound algorithm for local multiple alignment of DNA and protein sequences. Journal of Computational Biology 8(3), 249–282 (2001)

    Article  Google Scholar 

  10. Lawrence, C.E., Altschul, S.F., Boguski, M.B., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)

    Article  Google Scholar 

  11. Lawrence, C.E., Reilly, A.A.: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. PROTEINS 7, 41–51 (1990)

    Article  Google Scholar 

  12. Li, M., Ma, B., Wang, L.: Finding similar regions in many strings. In: Proceedings of the 32nd Annual ACM Symposium on the Theory of Computing (STOC), pp. 425–434 (1999)

    Google Scholar 

  13. Li, M., Ma, B., Wang, L.: Finding similar regions in many sequences. Journal of Computer and System Sciences 65, 73–96 (2002)

    Article  MathSciNet  Google Scholar 

  14. Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16, 16–23 (2000)

    Article  Google Scholar 

  15. Zhu, J., Zhang, M.Q.: SCPD: a promoter database of the yeast saccharomyces cerevisiae. Bioinformatics 15, 607–611 (1999)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Horton, P., Fujibuchi, W. (2005). An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery. In: Apostolico, A., Crochemore, M., Park, K. (eds) Combinatorial Pattern Matching. CPM 2005. Lecture Notes in Computer Science, vol 3537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11496656_19

Download citation

  • DOI: https://doi.org/10.1007/11496656_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26201-5

  • Online ISBN: 978-3-540-31562-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics