An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

Horton, Paul; Fujibuchi, Wataru

doi:10.1007/11496656_19

Paul Horton¹⁹ &
Wataru Fujibuchi¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3537))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

823 Accesses
1 Citations

Abstract

Motif discovery is the problem of finding local patterns or motifs from a set of unlabeled sequences. One common representation of a motif is a Markov model known as a score matrix. Matrix based motif discovery has been extensively studied but no positive results have been known regarding its theoretical hardness. We present the first non-trivial upper bound on the complexity (worst-case computation time) of this problem. Other than linear terms, our bound depends only on the motif width w (which is typically 5-20) and is a dramatic improvement relative to previously known bounds.

We prove this bound by relating the motif discovery problem to a search problem over permutations of strings of length w, in which the permutations have a particular property. We give a constructive proof of an upper bound on the number of such permutations. For an alphabet size of σ (typically 4) the trivial bound is \(n! \approx ({\frac{n}{e}})^n, n={\sigma}^w\). Our bound is roughly n(σlog_σ n)ⁿ.

We relate this theoretical result to the exact motif discovery program, TsukubaBB, whose algorithm contains ideas which inspired the result. We describe a recent improvement to the TsukubaBB program which can give a speed up of nine or more and use a dataset of REB1 transcription factor binding sites to illustrate that exact methods can indeed be used in some practical situations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Akutsu, T., Arimura, H., Shimozono, S.: On approximation algorithms for local multiple alignment. In: Proceedings of the fourth annual international conference on computational molecular biology (RECOMB 2000), pp. 1–7. ACM Press, New York (2000)
Chapter Google Scholar
Bailey, T., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers. Machine Learning 21, 51–80 (1995)
Google Scholar
Blekas, K., Fotiados, D., Likas, A.: Greedy mixture learning for multiple motif discovery in biological sequences. Bioinformatics 19(5), 607–617 (2003)
Article Google Scholar
Frith, M., Hansen, U., Spouge, J.L., Weng, Z.: Finding functional sequence elements by multiple local alignment. Nucleic Acids Research (2004)
Google Scholar
Hertz, G.Z., Hartzell III, G.W., Stormo, G.D.: Identification of consensus patterns in unaligned DNA sequences known to be functionally related. CABIOS 6(2), 81–92 (1990)
Google Scholar
Hertz, G.Z., Stormo, G.D.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999)
Article Google Scholar
Horton, P.: A branch and bound algorithm for local multiple alignment. In: Pacific Symposium on Biocomputing 1996, pp. 368–383 (1996)
Google Scholar
Horton, P.: Tsukuba BB: A branch and bound algorithm for local multiple sequence alignment. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 84–98. Springer, Heidelberg (2000)
Chapter Google Scholar
Horton, P.: Tsukuba BB: A branch and bound algorithm for local multiple alignment of DNA and protein sequences. Journal of Computational Biology 8(3), 249–282 (2001)
Article Google Scholar
Lawrence, C.E., Altschul, S.F., Boguski, M.B., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)
Article Google Scholar
Lawrence, C.E., Reilly, A.A.: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. PROTEINS 7, 41–51 (1990)
Article Google Scholar
Li, M., Ma, B., Wang, L.: Finding similar regions in many strings. In: Proceedings of the 32nd Annual ACM Symposium on the Theory of Computing (STOC), pp. 425–434 (1999)
Google Scholar
Li, M., Ma, B., Wang, L.: Finding similar regions in many sequences. Journal of Computer and System Sciences 65, 73–96 (2002)
Article MathSciNet Google Scholar
Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16, 16–23 (2000)
Article Google Scholar
Zhu, J., Zhang, M.Q.: SCPD: a promoter database of the yeast saccharomyces cerevisiae. Bioinformatics 15, 607–611 (1999)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computational Biology Research Center, National Institute of Advanced Industrial Science, Japan
Paul Horton & Wataru Fujibuchi

Authors

Paul Horton
View author publications
You can also search for this author in PubMed Google Scholar
Wataru Fujibuchi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Georgia Institute of Technology and Università di Padova,
Alberto Apostolico
Université Paris-Est, France
Maxime Crochemore
School of Computer Science and Engineering, Seoul National University, 151-742, Seoul, Korea
Kunsoo Park

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Horton, P., Fujibuchi, W. (2005). An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery. In: Apostolico, A., Crochemore, M., Park, K. (eds) Combinatorial Pattern Matching. CPM 2005. Lecture Notes in Computer Science, vol 3537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11496656_19

Download citation

DOI: https://doi.org/10.1007/11496656_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26201-5
Online ISBN: 978-3-540-31562-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics