Abstract
Motif discovery is the problem of finding local patterns or motifs from a set of unlabeled sequences. One common representation of a motif is a Markov model known as a score matrix. Matrix based motif discovery has been extensively studied but no positive results have been known regarding its theoretical hardness. We present the first non-trivial upper bound on the complexity (worst-case computation time) of this problem. Other than linear terms, our bound depends only on the motif width w (which is typically 5-20) and is a dramatic improvement relative to previously known bounds.
We prove this bound by relating the motif discovery problem to a search problem over permutations of strings of length w, in which the permutations have a particular property. We give a constructive proof of an upper bound on the number of such permutations. For an alphabet size of σ (typically 4) the trivial bound is \(n! \approx ({\frac{n}{e}})^n, n={\sigma}^w\). Our bound is roughly n(σlog σ n)n.
We relate this theoretical result to the exact motif discovery program, TsukubaBB, whose algorithm contains ideas which inspired the result. We describe a recent improvement to the TsukubaBB program which can give a speed up of nine or more and use a dataset of REB1 transcription factor binding sites to illustrate that exact methods can indeed be used in some practical situations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Akutsu, T., Arimura, H., Shimozono, S.: On approximation algorithms for local multiple alignment. In: Proceedings of the fourth annual international conference on computational molecular biology (RECOMB 2000), pp. 1–7. ACM Press, New York (2000)
Bailey, T., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers. Machine Learning 21, 51–80 (1995)
Blekas, K., Fotiados, D., Likas, A.: Greedy mixture learning for multiple motif discovery in biological sequences. Bioinformatics 19(5), 607–617 (2003)
Frith, M., Hansen, U., Spouge, J.L., Weng, Z.: Finding functional sequence elements by multiple local alignment. Nucleic Acids Research (2004)
Hertz, G.Z., Hartzell III, G.W., Stormo, G.D.: Identification of consensus patterns in unaligned DNA sequences known to be functionally related. CABIOS 6(2), 81–92 (1990)
Hertz, G.Z., Stormo, G.D.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999)
Horton, P.: A branch and bound algorithm for local multiple alignment. In: Pacific Symposium on Biocomputing 1996, pp. 368–383 (1996)
Horton, P.: Tsukuba BB: A branch and bound algorithm for local multiple sequence alignment. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 84–98. Springer, Heidelberg (2000)
Horton, P.: Tsukuba BB: A branch and bound algorithm for local multiple alignment of DNA and protein sequences. Journal of Computational Biology 8(3), 249–282 (2001)
Lawrence, C.E., Altschul, S.F., Boguski, M.B., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)
Lawrence, C.E., Reilly, A.A.: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. PROTEINS 7, 41–51 (1990)
Li, M., Ma, B., Wang, L.: Finding similar regions in many strings. In: Proceedings of the 32nd Annual ACM Symposium on the Theory of Computing (STOC), pp. 425–434 (1999)
Li, M., Ma, B., Wang, L.: Finding similar regions in many sequences. Journal of Computer and System Sciences 65, 73–96 (2002)
Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16, 16–23 (2000)
Zhu, J., Zhang, M.Q.: SCPD: a promoter database of the yeast saccharomyces cerevisiae. Bioinformatics 15, 607–611 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Horton, P., Fujibuchi, W. (2005). An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery. In: Apostolico, A., Crochemore, M., Park, K. (eds) Combinatorial Pattern Matching. CPM 2005. Lecture Notes in Computer Science, vol 3537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11496656_19
Download citation
DOI: https://doi.org/10.1007/11496656_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26201-5
Online ISBN: 978-3-540-31562-9
eBook Packages: Computer ScienceComputer Science (R0)