Extracting Common Motifs under the Levenshtein Measure: Theory and Experimentation
Using our techniques for extracting approximate non-tandem repeats on well constructed maximal models, we derive an algorithm to find common motifs of length P that occur in N sequences with at most D differences under the Edit distance metric. We compare the effectiveness of our algorithm with the more involved algorithm of Sagot for Edit distance on some real sequences. Her method has not been implemented before for Edit distance but only for Hamming distance,. Our resulting method turns out to be simpler and more efficient theoretically and also in practice for moderately large P and D.
KeywordsEdit Distance Maximal Repeat Distinct Sequence Common Motif Input String
Unable to display preview. Download preview PDF.
- 1.E. F. Adebiyi, T. Jiang, and M. Kaufmann. An efficient algorithm for finding short approximate non-tandem repeats (Extended Abstract). Bioinformatics, 17(1):S5–S13, 2001.Google Scholar
- 2.E. F. Adebiyi. Pattern Discovery in Biology and Strings Sorting: Theory and Experimentation. Ph. D Thesis, 2002.Google Scholar
- 4.J.-M. Claverie and S. Audic. The Statistical significance of nucleotide position-weight matrix matches. Computer Applications in Biosciences 12(5), 431–439, 1996.Google Scholar
- 5.M. Crochemore and M.-F. Sagot. Motifs in sequences: localization and extraction. In Handbook of Computational Chemistry, Crabbe, Drew, Konopka, eds., Marcel Dekker, Inc., 2001. To appear.Google Scholar
- 8.L. C. K. Hui. Color set size problem with applications to string matching. In CPM Proceeding, vol. 644 of LNCS, 230–243, 1992.Google Scholar
- 9.S. Karlin, F. Ost, and B. E. Blaisdell. Patterns in DNA and amino acid sequences and their statistical significance. In M. S. Waterman, editor, Mathematical Methods for DNA Sequences, 133–158, 1989.Google Scholar
- 12.L. Marsan and M. F. Sagot. Extracting structured motifs using a suffix tree-algorithms and application to promoter consensus identification. RECOMB 2000.Google Scholar
- 13.P. Pevzner and S.-H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. ISMB, 269–278, 2000.Google Scholar
- 14.W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes. In The Art of Scientific Computing, Cambridge University Press, Cambridge.Google Scholar
- 15.E. Rocke and M. Tompa. An algorithm for finding novel gaped motifs in DNA sequences. RECOMB, 228–233, 1998.Google Scholar
- 17.M.-F. Sagot. Spelling approximate repeated or common motifs using a suffix tree. LNCS 1380: 111–127, 1998.Google Scholar
- 19.E. Ukkonen. Approximate string matching over suffix trees. LNCS 684: 228–242, 1993.Google Scholar