Extracting Common Motifs under the Levenshtein Measure: Theory and Experimentation

  • Ezekiel F. Adebiyi
  • Michael Kaufmann
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2452)

Abstract

Using our techniques for extracting approximate non-tandem repeats[1] on well constructed maximal models, we derive an algorithm to find common motifs of length P that occur in N sequences with at most D differences under the Edit distance metric. We compare the effectiveness of our algorithm with the more involved algorithm of Sagot[17] for Edit distance on some real sequences. Her method has not been implemented before for Edit distance but only for Hamming distance[12],[20]. Our resulting method turns out to be simpler and more efficient theoretically and also in practice for moderately large P and D.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    E. F. Adebiyi, T. Jiang, and M. Kaufmann. An efficient algorithm for finding short approximate non-tandem repeats (Extended Abstract). Bioinformatics, 17(1):S5–S13, 2001.Google Scholar
  2. 2.
    E. F. Adebiyi. Pattern Discovery in Biology and Strings Sorting: Theory and Experimentation. Ph. D Thesis, 2002.Google Scholar
  3. 3.
    A. Blumer and A. Ehrenfeucht and others. Average size of suffix trees and DAWGS. Discrete Applied Mathematics, 24, 37–45, 1989.MATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    J.-M. Claverie and S. Audic. The Statistical significance of nucleotide position-weight matrix matches. Computer Applications in Biosciences 12(5), 431–439, 1996.Google Scholar
  5. 5.
    M. Crochemore and M.-F. Sagot. Motifs in sequences: localization and extraction. In Handbook of Computational Chemistry, Crabbe, Drew, Konopka, eds., Marcel Dekker, Inc., 2001. To appear.Google Scholar
  6. 6.
    D. Gusfield. Algorithms on strings, trees and sequences. Cambridge University Press, New York, 1997.MATHGoogle Scholar
  7. 7.
    J. D. Helmann. Compilation and analysis of Bacillus Subtilis σ A -dependent promoter sequences: evidence for extended contact between RNA polymerase and up-stream promoter DNA., Nucleic Acids Research, 23(13): 2351–2360, 1995.CrossRefGoogle Scholar
  8. 8.
    L. C. K. Hui. Color set size problem with applications to string matching. In CPM Proceeding, vol. 644 of LNCS, 230–243, 1992.Google Scholar
  9. 9.
    S. Karlin, F. Ost, and B. E. Blaisdell. Patterns in DNA and amino acid sequences and their statistical significance. In M. S. Waterman, editor, Mathematical Methods for DNA Sequences, 133–158, 1989.Google Scholar
  10. 10.
    C. J. McInerny, J. F. Patridge, G. E. Mikesell, D. P. Creemer, and L. L. Breeden. A novel Mcm1-dependent element in the SWI4, CLN3, CDC6, CDC46, and CDC47 promoters activates M/G 1 -specific transcription. Genes and Development, 11: 1277–1288, 1997.CrossRefGoogle Scholar
  11. 11.
    E. Myers. A sub-linear algorithm for approximate keyword matching. Algorithmica 12, 4–5, 345–374, 1994.MATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    L. Marsan and M. F. Sagot. Extracting structured motifs using a suffix tree-algorithms and application to promoter consensus identification. RECOMB 2000.Google Scholar
  13. 13.
    P. Pevzner and S.-H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. ISMB, 269–278, 2000.Google Scholar
  14. 14.
    W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes. In The Art of Scientific Computing, Cambridge University Press, Cambridge.Google Scholar
  15. 15.
    E. Rocke and M. Tompa. An algorithm for finding novel gaped motifs in DNA sequences. RECOMB, 228–233, 1998.Google Scholar
  16. 16.
    B. Schieber and U. Vishkin. On Finding Lowest Common Ancestors: Simplification and Parallelization. SIAM Journal on Computing, 17:1253–1262, 1988.MATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    M.-F. Sagot. Spelling approximate repeated or common motifs using a suffix tree. LNCS 1380: 111–127, 1998.Google Scholar
  18. 18.
    J. F. Tomb et al. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature, 388, 539–547, 1997.CrossRefGoogle Scholar
  19. 19.
    E. Ukkonen. Approximate string matching over suffix trees. LNCS 684: 228–242, 1993.Google Scholar
  20. 20.
    A. Vanet, L. Marsan, A. Labigne and M.-F. Sagot. Inferring regulatory elements from a whole genome. an analysis of Helicobacter pylori σ 80 family of promoter signals. J. Mol. Biol., 297, 335–353, 2000.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Ezekiel F. Adebiyi
    • 1
  • Michael Kaufmann
    • 1
  1. 1.Wilhelm-Schickard-Institut für InformatikUniversität TübingenTübingenGermany

Personalised recommendations