Skip to main content

Spelling approximate repeated or common motifs using a suffix tree

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1380))

Abstract

We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet σ. For instance, σ may be equal to {A, C, G, T} and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint). The second algorithm extracts common motifs from a set of N ≥ 2 sequences. In this case, the motifs must occur, again with at most e mismatches, in 1 ≤ q ≤ N distinct sequences of the set. In both cases, the words representing the motifs may never be present exactly in the sequences. We therefore speak of the motifs, repeated in a sequence or common to a set of them, as being “external” objects and denote them by the expression “valid models” if they verify the quorum constraint q. The approach we introduce here for finding all valid models corresponding to either repeated or common motifs starts by building a suffix tree of the sequence(s) and then, after some further preprocessing, uses this tree to simply “spell” the models. Assuming an alphabet of fixed size, the total time needed is O(nN 2 V(e, k)) using O(nN 2/w) space, where n is the (average) length of the sequence(s), k is the length of the models sought or is the length of the longest possible valid models, w is the size of a word machine and V(e, k) is the number of words of length k; at a Hamming distance at most e from another k-length word. V(e, k) may be majored by k e¦σ¦e. This improves on an algorithm by Waterman [23]. It is also a better time bound than our previous approach [15] for the common motifs problem whenever N < k¦σ¦, and a better space bound when N/w < k. It is a better time and space bound in absolute for the repeated motifs problem. The complexities obtained in this second case are O(nV(e, k)) and O(n) respectively. Finally, we suggest how to extend these algorithms to deal with gaps.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Commun. ACM, 35:74–82, 1992.

    Article  Google Scholar 

  2. P. Bieganski, J. Riedl, J. V. Carlis, and E.M. Retzel. Generalized suffix trees for biological sequence data: applications and implementations. In Proc. of the 27th Hawai Int. Conf. on Systems Sci., pages 35–44. IEEE Computer Society Press, 1994.

    Google Scholar 

  3. B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo. Sequence landscapes. Nucleic Acids Res., 14:141–158, 1986.

    Google Scholar 

  4. A.L. Cobbs. Fast identification of approximately matching substrings. In Z. Galil and E. Ukkonen, editors, Combinatorial Pattern Matching, volume 937 of Lecture Notes in Computer Science, pages 41–54. Springer Verlag, 1995.

    Google Scholar 

  5. M. Crochemore. An optimal algorithm for computing the repetitions in a word. Inf. Proc. Letters, 12:244–250, 1981.

    Article  MATH  MathSciNet  Google Scholar 

  6. M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.

    Google Scholar 

  7. D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Mol. Biol., 186:117–128, 1985.

    Article  Google Scholar 

  8. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, 1997.

    Google Scholar 

  9. L. C. K. Hui. Color set size problem with applications to string matching. In A. Apostolico, M. Crochemore, Z. Galil, and U. Manber, editors, Combinatorial Pattern Matching, volume 644 of Lecture Notes in Computer Science, pages 230–243. Springer-Verlag, 1992.

    Google Scholar 

  10. C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: struct., funct., and genetics, 7:41–51, 1990.

    Article  Google Scholar 

  11. C. Lefevre and J.-E. Ikeda. A fast word search algorithm for the representation of sequence similarity in genomic DNA. Nucleic Acids Res., 22:404–411, 1994.

    Google Scholar 

  12. E. M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23:262–272, 1976.

    Article  MATH  MathSciNet  Google Scholar 

  13. E. W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12:345–374, 1994.

    Article  MATH  MathSciNet  Google Scholar 

  14. E. W. Myers. 1997. personal communication.

    Google Scholar 

  15. M.-F. Sagot, V. Escalier, A. Viari, and H. Soldano. Searching for repeated words in a text allowing for mismatches and gaps. In R. Baeza-Yates and U. Manber, editors, Second South American Workshop on String Processing, pages 87–100, Viñas del Mar, Chili, 1995. University of Chili.

    Google Scholar 

  16. M.-F. Sagot and E. W. Myers. Identifying satellites in nucleic acid sequences. 1998. submitted to RECOMB 1998.

    Google Scholar 

  17. M.-F. Sagot and A. Viari. A double combinatorial approach to discovering patterns in biological sequences. In D. Hirschberg and G. Myers, editors, Combinatorial Pattern Matching, volume 1075 of Lecture Notes in Computer Science, pages 186–208. Springer-Verlag, 1996.

    Google Scholar 

  18. M.-F. Sagot, A. Viari, and H. Soldano. Multiple comparison: a peptide matching approach. Theoret. Comput. Sci., 180:115–137, 1997. presented at Combinatorial Pattern Matching 1995.

    Article  MathSciNet  Google Scholar 

  19. E. Ukkonen. Constructing suffix trees on-line in linear time, pages 484–492. IFIP'92, 1992.

    Google Scholar 

  20. E. Ukkonen. Approximate string matching over suffix trees. In Z. Galil A. Apostolico, M. Crochemore and U. Manber, editors, Combinatorial Pattern Matching, volume 684 of Lecture Notes in Computer Science, pages 228–242. Springer-Verlag, 1993.

    Google Scholar 

  21. M. S. Waterman. Multiple sequence alignments by consensus. Nucleic Acids Res., 14:9095–9102, 1986.

    MathSciNet  Google Scholar 

  22. M. S. Waterman. Consensus patterns in sequences. In M. S. Waterman, editor, Mathematical Methods for DNA Sequences, pages 93–116. CRC Press, 1989.

    Google Scholar 

  23. M. S. Waterman, R. Arratia, and D. J. Galas. Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46:515–527, 1984.

    Article  MathSciNet  Google Scholar 

  24. S. Wu and U. Manber. Agrep — a fast approximate pattern-matching tool, pages 153–162, San Francisco, CA, 1992. USENIX Technical Conference.

    Google Scholar 

  25. S. Wu and U. Manber. Fast text searching allowing errors. Commun. ACM, 35:83–91, 1992.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Cláudio L. Lucchesi Arnaldo V. Moura

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sagot, M.F. (1998). Spelling approximate repeated or common motifs using a suffix tree. In: Lucchesi, C.L., Moura, A.V. (eds) LATIN'98: Theoretical Informatics. LATIN 1998. Lecture Notes in Computer Science, vol 1380. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0054337

Download citation

  • DOI: https://doi.org/10.1007/BFb0054337

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-64275-6

  • Online ISBN: 978-3-540-69715-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics