Spelling approximate repeated or common motifs using a suffix tree

Sagot, Marie -France

doi:10.1007/BFb0054337

Spelling approximate repeated or common motifs using a suffix tree

Marie -France Sagot^1,2

Conference paper
First Online: 01 January 2006

254 Accesses
72 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1380))

Abstract

We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet σ. For instance, σ may be equal to {A, C, G, T} and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint). The second algorithm extracts common motifs from a set of N ≥ 2 sequences. In this case, the motifs must occur, again with at most e mismatches, in 1 ≤ q ≤ N distinct sequences of the set. In both cases, the words representing the motifs may never be present exactly in the sequences. We therefore speak of the motifs, repeated in a sequence or common to a set of them, as being “external” objects and denote them by the expression “valid models” if they verify the quorum constraint q. The approach we introduce here for finding all valid models corresponding to either repeated or common motifs starts by building a suffix tree of the sequence(s) and then, after some further preprocessing, uses this tree to simply “spell” the models. Assuming an alphabet of fixed size, the total time needed is O(nN ² V(e, k)) using O(nN ²/w) space, where n is the (average) length of the sequence(s), k is the length of the models sought or is the length of the longest possible valid models, w is the size of a word machine and V(e, k) is the number of words of length k; at a Hamming distance at most e from another k-length word. V(e, k) may be majored by k ^e¦σ¦^e. This improves on an algorithm by Waterman [23]. It is also a better time bound than our previous approach [15] for the common motifs problem whenever N < k¦σ¦, and a better space bound when N/w < k. It is a better time and space bound in absolute for the repeated motifs problem. The complexities obtained in this second case are O(nV(e, k)) and O(n) respectively. Finally, we suggest how to extend these algorithms to deal with gaps.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

References

R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Commun. ACM, 35:74–82, 1992.
Article Google Scholar
P. Bieganski, J. Riedl, J. V. Carlis, and E.M. Retzel. Generalized suffix trees for biological sequence data: applications and implementations. In Proc. of the 27th Hawai Int. Conf. on Systems Sci., pages 35–44. IEEE Computer Society Press, 1994.
Google Scholar
B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo. Sequence landscapes. Nucleic Acids Res., 14:141–158, 1986.
Google Scholar
A.L. Cobbs. Fast identification of approximately matching substrings. In Z. Galil and E. Ukkonen, editors, Combinatorial Pattern Matching, volume 937 of Lecture Notes in Computer Science, pages 41–54. Springer Verlag, 1995.
Google Scholar
M. Crochemore. An optimal algorithm for computing the repetitions in a word. Inf. Proc. Letters, 12:244–250, 1981.
Article MATH MathSciNet Google Scholar
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.
Google Scholar
D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Mol. Biol., 186:117–128, 1985.
Article Google Scholar
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, 1997.
Google Scholar
L. C. K. Hui. Color set size problem with applications to string matching. In A. Apostolico, M. Crochemore, Z. Galil, and U. Manber, editors, Combinatorial Pattern Matching, volume 644 of Lecture Notes in Computer Science, pages 230–243. Springer-Verlag, 1992.
Google Scholar
C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: struct., funct., and genetics, 7:41–51, 1990.
Article Google Scholar
C. Lefevre and J.-E. Ikeda. A fast word search algorithm for the representation of sequence similarity in genomic DNA. Nucleic Acids Res., 22:404–411, 1994.
Google Scholar
E. M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23:262–272, 1976.
Article MATH MathSciNet Google Scholar
E. W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12:345–374, 1994.
Article MATH MathSciNet Google Scholar
E. W. Myers. 1997. personal communication.
Google Scholar
M.-F. Sagot, V. Escalier, A. Viari, and H. Soldano. Searching for repeated words in a text allowing for mismatches and gaps. In R. Baeza-Yates and U. Manber, editors, Second South American Workshop on String Processing, pages 87–100, Viñas del Mar, Chili, 1995. University of Chili.
Google Scholar
M.-F. Sagot and E. W. Myers. Identifying satellites in nucleic acid sequences. 1998. submitted to RECOMB 1998.
Google Scholar
M.-F. Sagot and A. Viari. A double combinatorial approach to discovering patterns in biological sequences. In D. Hirschberg and G. Myers, editors, Combinatorial Pattern Matching, volume 1075 of Lecture Notes in Computer Science, pages 186–208. Springer-Verlag, 1996.
Google Scholar
M.-F. Sagot, A. Viari, and H. Soldano. Multiple comparison: a peptide matching approach. Theoret. Comput. Sci., 180:115–137, 1997. presented at Combinatorial Pattern Matching 1995.
Article MathSciNet Google Scholar
E. Ukkonen. Constructing suffix trees on-line in linear time, pages 484–492. IFIP'92, 1992.
Google Scholar
E. Ukkonen. Approximate string matching over suffix trees. In Z. Galil A. Apostolico, M. Crochemore and U. Manber, editors, Combinatorial Pattern Matching, volume 684 of Lecture Notes in Computer Science, pages 228–242. Springer-Verlag, 1993.
Google Scholar
M. S. Waterman. Multiple sequence alignments by consensus. Nucleic Acids Res., 14:9095–9102, 1986.
MathSciNet Google Scholar
M. S. Waterman. Consensus patterns in sequences. In M. S. Waterman, editor, Mathematical Methods for DNA Sequences, pages 93–116. CRC Press, 1989.
Google Scholar
M. S. Waterman, R. Arratia, and D. J. Galas. Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46:515–527, 1984.
Article MathSciNet Google Scholar
S. Wu and U. Manber. Agrep — a fast approximate pattern-matching tool, pages 153–162, San Francisco, CA, 1992. USENIX Technical Conference.
Google Scholar
S. Wu and U. Manber. Fast text searching allowing errors. Commun. ACM, 35:83–91, 1992.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Service d'Informatique Scientifique, Institut Pasteur, 28, rue du Dr. Roux, Paris
Marie -France Sagot
Institut Gaspard Monge, Université de Marne la Vallée, 2, rue de la Butte Verte, Noisy le Grand
Marie -France Sagot

Authors

Marie -France Sagot
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Cláudio L. Lucchesi Arnaldo V. Moura

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sagot, M.F. (1998). Spelling approximate repeated or common motifs using a suffix tree. In: Lucchesi, C.L., Moura, A.V. (eds) LATIN'98: Theoretical Informatics. LATIN 1998. Lecture Notes in Computer Science, vol 1380. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0054337

Download citation

DOI: https://doi.org/10.1007/BFb0054337
Published: 25 May 2006
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64275-6
Online ISBN: 978-3-540-69715-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics