Abstract
We present in this paper an algorithm for finding degenerated common features by multiple comparison of a set of biological sequences (nucleic acids or proteins). The features that are of interest to us are words in the sequences. The algorithm uses the concept of a model we introduced earlier for locating these features. A model can be seen as a generalization of a consensus pattern as defined by Waterman [42]. It is an object against which the words in the sequences are compared and which serves as an identifier for the groups of similar ones. The algorithm given here innovates in relation to our previous work in that the models are defined over what we call a weighted combinatorial cover. This is a collection of sets among all possible subsets of the alphabet Σ of nucleotides or amino acids, including the wild card {Σ}, with a weight attached to each of these sets indicating the number of times it may appear in a model. In this way, we explore both the space of models and that of alphabets. The words that are related to a model defined over such a combinatorial cover, and thus considered to be similar, are then the ones that either belong to the model or present at most a certain number of errors with a nearest element of it. We use two algorithmic ideas that allow us to deal with such double combinatorics, one concerns a left-to-right minimality of the sets composing a model, the other involves making a sketch of the solution space before exploring it in detail.
Preview
Unable to display preview. Download preview PDF.
References
A. Bairoch. PROSITE: A dictionary of protein sites and patterns. Nucl. Acids Res., 20:2013–2018, 1992.
D. Bashford, C. Chothia, and A. M. Lesk. Determinants of a protein fold: unique features of the globin amino acid sequence. J. Mol. Biol., 212:389–402, 1987.
S. C. Chan, A. K. Wong, and D. K. Chiu. A survey of multiple sequence comparison methods. Bull. Math. Biol., 54:563–598, 1992.
B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo. Sequence landscapes. Nucleic Acids Res., 14:141–158, 1986.
D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Mol. Biol., 186:117–128, 1985.
M. T. Gallegos, C. Michan, and J. L. Ramos. The XylS/AraC family of regulators. Nucl. Acids Res., 21:807–810, 1993.
M. Gribskov, R. Luthy, and D. Eisenberg. Profile analysis. Meth. Enzymol., 183:146–159, 1990.
M. Gribskov, M. McLachlan, and D. Eisenberg. Profile analysis: detection of distantly related proteins. Proceedings of the National Academy of Science USA, 84:4355–4358, 1987.
J. D. Helmann. Compilation and analysis of Bacillus subtilis α-dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA. Nucleic Acids Res., 23:2351–2360, 1995.
G. Z. Hertz, G. W. Hartzell, and G. D. Stormo. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci., 6:81–92, 1990.
S. Karlin and G. Ghandour. The use of multiple alphabets in kappa-gene immunoglobulin DNA sequence comparisons. The EMBO Journal, 4:1217–1223, 1985.
S. Karlin, M. Morris, G. Ghandour, and M.-Y. Leung. Efficient algorithms for molecular sequence analysis. Proceedings of the National Academy of Science USA, 85:841–845, 1988.
A. Krogh, M. Brown, I. S. Mian, K. Sjoelander, and D. Haussler. Hidden Markov model in computational biology. Applications to protein modeling. J. Mol. Biol., 235:1501–1531, 1994.
A. M. Landraud, J. F. Avril, and P. Chretienne. An algorithm for finding a common structure shared by a family of strings. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(8):890–895, 1989.
C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wooton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208–214, 1993.
C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7:41–51, 1990.
H. M. Martinez. An efficient method for finding repeats in molecular sequences. Nucleic Acids Res., 11:4629–4634, 1983.
A. F. Neuwald and P. Green. Detecting patterns in protein sequences. J. Mol. Biol., 239:698–712, 1994.
J. Posfai, A.S. Bhagwat, G. Posfai, and R.J. Roberts. Prediction motifs derived from cytosine methyltransferases. Nucl. Acids Res., 17:2421–2435, 1989.
M. J. Rooman, J. Rodriguez, and S. J. Wodak. Relations between protein sequence and structure and their significance. J. Mol. Biol., 213:337–350, 1990.
M. J. Rooman and S. J. Wodak. Identification of predictive sequence motifs limited by protein structure database size. Nature, 335:45–49, 1988.
M. F. Sagot, V. Escalier, A. Viari, and H. Soldano. Searching for repeated words in a text allowing for mismatches and gaps. Viñas del Mar, Chili, 1995. Second South American Workshop on String Processing.
M. F. Sagot, A. Viari, J. Pothier, V. Escalier, and H. Soldano. Multiple comparison in biology: some mathematical formalizations of the problem and combinatorial approaches to solve it. submitted to Discrete Applied Mathematics.
M. F. Sagot, A. Viari, and H. Soldano. A distance-based block searching algorithm. Cambridge, England, 1995. Third International Symposium on Intelligent Systems for Molecular Biology.
M. F. Sagot, A. Viari, and H. Soldano. Multiple comparison: a peptide matching approach. In Proc. Combinatorial Pattern Matching Conf. 95, volume 907 of Lecture Notes in Computer Science, pages 366–385, Helsinki, Finland, 1995. Springer-Verlag, to appear in Theor. Comput. Science.
M. A. S. Saqi and M. J. E. Sternberg. Identification of sequence motifs from a set of proteins with related function. Protein Eng., 7:165–171, 1994.
G. D. Schuler, S. F. Altschul, and D. J. Lipman. A workbench for multiple alignment construction and analysis. Proteins, 9:180–190, 1991.
R. P. Sheridan and R. Venkataraghavan. A systematic search for protein signature sequences. Proteins, 14:16–28, 1992.
H. O. Smith, T. M. Annau, and S. Chandrasegaran. Finding sequence motifs in groups of functionally releated proteins. Proceedings of the National Academy of Science USA, 87:826–830, 1990.
R. F. Smith and T. S. Smith. Automatic generation of primary sequence patterns from sets of related protein sequences. Proceedings of the National Academy of Science USA, 87:118–122, 1990.
E. Sobel and H. M. Martinez. A multiple sequence alignment program. Nucleic Acids Res., 14:363–374, 1986.
G. D. Stormo. Consensus patterns in DNA. Meth. Enzymol., 183:211–221, 1990.
R. L. Tatusov, S. F. Altschul, and E. V. Koonin. Detection of conserved segments in proteins: Iterative scanning of sequence databases with alignment blocks. Proceedings of the National Academy of Science USA, 91:12091–12095, 1994.
R. L. Tatusov and E. V. Koonin. A simple tool to search for sequence motifs that are conserved in Blast outputs. Comput. Appl. Biosci., 10:0–0, 1994.
W. R. Taylor. Pattern matching methods in protein sequence comparison and structure prediction. Protein Eng., 2(2):77–86, 1988.
W. R. Taylor. A template based method of pattern matching in protein sequences. Prog. Biophys. Molec. Biol., 54:159–252, 1989.
W. R. Taylor and D. T. Jones. Templates, consensus patterns and motifs. Curr. Opin. Struct. Biol., 1:327–333, 1991.
M. S. Waterman. General methods of sequence comparison. Bull. Math. Biol., 46:473–500, 1984.
M. S. Waterman. Multiple sequence alignments by consensus. Nucleic Acids Res., 14:9095–9102, 1986.
M. S. Waterman. Consensus patterns in sequences. In M. S. Waterman, editor, Mathematical Methods for DNA Sequences, pages 93–116. CRC Press, 1989.
M. S. Waterman. Consensus methods for DNA and protein sequence alignment. In Meth. Enzymol., volume 183, pages 221–237. 1990.
M. S. Waterman, R. Arratia, and D. J. Galas. Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46:515–527, 1984.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sagot, M.F., Viari, A. (1996). A double combinatorial approach to discovering patterns in biological sequences. In: Hirschberg, D., Myers, G. (eds) Combinatorial Pattern Matching. CPM 1996. Lecture Notes in Computer Science, vol 1075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61258-0_15
Download citation
DOI: https://doi.org/10.1007/3-540-61258-0_15
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61258-2
Online ISBN: 978-3-540-68390-2
eBook Packages: Springer Book Archive