A double combinatorial approach to discovering patterns in biological sequences

Sagot, Marie -France; Viari, Alain

doi:10.1007/3-540-61258-0_15

Marie -France Sagot^1,2 &
Alain Viari¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1075))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

10 Citations

Abstract

We present in this paper an algorithm for finding degenerated common features by multiple comparison of a set of biological sequences (nucleic acids or proteins). The features that are of interest to us are words in the sequences. The algorithm uses the concept of a model we introduced earlier for locating these features. A model can be seen as a generalization of a consensus pattern as defined by Waterman [42]. It is an object against which the words in the sequences are compared and which serves as an identifier for the groups of similar ones. The algorithm given here innovates in relation to our previous work in that the models are defined over what we call a weighted combinatorial cover. This is a collection of sets among all possible subsets of the alphabet Σ of nucleotides or amino acids, including the wild card {Σ}, with a weight attached to each of these sets indicating the number of times it may appear in a model. In this way, we explore both the space of models and that of alphabets. The words that are related to a model defined over such a combinatorial cover, and thus considered to be similar, are then the ones that either belong to the model or present at most a certain number of errors with a nearest element of it. We use two algorithmic ideas that allow us to deal with such double combinatorics, one concerns a left-to-right minimality of the sets composing a model, the other involves making a sketch of the solution space before exploring it in detail.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. Bairoch. PROSITE: A dictionary of protein sites and patterns. Nucl. Acids Res., 20:2013–2018, 1992.
PubMed Google Scholar
D. Bashford, C. Chothia, and A. M. Lesk. Determinants of a protein fold: unique features of the globin amino acid sequence. J. Mol. Biol., 212:389–402, 1987.
Google Scholar
S. C. Chan, A. K. Wong, and D. K. Chiu. A survey of multiple sequence comparison methods. Bull. Math. Biol., 54:563–598, 1992.
PubMed Google Scholar
B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo. Sequence landscapes. Nucleic Acids Res., 14:141–158, 1986.
PubMed Google Scholar
D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Mol. Biol., 186:117–128, 1985.
PubMed Google Scholar
M. T. Gallegos, C. Michan, and J. L. Ramos. The XylS/AraC family of regulators. Nucl. Acids Res., 21:807–810, 1993.
PubMed Google Scholar
M. Gribskov, R. Luthy, and D. Eisenberg. Profile analysis. Meth. Enzymol., 183:146–159, 1990.
PubMed Google Scholar
M. Gribskov, M. McLachlan, and D. Eisenberg. Profile analysis: detection of distantly related proteins. Proceedings of the National Academy of Science USA, 84:4355–4358, 1987.
Google Scholar
J. D. Helmann. Compilation and analysis of Bacillus subtilis α-dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA. Nucleic Acids Res., 23:2351–2360, 1995.
PubMed Google Scholar
G. Z. Hertz, G. W. Hartzell, and G. D. Stormo. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci., 6:81–92, 1990.
PubMed Google Scholar
S. Karlin and G. Ghandour. The use of multiple alphabets in kappa-gene immunoglobulin DNA sequence comparisons. The EMBO Journal, 4:1217–1223, 1985.
PubMed Google Scholar
S. Karlin, M. Morris, G. Ghandour, and M.-Y. Leung. Efficient algorithms for molecular sequence analysis. Proceedings of the National Academy of Science USA, 85:841–845, 1988.
Google Scholar
A. Krogh, M. Brown, I. S. Mian, K. Sjoelander, and D. Haussler. Hidden Markov model in computational biology. Applications to protein modeling. J. Mol. Biol., 235:1501–1531, 1994.
PubMed Google Scholar
A. M. Landraud, J. F. Avril, and P. Chretienne. An algorithm for finding a common structure shared by a family of strings. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(8):890–895, 1989.
Google Scholar
C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wooton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208–214, 1993.
PubMed Google Scholar
C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7:41–51, 1990.
PubMed Google Scholar
H. M. Martinez. An efficient method for finding repeats in molecular sequences. Nucleic Acids Res., 11:4629–4634, 1983.
PubMed Google Scholar
A. F. Neuwald and P. Green. Detecting patterns in protein sequences. J. Mol. Biol., 239:698–712, 1994.
PubMed Google Scholar
J. Posfai, A.S. Bhagwat, G. Posfai, and R.J. Roberts. Prediction motifs derived from cytosine methyltransferases. Nucl. Acids Res., 17:2421–2435, 1989.
PubMed Google Scholar
M. J. Rooman, J. Rodriguez, and S. J. Wodak. Relations between protein sequence and structure and their significance. J. Mol. Biol., 213:337–350, 1990.
PubMed Google Scholar
M. J. Rooman and S. J. Wodak. Identification of predictive sequence motifs limited by protein structure database size. Nature, 335:45–49, 1988.
PubMed Google Scholar
M. F. Sagot, V. Escalier, A. Viari, and H. Soldano. Searching for repeated words in a text allowing for mismatches and gaps. Viñas del Mar, Chili, 1995. Second South American Workshop on String Processing.
Google Scholar
M. F. Sagot, A. Viari, J. Pothier, V. Escalier, and H. Soldano. Multiple comparison in biology: some mathematical formalizations of the problem and combinatorial approaches to solve it. submitted to Discrete Applied Mathematics.
Google Scholar
M. F. Sagot, A. Viari, and H. Soldano. A distance-based block searching algorithm. Cambridge, England, 1995. Third International Symposium on Intelligent Systems for Molecular Biology.
Google Scholar
M. F. Sagot, A. Viari, and H. Soldano. Multiple comparison: a peptide matching approach. In Proc. Combinatorial Pattern Matching Conf. 95, volume 907 of Lecture Notes in Computer Science, pages 366–385, Helsinki, Finland, 1995. Springer-Verlag, to appear in Theor. Comput. Science.
Google Scholar
M. A. S. Saqi and M. J. E. Sternberg. Identification of sequence motifs from a set of proteins with related function. Protein Eng., 7:165–171, 1994.
PubMed Google Scholar
G. D. Schuler, S. F. Altschul, and D. J. Lipman. A workbench for multiple alignment construction and analysis. Proteins, 9:180–190, 1991.
PubMed Google Scholar
R. P. Sheridan and R. Venkataraghavan. A systematic search for protein signature sequences. Proteins, 14:16–28, 1992.
PubMed Google Scholar
H. O. Smith, T. M. Annau, and S. Chandrasegaran. Finding sequence motifs in groups of functionally releated proteins. Proceedings of the National Academy of Science USA, 87:826–830, 1990.
Google Scholar
R. F. Smith and T. S. Smith. Automatic generation of primary sequence patterns from sets of related protein sequences. Proceedings of the National Academy of Science USA, 87:118–122, 1990.
Google Scholar
E. Sobel and H. M. Martinez. A multiple sequence alignment program. Nucleic Acids Res., 14:363–374, 1986.
PubMed Google Scholar
G. D. Stormo. Consensus patterns in DNA. Meth. Enzymol., 183:211–221, 1990.
PubMed Google Scholar
R. L. Tatusov, S. F. Altschul, and E. V. Koonin. Detection of conserved segments in proteins: Iterative scanning of sequence databases with alignment blocks. Proceedings of the National Academy of Science USA, 91:12091–12095, 1994.
Google Scholar
R. L. Tatusov and E. V. Koonin. A simple tool to search for sequence motifs that are conserved in Blast outputs. Comput. Appl. Biosci., 10:0–0, 1994.
Google Scholar
W. R. Taylor. Pattern matching methods in protein sequence comparison and structure prediction. Protein Eng., 2(2):77–86, 1988.
PubMed Google Scholar
W. R. Taylor. A template based method of pattern matching in protein sequences. Prog. Biophys. Molec. Biol., 54:159–252, 1989.
Google Scholar
W. R. Taylor and D. T. Jones. Templates, consensus patterns and motifs. Curr. Opin. Struct. Biol., 1:327–333, 1991.
Google Scholar
M. S. Waterman. General methods of sequence comparison. Bull. Math. Biol., 46:473–500, 1984.
Google Scholar
M. S. Waterman. Multiple sequence alignments by consensus. Nucleic Acids Res., 14:9095–9102, 1986.
PubMed Google Scholar
M. S. Waterman. Consensus patterns in sequences. In M. S. Waterman, editor, Mathematical Methods for DNA Sequences, pages 93–116. CRC Press, 1989.
Google Scholar
M. S. Waterman. Consensus methods for DNA and protein sequence alignment. In Meth. Enzymol., volume 183, pages 221–237. 1990.
PubMed Google Scholar
M. S. Waterman, R. Arratia, and D. J. Galas. Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46:515–527, 1984.
PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Atelier de BioInformatique, CPASO - URA CNRS 448, Section de Recherche de l'Institut Curie, 26, Rue d'Ulm, 75005, Paris, France
Marie -France Sagot & Alain Viari
Institut Gaspard Monge, Université de Marne la Vallée, 2, rue de la Butte Verte, 93160, Noisy le Grand
Marie -France Sagot

Authors

Marie -France Sagot
View author publications
You can also search for this author in PubMed Google Scholar
Alain Viari
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Dan Hirschberg Gene Myers

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sagot, M.F., Viari, A. (1996). A double combinatorial approach to discovering patterns in biological sequences. In: Hirschberg, D., Myers, G. (eds) Combinatorial Pattern Matching. CPM 1996. Lecture Notes in Computer Science, vol 1075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61258-0_15

Download citation

DOI: https://doi.org/10.1007/3-540-61258-0_15
Published: 01 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61258-2
Online ISBN: 978-3-540-68390-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics