A double combinatorial approach to discovering patterns in biological sequences

  • Marie -France Sagot
  • Alain Viari
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1075)


We present in this paper an algorithm for finding degenerated common features by multiple comparison of a set of biological sequences (nucleic acids or proteins). The features that are of interest to us are words in the sequences. The algorithm uses the concept of a model we introduced earlier for locating these features. A model can be seen as a generalization of a consensus pattern as defined by Waterman [42]. It is an object against which the words in the sequences are compared and which serves as an identifier for the groups of similar ones. The algorithm given here innovates in relation to our previous work in that the models are defined over what we call a weighted combinatorial cover. This is a collection of sets among all possible subsets of the alphabet Σ of nucleotides or amino acids, including the wild card {Σ}, with a weight attached to each of these sets indicating the number of times it may appear in a model. In this way, we explore both the space of models and that of alphabets. The words that are related to a model defined over such a combinatorial cover, and thus considered to be similar, are then the ones that either belong to the model or present at most a certain number of errors with a nearest element of it. We use two algorithmic ideas that allow us to deal with such double combinatorics, one concerns a left-to-right minimality of the sets composing a model, the other involves making a sketch of the solution space before exploring it in detail.


multiple comparison weighted combinatorial cover wild card model degenerated feature left-to-right minimality of sets sketch of solution space DNA protein 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    A. Bairoch. PROSITE: A dictionary of protein sites and patterns. Nucl. Acids Res., 20:2013–2018, 1992.PubMedGoogle Scholar
  2. 2.
    D. Bashford, C. Chothia, and A. M. Lesk. Determinants of a protein fold: unique features of the globin amino acid sequence. J. Mol. Biol., 212:389–402, 1987.Google Scholar
  3. 3.
    S. C. Chan, A. K. Wong, and D. K. Chiu. A survey of multiple sequence comparison methods. Bull. Math. Biol., 54:563–598, 1992.PubMedGoogle Scholar
  4. 4.
    B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo. Sequence landscapes. Nucleic Acids Res., 14:141–158, 1986.PubMedGoogle Scholar
  5. 5.
    D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Mol. Biol., 186:117–128, 1985.PubMedGoogle Scholar
  6. 6.
    M. T. Gallegos, C. Michan, and J. L. Ramos. The XylS/AraC family of regulators. Nucl. Acids Res., 21:807–810, 1993.PubMedGoogle Scholar
  7. 7.
    M. Gribskov, R. Luthy, and D. Eisenberg. Profile analysis. Meth. Enzymol., 183:146–159, 1990.PubMedGoogle Scholar
  8. 8.
    M. Gribskov, M. McLachlan, and D. Eisenberg. Profile analysis: detection of distantly related proteins. Proceedings of the National Academy of Science USA, 84:4355–4358, 1987.Google Scholar
  9. 9.
    J. D. Helmann. Compilation and analysis of Bacillus subtilis α-dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA. Nucleic Acids Res., 23:2351–2360, 1995.PubMedGoogle Scholar
  10. 10.
    G. Z. Hertz, G. W. Hartzell, and G. D. Stormo. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci., 6:81–92, 1990.PubMedGoogle Scholar
  11. 11.
    S. Karlin and G. Ghandour. The use of multiple alphabets in kappa-gene immunoglobulin DNA sequence comparisons. The EMBO Journal, 4:1217–1223, 1985.PubMedGoogle Scholar
  12. 12.
    S. Karlin, M. Morris, G. Ghandour, and M.-Y. Leung. Efficient algorithms for molecular sequence analysis. Proceedings of the National Academy of Science USA, 85:841–845, 1988.Google Scholar
  13. 13.
    A. Krogh, M. Brown, I. S. Mian, K. Sjoelander, and D. Haussler. Hidden Markov model in computational biology. Applications to protein modeling. J. Mol. Biol., 235:1501–1531, 1994.PubMedGoogle Scholar
  14. 14.
    A. M. Landraud, J. F. Avril, and P. Chretienne. An algorithm for finding a common structure shared by a family of strings. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(8):890–895, 1989.Google Scholar
  15. 15.
    C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wooton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208–214, 1993.PubMedGoogle Scholar
  16. 16.
    C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7:41–51, 1990.PubMedGoogle Scholar
  17. 17.
    H. M. Martinez. An efficient method for finding repeats in molecular sequences. Nucleic Acids Res., 11:4629–4634, 1983.PubMedGoogle Scholar
  18. 18.
    A. F. Neuwald and P. Green. Detecting patterns in protein sequences. J. Mol. Biol., 239:698–712, 1994.PubMedGoogle Scholar
  19. 19.
    J. Posfai, A.S. Bhagwat, G. Posfai, and R.J. Roberts. Prediction motifs derived from cytosine methyltransferases. Nucl. Acids Res., 17:2421–2435, 1989.PubMedGoogle Scholar
  20. 20.
    M. J. Rooman, J. Rodriguez, and S. J. Wodak. Relations between protein sequence and structure and their significance. J. Mol. Biol., 213:337–350, 1990.PubMedGoogle Scholar
  21. 21.
    M. J. Rooman and S. J. Wodak. Identification of predictive sequence motifs limited by protein structure database size. Nature, 335:45–49, 1988.PubMedGoogle Scholar
  22. 22.
    M. F. Sagot, V. Escalier, A. Viari, and H. Soldano. Searching for repeated words in a text allowing for mismatches and gaps. Viñas del Mar, Chili, 1995. Second South American Workshop on String Processing.Google Scholar
  23. 23.
    M. F. Sagot, A. Viari, J. Pothier, V. Escalier, and H. Soldano. Multiple comparison in biology: some mathematical formalizations of the problem and combinatorial approaches to solve it. submitted to Discrete Applied Mathematics.Google Scholar
  24. 24.
    M. F. Sagot, A. Viari, and H. Soldano. A distance-based block searching algorithm. Cambridge, England, 1995. Third International Symposium on Intelligent Systems for Molecular Biology.Google Scholar
  25. 25.
    M. F. Sagot, A. Viari, and H. Soldano. Multiple comparison: a peptide matching approach. In Proc. Combinatorial Pattern Matching Conf. 95, volume 907 of Lecture Notes in Computer Science, pages 366–385, Helsinki, Finland, 1995. Springer-Verlag, to appear in Theor. Comput. Science.Google Scholar
  26. 26.
    M. A. S. Saqi and M. J. E. Sternberg. Identification of sequence motifs from a set of proteins with related function. Protein Eng., 7:165–171, 1994.PubMedGoogle Scholar
  27. 27.
    G. D. Schuler, S. F. Altschul, and D. J. Lipman. A workbench for multiple alignment construction and analysis. Proteins, 9:180–190, 1991.PubMedGoogle Scholar
  28. 28.
    R. P. Sheridan and R. Venkataraghavan. A systematic search for protein signature sequences. Proteins, 14:16–28, 1992.PubMedGoogle Scholar
  29. 29.
    H. O. Smith, T. M. Annau, and S. Chandrasegaran. Finding sequence motifs in groups of functionally releated proteins. Proceedings of the National Academy of Science USA, 87:826–830, 1990.Google Scholar
  30. 30.
    R. F. Smith and T. S. Smith. Automatic generation of primary sequence patterns from sets of related protein sequences. Proceedings of the National Academy of Science USA, 87:118–122, 1990.Google Scholar
  31. 31.
    E. Sobel and H. M. Martinez. A multiple sequence alignment program. Nucleic Acids Res., 14:363–374, 1986.PubMedGoogle Scholar
  32. 32.
    G. D. Stormo. Consensus patterns in DNA. Meth. Enzymol., 183:211–221, 1990.PubMedGoogle Scholar
  33. 33.
    R. L. Tatusov, S. F. Altschul, and E. V. Koonin. Detection of conserved segments in proteins: Iterative scanning of sequence databases with alignment blocks. Proceedings of the National Academy of Science USA, 91:12091–12095, 1994.Google Scholar
  34. 34.
    R. L. Tatusov and E. V. Koonin. A simple tool to search for sequence motifs that are conserved in Blast outputs. Comput. Appl. Biosci., 10:0–0, 1994.Google Scholar
  35. 35.
    W. R. Taylor. Pattern matching methods in protein sequence comparison and structure prediction. Protein Eng., 2(2):77–86, 1988.PubMedGoogle Scholar
  36. 36.
    W. R. Taylor. A template based method of pattern matching in protein sequences. Prog. Biophys. Molec. Biol., 54:159–252, 1989.Google Scholar
  37. 37.
    W. R. Taylor and D. T. Jones. Templates, consensus patterns and motifs. Curr. Opin. Struct. Biol., 1:327–333, 1991.Google Scholar
  38. 38.
    M. S. Waterman. General methods of sequence comparison. Bull. Math. Biol., 46:473–500, 1984.Google Scholar
  39. 39.
    M. S. Waterman. Multiple sequence alignments by consensus. Nucleic Acids Res., 14:9095–9102, 1986.PubMedGoogle Scholar
  40. 40.
    M. S. Waterman. Consensus patterns in sequences. In M. S. Waterman, editor, Mathematical Methods for DNA Sequences, pages 93–116. CRC Press, 1989.Google Scholar
  41. 41.
    M. S. Waterman. Consensus methods for DNA and protein sequence alignment. In Meth. Enzymol., volume 183, pages 221–237. 1990.PubMedGoogle Scholar
  42. 42.
    M. S. Waterman, R. Arratia, and D. J. Galas. Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46:515–527, 1984.PubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1996

Authors and Affiliations

  • Marie -France Sagot
    • 1
    • 2
  • Alain Viari
    • 1
  1. 1.Atelier de BioInformatiqueCPASO - URA CNRS 448, Section de Recherche de l'Institut CurieParisFrance
  2. 2.Institut Gaspard MongeUniversité de Marne la ValléeNoisy le Grand

Personalised recommendations