Skip to main content

A double combinatorial approach to discovering patterns in biological sequences

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 1996)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1075))

Included in the following conference series:

Abstract

We present in this paper an algorithm for finding degenerated common features by multiple comparison of a set of biological sequences (nucleic acids or proteins). The features that are of interest to us are words in the sequences. The algorithm uses the concept of a model we introduced earlier for locating these features. A model can be seen as a generalization of a consensus pattern as defined by Waterman [42]. It is an object against which the words in the sequences are compared and which serves as an identifier for the groups of similar ones. The algorithm given here innovates in relation to our previous work in that the models are defined over what we call a weighted combinatorial cover. This is a collection of sets among all possible subsets of the alphabet Σ of nucleotides or amino acids, including the wild card {Σ}, with a weight attached to each of these sets indicating the number of times it may appear in a model. In this way, we explore both the space of models and that of alphabets. The words that are related to a model defined over such a combinatorial cover, and thus considered to be similar, are then the ones that either belong to the model or present at most a certain number of errors with a nearest element of it. We use two algorithmic ideas that allow us to deal with such double combinatorics, one concerns a left-to-right minimality of the sets composing a model, the other involves making a sketch of the solution space before exploring it in detail.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Bairoch. PROSITE: A dictionary of protein sites and patterns. Nucl. Acids Res., 20:2013–2018, 1992.

    PubMed  Google Scholar 

  2. D. Bashford, C. Chothia, and A. M. Lesk. Determinants of a protein fold: unique features of the globin amino acid sequence. J. Mol. Biol., 212:389–402, 1987.

    Google Scholar 

  3. S. C. Chan, A. K. Wong, and D. K. Chiu. A survey of multiple sequence comparison methods. Bull. Math. Biol., 54:563–598, 1992.

    PubMed  Google Scholar 

  4. B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo. Sequence landscapes. Nucleic Acids Res., 14:141–158, 1986.

    PubMed  Google Scholar 

  5. D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Mol. Biol., 186:117–128, 1985.

    PubMed  Google Scholar 

  6. M. T. Gallegos, C. Michan, and J. L. Ramos. The XylS/AraC family of regulators. Nucl. Acids Res., 21:807–810, 1993.

    PubMed  Google Scholar 

  7. M. Gribskov, R. Luthy, and D. Eisenberg. Profile analysis. Meth. Enzymol., 183:146–159, 1990.

    PubMed  Google Scholar 

  8. M. Gribskov, M. McLachlan, and D. Eisenberg. Profile analysis: detection of distantly related proteins. Proceedings of the National Academy of Science USA, 84:4355–4358, 1987.

    Google Scholar 

  9. J. D. Helmann. Compilation and analysis of Bacillus subtilis α-dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA. Nucleic Acids Res., 23:2351–2360, 1995.

    PubMed  Google Scholar 

  10. G. Z. Hertz, G. W. Hartzell, and G. D. Stormo. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci., 6:81–92, 1990.

    PubMed  Google Scholar 

  11. S. Karlin and G. Ghandour. The use of multiple alphabets in kappa-gene immunoglobulin DNA sequence comparisons. The EMBO Journal, 4:1217–1223, 1985.

    PubMed  Google Scholar 

  12. S. Karlin, M. Morris, G. Ghandour, and M.-Y. Leung. Efficient algorithms for molecular sequence analysis. Proceedings of the National Academy of Science USA, 85:841–845, 1988.

    Google Scholar 

  13. A. Krogh, M. Brown, I. S. Mian, K. Sjoelander, and D. Haussler. Hidden Markov model in computational biology. Applications to protein modeling. J. Mol. Biol., 235:1501–1531, 1994.

    PubMed  Google Scholar 

  14. A. M. Landraud, J. F. Avril, and P. Chretienne. An algorithm for finding a common structure shared by a family of strings. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(8):890–895, 1989.

    Google Scholar 

  15. C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wooton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208–214, 1993.

    PubMed  Google Scholar 

  16. C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7:41–51, 1990.

    PubMed  Google Scholar 

  17. H. M. Martinez. An efficient method for finding repeats in molecular sequences. Nucleic Acids Res., 11:4629–4634, 1983.

    PubMed  Google Scholar 

  18. A. F. Neuwald and P. Green. Detecting patterns in protein sequences. J. Mol. Biol., 239:698–712, 1994.

    PubMed  Google Scholar 

  19. J. Posfai, A.S. Bhagwat, G. Posfai, and R.J. Roberts. Prediction motifs derived from cytosine methyltransferases. Nucl. Acids Res., 17:2421–2435, 1989.

    PubMed  Google Scholar 

  20. M. J. Rooman, J. Rodriguez, and S. J. Wodak. Relations between protein sequence and structure and their significance. J. Mol. Biol., 213:337–350, 1990.

    PubMed  Google Scholar 

  21. M. J. Rooman and S. J. Wodak. Identification of predictive sequence motifs limited by protein structure database size. Nature, 335:45–49, 1988.

    PubMed  Google Scholar 

  22. M. F. Sagot, V. Escalier, A. Viari, and H. Soldano. Searching for repeated words in a text allowing for mismatches and gaps. Viñas del Mar, Chili, 1995. Second South American Workshop on String Processing.

    Google Scholar 

  23. M. F. Sagot, A. Viari, J. Pothier, V. Escalier, and H. Soldano. Multiple comparison in biology: some mathematical formalizations of the problem and combinatorial approaches to solve it. submitted to Discrete Applied Mathematics.

    Google Scholar 

  24. M. F. Sagot, A. Viari, and H. Soldano. A distance-based block searching algorithm. Cambridge, England, 1995. Third International Symposium on Intelligent Systems for Molecular Biology.

    Google Scholar 

  25. M. F. Sagot, A. Viari, and H. Soldano. Multiple comparison: a peptide matching approach. In Proc. Combinatorial Pattern Matching Conf. 95, volume 907 of Lecture Notes in Computer Science, pages 366–385, Helsinki, Finland, 1995. Springer-Verlag, to appear in Theor. Comput. Science.

    Google Scholar 

  26. M. A. S. Saqi and M. J. E. Sternberg. Identification of sequence motifs from a set of proteins with related function. Protein Eng., 7:165–171, 1994.

    PubMed  Google Scholar 

  27. G. D. Schuler, S. F. Altschul, and D. J. Lipman. A workbench for multiple alignment construction and analysis. Proteins, 9:180–190, 1991.

    PubMed  Google Scholar 

  28. R. P. Sheridan and R. Venkataraghavan. A systematic search for protein signature sequences. Proteins, 14:16–28, 1992.

    PubMed  Google Scholar 

  29. H. O. Smith, T. M. Annau, and S. Chandrasegaran. Finding sequence motifs in groups of functionally releated proteins. Proceedings of the National Academy of Science USA, 87:826–830, 1990.

    Google Scholar 

  30. R. F. Smith and T. S. Smith. Automatic generation of primary sequence patterns from sets of related protein sequences. Proceedings of the National Academy of Science USA, 87:118–122, 1990.

    Google Scholar 

  31. E. Sobel and H. M. Martinez. A multiple sequence alignment program. Nucleic Acids Res., 14:363–374, 1986.

    PubMed  Google Scholar 

  32. G. D. Stormo. Consensus patterns in DNA. Meth. Enzymol., 183:211–221, 1990.

    PubMed  Google Scholar 

  33. R. L. Tatusov, S. F. Altschul, and E. V. Koonin. Detection of conserved segments in proteins: Iterative scanning of sequence databases with alignment blocks. Proceedings of the National Academy of Science USA, 91:12091–12095, 1994.

    Google Scholar 

  34. R. L. Tatusov and E. V. Koonin. A simple tool to search for sequence motifs that are conserved in Blast outputs. Comput. Appl. Biosci., 10:0–0, 1994.

    Google Scholar 

  35. W. R. Taylor. Pattern matching methods in protein sequence comparison and structure prediction. Protein Eng., 2(2):77–86, 1988.

    PubMed  Google Scholar 

  36. W. R. Taylor. A template based method of pattern matching in protein sequences. Prog. Biophys. Molec. Biol., 54:159–252, 1989.

    Google Scholar 

  37. W. R. Taylor and D. T. Jones. Templates, consensus patterns and motifs. Curr. Opin. Struct. Biol., 1:327–333, 1991.

    Google Scholar 

  38. M. S. Waterman. General methods of sequence comparison. Bull. Math. Biol., 46:473–500, 1984.

    Google Scholar 

  39. M. S. Waterman. Multiple sequence alignments by consensus. Nucleic Acids Res., 14:9095–9102, 1986.

    PubMed  Google Scholar 

  40. M. S. Waterman. Consensus patterns in sequences. In M. S. Waterman, editor, Mathematical Methods for DNA Sequences, pages 93–116. CRC Press, 1989.

    Google Scholar 

  41. M. S. Waterman. Consensus methods for DNA and protein sequence alignment. In Meth. Enzymol., volume 183, pages 221–237. 1990.

    PubMed  Google Scholar 

  42. M. S. Waterman, R. Arratia, and D. J. Galas. Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46:515–527, 1984.

    PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Dan Hirschberg Gene Myers

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sagot, M.F., Viari, A. (1996). A double combinatorial approach to discovering patterns in biological sequences. In: Hirschberg, D., Myers, G. (eds) Combinatorial Pattern Matching. CPM 1996. Lecture Notes in Computer Science, vol 1075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61258-0_15

Download citation

  • DOI: https://doi.org/10.1007/3-540-61258-0_15

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-61258-2

  • Online ISBN: 978-3-540-68390-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics