Skip to main content

Pattern Discovery

Methods and Software

  • Chapter
Introduction to Bioinformatics

Abstract

A pattern is a feature that occurs repeatedly in biological sequences, typically more often than expected at random. Patterns often correspond to functionally or structurally important elements in proteins and DNA sequences. Pattern discovery is one of the fundamental problems in bioinformatics. It has applications in multiple sequence alignment, protein structure and function prediction, drug target discovery, characterization of protein families, and promoter signal detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Suggested Readings

Biological Motivation and Introductory Reading

  • Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B., and Lander, E. S. (2000) Human and mouse gene structure: comparative analysis and application to exon prediction, Genome Res. 10(7), 950–958.

    Article  CAS  PubMed  Google Scholar 

  • Fickett, J. W. and Hatzigeorgiou, A. G. (1997) Eukaryotic promoter recognition, Genome Res. 7(9), 861–868.

    CAS  PubMed  Google Scholar 

  • Gelfand, M. S., Koonin, E. V., and Mironov, A. A. (2000) Prediction of transcription regulatory sites in Archaea by a comparative genomic approach, Nucleic Acids Res. 28(3), 695–705.

    Article  CAS  PubMed  Google Scholar 

  • Gomez, M., Johnson, S., and Gennaro, M. L. (2000) Identification of secreted proteins of Mycobacterium tuberculosis by a bioinformatic approach, Infect. Immun. 68(4), 2323–2327.

    Article  CAS  PubMed  Google Scholar 

  • Hardison, R. C., Oeltjen, J., and Miller, W. (1997) Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome, Genome Res. 7(10), 959–966.

    CAS  PubMed  Google Scholar 

  • Hughes, J. D., Estep, P. W., Tavazoie, S., and Church, G. M. (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae, J. Mol. Biol. 296(5), 1205–1214.

    Article  CAS  PubMed  Google Scholar 

  • Linial, M., Linial, N., Tishby, N., and Yona, G. (1997) Global self-organization of all known protein sequences reveals inherent biological signatures, J. Mol. Biol. 268(2), 539–546.

    Article  CAS  PubMed  Google Scholar 

  • Mironov, A. A., Koonin, E. V., Roytberg, M. A., and Gelfand, M. S. (1999) Computer analysis of transcription regulatory patterns in completely sequenced bacterial genomes, Nucleic Acids Res. 27(14), 2981–2989.

    Article  CAS  PubMed  Google Scholar 

  • Riechmann, J. L., Heard, J., Martin, G., Reuber, L., Jiang, C., Keddie, J., et al. (2000) Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes, Science 290(5499), 2105–2110.

    Article  CAS  PubMed  Google Scholar 

  • Yada, T., Totoki, Y., Ishii, T., and Nakai, K. (1997) Functional prediction of B. subtilis genes from their regulatory sequences, in: Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology (ISMB) (Gaasterland, T., Karp, P., Ouzounis, C., Sander, C., and Valencia, A., eds.) The AAAI Press, Halkidiki, Greece, pp. 354–357.

    Google Scholar 

Related Books and Surveys

  • Brazma, A., Jonassen, I., Eidhammer, I., and Gilbert, D. (1998) Approaches to the automatic discovery of patterns in biosequences, J. Comp. Biol. 5(2), 279–305.

    Article  CAS  Google Scholar 

  • Brejová, B., DiMarco, C., Vinar, T., Hidalgo, S. R., Holguin, G., and Patten, C. (2000) Finding Patterns in Biological Sequences, Technical Report CS-2000–22, Dept. of Computer Science, University of Waterloo, Ontario, Canada.

    Google Scholar 

  • Gusfield, D. (1997) Algorithms on strings, trees and sequences: computer science and computational biology, Chapman & Hall, New York, NY.

    Book  Google Scholar 

  • Pevzner, P. A. (2000) Computational molecular biology: an algorithmic approach, The MIT Press, Cambridge, MA.

    Google Scholar 

  • Rigoutsos, I., Floratos, A., Parida, L., Gao, Y., and Platt, D. (2000) The emergence of pattern discovery techniques in computational biology, Metabolic Eng. 2(3), 159–167.

    Article  CAS  Google Scholar 

Pattern Discovery

  • Gorodkin, J., Heyer, L. J., Brunak, S., and Stormo, G. D. (1997) Displaying the information contents of structural RNA alignments: the structure logos, Comp. Appl. Biosci. 13(6), 583–586.

    CAS  PubMed  Google Scholar 

  • Schneider, T. D. and Stephens, R. M. (1990) Sequence logos: a new way to display consensus sequences, Nucleic Acids Res. 18(20), 6097–6100.

    Article  CAS  PubMed  Google Scholar 

Algorithms for Pattern Discovery Exhaustive Search Methods

  • Jonassen, I. (1996) Efficient discovery of conserved patterns using a pattern graph, Technical Report 118, Department of Informatics, University of Bergen, Norway.

    Google Scholar 

  • Parda, L., Rigoutsos, I., Floratos, A., Platt, D., and Gao, Y. (2000) Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm, in: Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), ACM Press, San Francisco, CA, pp. 297–308.

    Google Scholar 

  • Pevzner, P. A. and Sze, S. H. (2000) Combinatorial approaches to finding subtle signals in DNA sequences, in: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), (Boume, P., Gribskov, M., Altman, R., Jensen, N., Hope, D., Lengauer, T., et al., eds.) The AAAI Press, San Diego, CA, pp. 269–278.

    Google Scholar 

  • Rigoutsos, I. and Floratos, A. (1998) Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm, Bioinformatics 14(1), 55–67. Published erratum appears in Bioinformatics, 14(2), 229.

    Article  CAS  PubMed  Google Scholar 

  • Rigoutsos, I. and Floratos, A. (1998) Motif discovery without alignment or enumeration (extended abstract), in: Proceedings of the 2nd Annual International Conference on Computational Molecular Biology (RECOMB), (Istrail, S., Pevzner, P., Waterman, M., eds.) ACM Press, New York, NY, pp. 221–227.

    Google Scholar 

  • Smith, H. O., Annau, T. M., and Chandrasegaran, S. (1990) Finding sequence motifs in groups of functionally related proteins, Proc. Natl. Acad. Sci. USA 87(2), 826–830.

    Article  CAS  PubMed  Google Scholar 

  • Tompa, M. (1999) An exact method for finding short motifs in sequences, with application to the ribosome binding site problem, in: Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB), (Glasgow, J., Littlejohn, T., Major, F., Lathrop, R., Sankoff, D., and Sensen, C., eds.) The AAAI Press, Montreal, Canada, pp. 262–271.

    Google Scholar 

  • van Helden, J., Andre, B., and Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies, J. Mol. Biol. 281(5), 827–832.

    Article  PubMed  Google Scholar 

Iterative Methods

  • Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton, J. C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment, Science 262(5131), 208–214.

    Article  CAS  PubMed  Google Scholar 

  • Li, M., Ma, B., and Wang, L. (1999) Finding Similar Regions in Many Strings, in: Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC), Atlanta, ACM Press, Portland, OR, pp. 473–482.

    Google Scholar 

  • Liang, C. (2001) COPIA: A New Software for Finding Consensus Patterns in Unaligned Protein Sequences. Master thesis, University of Waterloo.

    Google Scholar 

  • Liu, J. S., Neuwald, A. F., and Lawrence, C. E. (1995) Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies, J. Am. Stat. Assoc. 90(432), 1156–1170.

    Article  Google Scholar 

  • Neuwald, A. F., Liu, J. S., Lipman, D. J., and Lawrence, C. E. (1997) Extracting protein alignment models from the sequence database, Nucleic Acids Res. 25(9), 1665–1667.

    Article  CAS  PubMed  Google Scholar 

  • Singh, M., Berger, B., Kim, P. S., Berger, J. M., and Cochran, A. G. (1998) Computational learning reveals coiled coil-like motifs in histidine kinase linker domains, Proc. Natl. Acad. Sci. USA 95(6), 2738–2743.

    Article  CAS  PubMed  Google Scholar 

  • Zhang, M. Q. (1998) Statistical features of human exons and their flanking regions, Human Mol. Genet. 7(5), 919–922.

    Article  CAS  Google Scholar 

Machine Learning Methods

  • Bailey, T. L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers, in: Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology (ISMB), (Altman, R., Brutlag, D., Karp, P., Lathrop, R., and Searls, D., eds.) The AAAI Press, Stanford, CA, pp. 28–36.

    Google Scholar 

  • Bailey, T. L. and Elkan, C. (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization, Machine Learning 21(1/2), 51–80.

    Google Scholar 

  • Lawrence, C. E. and Reilly, A. A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences, Proteins, 7(1), 41–51.

    Article  CAS  PubMed  Google Scholar 

Hidden Markov Models

  • Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. (1998), Biological Sequence Analysis, Cambridge University Press, Cambridge, UK.

    Book  Google Scholar 

  • Grundy, W. N., Bailey, T. L., Elkan, C. P., and Baker, M. E. (1997) Meta-MEME: motif-based hidden Markov models of protein families, Comp. Appl. Biosci. 13(4), 397–406.

    CAS  PubMed  Google Scholar 

  • Hughey, R. and Krogh, A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method, Comp. Appl. Biosci. 12(2), 95–107.

    CAS  PubMed  Google Scholar 

  • Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. (1994) Hidden Markov models in computational biology. Applications to protein modeling, J. Mol. Biol. 235(5), 1501–1501.

    Article  CAS  PubMed  Google Scholar 

Pattern Discovery Using Additional Information

  • Blanchette, M., Schwikowski, B., and Tompa, M. (2000) An exact algorithm to identify motifs in orthologous sequences from multiple species, in: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), (Boume, P., Gribskov, M., Altman, R., Jensen, N., Hope, D., Lengauer, T., et al., eds.) The AAAI Press, San Diego, CA, pp. 37–45.

    Google Scholar 

  • Chiang, D. Y., Brown, P. O., and Eisen, M. B. (2001), Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles, Bioinformatics 17(S1), S49-S55.

    Article  Google Scholar 

  • Eidhammer, I., Jonassen, I., and Taylor, W. R. (2000) Structure comparison and structure patterns, J. Comp. Biol. 7(5), 685–716.

    Article  CAS  Google Scholar 

  • Gorodkin, J., Heyer, L. J., and Stormo, G. D. (1997b) Finding the most significant common sequence and structure motifs in a set of RNA sequences, Nucleic Acids Res. 25(18), 3724–3732.

    Article  CAS  PubMed  Google Scholar 

  • Ison, J. C., Blades, M. J., Bleasby, A. J., Daniel, S. C., Parish, J. H., and Findlay, J. B. (2000) Key residues approach to the definition of protein families and analysis of sparse family signatures, Proteins 40(2), 330–331.

    Article  CAS  PubMed  Google Scholar 

  • Nevill-Manning, C. G., Wu, T. D., and Brutlag, D. L. (1998) Highly specific protein sequence motifs for genome analysis, Proc. Natl. Acad. Sci. USA 95(11), 5865–5871.

    Article  CAS  PubMed  Google Scholar 

  • Pedersen, A. G., Baldi, P., Chauvin, Y., and Brunak, S. (1999) The biology of eukaryotic promoter prediction-a review, Comp. Chem. 23(3–4), 191–207.

    Article  Google Scholar 

Finding Homologies Between Two Sequences

  • Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25(17), 3389–3392.

    Article  CAS  PubMed  Google Scholar 

  • Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool, J. Mol. Biol. 215(3), 403–410.

    CAS  PubMed  Google Scholar 

  • Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.-P., Rivals, E., and Vingron, M. (1999) q-gram based database searching using a suffix array (QUASAR), in: Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB), ACM Press, Lyon, France, pp. 77–83.

    Google Scholar 

  • Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O., and Salzberg, S. L. (1999) Alignment of whole genomes, Nucleic Acids Res. 27(11), 2369–2376.

    Article  CAS  PubMed  Google Scholar 

  • Gish, W. (2001) WU-Blast website (see Website: http://www.blast.wustl.edu.

    Google Scholar 

  • Huang, X. and Miller, W. (1991) A time-efficient, linear-space local similarity algorithm, Adv. Appl. Math. 12(3), 337–357. (see SIM Website: http://www.expasy.ch/tools/sim.html)

    Article  Google Scholar 

  • Kurtz, S. and Schleiermacher, C. (1999) REPuter: fast computation of maximal repeats in complete genomes, Bioinformatics 15(5), 426–427.

    Article  CAS  PubMed  Google Scholar 

  • Lipman, D. J. and Pearson, W. R. (1985) Rapid and sensitive protein similarity searches, Science 227(4693), 1435–1441.

    Article  CAS  PubMed  Google Scholar 

  • Ma, B., Tromp, J., and Li, M. (2002) PatternHunter faster and more sensitive homology search, Bioinformatics 18(3), 440–445.

    Article  CAS  PubMed  Google Scholar 

  • Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences, J. Mol. Biol. 147(1), 195–197.

    Article  CAS  PubMed  Google Scholar 

  • States, D. J. and Agarwal, P. (1996) Compact encoding strategies for DNA sequence similarity search, in: Proceedings of the 4th International Conference on Intelligent Systems for Molecular Biology (ISMB), (States, D. J., Agarwal, P., Gaasterland, T., Hunter, L., and Smith, R. F., eds.) The AAAI Press, St. Louis, MO, pp. 211–217. (see SENSEI Website: http://www.stateslab.wustl.edu/software/sensei/).

    Google Scholar 

  • Tatusova, T. A. and Madden, T. L. (1999) BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences, FEMS Microbiol. Lett. 174(2), 247–250.

    CAS  Google Scholar 

  • Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2000) A greedy algorithm for aligning DNA sequences, J. Comp. Biol. 7(1–2), 203–204.

    Article  CAS  Google Scholar 

Assessment of Pattern Quality

  • Nicodème, P., Salvy, B., and Flajolet, P. (1999) Motif statistics, in: Algorithms — ESA ’99, 7th Annual European Symposium, vol. 1643, Lecture Notes in Computer Science, (Nesetril, J., ed.), Springer, Prague, pp. 194–211.

    Google Scholar 

  • Pesole, G., Liuni, S., and D’Souza, M. (2000) PatSearch: a pattern matcher software that finds functional elements in nucleotide and protein sequences and assesses their statistical significance, Bioinformatics 16(5), 439–440.

    Article  CAS  PubMed  Google Scholar 

  • Rocke, E. and Tompa, M. (1998) An algorithm for finding novel gapped motifs in DNA sequences, in: Proceedings of the 2nd Annual International Conference on Computational Molecular Biology (RECOMB), (Istrail, S., Pevzner, P., and Waterman, M., eds.), ACM Press, New York, NY, pp. 228–233.

    Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer Science+Business Media New York

About this chapter

Cite this chapter

Brejová, B., Vinar, T., Li, M. (2003). Pattern Discovery. In: Krawetz, S.A., Womble, D.D. (eds) Introduction to Bioinformatics. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-59259-335-4_29

Download citation

  • DOI: https://doi.org/10.1007/978-1-59259-335-4_29

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-58829-241-4

  • Online ISBN: 978-1-59259-335-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics