Abstract
A pattern is a feature that occurs repeatedly in biological sequences, typically more often than expected at random. Patterns often correspond to functionally or structurally important elements in proteins and DNA sequences. Pattern discovery is one of the fundamental problems in bioinformatics. It has applications in multiple sequence alignment, protein structure and function prediction, drug target discovery, characterization of protein families, and promoter signal detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Suggested Readings
Biological Motivation and Introductory Reading
Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B., and Lander, E. S. (2000) Human and mouse gene structure: comparative analysis and application to exon prediction, Genome Res. 10(7), 950–958.
Fickett, J. W. and Hatzigeorgiou, A. G. (1997) Eukaryotic promoter recognition, Genome Res. 7(9), 861–868.
Gelfand, M. S., Koonin, E. V., and Mironov, A. A. (2000) Prediction of transcription regulatory sites in Archaea by a comparative genomic approach, Nucleic Acids Res. 28(3), 695–705.
Gomez, M., Johnson, S., and Gennaro, M. L. (2000) Identification of secreted proteins of Mycobacterium tuberculosis by a bioinformatic approach, Infect. Immun. 68(4), 2323–2327.
Hardison, R. C., Oeltjen, J., and Miller, W. (1997) Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome, Genome Res. 7(10), 959–966.
Hughes, J. D., Estep, P. W., Tavazoie, S., and Church, G. M. (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae, J. Mol. Biol. 296(5), 1205–1214.
Linial, M., Linial, N., Tishby, N., and Yona, G. (1997) Global self-organization of all known protein sequences reveals inherent biological signatures, J. Mol. Biol. 268(2), 539–546.
Mironov, A. A., Koonin, E. V., Roytberg, M. A., and Gelfand, M. S. (1999) Computer analysis of transcription regulatory patterns in completely sequenced bacterial genomes, Nucleic Acids Res. 27(14), 2981–2989.
Riechmann, J. L., Heard, J., Martin, G., Reuber, L., Jiang, C., Keddie, J., et al. (2000) Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes, Science 290(5499), 2105–2110.
Yada, T., Totoki, Y., Ishii, T., and Nakai, K. (1997) Functional prediction of B. subtilis genes from their regulatory sequences, in: Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology (ISMB) (Gaasterland, T., Karp, P., Ouzounis, C., Sander, C., and Valencia, A., eds.) The AAAI Press, Halkidiki, Greece, pp. 354–357.
Related Books and Surveys
Brazma, A., Jonassen, I., Eidhammer, I., and Gilbert, D. (1998) Approaches to the automatic discovery of patterns in biosequences, J. Comp. Biol. 5(2), 279–305.
Brejová, B., DiMarco, C., Vinar, T., Hidalgo, S. R., Holguin, G., and Patten, C. (2000) Finding Patterns in Biological Sequences, Technical Report CS-2000–22, Dept. of Computer Science, University of Waterloo, Ontario, Canada.
Gusfield, D. (1997) Algorithms on strings, trees and sequences: computer science and computational biology, Chapman & Hall, New York, NY.
Pevzner, P. A. (2000) Computational molecular biology: an algorithmic approach, The MIT Press, Cambridge, MA.
Rigoutsos, I., Floratos, A., Parida, L., Gao, Y., and Platt, D. (2000) The emergence of pattern discovery techniques in computational biology, Metabolic Eng. 2(3), 159–167.
Pattern Discovery
Gorodkin, J., Heyer, L. J., Brunak, S., and Stormo, G. D. (1997) Displaying the information contents of structural RNA alignments: the structure logos, Comp. Appl. Biosci. 13(6), 583–586.
Schneider, T. D. and Stephens, R. M. (1990) Sequence logos: a new way to display consensus sequences, Nucleic Acids Res. 18(20), 6097–6100.
Algorithms for Pattern Discovery Exhaustive Search Methods
Jonassen, I. (1996) Efficient discovery of conserved patterns using a pattern graph, Technical Report 118, Department of Informatics, University of Bergen, Norway.
Parda, L., Rigoutsos, I., Floratos, A., Platt, D., and Gao, Y. (2000) Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm, in: Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), ACM Press, San Francisco, CA, pp. 297–308.
Pevzner, P. A. and Sze, S. H. (2000) Combinatorial approaches to finding subtle signals in DNA sequences, in: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), (Boume, P., Gribskov, M., Altman, R., Jensen, N., Hope, D., Lengauer, T., et al., eds.) The AAAI Press, San Diego, CA, pp. 269–278.
Rigoutsos, I. and Floratos, A. (1998) Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm, Bioinformatics 14(1), 55–67. Published erratum appears in Bioinformatics, 14(2), 229.
Rigoutsos, I. and Floratos, A. (1998) Motif discovery without alignment or enumeration (extended abstract), in: Proceedings of the 2nd Annual International Conference on Computational Molecular Biology (RECOMB), (Istrail, S., Pevzner, P., Waterman, M., eds.) ACM Press, New York, NY, pp. 221–227.
Smith, H. O., Annau, T. M., and Chandrasegaran, S. (1990) Finding sequence motifs in groups of functionally related proteins, Proc. Natl. Acad. Sci. USA 87(2), 826–830.
Tompa, M. (1999) An exact method for finding short motifs in sequences, with application to the ribosome binding site problem, in: Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB), (Glasgow, J., Littlejohn, T., Major, F., Lathrop, R., Sankoff, D., and Sensen, C., eds.) The AAAI Press, Montreal, Canada, pp. 262–271.
van Helden, J., Andre, B., and Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies, J. Mol. Biol. 281(5), 827–832.
Iterative Methods
Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton, J. C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment, Science 262(5131), 208–214.
Li, M., Ma, B., and Wang, L. (1999) Finding Similar Regions in Many Strings, in: Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC), Atlanta, ACM Press, Portland, OR, pp. 473–482.
Liang, C. (2001) COPIA: A New Software for Finding Consensus Patterns in Unaligned Protein Sequences. Master thesis, University of Waterloo.
Liu, J. S., Neuwald, A. F., and Lawrence, C. E. (1995) Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies, J. Am. Stat. Assoc. 90(432), 1156–1170.
Neuwald, A. F., Liu, J. S., Lipman, D. J., and Lawrence, C. E. (1997) Extracting protein alignment models from the sequence database, Nucleic Acids Res. 25(9), 1665–1667.
Singh, M., Berger, B., Kim, P. S., Berger, J. M., and Cochran, A. G. (1998) Computational learning reveals coiled coil-like motifs in histidine kinase linker domains, Proc. Natl. Acad. Sci. USA 95(6), 2738–2743.
Zhang, M. Q. (1998) Statistical features of human exons and their flanking regions, Human Mol. Genet. 7(5), 919–922.
Machine Learning Methods
Bailey, T. L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers, in: Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology (ISMB), (Altman, R., Brutlag, D., Karp, P., Lathrop, R., and Searls, D., eds.) The AAAI Press, Stanford, CA, pp. 28–36.
Bailey, T. L. and Elkan, C. (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization, Machine Learning 21(1/2), 51–80.
Lawrence, C. E. and Reilly, A. A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences, Proteins, 7(1), 41–51.
Hidden Markov Models
Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. (1998), Biological Sequence Analysis, Cambridge University Press, Cambridge, UK.
Grundy, W. N., Bailey, T. L., Elkan, C. P., and Baker, M. E. (1997) Meta-MEME: motif-based hidden Markov models of protein families, Comp. Appl. Biosci. 13(4), 397–406.
Hughey, R. and Krogh, A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method, Comp. Appl. Biosci. 12(2), 95–107.
Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. (1994) Hidden Markov models in computational biology. Applications to protein modeling, J. Mol. Biol. 235(5), 1501–1501.
Pattern Discovery Using Additional Information
Blanchette, M., Schwikowski, B., and Tompa, M. (2000) An exact algorithm to identify motifs in orthologous sequences from multiple species, in: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), (Boume, P., Gribskov, M., Altman, R., Jensen, N., Hope, D., Lengauer, T., et al., eds.) The AAAI Press, San Diego, CA, pp. 37–45.
Chiang, D. Y., Brown, P. O., and Eisen, M. B. (2001), Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles, Bioinformatics 17(S1), S49-S55.
Eidhammer, I., Jonassen, I., and Taylor, W. R. (2000) Structure comparison and structure patterns, J. Comp. Biol. 7(5), 685–716.
Gorodkin, J., Heyer, L. J., and Stormo, G. D. (1997b) Finding the most significant common sequence and structure motifs in a set of RNA sequences, Nucleic Acids Res. 25(18), 3724–3732.
Ison, J. C., Blades, M. J., Bleasby, A. J., Daniel, S. C., Parish, J. H., and Findlay, J. B. (2000) Key residues approach to the definition of protein families and analysis of sparse family signatures, Proteins 40(2), 330–331.
Nevill-Manning, C. G., Wu, T. D., and Brutlag, D. L. (1998) Highly specific protein sequence motifs for genome analysis, Proc. Natl. Acad. Sci. USA 95(11), 5865–5871.
Pedersen, A. G., Baldi, P., Chauvin, Y., and Brunak, S. (1999) The biology of eukaryotic promoter prediction-a review, Comp. Chem. 23(3–4), 191–207.
Finding Homologies Between Two Sequences
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25(17), 3389–3392.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool, J. Mol. Biol. 215(3), 403–410.
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.-P., Rivals, E., and Vingron, M. (1999) q-gram based database searching using a suffix array (QUASAR), in: Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB), ACM Press, Lyon, France, pp. 77–83.
Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O., and Salzberg, S. L. (1999) Alignment of whole genomes, Nucleic Acids Res. 27(11), 2369–2376.
Gish, W. (2001) WU-Blast website (see Website: http://www.blast.wustl.edu.
Huang, X. and Miller, W. (1991) A time-efficient, linear-space local similarity algorithm, Adv. Appl. Math. 12(3), 337–357. (see SIM Website: http://www.expasy.ch/tools/sim.html)
Kurtz, S. and Schleiermacher, C. (1999) REPuter: fast computation of maximal repeats in complete genomes, Bioinformatics 15(5), 426–427.
Lipman, D. J. and Pearson, W. R. (1985) Rapid and sensitive protein similarity searches, Science 227(4693), 1435–1441.
Ma, B., Tromp, J., and Li, M. (2002) PatternHunter faster and more sensitive homology search, Bioinformatics 18(3), 440–445.
Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences, J. Mol. Biol. 147(1), 195–197.
States, D. J. and Agarwal, P. (1996) Compact encoding strategies for DNA sequence similarity search, in: Proceedings of the 4th International Conference on Intelligent Systems for Molecular Biology (ISMB), (States, D. J., Agarwal, P., Gaasterland, T., Hunter, L., and Smith, R. F., eds.) The AAAI Press, St. Louis, MO, pp. 211–217. (see SENSEI Website: http://www.stateslab.wustl.edu/software/sensei/).
Tatusova, T. A. and Madden, T. L. (1999) BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences, FEMS Microbiol. Lett. 174(2), 247–250.
Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2000) A greedy algorithm for aligning DNA sequences, J. Comp. Biol. 7(1–2), 203–204.
Assessment of Pattern Quality
Nicodème, P., Salvy, B., and Flajolet, P. (1999) Motif statistics, in: Algorithms — ESA ’99, 7th Annual European Symposium, vol. 1643, Lecture Notes in Computer Science, (Nesetril, J., ed.), Springer, Prague, pp. 194–211.
Pesole, G., Liuni, S., and D’Souza, M. (2000) PatSearch: a pattern matcher software that finds functional elements in nucleotide and protein sequences and assesses their statistical significance, Bioinformatics 16(5), 439–440.
Rocke, E. and Tompa, M. (1998) An algorithm for finding novel gapped motifs in DNA sequences, in: Proceedings of the 2nd Annual International Conference on Computational Molecular Biology (RECOMB), (Istrail, S., Pevzner, P., and Waterman, M., eds.), ACM Press, New York, NY, pp. 228–233.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer Science+Business Media New York
About this chapter
Cite this chapter
Brejová, B., Vinar, T., Li, M. (2003). Pattern Discovery. In: Krawetz, S.A., Womble, D.D. (eds) Introduction to Bioinformatics. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-59259-335-4_29
Download citation
DOI: https://doi.org/10.1007/978-1-59259-335-4_29
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-58829-241-4
Online ISBN: 978-1-59259-335-4
eBook Packages: Springer Book Archive