Discovering Sequence Motifs

Bailey, Timothy L.

doi:10.1007/978-1-60327-159-2_12

Timothy L. Bailey³

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 452))

6956 Accesses
22 Citations
6 Altmetric

Abstract

Sequence motif discovery algorithms are an important part of the computational biologist's toolkit. The purpose of motif discovery is to discover patterns in biopolymer (nucleotide or protein) sequences in order to better understand the structure and function of the molecules the sequences represent. This chapter provides an overview of the use of sequence motif discovery in biology and a general guide to the use of motif discovery algorithms. The chapter discusses the types of biological features that DNA and protein motifs can represent and their usefulness. It also defines what sequence motifs are, how they are represented, and general techniques for discovering them. The primary focus is on one aspect of motif discovery: discovering motifs in a set of unaligned DNA or protein sequences. Also presented are steps useful for checking the biological validity and investigating the function of sequence motifs using methods such as motif scanning—searching for matches to motifs in a given sequence or a database of sequences. A discussion of some limitations of motif discovery concludes the chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Blais, A., Dynlacht, B. D. (2005) Constructing transcriptional regulatory networks. Genes Dev 19, 1499–1511.
Article PubMed CAS Google Scholar
Tan, K., McCue, L. A., Stormo, G. D. (2005) Making connections between novel transcription factors and their DNA motifs. Genome Res 15, 312–320.
Article PubMed CAS Google Scholar
Hulo, N., Bairoch, A., Bulliard, V., et al. (2006) The PROSITE database. Nucleic Acids Res 34, D227–D230.
Article PubMed CAS Google Scholar
Henikoff, J. G., Greene, E. A., Pietrokovski, S., et al. (2000) Increased coverage of protein families with the Blocks Database servers. Nucleic Acids Res 28, 228–230.
Article PubMed CAS Google Scholar
Attwood, T. K., Bradley, P., Flower, D. R., et al. (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res 31, 400–402.
Article PubMed CAS Google Scholar
La, D., Livesay, D. R (2005) Predicting functional sites with an automated algorithm suitable for heterogeneous datasets. BMC Bioinformatics 6, 116.
Article PubMed Google Scholar
Matys, V., Kel-Margoulis, O. V., Fricke, E., et al. (2006) TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34, D108–D110.
Article PubMed CAS Google Scholar
Sandelin, A., Alkema, W., Engstrom, P., et al. (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 34, D91–D94.
Article Google Scholar
Zhu, J., Zhang, M. Q. (1999) SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics 15, 607–611.
Article PubMed CAS Google Scholar
Makita, Y., Nakao, M., Ogasawara, N., et al. (2004) DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. Nucleic Acids Res 32, D75–D77.
Article PubMed CAS Google Scholar
Salgado, H., Gama-Castro, S., Peralta-Gil, M., et al. (2006) RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res 34(Database issue), D394–397.
Article PubMed CAS Google Scholar
Waterston, R H., Lindblad-Toh, K., Birney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562.
Article PubMed CAS Google Scholar
Gribskov, M., Veretnik, S. (1996) Identification of sequence pattern with profile analysis. Methods Enzymol 266, 198–212.
Article PubMed CAS Google Scholar
Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics 14, 755–763.
Article PubMed CAS Google Scholar
Krogh, A., Brown, M., Mian, I. S., et al. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol 235, 1501–1531.
Article PubMed CAS Google Scholar
IUPAC-IUB Commission on Biochemical Nomenclature (1970) Abbreviations and symbols for nucleic acids, polynucleotides and their constituents, recommendations 1970. Eur J Biochem 15, 203–208.
Article Google Scholar
van Helden, J., Andre, B., Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281, 827–842.
Article PubMed Google Scholar
van Helden, J., Rios, A. F., Collado-Vides, J. (2000) Discovering regulatory elements in non-coding sequences by analysis of spaced ences. Nucleic Acids Res 28, 1808–1818.
Article PubMed Google Scholar
Schneider, T. D., Stephens, R. M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18, 6097–6100.
Article PubMed CAS Google Scholar
Reinert, G., Schbath, S., Waterman, M. S. (2000) Probabilistic and statistical properties of words: an overview. J Comput Biol 7, 1–46.
Article PubMed CAS Google Scholar
Schneider, T. D., Stormo, G. D., Gold, L., et al. (1986) Information content of binding sites on nucleotide sequences. J Mol Biol 188, 415–431.
Article PubMed CAS Google Scholar
Berg, O. G., von Hippel, P. H. (1987) Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol 193, 723–750.
Article PubMed CAS Google Scholar
Berg, O. G., von Hippel, P. H. (1988) Selection of DNA binding sites by regulatory proteins. II. The binding specificity of cyclic AMP receptor protein to recognition sites. J Mol Biol 200, 709–723.
Article PubMed CAS Google Scholar
Finn, R. D., Mistry, J., Schuster-Bockler, B., et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34, D247–D251.
Article PubMed CAS Google Scholar
Sinha, S. (2003) Discriminative motifs. J Comput Biol 10, 599–615.
Article PubMed CAS Google Scholar
Workman, C. T., Stormo, G. D. (2000) ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput, 467–478.
Google Scholar
Sinha, S., Blanchette, M., Tompa, M. (2004) PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 5, 170.
Article PubMed Google Scholar
Moses, A. M., Chiang, D. Y., Eisen, M. B. (2004) Phylogenetic motif detection by expectation-maximization on evolutionary mixtures. Pac Symp Biocomput 324–335.
Google Scholar
Siddharthan, R., Siggia, E. D., van Nimwegen, E. (2005) PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol 1, e67.
Article PubMed Google Scholar
Liu, X., Brutlag, D. L., Liu, J. S. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput, 127–138.
Google Scholar
Xie, X., Lu, J., Kulbokas, E. J., et al. (2005) Systematic discovery of regulatory motifs in human promoters and 3 UTRs by comparison of several mammals. Nature 434, 338–345.
Article PubMed CAS Google Scholar
Kellis, M., Patterson, N., Birren, B., et al. (2004) Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery. J Comput Biol 11, 319–355.
Article PubMed CAS Google Scholar
Duda, R. O., Hart, P. E. (1973) Pattern Classification and Scene Analysis. John Wiley & Sons, New York.
Google Scholar
Seki, M., Narusaka, M., Abe, H., et al. (2001) Monitoring the expression pattern of 1300 Arabidopsis genes under drought and cold stresses by using a full-length cDNA microarray. Plant Cell 13, 61–72.
Article PubMed CAS Google Scholar
Harbison, C. T., Gordon, D. B., Lee, T. I., et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99–104.
Article PubMed CAS Google Scholar
Kawaji, H., Kasukawa, T., Fukuda, S., et al. (2006) CAGE Basic/Analysis Databases: the CAGE resource for comprehensive promoter analysis. Nucleic Acids Res 34, D632–D636.
Article PubMed CAS Google Scholar
Kodzius, R., Matsumura, Y., Kasukawa, T., et al. (2004) Absolute expression values for mouse transcripts: re-annotation of the READ expression database by the use of CAGE and EST sequence tags. FEBS Lett 559, 22–26.
Article PubMed CAS Google Scholar
Tatusov, R. L., Fedorova, N. D., Jackson, J. D., et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.
Article PubMed Google Scholar
Andreeva, A., Howorth, D., Brenner, S. E., et al. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 32, D226–D229.
Article PubMed CAS Google Scholar
La, D., Silver, M., Edgar, R C, Livesay, D. R (2003) Using motif-based methods in multiple genome analyses: a case study comparing orthologous mesophilic and thermophilic proteins. Biochemistry 42, 8988–8998.
Article PubMed CAS Google Scholar
Tatusov, R. L., Lipman, D. J. Dust, in the NCBI/Toolkit available at http://blast.wustl.edu/pub/dust/.
Claverie, J.-M., States, D. J. (1993) Information enhancement methods for large scale sequence analysis. Comput Chem 17, 191–201.
Article CAS Google Scholar
Wootton, J. C, Federhen, S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266, 554–571.
Article PubMed CAS Google Scholar
Smit, A., Hubley, R, Green, P. Repeatmasker, available at http://www.repeatmasker.org.
Bailey, T. L., Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28–36.
PubMed CAS Google Scholar
Thompson, W., Rouchka, E. C, Lawrence, C. E. (2003) Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res 31, 3580–3585.
Article PubMed CAS Google Scholar
Roth, F. P., Hughes, J. D., Estep, P. W., et al. (1998) Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 16, 939–945.
Article PubMed CAS Google Scholar
Liu, X. S., Brutlag, D. L., Liu, J. S. (2002) An algorithm for finding protein-DNA binding sites with applications to chroma-tin immunoprecipitation microarray experiments. Nat Biotechnol 20, 835–839.
PubMed CAS Google Scholar
van Helden, J., Andre, B., Collado-Vides, J. (2000) A web site for the computational analysis of yeast regulatory sequences. Yeast 16, 177–187.
Article PubMed Google Scholar
Pavesi, G., Mereghetti, P., Mauri, G., et al. (2004) Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res 32,W199–W203.
Article PubMed CAS Google Scholar
Sinha, S., Tompa, M. (2003) YMF: A program for discovery of novel transcription factor binding sites by statistical overrep-resentation. Nucleic Acids Res 31, 3586–3588.
Article PubMed CAS Google Scholar
Liu, Y., Liu, X. S., Wei, L., Altman, R B., et al. (2004) Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res 14, 451–458.
Article PubMed CAS Google Scholar
Henikoff, S., Henikoff, J. G., Alford, W J., et al. (1995) Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene 163, GC17–GC26.
Article PubMed CAS Google Scholar
Gordon, D. B., Nekludova, L., McCallum, S., et al. (2005) TAMO: a flexible, object-oriented framework for analyzing transcrip-tional regulation using DNA-sequence motifs. Bioinformatics 21, 3164–3165.
Article PubMed CAS Google Scholar
Hertz, G. Z., Stormo, G. D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577.
Article PubMed CAS Google Scholar
Frith, M. C, Hansen, U., Spouge, J. L., et al. (2004) Finding functional sequence elements by multiple local alignment. Nucleic Acids Res 32, 189–200.
Article PubMed CAS Google Scholar
Ao, W, Gaudet, J., Kent, W J., et al. (2004) Environmentally induced foregut remodeling by PHA4/FoxA and DAF-12/ NHR Science 305, 1742–1746.
Article Google Scholar
Eskin, E., Pevzner, P. A. (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics 18, S354–S363.
PubMed Google Scholar
Thijs, G., Marchal, K., Lescot, M., et al. (2002) A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol 9, 447–464.
Article PubMed CAS Google Scholar
Regnier, M., Denise, A. (2004) Rare events and conditional events on random strings. Discrete Math Theor Comput Sci 6, 191–214.
Google Scholar
Favorov, A. V., Gelfand, M. S., Gerasi-mova, A. V., et al. (2005) A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics 21, 2240–2245.
Article PubMed CAS Google Scholar
Tagle, D. A., Koop, B. F., Goodman, M., et al. (1988) Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassi caudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J Mol Biol 203, 439–455.
Article PubMed CAS Google Scholar
Duret, L., Bucher, P. (1997) Searching for regulatory elements in human non-coding sequences. Curr Opin Struct Biol 7, 399–406.
Article PubMed CAS Google Scholar
Macisaac, K. D., Gordon, D. B., Nekludova, L., et al. (2006) A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data. Bioinformatics 22, 423–429. 251
Article PubMed CAS Google Scholar
Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res 24, 3836–3845.
Article PubMed CAS Google Scholar
Bailey, T. L., Gribskov, M. (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14, 48–54.
Article PubMed CAS Google Scholar
Bailey, T. L., Noble, W. S. (2003) Searching for statistically significant regulatory modules. Bioinformatics 19, II16–II25.
Article Google Scholar
Frith, M. C, Spouge, J. L., Hansen, U., et al. (2002) Statistical significance of clusters of motifs represented by position specific scoring matrices in nucle-otide sequences. Nucleic Acids Res 30, 3214–3224.
Article PubMed CAS Google Scholar
Frith, M. C, Li, M. C, Weng, Z. (2003) Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res 31, 3666–3668.
Article PubMed CAS Google Scholar
Ashburner,M.,Ball,C.A.,Blake,J.A.,etal. (2000) Gene ontology: tool for the unification of biology. Nat Genet 25, 25–29.
Article PubMed CAS Google Scholar
Stanley, S., Bailey, T., Mattick, J. (2006) GONOME: measuring correlations between gene ontology terms and genomic gorithms. BMC Bioinformatics 7, 94.
Article PubMed Google Scholar
Keich, U., Pevzner, P. A. (2002) Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics 18, 1382–1390.
Article PubMed CAS Google Scholar
Tompa, M., Li, N., Bailey, T. L., et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23, 137–144.
Article PubMed CAS Google Scholar
Kent, W. J., Sugnet, C. W., Furey, T. S., et al. (2002) The human genome browser at UCSC. Genome Res 12, 996–1006.
PubMed CAS Google Scholar

Download references

Author information

Authors and Affiliations

ARC Centre of Excellence in Bioinformatics, and Institute for Molecular Bioscience, The University of Queensland, Brisbane, Queensland, Australia
Timothy L. Bailey

Authors

Timothy L. Bailey
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
Jonathan M. Keith PhD

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Bailey, T.L. (2008). Discovering Sequence Motifs. In: Keith, J.M. (eds) Bioinformatics. Methods in Molecular Biology™, vol 452. Humana Press. https://doi.org/10.1007/978-1-60327-159-2_12

Download citation

DOI: https://doi.org/10.1007/978-1-60327-159-2_12
Publisher Name: Humana Press
Print ISBN: 978-1-58829-707-5
Online ISBN: 978-1-60327-159-2
eBook Packages: Springer Protocols

Publish with us

Policies and ethics