Skip to main content

Discovering Sequence Motifs

  • Protocol
Book cover Bioinformatics

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 452))

Abstract

Sequence motif discovery algorithms are an important part of the computational biologist's toolkit. The purpose of motif discovery is to discover patterns in biopolymer (nucleotide or protein) sequences in order to better understand the structure and function of the molecules the sequences represent. This chapter provides an overview of the use of sequence motif discovery in biology and a general guide to the use of motif discovery algorithms. The chapter discusses the types of biological features that DNA and protein motifs can represent and their usefulness. It also defines what sequence motifs are, how they are represented, and general techniques for discovering them. The primary focus is on one aspect of motif discovery: discovering motifs in a set of unaligned DNA or protein sequences. Also presented are steps useful for checking the biological validity and investigating the function of sequence motifs using methods such as motif scanning—searching for matches to motifs in a given sequence or a database of sequences. A discussion of some limitations of motif discovery concludes the chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Blais, A., Dynlacht, B. D. (2005) Constructing transcriptional regulatory networks. Genes Dev 19, 1499–1511.

    Article  PubMed  CAS  Google Scholar 

  2. Tan, K., McCue, L. A., Stormo, G. D. (2005) Making connections between novel transcription factors and their DNA motifs. Genome Res 15, 312–320.

    Article  PubMed  CAS  Google Scholar 

  3. Hulo, N., Bairoch, A., Bulliard, V., et al. (2006) The PROSITE database. Nucleic Acids Res 34, D227–D230.

    Article  PubMed  CAS  Google Scholar 

  4. Henikoff, J. G., Greene, E. A., Pietrokovski, S., et al. (2000) Increased coverage of protein families with the Blocks Database servers. Nucleic Acids Res 28, 228–230.

    Article  PubMed  CAS  Google Scholar 

  5. Attwood, T. K., Bradley, P., Flower, D. R., et al. (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res 31, 400–402.

    Article  PubMed  CAS  Google Scholar 

  6. La, D., Livesay, D. R (2005) Predicting functional sites with an automated algorithm suitable for heterogeneous datasets. BMC Bioinformatics 6, 116.

    Article  PubMed  Google Scholar 

  7. Matys, V., Kel-Margoulis, O. V., Fricke, E., et al. (2006) TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34, D108–D110.

    Article  PubMed  CAS  Google Scholar 

  8. Sandelin, A., Alkema, W., Engstrom, P., et al. (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 34, D91–D94.

    Article  Google Scholar 

  9. Zhu, J., Zhang, M. Q. (1999) SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics 15, 607–611.

    Article  PubMed  CAS  Google Scholar 

  10. Makita, Y., Nakao, M., Ogasawara, N., et al. (2004) DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. Nucleic Acids Res 32, D75–D77.

    Article  PubMed  CAS  Google Scholar 

  11. Salgado, H., Gama-Castro, S., Peralta-Gil, M., et al. (2006) RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res 34(Database issue), D394–397.

    Article  PubMed  CAS  Google Scholar 

  12. Waterston, R H., Lindblad-Toh, K., Birney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562.

    Article  PubMed  CAS  Google Scholar 

  13. Gribskov, M., Veretnik, S. (1996) Identification of sequence pattern with profile analysis. Methods Enzymol 266, 198–212.

    Article  PubMed  CAS  Google Scholar 

  14. Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics 14, 755–763.

    Article  PubMed  CAS  Google Scholar 

  15. Krogh, A., Brown, M., Mian, I. S., et al. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol 235, 1501–1531.

    Article  PubMed  CAS  Google Scholar 

  16. IUPAC-IUB Commission on Biochemical Nomenclature (1970) Abbreviations and symbols for nucleic acids, polynucleotides and their constituents, recommendations 1970. Eur J Biochem 15, 203–208.

    Article  Google Scholar 

  17. van Helden, J., Andre, B., Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281, 827–842.

    Article  PubMed  Google Scholar 

  18. van Helden, J., Rios, A. F., Collado-Vides, J. (2000) Discovering regulatory elements in non-coding sequences by analysis of spaced ences. Nucleic Acids Res 28, 1808–1818.

    Article  PubMed  Google Scholar 

  19. Schneider, T. D., Stephens, R. M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18, 6097–6100.

    Article  PubMed  CAS  Google Scholar 

  20. Reinert, G., Schbath, S., Waterman, M. S. (2000) Probabilistic and statistical properties of words: an overview. J Comput Biol 7, 1–46.

    Article  PubMed  CAS  Google Scholar 

  21. Schneider, T. D., Stormo, G. D., Gold, L., et al. (1986) Information content of binding sites on nucleotide sequences. J Mol Biol 188, 415–431.

    Article  PubMed  CAS  Google Scholar 

  22. Berg, O. G., von Hippel, P. H. (1987) Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol 193, 723–750.

    Article  PubMed  CAS  Google Scholar 

  23. Berg, O. G., von Hippel, P. H. (1988) Selection of DNA binding sites by regulatory proteins. II. The binding specificity of cyclic AMP receptor protein to recognition sites. J Mol Biol 200, 709–723.

    Article  PubMed  CAS  Google Scholar 

  24. Finn, R. D., Mistry, J., Schuster-Bockler, B., et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34, D247–D251.

    Article  PubMed  CAS  Google Scholar 

  25. Sinha, S. (2003) Discriminative motifs. J Comput Biol 10, 599–615.

    Article  PubMed  CAS  Google Scholar 

  26. Workman, C. T., Stormo, G. D. (2000) ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput, 467–478.

    Google Scholar 

  27. Sinha, S., Blanchette, M., Tompa, M. (2004) PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 5, 170.

    Article  PubMed  Google Scholar 

  28. Moses, A. M., Chiang, D. Y., Eisen, M. B. (2004) Phylogenetic motif detection by expectation-maximization on evolutionary mixtures. Pac Symp Biocomput 324–335.

    Google Scholar 

  29. Siddharthan, R., Siggia, E. D., van Nimwegen, E. (2005) PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol 1, e67.

    Article  PubMed  Google Scholar 

  30. Liu, X., Brutlag, D. L., Liu, J. S. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput, 127–138.

    Google Scholar 

  31. Xie, X., Lu, J., Kulbokas, E. J., et al. (2005) Systematic discovery of regulatory motifs in human promoters and 3 UTRs by comparison of several mammals. Nature 434, 338–345.

    Article  PubMed  CAS  Google Scholar 

  32. Kellis, M., Patterson, N., Birren, B., et al. (2004) Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery. J Comput Biol 11, 319–355.

    Article  PubMed  CAS  Google Scholar 

  33. Duda, R. O., Hart, P. E. (1973) Pattern Classification and Scene Analysis. John Wiley & Sons, New York.

    Google Scholar 

  34. Seki, M., Narusaka, M., Abe, H., et al. (2001) Monitoring the expression pattern of 1300 Arabidopsis genes under drought and cold stresses by using a full-length cDNA microarray. Plant Cell 13, 61–72.

    Article  PubMed  CAS  Google Scholar 

  35. Harbison, C. T., Gordon, D. B., Lee, T. I., et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99–104.

    Article  PubMed  CAS  Google Scholar 

  36. Kawaji, H., Kasukawa, T., Fukuda, S., et al. (2006) CAGE Basic/Analysis Databases: the CAGE resource for comprehensive promoter analysis. Nucleic Acids Res 34, D632–D636.

    Article  PubMed  CAS  Google Scholar 

  37. Kodzius, R., Matsumura, Y., Kasukawa, T., et al. (2004) Absolute expression values for mouse transcripts: re-annotation of the READ expression database by the use of CAGE and EST sequence tags. FEBS Lett 559, 22–26.

    Article  PubMed  CAS  Google Scholar 

  38. Tatusov, R. L., Fedorova, N. D., Jackson, J. D., et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.

    Article  PubMed  Google Scholar 

  39. Andreeva, A., Howorth, D., Brenner, S. E., et al. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 32, D226–D229.

    Article  PubMed  CAS  Google Scholar 

  40. La, D., Silver, M., Edgar, R C, Livesay, D. R (2003) Using motif-based methods in multiple genome analyses: a case study comparing orthologous mesophilic and thermophilic proteins. Biochemistry 42, 8988–8998.

    Article  PubMed  CAS  Google Scholar 

  41. Tatusov, R. L., Lipman, D. J. Dust, in the NCBI/Toolkit available at http://blast.wustl.edu/pub/dust/.

  42. Claverie, J.-M., States, D. J. (1993) Information enhancement methods for large scale sequence analysis. Comput Chem 17, 191–201.

    Article  CAS  Google Scholar 

  43. Wootton, J. C, Federhen, S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266, 554–571.

    Article  PubMed  CAS  Google Scholar 

  44. Smit, A., Hubley, R, Green, P. Repeatmasker, available at http://www.repeatmasker.org.

  45. Bailey, T. L., Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28–36.

    PubMed  CAS  Google Scholar 

  46. Thompson, W., Rouchka, E. C, Lawrence, C. E. (2003) Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res 31, 3580–3585.

    Article  PubMed  CAS  Google Scholar 

  47. Roth, F. P., Hughes, J. D., Estep, P. W., et al. (1998) Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 16, 939–945.

    Article  PubMed  CAS  Google Scholar 

  48. Liu, X. S., Brutlag, D. L., Liu, J. S. (2002) An algorithm for finding protein-DNA binding sites with applications to chroma-tin immunoprecipitation microarray experiments. Nat Biotechnol 20, 835–839.

    PubMed  CAS  Google Scholar 

  49. van Helden, J., Andre, B., Collado-Vides, J. (2000) A web site for the computational analysis of yeast regulatory sequences. Yeast 16, 177–187.

    Article  PubMed  Google Scholar 

  50. Pavesi, G., Mereghetti, P., Mauri, G., et al. (2004) Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res 32,W199–W203.

    Article  PubMed  CAS  Google Scholar 

  51. Sinha, S., Tompa, M. (2003) YMF: A program for discovery of novel transcription factor binding sites by statistical overrep-resentation. Nucleic Acids Res 31, 3586–3588.

    Article  PubMed  CAS  Google Scholar 

  52. Liu, Y., Liu, X. S., Wei, L., Altman, R B., et al. (2004) Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res 14, 451–458.

    Article  PubMed  CAS  Google Scholar 

  53. Henikoff, S., Henikoff, J. G., Alford, W J., et al. (1995) Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene 163, GC17–GC26.

    Article  PubMed  CAS  Google Scholar 

  54. Gordon, D. B., Nekludova, L., McCallum, S., et al. (2005) TAMO: a flexible, object-oriented framework for analyzing transcrip-tional regulation using DNA-sequence motifs. Bioinformatics 21, 3164–3165.

    Article  PubMed  CAS  Google Scholar 

  55. Hertz, G. Z., Stormo, G. D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577.

    Article  PubMed  CAS  Google Scholar 

  56. Frith, M. C, Hansen, U., Spouge, J. L., et al. (2004) Finding functional sequence elements by multiple local alignment. Nucleic Acids Res 32, 189–200.

    Article  PubMed  CAS  Google Scholar 

  57. Ao, W, Gaudet, J., Kent, W J., et al. (2004) Environmentally induced foregut remodeling by PHA4/FoxA and DAF-12/ NHR Science 305, 1742–1746.

    Article  Google Scholar 

  58. Eskin, E., Pevzner, P. A. (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics 18, S354–S363.

    PubMed  Google Scholar 

  59. Thijs, G., Marchal, K., Lescot, M., et al. (2002) A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol 9, 447–464.

    Article  PubMed  CAS  Google Scholar 

  60. Regnier, M., Denise, A. (2004) Rare events and conditional events on random strings. Discrete Math Theor Comput Sci 6, 191–214.

    Google Scholar 

  61. Favorov, A. V., Gelfand, M. S., Gerasi-mova, A. V., et al. (2005) A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics 21, 2240–2245.

    Article  PubMed  CAS  Google Scholar 

  62. Tagle, D. A., Koop, B. F., Goodman, M., et al. (1988) Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassi caudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J Mol Biol 203, 439–455.

    Article  PubMed  CAS  Google Scholar 

  63. Duret, L., Bucher, P. (1997) Searching for regulatory elements in human non-coding sequences. Curr Opin Struct Biol 7, 399–406.

    Article  PubMed  CAS  Google Scholar 

  64. Macisaac, K. D., Gordon, D. B., Nekludova, L., et al. (2006) A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data. Bioinformatics 22, 423–429. 251

    Article  PubMed  CAS  Google Scholar 

  65. Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res 24, 3836–3845.

    Article  PubMed  CAS  Google Scholar 

  66. Bailey, T. L., Gribskov, M. (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14, 48–54.

    Article  PubMed  CAS  Google Scholar 

  67. Bailey, T. L., Noble, W. S. (2003) Searching for statistically significant regulatory modules. Bioinformatics 19, II16–II25.

    Article  Google Scholar 

  68. Frith, M. C, Spouge, J. L., Hansen, U., et al. (2002) Statistical significance of clusters of motifs represented by position specific scoring matrices in nucle-otide sequences. Nucleic Acids Res 30, 3214–3224.

    Article  PubMed  CAS  Google Scholar 

  69. Frith, M. C, Li, M. C, Weng, Z. (2003) Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res 31, 3666–3668.

    Article  PubMed  CAS  Google Scholar 

  70. Ashburner,M.,Ball,C.A.,Blake,J.A.,etal. (2000) Gene ontology: tool for the unification of biology. Nat Genet 25, 25–29.

    Article  PubMed  CAS  Google Scholar 

  71. Stanley, S., Bailey, T., Mattick, J. (2006) GONOME: measuring correlations between gene ontology terms and genomic gorithms. BMC Bioinformatics 7, 94.

    Article  PubMed  Google Scholar 

  72. Keich, U., Pevzner, P. A. (2002) Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics 18, 1382–1390.

    Article  PubMed  CAS  Google Scholar 

  73. Tompa, M., Li, N., Bailey, T. L., et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23, 137–144.

    Article  PubMed  CAS  Google Scholar 

  74. Kent, W. J., Sugnet, C. W., Furey, T. S., et al. (2002) The human genome browser at UCSC. Genome Res 12, 996–1006.

    PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Humana Press, a part of Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Bailey, T.L. (2008). Discovering Sequence Motifs. In: Keith, J.M. (eds) Bioinformatics. Methods in Molecular Biology™, vol 452. Humana Press. https://doi.org/10.1007/978-1-60327-159-2_12

Download citation

  • DOI: https://doi.org/10.1007/978-1-60327-159-2_12

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-58829-707-5

  • Online ISBN: 978-1-60327-159-2

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics