Bioinformatics pp 231-251

Part of the Methods in Molecular Biology™ book series (MIMB, volume 452)

Discovering Sequence Motifs

  • Timothy L. Bailey


Sequence motif discovery algorithms are an important part of the computational biologist's toolkit. The purpose of motif discovery is to discover patterns in biopolymer (nucleotide or protein) sequences in order to better understand the structure and function of the molecules the sequences represent. This chapter provides an overview of the use of sequence motif discovery in biology and a general guide to the use of motif discovery algorithms. The chapter discusses the types of biological features that DNA and protein motifs can represent and their usefulness. It also defines what sequence motifs are, how they are represented, and general techniques for discovering them. The primary focus is on one aspect of motif discovery: discovering motifs in a set of unaligned DNA or protein sequences. Also presented are steps useful for checking the biological validity and investigating the function of sequence motifs using methods such as motif scanning—searching for matches to motifs in a given sequence or a database of sequences. A discussion of some limitations of motif discovery concludes the chapter.

Key words

Motif discovery sequence motif sequence pattern protein domain multiple alignment position-specific scoring matrix PSSM position-specific weight matrix PWM transcription factor binding site transcription factor promoter protein features 


  1. 1.
    Blais, A., Dynlacht, B. D. (2005) Constructing transcriptional regulatory networks. Genes Dev 19, 1499–1511.PubMedCrossRefGoogle Scholar
  2. 2.
    Tan, K., McCue, L. A., Stormo, G. D. (2005) Making connections between novel transcription factors and their DNA motifs. Genome Res 15, 312–320.PubMedCrossRefGoogle Scholar
  3. 3.
    Hulo, N., Bairoch, A., Bulliard, V., et al. (2006) The PROSITE database. Nucleic Acids Res 34, D227–D230.PubMedCrossRefGoogle Scholar
  4. 4.
    Henikoff, J. G., Greene, E. A., Pietrokovski, S., et al. (2000) Increased coverage of protein families with the Blocks Database servers. Nucleic Acids Res 28, 228–230.PubMedCrossRefGoogle Scholar
  5. 5.
    Attwood, T. K., Bradley, P., Flower, D. R., et al. (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res 31, 400–402.PubMedCrossRefGoogle Scholar
  6. 6.
    La, D., Livesay, D. R (2005) Predicting functional sites with an automated algorithm suitable for heterogeneous datasets. BMC Bioinformatics 6, 116.PubMedCrossRefGoogle Scholar
  7. 7.
    Matys, V., Kel-Margoulis, O. V., Fricke, E., et al. (2006) TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34, D108–D110.PubMedCrossRefGoogle Scholar
  8. 8.
    Sandelin, A., Alkema, W., Engstrom, P., et al. (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 34, D91–D94.CrossRefGoogle Scholar
  9. 9.
    Zhu, J., Zhang, M. Q. (1999) SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics 15, 607–611.PubMedCrossRefGoogle Scholar
  10. 10.
    Makita, Y., Nakao, M., Ogasawara, N., et al. (2004) DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. Nucleic Acids Res 32, D75–D77.PubMedCrossRefGoogle Scholar
  11. 11.
    Salgado, H., Gama-Castro, S., Peralta-Gil, M., et al. (2006) RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res 34(Database issue), D394–397.PubMedCrossRefGoogle Scholar
  12. 12.
    Waterston, R H., Lindblad-Toh, K., Birney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562.PubMedCrossRefGoogle Scholar
  13. 13.
    Gribskov, M., Veretnik, S. (1996) Identification of sequence pattern with profile analysis. Methods Enzymol 266, 198–212.PubMedCrossRefGoogle Scholar
  14. 14.
    Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics 14, 755–763.PubMedCrossRefGoogle Scholar
  15. 15.
    Krogh, A., Brown, M., Mian, I. S., et al. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol 235, 1501–1531.PubMedCrossRefGoogle Scholar
  16. 16.
    IUPAC-IUB Commission on Biochemical Nomenclature (1970) Abbreviations and symbols for nucleic acids, polynucleotides and their constituents, recommendations 1970. Eur J Biochem 15, 203–208.CrossRefGoogle Scholar
  17. 17.
    van Helden, J., Andre, B., Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281, 827–842.PubMedCrossRefGoogle Scholar
  18. 18.
    van Helden, J., Rios, A. F., Collado-Vides, J. (2000) Discovering regulatory elements in non-coding sequences by analysis of spaced ences. Nucleic Acids Res 28, 1808–1818.PubMedCrossRefGoogle Scholar
  19. 19.
    Schneider, T. D., Stephens, R. M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18, 6097–6100.PubMedCrossRefGoogle Scholar
  20. 20.
    Reinert, G., Schbath, S., Waterman, M. S. (2000) Probabilistic and statistical properties of words: an overview. J Comput Biol 7, 1–46.PubMedCrossRefGoogle Scholar
  21. 21.
    Schneider, T. D., Stormo, G. D., Gold, L., et al. (1986) Information content of binding sites on nucleotide sequences. J Mol Biol 188, 415–431.PubMedCrossRefGoogle Scholar
  22. 22.
    Berg, O. G., von Hippel, P. H. (1987) Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol 193, 723–750.PubMedCrossRefGoogle Scholar
  23. 23.
    Berg, O. G., von Hippel, P. H. (1988) Selection of DNA binding sites by regulatory proteins. II. The binding specificity of cyclic AMP receptor protein to recognition sites. J Mol Biol 200, 709–723.PubMedCrossRefGoogle Scholar
  24. 24.
    Finn, R. D., Mistry, J., Schuster-Bockler, B., et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34, D247–D251.PubMedCrossRefGoogle Scholar
  25. 25.
    Sinha, S. (2003) Discriminative motifs. J Comput Biol 10, 599–615.PubMedCrossRefGoogle Scholar
  26. 26.
    Workman, C. T., Stormo, G. D. (2000) ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput, 467–478.Google Scholar
  27. 27.
    Sinha, S., Blanchette, M., Tompa, M. (2004) PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 5, 170.PubMedCrossRefGoogle Scholar
  28. 28.
    Moses, A. M., Chiang, D. Y., Eisen, M. B. (2004) Phylogenetic motif detection by expectation-maximization on evolutionary mixtures. Pac Symp Biocomput 324–335.Google Scholar
  29. 29.
    Siddharthan, R., Siggia, E. D., van Nimwegen, E. (2005) PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol 1, e67.PubMedCrossRefGoogle Scholar
  30. 30.
    Liu, X., Brutlag, D. L., Liu, J. S. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput, 127–138.Google Scholar
  31. 31.
    Xie, X., Lu, J., Kulbokas, E. J., et al. (2005) Systematic discovery of regulatory motifs in human promoters and 3 UTRs by comparison of several mammals. Nature 434, 338–345.PubMedCrossRefGoogle Scholar
  32. 32.
    Kellis, M., Patterson, N., Birren, B., et al. (2004) Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery. J Comput Biol 11, 319–355.PubMedCrossRefGoogle Scholar
  33. 33.
    Duda, R. O., Hart, P. E. (1973) Pattern Classification and Scene Analysis. John Wiley & Sons, New York.Google Scholar
  34. 34.
    Seki, M., Narusaka, M., Abe, H., et al. (2001) Monitoring the expression pattern of 1300 Arabidopsis genes under drought and cold stresses by using a full-length cDNA microarray. Plant Cell 13, 61–72.PubMedCrossRefGoogle Scholar
  35. 35.
    Harbison, C. T., Gordon, D. B., Lee, T. I., et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99–104.PubMedCrossRefGoogle Scholar
  36. 36.
    Kawaji, H., Kasukawa, T., Fukuda, S., et al. (2006) CAGE Basic/Analysis Databases: the CAGE resource for comprehensive promoter analysis. Nucleic Acids Res 34, D632–D636.PubMedCrossRefGoogle Scholar
  37. 37.
    Kodzius, R., Matsumura, Y., Kasukawa, T., et al. (2004) Absolute expression values for mouse transcripts: re-annotation of the READ expression database by the use of CAGE and EST sequence tags. FEBS Lett 559, 22–26.PubMedCrossRefGoogle Scholar
  38. 38.
    Tatusov, R. L., Fedorova, N. D., Jackson, J. D., et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.PubMedCrossRefGoogle Scholar
  39. 39.
    Andreeva, A., Howorth, D., Brenner, S. E., et al. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 32, D226–D229.PubMedCrossRefGoogle Scholar
  40. 40.
    La, D., Silver, M., Edgar, R C, Livesay, D. R (2003) Using motif-based methods in multiple genome analyses: a case study comparing orthologous mesophilic and thermophilic proteins. Biochemistry 42, 8988–8998.PubMedCrossRefGoogle Scholar
  41. 41.
    Tatusov, R. L., Lipman, D. J. Dust, in the NCBI/Toolkit available at
  42. 42.
    Claverie, J.-M., States, D. J. (1993) Information enhancement methods for large scale sequence analysis. Comput Chem 17, 191–201.CrossRefGoogle Scholar
  43. 43.
    Wootton, J. C, Federhen, S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266, 554–571.PubMedCrossRefGoogle Scholar
  44. 44.
    Smit, A., Hubley, R, Green, P. Repeatmasker, available at
  45. 45.
    Bailey, T. L., Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28–36.PubMedGoogle Scholar
  46. 46.
    Thompson, W., Rouchka, E. C, Lawrence, C. E. (2003) Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res 31, 3580–3585.PubMedCrossRefGoogle Scholar
  47. 47.
    Roth, F. P., Hughes, J. D., Estep, P. W., et al. (1998) Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 16, 939–945.PubMedCrossRefGoogle Scholar
  48. 48.
    Liu, X. S., Brutlag, D. L., Liu, J. S. (2002) An algorithm for finding protein-DNA binding sites with applications to chroma-tin immunoprecipitation microarray experiments. Nat Biotechnol 20, 835–839.PubMedGoogle Scholar
  49. 49.
    van Helden, J., Andre, B., Collado-Vides, J. (2000) A web site for the computational analysis of yeast regulatory sequences. Yeast 16, 177–187.PubMedCrossRefGoogle Scholar
  50. 50.
    Pavesi, G., Mereghetti, P., Mauri, G., et al. (2004) Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res 32,W199–W203.PubMedCrossRefGoogle Scholar
  51. 51.
    Sinha, S., Tompa, M. (2003) YMF: A program for discovery of novel transcription factor binding sites by statistical overrep-resentation. Nucleic Acids Res 31, 3586–3588.PubMedCrossRefGoogle Scholar
  52. 52.
    Liu, Y., Liu, X. S., Wei, L., Altman, R B., et al. (2004) Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res 14, 451–458.PubMedCrossRefGoogle Scholar
  53. 53.
    Henikoff, S., Henikoff, J. G., Alford, W J., et al. (1995) Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene 163, GC17–GC26.PubMedCrossRefGoogle Scholar
  54. 54.
    Gordon, D. B., Nekludova, L., McCallum, S., et al. (2005) TAMO: a flexible, object-oriented framework for analyzing transcrip-tional regulation using DNA-sequence motifs. Bioinformatics 21, 3164–3165.PubMedCrossRefGoogle Scholar
  55. 55.
    Hertz, G. Z., Stormo, G. D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577.PubMedCrossRefGoogle Scholar
  56. 56.
    Frith, M. C, Hansen, U., Spouge, J. L., et al. (2004) Finding functional sequence elements by multiple local alignment. Nucleic Acids Res 32, 189–200.PubMedCrossRefGoogle Scholar
  57. 57.
    Ao, W, Gaudet, J., Kent, W J., et al. (2004) Environmentally induced foregut remodeling by PHA4/FoxA and DAF-12/ NHR Science 305, 1742–1746.CrossRefGoogle Scholar
  58. 58.
    Eskin, E., Pevzner, P. A. (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics 18, S354–S363.PubMedGoogle Scholar
  59. 59.
    Thijs, G., Marchal, K., Lescot, M., et al. (2002) A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol 9, 447–464.PubMedCrossRefGoogle Scholar
  60. 60.
    Regnier, M., Denise, A. (2004) Rare events and conditional events on random strings. Discrete Math Theor Comput Sci 6, 191–214.Google Scholar
  61. 61.
    Favorov, A. V., Gelfand, M. S., Gerasi-mova, A. V., et al. (2005) A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics 21, 2240–2245.PubMedCrossRefGoogle Scholar
  62. 62.
    Tagle, D. A., Koop, B. F., Goodman, M., et al. (1988) Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassi caudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J Mol Biol 203, 439–455.PubMedCrossRefGoogle Scholar
  63. 63.
    Duret, L., Bucher, P. (1997) Searching for regulatory elements in human non-coding sequences. Curr Opin Struct Biol 7, 399–406.PubMedCrossRefGoogle Scholar
  64. 64.
    Macisaac, K. D., Gordon, D. B., Nekludova, L., et al. (2006) A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data. Bioinformatics 22, 423–429. 251PubMedCrossRefGoogle Scholar
  65. 65.
    Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res 24, 3836–3845.PubMedCrossRefGoogle Scholar
  66. 66.
    Bailey, T. L., Gribskov, M. (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14, 48–54.PubMedCrossRefGoogle Scholar
  67. 67.
    Bailey, T. L., Noble, W. S. (2003) Searching for statistically significant regulatory modules. Bioinformatics 19, II16–II25.CrossRefGoogle Scholar
  68. 68.
    Frith, M. C, Spouge, J. L., Hansen, U., et al. (2002) Statistical significance of clusters of motifs represented by position specific scoring matrices in nucle-otide sequences. Nucleic Acids Res 30, 3214–3224.PubMedCrossRefGoogle Scholar
  69. 69.
    Frith, M. C, Li, M. C, Weng, Z. (2003) Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res 31, 3666–3668.PubMedCrossRefGoogle Scholar
  70. 70.
    Ashburner,M.,Ball,C.A.,Blake,J.A.,etal. (2000) Gene ontology: tool for the unification of biology. Nat Genet 25, 25–29.PubMedCrossRefGoogle Scholar
  71. 71.
    Stanley, S., Bailey, T., Mattick, J. (2006) GONOME: measuring correlations between gene ontology terms and genomic gorithms. BMC Bioinformatics 7, 94.PubMedCrossRefGoogle Scholar
  72. 72.
    Keich, U., Pevzner, P. A. (2002) Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics 18, 1382–1390.PubMedCrossRefGoogle Scholar
  73. 73.
    Tompa, M., Li, N., Bailey, T. L., et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23, 137–144.PubMedCrossRefGoogle Scholar
  74. 74.
    Kent, W. J., Sugnet, C. W., Furey, T. S., et al. (2002) The human genome browser at UCSC. Genome Res 12, 996–1006.PubMedGoogle Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Timothy L. Bailey
    • 1
  1. 1.ARC Centre of Excellence in Bioinformatics, and Institute for Molecular BioscienceThe University of QueenslandBrisbaneAustralia

Personalised recommendations