Discovering Sequence Motifs

  • Timothy L. Bailey
Part of the Methods in Molecular Biology™ book series (MIMB, volume 395)


Sequence motif discovery algorithms are an important part of the computational biologist’s toolkit. The purpose of motif discovery is to discover patterns in biopolymer (nucleotide or protein) sequences to better understand the structure and function of the molecules the sequences represent. This chapter provides an overview of the use of sequence motif discovery in biology and a general guide to the use of motif discovery algorithms. This chapter examines the types of biological features that DNA and protein motifs can represent and their usefulness. This chapter also defines what sequence motifs are, how they are represented, and general techniques for discovering them. The primary focus of the chapter is on one aspect of motif discovery: discovering motifs in a set of unaligned DNA or protein sequences. This chapter also provides the steps useful for checking the biological validity and investigating the function of sequence motifs using methods such as motif scanning—searching for matches to motifs in a given sequence or a database of sequences. A discussion of some limitations of motif discovery concludes the chapter.

Key Words

Motif discovery sequence motif sequence pattern protein domain multiple alignment position-specific scoring matrix PSSM position-specific weight matrix PWM transcription factor-binding site transcription factor promoter protein features 


  1. 1.
    Blais, A. and Dynlacht, B. D. (2005) Constructing transcriptional regulatory networks. Genes Dev. 19, 1499–1511.CrossRefPubMedGoogle Scholar
  2. 2.
    Tan, K., McCue, L. A., and Stormo, G. D. (2005) Making connections between novel transcription factors and their DNA motifs. Genome Res. 15, 312–320.CrossRefPubMedGoogle Scholar
  3. 3.
    Hulo, N., Bairoch, A., Bulliard, V., et al. (2006) The PROSITE database. Nucleic Acids Res. 34, D227–D230.CrossRefPubMedGoogle Scholar
  4. 4.
    Henikoff, J. G., Greene, E. A., Pietrokovski, S., and Henikoff, S. (2000) Increased coverage of protein families with the Blocks Database servers. Nucleic Acids Res. 28, 228–230.CrossRefPubMedGoogle Scholar
  5. 5.
    Attwood, T. K., Bradley, P., Flower, D. R., et al. (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 31, 400–402.CrossRefPubMedGoogle Scholar
  6. 6.
    La, D. and Livesay, D. R. (2005) Predicting functional sites with an automated algorithm suitable for heterogeneous datasets. BMC Bioinformatics 6, 116.CrossRefPubMedGoogle Scholar
  7. 7.
    Matys, V., Kel-Margoulis, O. V., Fricke, E., et al. (2006) TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110.CrossRefPubMedGoogle Scholar
  8. 8.
    Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W. W., and Lenhard, B. (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, D91–D94.CrossRefPubMedGoogle Scholar
  9. 9.
    Zhu, J. and Zhang, M. Q. (1999) SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics 15, 607–611.CrossRefPubMedGoogle Scholar
  10. 10.
    Makita, Y., Nakao, M., Ogasawara, N., and Nakai, K. (2004) DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. Nucleic Acids Res. 32, D75–D77.CrossRefPubMedGoogle Scholar
  11. 11.
    Waterston, R. H., Lindblad-Toh, K., Birney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562.CrossRefPubMedGoogle Scholar
  12. 12.
    Gribskov, M. and Veretnik, S. (1996) Identification of sequence pattern with profile analysis. Methods Enzymol. 266, 198–212.CrossRefPubMedGoogle Scholar
  13. 13.
    Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics 14, 755–763.CrossRefPubMedGoogle Scholar
  14. 14.
    Krogh, A., Brown, M., Mian, I. S., Sjölander, K., and Haussler, D. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235, 1501–1531.CrossRefPubMedGoogle Scholar
  15. 15.
    CBN and U.-I.C.o.B.N. (1970) Abbreviations and symbols for nucleic acids, polynucleotides and their constituents. recommendations 1970. Eur. J. Biochem. 15, 203–208.Google Scholar
  16. 16.
    van Helden, J., Andre, B., and Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842.CrossRefPubMedGoogle Scholar
  17. 17.
    van Helden, J., Rios, A. F., and Collado-Vides, J. (2000) Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res 28, 1808–1818.CrossRefPubMedGoogle Scholar
  18. 18.
    Schneider, T. D. and Stephens, R. M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097–6100.CrossRefPubMedGoogle Scholar
  19. 19.
    Reinert, G., Schbath, S., and Waterman, M. S. (2000) Probabilistic and statistical properties of words: an overview. J. Comput. Biol. 7, 1–46.CrossRefPubMedGoogle Scholar
  20. 20.
    Schneider, T. D., Stormo, G. D., Gold, L., and Ehrenfeucht, A. (1986) Information content of binding sites on nucleotide sequences. J. Mol. Biol. 188, 415–431.CrossRefPubMedGoogle Scholar
  21. 21.
    Berg, O. G. and von Hippel, P. H. (1987) Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 193, 723–750.CrossRefPubMedGoogle Scholar
  22. 22.
    Berg, O. G. and von Hippel, P. H. (1988) Selection of DNA binding sites by regulatory proteins. II. The binding specificity of cyclic AMP receptor protein to recognition sites. J. Mol. Biol. 200, 709–723.CrossRefPubMedGoogle Scholar
  23. 23.
    Finn, R. D., Mistry, J., Schuster-Bockler, B., et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res. 34, D247–D251.CrossRefPubMedGoogle Scholar
  24. 24.
    Sinha, S. (2003) Discriminative motifs. J. Comput. Biol. 10, 599–615.CrossRefPubMedGoogle Scholar
  25. 25.
    Workman, C. T. and Stormo, G. D. (2000) ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac. Symp. Biocomput. 467–478.Google Scholar
  26. 26.
    Sinha, S., Blanchette, M., and Tompa, M. (2004) PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 5, 170.CrossRefPubMedGoogle Scholar
  27. 27.
    Moses, A. M., Chiang, D. Y., and Eisen, M. B. (2004) Phylogenetic motif detection by expectation-maximization on evolutionary mixtures. Pac. Symp. Biocomput. 324–335.Google Scholar
  28. 28.
    Siddharthan, R., Siggia, E. D., and van Nimwegen, E. (2005) PhyloGibbs: a gibbs sampling motif finder that incorporates phylogeny. PLoS Comput. Biol. 1, e67.CrossRefPubMedGoogle Scholar
  29. 29.
    Liu, X., Brutlag, D. L., and Liu, J. S. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput. 127–138.Google Scholar
  30. 30.
    Xie, X., Lu, J., Kulbokas, E. J., Golub, T. R., et al. (2005) Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature 434, 338–345.CrossRefPubMedGoogle Scholar
  31. 31.
    Kellis, M., Patterson, N., Birren, B., Berger, B., and Lander, E. S. (2004) Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery. J. Comput. Biol. 11, 319–355.CrossRefPubMedGoogle Scholar
  32. 32.
    Duda, R. O. and Hart, P. E. (1973) Pattern Classification and Scene Analysis. John Wiley and Sons, Inc., New York.Google Scholar
  33. 33.
    Seki, M., Narusaka, M., Abe, H., et al. (2001) Monitoring the expression pattern of 1300 Arabidopsis genes under drought and cold stresses by using a full-length cDNA microarray. Plant Cell. 13, 61–72.CrossRefPubMedGoogle Scholar
  34. 34.
    Harbison, C. T., Gordon, D. B., Lee, T. I., et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99–104.CrossRefPubMedGoogle Scholar
  35. 35.
    Kawaji, H., Kasukawa, T., Fukuda, S., et al. (2006) CAGE Basic/Analysis Databases: the CAGE resource for comprehensive promoter analysis. Nucleic Acids Res. 34, D632–D636.CrossRefPubMedGoogle Scholar
  36. 36.
    Kodzius, R., Matsumura, Y., Kasukawa, T., et al. (2004) Absolute expression values for mouse transcripts: re-annotation of the READ expression database by the use of CAGE and EST sequence tags. FEBS Lett. 559, 22–26.CrossRefPubMedGoogle Scholar
  37. 37.
    Tatusov, R. L., Fedorova, N. D., Jackson, J. D., et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.CrossRefPubMedGoogle Scholar
  38. 38.
    Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T. J. P., Chothia, C., and Murzin, A. G. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. 32, D226–D229.CrossRefPubMedGoogle Scholar
  39. 39.
    La, D., Silver, M., Edgar, R. C., and Livesay, D. R. (2003) Using motif-based methods in multiple genome analyses: a case study comparing orthologous mesophilic and thermophilic proteins. Biochemistry 42, 8988–8998.CrossRefPubMedGoogle Scholar
  40. 40.
    Tatusov, R. L., and Lipman, D. J. Dust, in the NCBI/Toolkit available at
  41. 41.
    Claverie, J. -M., and States, D. J. (1993) Information enhancement methods for large scale sequence analysis. Comput. Chem. 17, 191–201.CrossRefGoogle Scholar
  42. 42.
    Wootton, J. C. and Federhen, S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 266, 554–571.CrossRefPubMedGoogle Scholar
  43. 43.
    Smit, A., Hubley, R., and Green, P. Repeatmasker, available at
  44. 44.
    Bailey, T. L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28–36.PubMedGoogle Scholar
  45. 45.
    Thompson, W., Rouchka, E. C., and Lawrence, C. E. (2003) Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res. 31, 3580–3585.CrossRefPubMedGoogle Scholar
  46. 46.
    Roth, F. P., Hughes, J. D., Estep, P. W., and Church, G. M. (1998) Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol. 16, 939–945.CrossRefPubMedGoogle Scholar
  47. 47.
    Liu, X. S., Brutlag, D. L., and Liu, J. S. (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nat. Biotechnol. 20, 835–839.PubMedGoogle Scholar
  48. 48.
    van Helden, J., Andre, B., and Collado-Vides, J. (2000) A web site for the computational analysis of yeast regulatory sequences. Yeast 16, 177–187.CrossRefPubMedGoogle Scholar
  49. 49.
    Pavesi, G., Mereghetti, P., Mauri, G., and Pesole, G. (2004) Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 32, W199–W203.CrossRefPubMedGoogle Scholar
  50. 50.
    Sinha, S. and Tompa, M. (2003) YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 31, 3586–3588.CrossRefPubMedGoogle Scholar
  51. 51.
    Liu, Y., Liu, X. S., Wei, L., Altman, R. B., and Batzoglou, S. (2004) Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res. 14, 451–458.CrossRefPubMedGoogle Scholar
  52. 52.
    Henikoff, S., Henikoff, J. G., Alford, W. J., and Pietrokovski, S. (1995) Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene 163, GC17–GC26.CrossRefPubMedGoogle Scholar
  53. 53.
    Gordon, D. B., Nekludova, L., McCallum, S., and Fraenkel, E. (2005) TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs. Bioinformatics 21, 3164–3165.CrossRefPubMedGoogle Scholar
  54. 54.
    Hertz, G. Z. and Stormo, G. D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577.CrossRefPubMedGoogle Scholar
  55. 55.
    Frith, M. C., Hansen, U., Spouge, J. L., and Weng, Z. (2004) Finding functional sequence elements by multiple local alignment. Nucleic Acids Res. 32, 189–200.CrossRefPubMedGoogle Scholar
  56. 56.
    Ao, W., Gaudet, J., Kent, W. J., Muttumu, S., and Mango, S. E. (2004) Environmentally induced foregut remodeling by PHA4/FoxA and DAF-12/NHR. Science 305, 1742–1746.CrossRefGoogle Scholar
  57. 57.
    Eskin, E. and Pevzner, P. A. (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics 18, S354–S363.PubMedGoogle Scholar
  58. 58.
    Thijs, G., Marchal, K., Lescot, M., et al. (2002) A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J. Comput. Biol. 9, 447–464.CrossRefPubMedGoogle Scholar
  59. 59.
    Regnier, M. and Denise, A. (2004) Rare events and conditional events on random strings. Discrete Math. Theor. Comput. Sci. 6, 191–214.Google Scholar
  60. 60.
    Favorov, A. V., Gelfand, M. S., Gerasimova, A. V., Ravcheev, D. A., Mironov, A. A., and Makeev, V. J. (2005) A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics 21, 2240–2245.CrossRefPubMedGoogle Scholar
  61. 61.
    Tagle, D. A., Koop, B. F., Goodman, M., Slightom, J. L., Hess, D. L., and Jones, R. T. (1988) Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J. Mol. Biol. 203, 439–455.CrossRefPubMedGoogle Scholar
  62. 62.
    Duret, L. and Bucher, P. (1997) Searching for regulatory elements in human noncoding sequences. Curr. Opin. Struct. Biol. 7, 399–406.CrossRefPubMedGoogle Scholar
  63. 63.
    Macisaac, K. D., Gordon, D. B., Nekludova, L., et al. (2006) A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data. Bioinformatics 22, 423–429.CrossRefPubMedGoogle Scholar
  64. 64.
    Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res. 24, 3836–3845.CrossRefPubMedGoogle Scholar
  65. 65.
    Bailey, T. L. and Gribskov, M. (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14, 48–54.CrossRefPubMedGoogle Scholar
  66. 66.
    Bailey, T. L. and Noble, W. S. (2003) Searching for statistically significant regulatory modules. Bioinformatics 19, II16–II25.CrossRefPubMedGoogle Scholar
  67. 67.
    Frith, M. C., Spouge, J. L., Hansen, U., and Weng, Z. (2002) Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res. 30, 3214–3224.CrossRefPubMedGoogle Scholar
  68. 68.
    Frith, M. C., Li, M. C., and Weng, Z. (2003) Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res. 31, 3666–3668.CrossRefPubMedGoogle Scholar
  69. 69.
    Ashburner, M., Ball, C. A., Blake, J. A., et al. (2000) Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29.CrossRefPubMedGoogle Scholar
  70. 70.
    Stanley, S., Bailey, T., and Mattick, J. (2006) GONOME: measuring correlations between gene ontology terms and genomic positions. BMC Bioinformatics 7, 94.CrossRefPubMedGoogle Scholar
  71. 71.
    Keich, U. and Pevzner, P. A. (2002) Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics 18, 1382–1390.CrossRefPubMedGoogle Scholar
  72. 72.
    Tompa, M., Li, N., Bailey, T. L., et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23, 137–144.CrossRefPubMedGoogle Scholar
  73. 73.
    Kent, W. J., Sugnet, C. W., Furey, T. S., et al. (2002) The human genome browser at UCSC. Genome Res. 12, 996–1006.PubMedGoogle Scholar

Copyright information

© Humana Press Inc. 2007

Authors and Affiliations

  • Timothy L. Bailey
    • 1
  1. 1.IMB/University of QueenslandUSA

Personalised recommendations