Motif Discovery Using Expectation Maximization and Gibbs’ Sampling

Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 674)

Abstract

Expectation maximization and Gibbs’ sampling are two statistical approaches used to identify transcription factor binding sites and the motif that represents them. Both take as input unaligned sequences and search for a statistically significant alignment of putative binding sites. Expectation maximization is deterministic so that starting with the same initial parameters will always converge to the same solution, making it wise to start it multiple times from different initial parameters. Gibbs’ sampling is stochastic so that it may arrive at different solutions from the same initial parameters. In both cases multiple runs are advised because comparisons of the solutions after each run can indicate whether a global, optimum solution is likely to have been achieved.

Key words

Expectation maximization Gibbs’ sampling transcription factor binding sites motif discovery position weight matrices position frequency matrices regulatory sites motif modeling 

References

  1. 1.
    Pribnow, D. (1975) Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proc Natl Acad Sci USA 72, 784–788.PubMedCrossRefGoogle Scholar
  2. 2.
    Rosenberg, M., and Court, D. (1979) Regulatory sequences involved in the promotion and termination of RNA transcription. Annu Rev Genet 13, 319–353.PubMedCrossRefGoogle Scholar
  3. 3.
    Galas, D.J., Eggert, M., and Waterman, M.S. (1985) Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J Mol Biol 186, 117–128.PubMedCrossRefGoogle Scholar
  4. 4.
    Pavesi, G., Mauri, G., and Pesole, G. (2001) An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17(Suppl. 1), S207–S214.PubMedCrossRefGoogle Scholar
  5. 5.
    Marschall, T., and Rahmann, S. (2009) Efficient exact motif discovery. Bioinformatics 25, i356–i364.PubMedCrossRefGoogle Scholar
  6. 6.
    Stormo, G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics 16, 16–23.PubMedCrossRefGoogle Scholar
  7. 7.
    Stormo, G.D., Schneider, T.D., Gold, L., and Ehrenfeucht, A. (1982) Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10, 2997–3011.PubMedCrossRefGoogle Scholar
  8. 8.
    Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res 12, 505–519.PubMedCrossRefGoogle Scholar
  9. 9.
    Stormo, G.D., and Hartzell, G.W., 3rd. (1989) Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci USA 86, 1183–1187.PubMedCrossRefGoogle Scholar
  10. 10.
    Das, M.K., and Dai, H.K. (2007) A survey of DNA motif finding algorithms. BMC Bioinformatics 8(Suppl. 7), S21.PubMedCrossRefGoogle Scholar
  11. 11.
    GuhaThakurta, D. (2006) Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res 34, 3585–3598.PubMedCrossRefGoogle Scholar
  12. 12.
    Lawrence, C.E., and Reilly, A.A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7, 41–51.PubMedCrossRefGoogle Scholar
  13. 13.
    Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wootton, J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214.PubMedCrossRefGoogle Scholar
  14. 14.
    Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc. Ser B (Methodol) 39, 1–38.Google Scholar
  15. 15.
    Little, R.J.A., and Rubin, D.B. (2002). Statistical analysis with missing data, 2nd edn. Wiley, New York, NY.Google Scholar
  16. 16.
    Narlikar, L., Gordân, R., Ohler, U., and Hartemink, A.J. (2006) Informative priors based on transcription factor structural class improve de novo motif discovery. Bioinformatics 22, e384–e392.PubMedCrossRefGoogle Scholar
  17. 17.
    Bailey, T.L., and Elkan, C. (1995) The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol 3, 21–29.PubMedGoogle Scholar
  18. 18.
    Bailey, T.L., and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28–36.PubMedGoogle Scholar
  19. 19.
    Bailey, T.L., and Elkan, C.P. (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach Learn 21, 51–80.Google Scholar
  20. 20.
    Bailey, T.L. (2002) Discovering novel sequence motifs with MEME. Curr Protoc Bioinformatics Chapter 2, Unit 2.4.
  21. 21.
    Liu, J.S., Neuwald, A.F., and Lawrence, C.E. (1995) Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J Am Stat Assoc 90, 1156–1170.CrossRefGoogle Scholar
  22. 22.
    Roth, F.P., Hughes, J.D., Estep, P.W., and Church, G.M. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 16, 939–945.PubMedCrossRefGoogle Scholar
  23. 23.
    Liu, X., Brutlag, D.L., and Liu, J.S. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput 2001, 127–138.Google Scholar
  24. 24.
    Benos, P.V., Bulyk, M.L., and Stormo, G.D. (2002) Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res 30, 4442–4451.PubMedCrossRefGoogle Scholar
  25. 25.
    Djordjevic, M., Sengupta, A.M., and Shraiman, B.I. (2003) A biophysical approach to transcription factor binding site discovery. Genome Res 13, 2381–2390.PubMedCrossRefGoogle Scholar
  26. 26.
    Zhao, Y., Granas, D., and Stormo, G.D. (2009) Inferring binding energies from selected binding sites. PLoS Comp Bio, 5, e1000590.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Department of Genetics, School of MedicineWashington UniversitySt. LouisUSA

Personalised recommendations