Machine Learning

, Volume 21, Issue 1–2, pp 51–80 | Cite as

Unsupervised learning of multiple motifs in biopolymers using expectation maximization

  • Timothy L. Bailey
  • Charles Elkan
Article

Abstract

The MEME algorithm extends the expectation maximization (EM) algorithm for identifying motifs in unaligned biopolymer sequences. The aim of MEME is to discover new motifs in a set of biopolymer sequences where little or nothing is known in advance about any motifs that may be present. MEME innovations expand the range of problems which can be solved using EM and increase the chance of finding good solutions. First, subsequences which actually occur in the biopolymer sequences are used as starting points for the EM algorithm to increase the probability of finding globally optimal motifs. Second, the assumption that each sequence contains exactly one occurrence of the shared motif is removed. This allows multiple appearances of a motif to occur in any sequence and permits the algorithm to ignore sequences with no appearance of the shared motif, increasing its resistance to noisy data. Third, a method for probabilistically erasing shared motifs after they are found is incorporated so that several distinct motifs can be found in the same set of sequences, both when different motifs appear in different sequences and when a single sequence may contain multiple motifs. Experiments show that MEME can discover both the CRP and LexA binding sites from a set of sequences which contain one or both sites, and that MEME can discover both the −10 and −35 promoter regions in a set ofE. coli sequences.

Keywords

Unsupervised learning expectation maximization consensus sequence motif biopolymer promoter binding site DNA protein sequence analysis 

References

  1. Bailey, T.L. (1993). Likelihood vs. information in aligning biopolymer sequences. Technical Report CS93-318, University of California, San Diego.Google Scholar
  2. Bairoch, A. (1993). The PROSITE dictionary of sites and patterns in proteins, its current status.Nucleic Acids Research, 21(13):3097–3103.Google Scholar
  3. Berg, O.G. & von Hippel, P.H. (1988). Selection of DNA binding sites by regulatory proteins.Journal of Molecular Biology, 200:709–723.Google Scholar
  4. Breiman, L., Friedman, J. H., Olshen, R. A.& Stone, C. J. (1984).Classification and Regression Trees. Wadsworth, Belmont, California.Google Scholar
  5. Cardon, L.R. & Stormo, G.D. (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments.Journal of Molecular Biology, 223:159–170.Google Scholar
  6. de Crombrugghe, B., Busby, S. & Buc, H. (1984). Cyclic AMP receptor protein: Role in transcription activation.Science, 224:831–838.Google Scholar
  7. Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society, 39:1–38.Google Scholar
  8. Duda, R.O. & Hart, P.E. (1973).Pattern Classification and Scene Analysis. John Wiley & Sons, Inc.Google Scholar
  9. Harley, C.B. & Reynolds, R.P. (1987). Analysis ofE. coli promoter sequences.Nucleic Acids Research, 15:2343–2361.Google Scholar
  10. Haussler, D., Krogh, A., Mian, I.S. & Sjölander, K. (1993). Protein modeling using hidden Markov models: Analysis of globins. InProceedings of the Hawaii International Conference on System Sciences, volume 1, pages 792–802, Los Alamitos, CA. IEEE Computer Society Press.Google Scholar
  11. Hertz, G.Z. Hartzell, III, G.W. & Stormo, G.D. (1990). Identification of consensus patterns in unaligned DNA sequences known to be functionally related.Computer Applications in Biosciences, 6(2):81–92.Google Scholar
  12. Kolakowski, L.F., Leunissen, J.A. & Smith, J.E. (1992). ProSearch: fast searching of protein sequences with regular expression patterns related to protein structure and function.Biotechniques, 13:919–921.Google Scholar
  13. Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, M.S., Neuwald, A.F. & Wootton, J.C. (1993). Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment.Science, 262:208–214.Google Scholar
  14. Lawrence, C.E. & Reilly, A.A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences.Proteins: Structure Function and Genetics, 7:41–51.Google Scholar
  15. Quinlan, J.R. (1986). Induction of decision trees.Machine Learning, 1:81–106.Google Scholar
  16. Sakakibara, Yasubumi, Brown, Michael, Underwood, Rebecca C., Mian, I. Saira & Haussler, D. (1993). Stochastic context-free grammars for modeling RNA. Technical Report 93-16, UCSC-CRL.Google Scholar
  17. Stormo, G.D. (1988). Computer methods for analyzing sequence recognition of nucleic acids.Annual Review of Biophysics and Biophysical Chemistry, 17:241–263.Google Scholar
  18. Stormo, G.D. (1990). Consensus patterns in DNA.Methods in Enzymology, 183:211–221.Google Scholar
  19. Stormo, G.D. & Hartzell, III, G.W. (1989). A tool for multiple sequence alignment.Proceedings National Academy Science USA, 86:1183–1187.Google Scholar
  20. Uberbacher, E.C. & Mural, R.J. (1991). Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach.Proceedings National Academy Science USA, 88:11261–11265.Google Scholar
  21. Varley, J.M. & Boulnois, G.J. (1984). Analysis of a cloned colicin Ib gene: complete nucleotide sequence and implications for regulation of expression.Nucleic Acids Research, 12:6727–6739.Google Scholar

Copyright information

© Kluwer Academic Publishers 1995

Authors and Affiliations

  • Timothy L. Bailey
    • 1
  • Charles Elkan
    • 1
  1. 1.Department of Computer Science and EngineeringUniversity of California, San DiegoLa Jolla

Personalised recommendations