Abstract
The MEME algorithm extends the expectation maximization (EM) algorithm for identifying motifs in unaligned biopolymer sequences. The aim of MEME is to discover new motifs in a set of biopolymer sequences where little or nothing is known in advance about any motifs that may be present. MEME innovations expand the range of problems which can be solved using EM and increase the chance of finding good solutions. First, subsequences which actually occur in the biopolymer sequences are used as starting points for the EM algorithm to increase the probability of finding globally optimal motifs. Second, the assumption that each sequence contains exactly one occurrence of the shared motif is removed. This allows multiple appearances of a motif to occur in any sequence and permits the algorithm to ignore sequences with no appearance of the shared motif, increasing its resistance to noisy data. Third, a method for probabilistically erasing shared motifs after they are found is incorporated so that several distinct motifs can be found in the same set of sequences, both when different motifs appear in different sequences and when a single sequence may contain multiple motifs. Experiments show that MEME can discover both the CRP and LexA binding sites from a set of sequences which contain one or both sites, and that MEME can discover both the −10 and −35 promoter regions in a set ofE. coli sequences.
Article PDF
Similar content being viewed by others
References
Bailey, T.L. (1993). Likelihood vs. information in aligning biopolymer sequences. Technical Report CS93-318, University of California, San Diego.
Bairoch, A. (1993). The PROSITE dictionary of sites and patterns in proteins, its current status.Nucleic Acids Research, 21(13):3097–3103.
Berg, O.G. & von Hippel, P.H. (1988). Selection of DNA binding sites by regulatory proteins.Journal of Molecular Biology, 200:709–723.
Breiman, L., Friedman, J. H., Olshen, R. A.& Stone, C. J. (1984).Classification and Regression Trees. Wadsworth, Belmont, California.
Cardon, L.R. & Stormo, G.D. (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments.Journal of Molecular Biology, 223:159–170.
de Crombrugghe, B., Busby, S. & Buc, H. (1984). Cyclic AMP receptor protein: Role in transcription activation.Science, 224:831–838.
Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society, 39:1–38.
Duda, R.O. & Hart, P.E. (1973).Pattern Classification and Scene Analysis. John Wiley & Sons, Inc.
Harley, C.B. & Reynolds, R.P. (1987). Analysis ofE. coli promoter sequences.Nucleic Acids Research, 15:2343–2361.
Haussler, D., Krogh, A., Mian, I.S. & Sjölander, K. (1993). Protein modeling using hidden Markov models: Analysis of globins. InProceedings of the Hawaii International Conference on System Sciences, volume 1, pages 792–802, Los Alamitos, CA. IEEE Computer Society Press.
Hertz, G.Z. Hartzell, III, G.W. & Stormo, G.D. (1990). Identification of consensus patterns in unaligned DNA sequences known to be functionally related.Computer Applications in Biosciences, 6(2):81–92.
Kolakowski, L.F., Leunissen, J.A. & Smith, J.E. (1992). ProSearch: fast searching of protein sequences with regular expression patterns related to protein structure and function.Biotechniques, 13:919–921.
Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, M.S., Neuwald, A.F. & Wootton, J.C. (1993). Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment.Science, 262:208–214.
Lawrence, C.E. & Reilly, A.A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences.Proteins: Structure Function and Genetics, 7:41–51.
Quinlan, J.R. (1986). Induction of decision trees.Machine Learning, 1:81–106.
Sakakibara, Yasubumi, Brown, Michael, Underwood, Rebecca C., Mian, I. Saira & Haussler, D. (1993). Stochastic context-free grammars for modeling RNA. Technical Report 93-16, UCSC-CRL.
Stormo, G.D. (1988). Computer methods for analyzing sequence recognition of nucleic acids.Annual Review of Biophysics and Biophysical Chemistry, 17:241–263.
Stormo, G.D. (1990). Consensus patterns in DNA.Methods in Enzymology, 183:211–221.
Stormo, G.D. & Hartzell, III, G.W. (1989). A tool for multiple sequence alignment.Proceedings National Academy Science USA, 86:1183–1187.
Uberbacher, E.C. & Mural, R.J. (1991). Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach.Proceedings National Academy Science USA, 88:11261–11265.
Varley, J.M. & Boulnois, G.J. (1984). Analysis of a cloned colicin Ib gene: complete nucleotide sequence and implications for regulation of expression.Nucleic Acids Research, 12:6727–6739.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bailey, T.L., Elkan, C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach Learn 21, 51–80 (1995). https://doi.org/10.1007/BF00993379
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF00993379