Unsupervised learning of multiple motifs in biopolymers using expectation maximization

Bailey, Timothy L.; Elkan, Charles

doi:10.1007/BF00993379

Unsupervised learning of multiple motifs in biopolymers using expectation maximization

Published: October 1995

Volume 21, pages 51–80, (1995)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Unsupervised learning of multiple motifs in biopolymers using expectation maximization

Download PDF

Timothy L. Bailey¹ &
Charles Elkan¹

3234 Accesses
428 Citations
6 Altmetric
Explore all metrics

Abstract

The MEME algorithm extends the expectation maximization (EM) algorithm for identifying motifs in unaligned biopolymer sequences. The aim of MEME is to discover new motifs in a set of biopolymer sequences where little or nothing is known in advance about any motifs that may be present. MEME innovations expand the range of problems which can be solved using EM and increase the chance of finding good solutions. First, subsequences which actually occur in the biopolymer sequences are used as starting points for the EM algorithm to increase the probability of finding globally optimal motifs. Second, the assumption that each sequence contains exactly one occurrence of the shared motif is removed. This allows multiple appearances of a motif to occur in any sequence and permits the algorithm to ignore sequences with no appearance of the shared motif, increasing its resistance to noisy data. Third, a method for probabilistically erasing shared motifs after they are found is incorporated so that several distinct motifs can be found in the same set of sequences, both when different motifs appear in different sequences and when a single sequence may contain multiple motifs. Experiments show that MEME can discover both the CRP and LexA binding sites from a set of sequences which contain one or both sites, and that MEME can discover both the −10 and −35 promoter regions in a set ofE. coli sequences.

References

Bailey, T.L. (1993). Likelihood vs. information in aligning biopolymer sequences. Technical Report CS93-318, University of California, San Diego.
Google Scholar
Bairoch, A. (1993). The PROSITE dictionary of sites and patterns in proteins, its current status.Nucleic Acids Research, 21(13):3097–3103.
Google Scholar
Berg, O.G. & von Hippel, P.H. (1988). Selection of DNA binding sites by regulatory proteins.Journal of Molecular Biology, 200:709–723.
Google Scholar
Breiman, L., Friedman, J. H., Olshen, R. A.& Stone, C. J. (1984).Classification and Regression Trees. Wadsworth, Belmont, California.
Google Scholar
Cardon, L.R. & Stormo, G.D. (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments.Journal of Molecular Biology, 223:159–170.
Google Scholar
de Crombrugghe, B., Busby, S. & Buc, H. (1984). Cyclic AMP receptor protein: Role in transcription activation.Science, 224:831–838.
Google Scholar
Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society, 39:1–38.
Google Scholar
Duda, R.O. & Hart, P.E. (1973).Pattern Classification and Scene Analysis. John Wiley & Sons, Inc.
Harley, C.B. & Reynolds, R.P. (1987). Analysis ofE. coli promoter sequences.Nucleic Acids Research, 15:2343–2361.
Google Scholar
Haussler, D., Krogh, A., Mian, I.S. & Sjölander, K. (1993). Protein modeling using hidden Markov models: Analysis of globins. InProceedings of the Hawaii International Conference on System Sciences, volume 1, pages 792–802, Los Alamitos, CA. IEEE Computer Society Press.
Google Scholar
Hertz, G.Z. Hartzell, III, G.W. & Stormo, G.D. (1990). Identification of consensus patterns in unaligned DNA sequences known to be functionally related.Computer Applications in Biosciences, 6(2):81–92.
Google Scholar
Kolakowski, L.F., Leunissen, J.A. & Smith, J.E. (1992). ProSearch: fast searching of protein sequences with regular expression patterns related to protein structure and function.Biotechniques, 13:919–921.
Google Scholar
Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, M.S., Neuwald, A.F. & Wootton, J.C. (1993). Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment.Science, 262:208–214.
Google Scholar
Lawrence, C.E. & Reilly, A.A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences.Proteins: Structure Function and Genetics, 7:41–51.
Google Scholar
Quinlan, J.R. (1986). Induction of decision trees.Machine Learning, 1:81–106.
Google Scholar
Sakakibara, Yasubumi, Brown, Michael, Underwood, Rebecca C., Mian, I. Saira & Haussler, D. (1993). Stochastic context-free grammars for modeling RNA. Technical Report 93-16, UCSC-CRL.
Stormo, G.D. (1988). Computer methods for analyzing sequence recognition of nucleic acids.Annual Review of Biophysics and Biophysical Chemistry, 17:241–263.
Google Scholar
Stormo, G.D. (1990). Consensus patterns in DNA.Methods in Enzymology, 183:211–221.
Google Scholar
Stormo, G.D. & Hartzell, III, G.W. (1989). A tool for multiple sequence alignment.Proceedings National Academy Science USA, 86:1183–1187.
Google Scholar
Uberbacher, E.C. & Mural, R.J. (1991). Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach.Proceedings National Academy Science USA, 88:11261–11265.
Google Scholar
Varley, J.M. & Boulnois, G.J. (1984). Analysis of a cloned colicin Ib gene: complete nucleotide sequence and implications for regulation of expression.Nucleic Acids Research, 12:6727–6739.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of California, San Diego, 92093-0114, La Jolla, California
Timothy L. Bailey & Charles Elkan

Authors

Timothy L. Bailey
View author publications
You can also search for this author in PubMed Google Scholar
Charles Elkan
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bailey, T.L., Elkan, C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach Learn 21, 51–80 (1995). https://doi.org/10.1007/BF00993379

Download citation

Received: 30 September 1993
Accepted: 27 July 1994
Issue Date: October 1995
DOI: https://doi.org/10.1007/BF00993379

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Unsupervised learning of multiple motifs in biopolymers using expectation maximization

Abstract

Article PDF

Similar content being viewed by others

Maximal Motif Discovery in a Sliding Window

Towards a More Efficient Discovery of Biologically Significant DNA Motifs

Exact Discovery of Length-Range Motifs

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised learning of multiple motifs in biopolymers using expectation maximization

Abstract

Article PDF

Similar content being viewed by others

Maximal Motif Discovery in a Sliding Window

Towards a More Efficient Discovery of Biologically Significant DNA Motifs

Exact Discovery of Length-Range Motifs

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation