Multiple Sequence Local Alignment Using Monte Carlo EM Algorithm

  • Chengpeng Bi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4463)

Abstract

The Expectation Maximization (EM) motif-finding algorithm is one of the most popular de novo motif discovery methods. However, the EM algorithm largely depends on its initialization and can be easily trapped in local optima. This paper implements a Monte Carlo version of the EM algorithm that performs multiple sequence local alignment to overcome the drawbacks inherent in conventional EM motif-finding algorithms. The newly implemented algorithm is named as Monte Carlo EM Motif Discovery Algorithm (MCEMDA). MCEMDA starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update steps until convergence. MCEMDA is compared with other popular motif-finding algorithms using simulated, prokaryotic and eukaryotic motif sequences. Results show that MCEMDA outperforms other algorithms. MCEMDA successfully discovers a helix-turn-helix motif in protein sequences as well. It provides a general framework for motif-finding algorithm development. A website of this program will be available at http://motif.cmh.edu.

Keywords

Expectation Maximization (EM) Monte Carlo EM Motif Discovery Multiple Sequence Local Alignment Transcriptional Regulation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    MacIsaac, K.D., Fraenkel, E.: Practical Strategies for Discovering Regulatory DNA Sequence Motifs. PLoS Comput. Biol. 2, e36 (2006)Google Scholar
  2. 2.
    Tompa, M., et al.: Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites. Nature Biotechnology 23, 137–144 (2005)CrossRefGoogle Scholar
  3. 3.
    Lawrence, C.E., Reilly, A.A.: An Expectation Maximization Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences. Proteins: Structure, Function and Genetics 7, 41–51 (1990)CrossRefGoogle Scholar
  4. 4.
    Dempster, A.P., et al.: Maximum Likelihood from Incomplete Data via the EM Algorithm (with Discussion). J. the Royal Statist. Soc. B 39, 1–38 (1977)MATHMathSciNetGoogle Scholar
  5. 5.
    Bailey, T.L., Elkan, C.: Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization. Machine Learning 21, 51–80 (1995)Google Scholar
  6. 6.
    Celeux, G., et al.: Stochastic Versions of the EM Algorithm: An Experimental Study in the Mixture Case. J. Statist. Comput. Simul. 55, 287–314 (1996)MATHCrossRefGoogle Scholar
  7. 7.
    Wei, G.C.G., Tanner, M.A.: A Monte Carlo Implementation of the EM Algorithm and the Poor Man’s Data Augmentation Algorithms. Journal of the American Statistical Association 85, 699–704 (1990)CrossRefGoogle Scholar
  8. 8.
    Delyon, B., et al.: Convergence of a Stochastic Approximation Version of the EM Algorithm. Ann. Statist. 27, 94–128 (1999)MATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Berg, O.G., von Hippel, P.H.: Selection of DNA Binding Sites by Regulatory Proteins: Statistical-mechanical Theory and Application to Operators and Promoters. Journal of Molecular Biology 193, 723–750 (1987)CrossRefGoogle Scholar
  10. 10.
    Bonizzoni, P., Vedova, G.D.: The Complexity of Multiple Sequence Alignment with SP-score That Is a Metric. Theoretical Computer Science 259, 63–79 (2001)MATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Bi, C.-P.: SEAM: A Stochastic EM-type Algorithm for Motif-Finding in Biopolymer Sequences. J. Bioinformatics and Comput. Biol., in press (2007)Google Scholar
  12. 12.
    Wu, C.F.J.: On the Convergence Properties of the EM Algorithm. The Annals of Statistics 11, 95–103 (1983)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Lawrence, C.E., et al.: Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science 262, 208–214 (1993)CrossRefGoogle Scholar
  14. 14.
    Liu, X., et al.: BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-expressed Genes. In: Pacific Symposium on Biocomputing, vol. 6, pp. 127–138 (2001)Google Scholar
  15. 15.
    Schneider, T.D., Stephens, R.M.: Sequence Logos: A New Way to Display Consensus Sequences. Nucleic Acids Research 18, 6097–6100 (1990)CrossRefGoogle Scholar
  16. 16.
    Crooks, G.E., et al.: WebLogo: A Sequence Logo Generator. Genome Research 14, 1188–1190 (2004)CrossRefGoogle Scholar
  17. 17.
    Salgado, H., et al.: RegulonDB (version 5.0): Escherichia coli K-12 Transcriptional Regulatory Network, Operon Organization, and Growth Conditions. Nucleic Acids Res. 34, D394–397 (2006)Google Scholar
  18. 18.
    Kel, A.E., et al.: Computer-assisted Identification of Cell Cycle-related Genes: New Targets for E2F Transcription Factors. J. Mol. Biol. 309, 99–120 (2001)CrossRefGoogle Scholar
  19. 19.
    Klinge, C.M.: Estrogen Receptor Interaction with Estrogen Response Elements. Nucleic Acids Res. 29, 2905–2919 (2001)CrossRefGoogle Scholar
  20. 20.
    Wei, Z., Jensen, S.T.: GAME: Detecting cis-Regulatory Elements Using a Genetic Algorithm. Bioinformatics 22, 1577–1584 (2006)CrossRefGoogle Scholar
  21. 21.
    Martinez-Bueno, M., et al.: BacTregulators: A Database of Transcriptional Regulators in Bacteria and Archaea. Bioinformatics 20, 2787–2791 (2004)CrossRefGoogle Scholar
  22. 22.
    Krell, T., et al.: The IclR Family of Transcriptional Activators and Repressors Can Be Defined by a Single Profile. Protein Science 15, 1207–1213 (2006)CrossRefGoogle Scholar
  23. 23.
    Bi, C.-P.: A Genetic-Based EM Motif-Finding Algorithm for Biological Sequence Analysis. In: Proceeding of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, in press (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Chengpeng Bi
    • 1
  1. 1.Children’s Mercy Hospitals, Schools of Medicine, Computing and Engineering, University of Missouri, 2401 Gillham Road, Kansas City, MO 64108USA

Personalised recommendations