Artificial Intelligence Review

, Volume 24, Issue 3–4, pp 397–413 | Cite as

Self-Organizing Maps of Position Weight Matrices for Motif Discovery in Biological Sequences

  • Shaun Mahony
  • David Hendrix
  • Terry J. Smith
  • Aaron Golden


The identification of overrepresented motifs in a collection of biological sequences continues to be a relevant and challenging problem in computational biology. Currently popular methods of motif discovery are based on statistical learning theory. In this paper, a machine-learning approach to the motif discovery problem is explored. The approach is based on a Self-Organizing Map (SOM) where the output layer neuron weight vectors are replaced by position weight matrices. This approach can be used to characterise features present in a set of sequences, and thus can be used as an aid in overrepresented motif discovery. The SOM approach to motif discovery is demonstrated using biological sequence datasets, both real and simulated


biological motif discovery self-organizing map 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Abe T., Kanaya S., Kinouchi M., Ichiba Y., Kozuki T., Ikemura T. (2003). Informatics for Unveiling Hidden Genome Signatures. Genome Research 13:693–702CrossRefPubMedGoogle Scholar
  2. Bailey T.L., Elkan C. (1994). Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers. Proceedings of the International Conference on Intelligent Systems for Molecular Biology 2:8–36Google Scholar
  3. Bussemaker H.J., Li H., Siggia E.D. (2000). Building a Dictionary for Genomes: Identification of Presumptive Regulatory Sites by Statistical Analysis. Proceedings of the National Academy of Sciences of the United States of America 97:10096–10100CrossRefPubMedMathSciNetGoogle Scholar
  4. Gupta M., Liu J.S. (2003). Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model. Journal of the American Statistical Association 98:55–66zbMATHMathSciNetGoogle Scholar
  5. Hughes J.D., Estep P.W., Tavazoie S., Church G.M. (2000). Computational Identification of Cis-regulatory Elements Associated with Groups of Functionally Related Genes in Saccharomyces Cerevisiae. Journal of Molecular Biology 296:1205–1214CrossRefPubMedGoogle Scholar
  6. Kanaya S., Kinouchi M., Abe T., Kudo Y., Yamada Y., Nishi T., Mori H., Ikemura T. (2001). Analysis of Codon Usage Diversity of Bacterial Genes with a Self-organizing Map (SOM): Characterization of Horizontally Transferred Genes with Emphasis on the E. coli O157 Genome. Gene 276:89–99CrossRefPubMedGoogle Scholar
  7. Kohonen T. (1995). Self-Organizing Maps. Springer-Verlag, BerlinGoogle Scholar
  8. Kohonen T., Somervuo P. (2002). How to Make Large Self-organizing Maps for Nonvectorial Data. Neural Networks 15:945–952CrossRefPubMedGoogle Scholar
  9. Lawrence C.E., Altschul S.F., Boguski M.S., Liu J.S., Neuwald A.F., Wootton J.C. (1993). Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science 262:208–214PubMedCrossRefGoogle Scholar
  10. Lawrence C.E., Reilly A.A. (1990). An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences. Proteins 7:41–51CrossRefPubMedGoogle Scholar
  11. Liu X., Brutlag D.L., Liu J.S. (2001). BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-expressed Genes. Pacific Symposium on Biocomputing 127–138Google Scholar
  12. Mahony S., McInerney J.O., Smith T.J., Golden A. (2004). Gene Prediction Using the Self-Organizing Map: Automatic Generation of Multiple Gene Models. BMC Bioinformatics 5:23CrossRefPubMedGoogle Scholar
  13. Matys V., Fricke E., Geffers R., Gossling E., Haubrock M., Hehl R., Hornischer K., Karas D., Kel A.E., Kel-Margoulis O.V. et al. (2003). TRANSFAC: Transcriptional Regulation, from Patterns to Profiles. Nucleic Acids Research 31:374–378CrossRefPubMedGoogle Scholar
  14. Pevzner P.A., Sze S.H. (2000). Combinatorial Approaches to Finding Subtle Signals in DNA Sequences. Proceedings of the International Conference on Intelligent Systems for Molecular Biology 8:269–278Google Scholar
  15. Rigoutsos I., Floratos A. (1998). Combinatorial Pattern Discovery in Biological Sequences: The TEIRESIAS Algorithm. Bioinformatics 14:55–67CrossRefPubMedGoogle Scholar
  16. Sinha S., Tompa M. (2002). Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation. Nucleic Acids Research 30:5549–5560CrossRefPubMedGoogle Scholar
  17. Wan H., Li L., Federhen S., Wootton J.C. (2003). Discovering Simple Regions in Biological Sequences Associated with Scoring Schemes. Journal of Computational Biology 10:171–185CrossRefPubMedGoogle Scholar
  18. Wang H.C., Badger J., Kearney P., Li M. (2001). Analysis of Codon Usage Patterns of Bacterial Genomes Using the Self-organizing Map. Molecular Biology and Evolution 18:792–800PubMedGoogle Scholar
  19. Yang Z.R., Chou K.C. (2003). Mining Biological Data Using Self-organizing Map. Journal of Chemical Information and Computer Science 43:1748–1753CrossRefGoogle Scholar

Copyright information

© Springer 2005

Authors and Affiliations

  • Shaun Mahony
    • 1
  • David Hendrix
    • 2
  • Terry J. Smith
    • 1
  • Aaron Golden
    • 1
    • 3
  1. 1.National Centre for Biomedical Engineering ScienceNUI GalwayGalwayIreland
  2. 2.Center for Integrative GenomicsUniversity of CaliforniaBerkeleyUSA
  3. 3.Department of Information TechnologyNUI GalwayGalwayIreland

Personalised recommendations