Efficient Searching for Motifs in DNA Sequences Using Position Weight Matrices

  • Nikola Stojanovic
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 127)


Searching genomic sequences for motifs representing functionally important sites is a significant and well–established subfield of bioinformatics. In that context, Position Weight Matrices are a popular way of representing variable motifs, as they have been widely used for describing the binding sites of transcriptional proteins. However, the standard implementation of PWM matching, while not inefficient on shorter sequences, is too expensive for whole–genome searches. In this paper we present an algorithm we have developed for efficient matching of PWMs in long target sequences. After the initial pre–processing of the matrix it performs in time linear to the size of the genomic segment.


DNA motifs Position weight matrices Genome–wide analysis Algorithms Genomics Pattern matching 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aho, A., Corasick, M.: Efficient string matching: an aid to bibliographic search. Comm. Assoc. Comput. Mach. 18, 333–340 (1975)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Apostolico, A., Bock, M., Lonardi, S., Xu, X.: Efficient detection of unusual words. J. Comput. Biol. 7, 71–94 (2000)CrossRefGoogle Scholar
  3. 3.
    Bryne, J., Valen, E., Tang, M., Marstrand, T., Winther, O., da Piedade, I., Krogh, A., Lenhard, B., Sandelin, A.: JASPAR, the open access database of transcription factor–binding profiles: new content and tools in the 2008 update. Nucleic Acids Res. 36, D102–D106 (2008)CrossRefGoogle Scholar
  4. 4.
    Gershenzon, N.I., Stormo, G.D., Ioshikhes, I.P.: Computational technique for improvement of the position–weight matrices for the DNA/protein binding sites. Nucleic Acids Res. 33, 2290–2301 (2005)CrossRefGoogle Scholar
  5. 5.
    Hannenhalli, S., Wang, L.S.: Enhanced position weight matrices using mixture models. Bioinformatics 21, i204–i212 (2005)CrossRefGoogle Scholar
  6. 6.
    Hughes, J., Estep, P., Tavazoie, S., Church, G.: Computational identification of cis–regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296, 1205–1214 (2000)CrossRefGoogle Scholar
  7. 7.
    Kel, A.E., Gössling, E., Reuter, I., Cheremushkin, E., Kel-Margoulis, O.V., Wingender, E.: Match: A tool for searching transcription factor binding sites in dna sequences. Nucleic Acids Res. 31(13), 3576–3579 (2003), CrossRefGoogle Scholar
  8. 8.
    Khambata-Ford, S., Liu, Y., Gleason, C., Dickson, M., Altman, R., Batzoglou, S., Myers, R.: Identification of promoter regions in the human genome by using a retroviral plasmid library–based functional reporter gene assay. Genome Res. 13, 1765–1774 (2003)CrossRefGoogle Scholar
  9. 9.
    Knuth, D., Morris, J., Pratt, V.: Fast pattern matching in strings. SIAM J. Computing 6, 323–350 (1977)MathSciNetzbMATHCrossRefGoogle Scholar
  10. 10.
    Liefooghe, A., Touzet, H., Varré, J.S.: Large Scale Matching for Position Weight Matrices. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 401–412. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  11. 11.
    Nelson, C., Hersh, B., Carroll, S.B.: The regulatory content of intergenic DNA shapes genome architecture. Genome Biol. 5, R25 (2004)CrossRefGoogle Scholar
  12. 12.
    Pizzi, C., Rastas, P., Ukkonen, E.: Finding signicant matches of position weight matrices in linear time. IEEE/ACM Transactions on Computational Biology and Bioinformatics E–publication ahead of print (2009)Google Scholar
  13. 13.
    Qin, Z., McCue, L., Thompson, W., Mayerhofer, L., Lawrence, C., Liu, J.: Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites. Nature Biotechnology 21, 435–439 (2003)CrossRefGoogle Scholar
  14. 14.
    Singh, A., Stojanovic, N.: An efficient algorithm for the identification of repetitive variable motifs in the regulatory sequences of co-expressed genes. In: Levi, A., Savaş, E., Yenigün, H., Balcısoy, S., Saygın, Y. (eds.) ISCIS 2006. LNCS, vol. 4263, pp. 182–191. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  15. 15.
    Singh, A., Stojanovic, N.: Genome–wide search for putative transcriptional modules in eukaryotic sequences. In: Proceedings of BIOCOMP 2009, pp. 848–854 (2009)Google Scholar
  16. 16.
    Stojanovic, N.: A study on the distribution of phylogenetically conserved blocks within clusters of mammalian homeobox genes. Genetics and Molecular Biology 32, 666–673 (2009)CrossRefGoogle Scholar
  17. 17.
    Stojanovic, N.: Linear-time matching of position weight matrices. In: Proceedings of the First International Conference on Bioinformatics, BIOINFORMATICS 2010, pp. 66–73 (2010)Google Scholar
  18. 18.
    Stormo, G.: Consensus patterns in DNA. Methods Enzym. 183, 211–221 (1990)CrossRefGoogle Scholar
  19. 19.
    The ENCODE Project Consortium: The ENCODE pilot project: Identification and analysis of functional elements in 1% of the human genome. Nature 447, 799–816 (2007)Google Scholar
  20. 20.
    van Helden, J.: Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 20, 399–406 (2004)CrossRefGoogle Scholar
  21. 21.
    Wingender, E.: The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Briefings in Bioinformatics 9, 326–332 (2008)CrossRefGoogle Scholar
  22. 22.
    Young, J.E., Vogt, T., Gross, K.W., Khani, S.C.: A short, highly active photoreceptor–specific enhancer/promoter region upstream of the human rhodopsin kinase gene. Investigative Ophtamology and Visual Science 44, 4076–4085 (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Nikola Stojanovic
    • 1
  1. 1.Department of Computer Science and EngineeringUniversity of Texas at ArlingtonArlingtonU.S.A.

Personalised recommendations