Advertisement

Improved Pattern-Driven Algorithms for Motif Finding in DNA Sequences

  • Sing-Hoi Sze
  • Xiaoyan Zhao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4023)

Abstract

In order to guarantee that the optimal motif is found, traditional pattern-driven approaches perform an exhaustive search over all candidate motifs of length l. We develop an improved pattern-driven algorithm that takes O(4 l lk) time, where k is the number of sequences in the sample and l is the motif length, which is independent of the length of each sequence n for large enough l and saving a factor of n in time complexity over the original pattern-driven approach. We further extend this strategy to allow arbitrary don’t care positions within a motif without much decrease in solvable values of l. Testing this algorithm on a large set of yeast samples constructed from co-expressed gene clusters reveals that most biological motifs have many invariant or almost invariant positions and these positions can be used to define the motif while ignoring the other positions. This motivates the following two-stage strategy that extends the solvable values of l substantially for the pattern-driven approach: first use an O(2 l lkn) algorithm to exhaustively search over all candidate motifs allowing arbitrary don’t care positions but disallowing mismatches, then refine these motifs by allowing a limited amount of flexibility to model the almost invariant positions. We demonstrate that this seemingly restrictive motif definition is sufficiently powerful by showing that the performance of this algorithm is comparable to the best existing motif finding algorithms on a large benchmark set of samples. A software program implementing these approaches (MotifEnumerator) is available at http://faculty.cs.tamu.edu/shsze/motifenumerator.

Keywords

Transcription Factor Binding Site Motif Find Invariant Position Candidate Motif Motif Length 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [2004]
    Apostolico, A., Parida, L.: Incremental paradigms of motif discovery. J. Comp. Biol. 11, 15–25 (2004)CrossRefGoogle Scholar
  2. [1994]
    Bailey, T.L., Elkan, C.P.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proc. 2nd Int. Conf. Intelligent Systems Mol. Biol (ISMB’1994), pp. 28–36 (1994)Google Scholar
  3. [2003]
    Barash, Y., Elidan, G., Friedman, N., Kaplan, T.: Modeling dependencies in protein-DNA binding sites. In: Proc. 7th Ann. Int. Conf. Res. Comp. Mol. Biol (RECOMB’2003), pp. 28–37 (2003)Google Scholar
  4. [2002]
    Blanchette, M., Schwikowski, B., Tompa, M.: Algorithms for phylogenetic footprinting. J. Comp. Biol. 9, 211–223 (2002)CrossRefGoogle Scholar
  5. [2002]
    Buhler, J., Tompa, M.: Finding motifs using random projections. J. Comp. Biol. 9, 225–242 (2002)CrossRefGoogle Scholar
  6. [2004]
    Eskin, E.: From profiles to patterns and back again: a branch and bound algorithm for finding near optimal motif profiles. In: Proc. 8th Ann. Int. Conf. Res. Comp. Mol. Biol (RECOMB’2004), pp. 115–124 (2004)Google Scholar
  7. [2002]
    Eskin, E., Pevzner, P.A.: Finding composite regulatory patterns in DNA sequences. Bioinformatics 18, S354–363 (2002)Google Scholar
  8. [2005]
    Favorov, A.V., Gelfand, M.S., Gerasimova, A.V., Ravcheev, D.A., Mironov, A.A., Makeev, V.J.: A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics 21, 2240–2245 (2005)CrossRefGoogle Scholar
  9. [1995]
    Fraenkel, Y.M., Mandel, Y., Friedberg, D., Margalit, H.: Identification of common motifs in unaligned DNA sequences: application to Escherichia coli Lrp regulon. Comp. Appl. Biosci. 11, 379–387 (1995)Google Scholar
  10. [1985]
    Galas, D.J., Eggert, M., Waterman, M.S.: Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Mol. Biol. 186, 117–128 (1985)CrossRefGoogle Scholar
  11. [2001]
    GuhaThakurta, D., Stormo, G.D.: Identifying target sites for cooperatively binding factors. Bioinformatics 17, 608–621 (2001)CrossRefGoogle Scholar
  12. [2000]
    Hughes, J.D., Estep, P.W., Tavazoie, S., Church, G.M.: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296, 1205–1214 (2000)CrossRefGoogle Scholar
  13. [2002]
    Keich, U., Pevzner, P.A.: Finding motifs in the twilight zone. Bioinformatics 18, 1374–1381 (2002)CrossRefGoogle Scholar
  14. [2004]
    Kel, A., Tikunov, Y., Voss, N., Wingender, E.: Recognition of multiple patterns in unaligned sets of sequences: comparison of kernel clustering method with other methods. Bioinformatics 20, 1512–1516 (2004)CrossRefGoogle Scholar
  15. [1993]
    Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)CrossRefGoogle Scholar
  16. [2005]
    Leung, H.C., Chin, F.Y.: Finding exact optimal motifs in matrix representation by partitioning. Bioinformatics 21, SII86–92 (2005)CrossRefGoogle Scholar
  17. [2001]
    Liu, X., Brutlag, D.L., Liu, J.S.: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. In: Pac. Sym. Biocomp (PSB’2001), pp. 127–138 (2001)Google Scholar
  18. [2000]
    Marsan, L., Sagot, M.-F.: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comp. Biol. 7, 345–362 (2000)CrossRefGoogle Scholar
  19. [2001]
    Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17, S207–214 (2001)Google Scholar
  20. [1992]
    Pesole, G., Prunella, N., Liuni, S., Attimonelli, M., Saccone, C.: WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucleic Acids Res. 20, 2871–2875 (1992)CrossRefGoogle Scholar
  21. [2000]
    Pevzner, P.A., Sze, S.-H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proc. 8th Int. Conf. Intelligent Systems Mol. Biol (ISMB’2000), pp. 269–278 (2000)Google Scholar
  22. [2003]
    Price, A., Ramabhadran, S., Pevzner, P.A.: Finding subtle motifs by branching from sample strings. Bioinformatics 19, SII149–155 (2003)CrossRefGoogle Scholar
  23. [1982]
    Queen, C., Wegman, M.N., Korn, L.J.: Improvements to a program for DNA analysis: a procedure to find homologies among many sequences. Nucleic Acids Res. 10, 449–456 (1982)CrossRefGoogle Scholar
  24. [1998]
    Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14, 55–67 (1998)CrossRefGoogle Scholar
  25. [2000]
    Sinha, S., Tompa, M.: A statistical method for finding transcription factor binding sites. In: Proc. 8th Int. Conf. Intelligent Systems Mol. Biol (ISMB’2000), pp. 344–354 (2000)Google Scholar
  26. [1989]
    Staden, R.: Methods for discovering novel motifs in nucleic acid sequences. Comp. Appl. Biosci. 5, 293–298 (1989)Google Scholar
  27. [1989]
    Stormo, G.D., Hartzell, G.W.: Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. USA 86, 1183–1187 (1989)CrossRefGoogle Scholar
  28. [1999]
    Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M.: Systematic determination of genetic network architecture. Nature Genet. 22, 281–285 (1999)CrossRefGoogle Scholar
  29. [2001]
    Thijs, G., Lescot, M., Marchal, K., Rombauts, S., De Moor, B., Rouzé, P., Moreau, Y.: A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17, 1113–1122 (2001)CrossRefGoogle Scholar
  30. [1999]
    Tompa, M.: An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. In: Proc. 7th Int. Conf. Intelligent Systems Mol. Biol (ISMB’1999), pp. 262–271 (1999)Google Scholar
  31. [2005]
    Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., Régnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z.: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotech. 23, 137–144 (2005)CrossRefGoogle Scholar
  32. [1998]
    van Helden, J., André, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842 (1998)CrossRefGoogle Scholar
  33. [2000]
    van Helden, J., Rios, A.F., Collado-Vides, J.: Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28, 1808–1818 (2000)CrossRefGoogle Scholar
  34. [1984]
    Waterman, M.S., Arratia, R., Galas, D.J.: Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol. 46, 515–527 (1984)zbMATHMathSciNetGoogle Scholar
  35. [1996]
    Wolfertstetter, F., Frech, K., Herrmann, G., Werner, T.: Identification of functional elements in unaligned nucleic acid sequences by a novel tuple search algorithm. Comp. Appl. Biosci. 12, 71–80 (1996)Google Scholar
  36. [2000]
    Workman, C.T., Stormo, G.D.: ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. In: Pac. Sym. Biocomp (PSB’2000), pp. 467–478 (2000)Google Scholar
  37. [2004]
    Zhou, Q., Liu, J.S.: Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics 20, 909–916 (2004)CrossRefGoogle Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Sing-Hoi Sze
    • 1
    • 2
  • Xiaoyan Zhao
    • 1
  1. 1.Department of Computer Science 
  2. 2.Department of Biochemistry & Biophysics, Texas A&M University, College Station, TX 77843USA

Personalised recommendations