A Compact Mathematical Programming Formulation for DNA Motif Finding

  • Carl Kingsford
  • Elena Zaslavsky
  • Mona Singh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4009)


In the motif finding problem one seeks a set of mutually similar subsequences within a collection of biological sequences. This is an important and widely-studied problem, as such shared motifs in DNA often correspond to regulatory elements. We study a combinatorial framework where the goal is to find subsequences of a given length such that the sum of their pairwise distances is minimized. We describe a novel integer linear program for the problem, which uses the fact that distances between subsequences come from a limited set of possibilities. We show how to tighten its linear programming relaxation by adding an exponential set of constraints and give an efficient separation algorithm that can find violated constraints, thereby showing that the tightened linear program can still be solved in polynomial time. We apply our approach to find optimal solutions for the motif finding problem and show that it is effective in practice in uncovering known transcription factor binding sites.


Transcription Factor Binding Site Integer Linear Programming Linear Programming Relaxation Separation Algorithm Integer Linear Programming Formulation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Akutsu, T., Arimura, H., Shimozono, S.: On approximation algorithms for local multiple alignment. In: RECOMB, pp. 1–7 (2000)Google Scholar
  2. 2.
    Bafna, V., Lawler, E., Pevzner, P.A.: Approximation algorithms for multiple alignment. Theoretical Computer Science 182, 233–244 (1997)CrossRefMathSciNetzbMATHGoogle Scholar
  3. 3.
    Bailey, T., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21, 51–80 (1995)Google Scholar
  4. 4.
    Chazelle, B., Kingsford, C., Singh, M.: A semidefinite programming approach to side-chain positioning with new rounding strategies. INFORMS J. on Computing 16, 380–392 (2004)CrossRefMathSciNetGoogle Scholar
  5. 5.
    Cook, W., Cunningham, W., Pulleyblank, W., Schrijver, A.: Combinatorial Optimization. Wiley-Interscience, New York (1997)Google Scholar
  6. 6.
    Grötschel, M., Lovász, L., Schrijver, A.: Geometric Algorithms and Combinatorial Optimization, 2nd edn. Springer, Berlin (1993)zbMATHGoogle Scholar
  7. 7.
    Hertz, G., Stormo, G.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinf. 15, 563–577 (1999)CrossRefGoogle Scholar
  8. 8.
    Kellis, M., Patterson, N., Endrizzi, M., Birren, B., Lander, E.: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254 (2003)CrossRefGoogle Scholar
  9. 9.
    Kingsford, C., Chazelle, B., Singh, M.: Solving and analyzing side-chain positioning problems using linear and integer programming. Bioinf. 21, 1028–1039 (2005)CrossRefGoogle Scholar
  10. 10.
    Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A., Wootton, J.: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)CrossRefGoogle Scholar
  11. 11.
    Lee, T., Rinaldi, N., Robert, F., Odom, D., Bar-Joseph, Z., Gerber, G., et al.: Transcriptional regulatory networks in S. cerevisiae. Science 298, 799–804 (2002)CrossRefGoogle Scholar
  12. 12.
    Li, M., Ma, B., Wang, L.: Finding similar regions in many strings. J. Computer and Systems Sciences 65(1), 73–96 (2002)CrossRefMathSciNetGoogle Scholar
  13. 13.
    Marsan, L., Sagot, M.F.: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comp. Bio. 7, 345–362 (2000)CrossRefGoogle Scholar
  14. 14.
    McGuire, A., Hughes, J., Church, G.: Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Res. 10, 744–757 (2000)CrossRefGoogle Scholar
  15. 15.
    Osada, R., Zaslavsky, E., Singh, M.: Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinf. 20, 3516–3525 (2004)CrossRefGoogle Scholar
  16. 16.
    Pevzner, P., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. In: ISMB, pp. 269–278 (2000)Google Scholar
  17. 17.
    Robison, K., McGuire, A., Church, G.: A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 Genome. J. Mol. Biol. 284, 241–254 (1998)CrossRefGoogle Scholar
  18. 18.
    Schuler, G., Altschul, S., Lipman, D.: A workbench for multiple alignment construction and analysis. Proteins 9(3), 180–190 (1991)CrossRefGoogle Scholar
  19. 19.
    Tavazoie, S., Hughes, J., Campbell, M., Cho, R., Church, G.: Systematic determination of genetic network architecture. Nat. Genetics 22(3), 281–285 (1999)CrossRefGoogle Scholar
  20. 20.
    Thompson, W., Rouchka, E., Lawrence, C.: Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res. 31, 3580–3585 (2003)CrossRefGoogle Scholar
  21. 21.
    Tompa, M., Li, N., Bailey, T., Church, G., De Moor, B., Eskin, E., et al.: Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotech. 23, 137–144 (2005)CrossRefGoogle Scholar
  22. 22.
    Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comp. Bio. 1, 337–348 (1994)CrossRefGoogle Scholar
  23. 23.
    Zaslavsky, E., Singh, M.: Combinatorial Optimization Approaches to Motif Finding (submitted), also available as Princeton University Computer Science Dept. Technical Report TR-728-05Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Carl Kingsford
    • 1
  • Elena Zaslavsky
    • 2
  • Mona Singh
    • 2
  1. 1.Center for Bioinformatics & Computational BiologyUniversity of MarylandCollege Park
  2. 2.Department of Computer Science and Lewis-Sigler Institute for Integrative GenomicsPrinceton UniversityPrinceton

Personalised recommendations