An Evolutionary Model of DNA Substring Distribution

  • Meelis Kull
  • Konstantin Tretyakov
  • Jaak Vilo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6060)


DNA sequence analysis methods, such as motif discovery, gene detection or phylogeny reconstruction, can often provide important input for biological studies. Many of such methods require a background model, representing the expected distribution of short substrings in a given DNA region. Most current techniques for modeling this distribution disregard the evolutionary processes underlying DNA formation. We propose a novel approach for modeling DNA k-mer distribution that is capable of taking the notions of evolution and natural selection into account. We derive a computionally tractable approximation for estimating k-mer probabilities at genetic equilibrium, given a description of evolutionary processes in terms of fitness and mutation probabilities. We assess the goodness of this approximation via numerical experiments. Besides providing a generative model for DNA sequences, our method has further applications in motif discovery.


Background Model Motif Discovery Transcription Factor Binding Motif Genetic Equilibrium Inductive Bias 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Davidson, E.H.: The regulatory genome: gene regulatory networks in development and evolution. Academic Press, San Diego (2006)Google Scholar
  2. 2.
    Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16(1), 16–23 (2000)CrossRefGoogle Scholar
  3. 3.
    Thijs, G., Lescot, M., Marchal, K., Rombauts, S., Moor, B.D., Rouzé, P., Moreau, Y.: A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17(12), 1113–1122 (2001)CrossRefGoogle Scholar
  4. 4.
    Mustonen, V., Lässig, M.: Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc. Natl. Acad. Sci. USA 102(44), 15936–15941 (2005)CrossRefGoogle Scholar
  5. 5.
    Moses, A.M., Pollard, D.A., Nix, D.A., Iyer, V.N., Li, X.Y., Biggin, M.D., Eisen, M.B.: Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS Comput. Biol. 2(10), e130 (2006)CrossRefGoogle Scholar
  6. 6.
    Doniger, S.W., Fay, J.C.: Frequent gain and loss of functional transcription factor binding sites. PLoS Comput. Biol. 3(5), e99 (2007)CrossRefGoogle Scholar
  7. 7.
    Huang, W., Nevins, J.R., Ohler, U.: Phylogenetic simulation of promoter evolution: estimation and modeling of binding site turnover events and assessment of their impact on alignment tools. Genome. Biol. 8(10), R225 (2007)CrossRefGoogle Scholar
  8. 8.
    Brazma, A., Jonassen, I., Vilo, J., Ukkonen, E.: Predicting gene regulatory elements in silico on a genomic scale. Genome. Res. 8(11), 1202–1215 (1998)Google Scholar
  9. 9.
    Das, M.K., Dai, H.K.: A survey of DNA motif finding algorithms. BMC Bioinformatics 8(Suppl. 7), S21 (2007)CrossRefGoogle Scholar
  10. 10.
    Redhead, E., Bailey, T.: Discriminative motif discovery in DNA and protein sequences using the DEME algorithm. BMC Bioinformatics 8(1), 385 (2007)CrossRefGoogle Scholar
  11. 11.
    Vilo, J.: Pattern discovery from biosequences. Thesis PhD (2002)Google Scholar
  12. 12.
    Wang, G., Yu, T., Zhang, W.: WordSpy: identifying transcription factor binding motifs by building a dictionary and learning a grammar. Nucleic Acids Res. 33(Web Server issue), W412–W416 (2005)CrossRefGoogle Scholar
  13. 13.
    Cartwright, R.A.: DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics 21(Suppl. 3), iii31–iii38 (2005)MathSciNetGoogle Scholar
  14. 14.
    Varadarajan, A., Bradley, R., Holmes, I.: Tools for simulating evolution of aligned genomic regions with integrated parameter estimation. Genome. Biol. 9(10), R147 (2008)Google Scholar
  15. 15.
    Rouchka, E.C., Hardin, C.T.: rMotifGen: random motif generator for DNA and protein sequences. BMC Bioinformatics 8, 292 (2007)CrossRefGoogle Scholar
  16. 16.
    Segal, E., Fondufe-Mittendorf, Y., Chen, L., Thåström, A., Field, Y., Moore, I.K., Wang, J.P.Z., Widom, J.: A genomic code for nucleosome positioning. Nature 442(7104), 772–778 (2006)CrossRefGoogle Scholar
  17. 17.
    Saxonov, S., Berg, P., Brutlag, D.L.: A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc. Natl. Acad. Sci. USA 103(5), 1412–1417 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Meelis Kull
    • 1
    • 2
  • Konstantin Tretyakov
    • 1
  • Jaak Vilo
    • 1
    • 2
  1. 1.Institute of Computer ScienceUniversity of TartuTartuEstonia
  2. 2.Quretec Ltd.TartuEstonia

Personalised recommendations