Journal of Mathematical Biology

, Volume 70, Issue 1–2, pp 45–69 | Cite as

Coding sequence density estimation via topological pressure

  • David Koslicki
  • Daniel J. ThompsonEmail author


We give a new approach to coding sequence (CDS) density estimation in genomic analysis based on the topological pressure, which we develop from a well known concept in ergodic theory. Topological pressure measures the ‘weighted information content’ of a finite word, and incorporates 64 parameters which can be interpreted as a choice of weight for each nucleotide triplet. We train the parameters so that the topological pressure fits the observed coding sequence density on the human genome, and use this to give ab initio predictions of CDS density over windows of size around 66,000  bp on the genomes of Mus Musculus, Rhesus Macaque and Drososphilia Melanogaster. While the differences between these genomes are too great to expect that training on the human genome could predict, for example, the exact locations of genes, we demonstrate that our method gives reasonable estimates for the ‘coarse scale’ problem of predicting CDS density. Inspired again by ergodic theory, the weightings of the nucleotide triplets obtained from our training procedure are used to define a probability distribution on finite sequences, which can be used to distinguish between intron and exon sequences from the human genome of lengths between 750 and 5,000  bp. At the end of the paper, we explain the theoretical underpinning for our approach, which is the theory of Thermodynamic Formalism from the dynamical systems literature. Mathematica and MATLAB implementations of our method are available at


DNA sequence analysis Coding sequence density estimation Topological pressure 

Mathematics Subject Classification (2000)

92D20 37N25 92-08 37D35 



D. K. was partially supported by NSF grant DMS-1008538 and D. T. was partially supported by NSF grants DMS-1101576 and DMS-1259311. Portions of this work were completed while D.K. was a postdoctoral fellow at the Mathematical Biosciences Institute of the Ohio State University, and while D. K. and D. T. were members of the Mathematics Department at the Pennsylvania State University. A preliminary version of this work is included in D. K.’s PhD thesis at Penn State. The authors wish to thank the anonymous referees of this paper, whose input has greatly benefited this work.


  1. Akashi H (2001) Gene expression and molecular evolution. Curr Opin Genet Dev 11(6):660–666CrossRefGoogle Scholar
  2. Baladi V (2000) Positive transfer operators and decay of correlations, vol 16. World Scientific, SingaporezbMATHGoogle Scholar
  3. Berná L, Chaurasia A, Angelini C, Federico C, Saccone S, D’Onofrio G (2012) The footpring of metabolism in the organization of mammalian genomes. BMC Bioinform 13(174):1–13Google Scholar
  4. Bialek W, Cavagna A, Giardina I, Mora T, Silvestri E, Viale M, Walczak A (2012) Statistical mechanics for natural flocks of birds. PNAS 109:4786–4791CrossRefGoogle Scholar
  5. Blanco E, Parra G, Guigó R (2002) Using geneid to identify genes, current protocols in bioinformatics, vol 1. John Wiley & Sons Inc., New YorkGoogle Scholar
  6. Bowen R (1975) Equilibrium states and the ergodic theory of Anosov diffeomorphisms, lecture notes in mathematics, vol 470. Springer-Verlag, BerlinGoogle Scholar
  7. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94CrossRefGoogle Scholar
  8. Carter D, Durbin R (2006) Vertebrate gene finding from multiple-species alignments using a two-level strategy. Genome Biol 7(1):S6.1–12Google Scholar
  9. Comeron JM, Aguadé M (1998) An evaluation of measures of synonymous codon usage bias. J Mol Evol 47(3):268–274CrossRefGoogle Scholar
  10. Creanza TM, Horner DS, D’Addabbo A, Maglietta R, Mignone F, Ancona N, Pesole G (2009) Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements. BMC Bioinform 10(Suppl 6):S2. doi: 10.1186/1471-2105-10-S6-S2 CrossRefGoogle Scholar
  11. Durbin R, Eddy S, Krogh A, Mithcison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, MACrossRefzbMATHGoogle Scholar
  12. Erayman M, Sandhu D, Sidhu D, Dilbirligi M, Baenziger PS, Gill KS (2004) Demarcating the gene-rich regions of the wheat genome. Nucleic Acids Res 32(12):3546–3565CrossRefGoogle Scholar
  13. Fickett JW, Tung CS (1992) Assessment of protein coding measures. Nucleic Acids Res 20(24):6441–6450CrossRefGoogle Scholar
  14. Flicek P (2007) Gene prediction: compare and CONTRAST. Genome Biol 8(233):233.1–233.3. doi: 10.1186/gb-2007-8-12-233 Google Scholar
  15. Gao F, Zhang CT (2004) Comparison of various algorithms for recognizing short coding sequences of human genes. Bioinformatics 20(5):673–681. doi: 10.1093/bioinformatics/btg467 CrossRefGoogle Scholar
  16. Gheorghiciuc I, Ward M (2008) On Correlation polynomials and subword complexity. DMTCS proceedings, pp 1–18Google Scholar
  17. Giogo R, Reese M (2005) EGASP: collaboration through competition to find human genes. Nature Methods 2:575–577CrossRefGoogle Scholar
  18. Graves J (2006) Sex chromosome specialization and degeneration in mammals. Cell 124(5):901–914CrossRefGoogle Scholar
  19. Guigó R, Fickett JW (1995) Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA. J Mol Biol 253(1):51–60. doi: 10.1006/jmbi.1995.0535 CrossRefGoogle Scholar
  20. Guig R, Flicek P, Abril J, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic V, Birney E, Castelo R, Eyras E, Ucla C, Gingeras T, Hubbard T, Lewis S, Reese M (2006) EGASP: the human ENCODE genome annotation assessment project. Genome Biol 7(Suppl 1):S2.1–S2.31Google Scholar
  21. Haussler D, O’Brien S, Ryder O, Barker F, Clamp M, Crawford A, Hanner R, Hanotte O, Johnson W, McGuire J, Miller W, Murphy R, Murphy W, Sheldon F, Sinervo B, Venkatesh B, Wiley E, Allendorf F, Baker S, Bernardi G, Brenner S, Cracraft J, Diekhans M, Edwards S, Estes J, Gaubert P, Graphodatsky A, Green E, Hebert P, Helgen K, Kessing B, Kingsley D, Lewin H, Luikart G, Martelli P, Nguyen N, Orti G, Pike B, Rawson D, Schuster S, Seunez H, Shaffer H, Springer M, Stuart J, Teeling E, Vrijenhoek R, Ward R, Wayne R, Williams T, Wolfe N, Zhang YP (2009) Genome10K: a proposal to obtian whole-genome sequence for 10,000 vertebrate species. J Hered 100(6):659–674CrossRefGoogle Scholar
  22. Karlin S, Mrázek J, Campbell A (1998) Codon usages in different gene classes of the Escherichia coli genome. Mol Microbiol 29(6):1341–1355CrossRefGoogle Scholar
  23. Karolchik D, Hinrichs A, Furey T, Roskin K, Sugnet C, Haussler D, Kent W (2004) The UCSC table browser data retrieval tool. Nucleic Acids Res 164:D493–D496CrossRefGoogle Scholar
  24. Korf I (2004) Gene finding in novel genomes. BMC Bioinform 5(59):9Google Scholar
  25. Koslicki D (2011) Topological entropy of DNA sequences. Bioinformatics 27(8):1061–1067. doi: 10.1093/bioinformatics/btr077 CrossRefGoogle Scholar
  26. Kowalski J, Waga W, Zawierta M, Cebrat S (2009) Phase transition in the genome evolution favors nonrandom distribution of genes on chromosomes. Int J Mod Phys C 20(08):1299–1309CrossRefzbMATHGoogle Scholar
  27. Ksiazkiewics M, Wyrwa K, Szxzepaniak A, Rychel S, Majcherkiewics K, Przysiecka L, Karlowski W, B W, Naganowska B (2013) Comparative genomics of lupinus angustifolius gene-righ regions: BAC library exploration, genetic mapping and cytogenetics. BMC Genomics 14(79):1–16Google Scholar
  28. Kvikstad E, Tyekucheva S, Chiaromonte F, Makova K (2007) A macaque’s-eye view of human insertions and deletions: differences in mechanisms. PLoS Comput Biol 3(9):1772–1782. doi: 10.1371/journal.pcbi.0030176 CrossRefGoogle Scholar
  29. Lin MF, Deoras AN, Rasmussen MD, Kellis M (2008), Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes. PLoS Comp Biol 4(4):e1000067, Doi: 10.1371/journal.pcbi.1000067
  30. Lin MF, Jungreis I, Kellis M (2011) PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27(13):i275–i282. doi: 10.1093/bioinformatics/btr209 CrossRefGoogle Scholar
  31. Lukashin AV, Borodovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26:1107–1115CrossRefGoogle Scholar
  32. Mackiewics D, Zawierta M, Waga W, Cebrat S (2010) Genome analyses and modelling the relationships between coding density, recombination rate and chromosome length. J Theor Biol 267(2):186–192CrossRefGoogle Scholar
  33. Makova K, Yang S, Chiaromonte F (2004) Insertions and deletions are male biased too: a whole-genome analysis in rodents. Genome Res 14(4):567–573. doi: 10.1101/gr.1971104.autosome CrossRefGoogle Scholar
  34. MATLAB (2012) version 8.0 (R2012b) The MathWorks Inc., Natick, MassachusettsGoogle Scholar
  35. Mora T, Walczak A, Bialek W, Callan CJ (2010) Maximum entropy models for antibody diversity. PNAS 107(12):5405–5410CrossRefGoogle Scholar
  36. Nelder J, Mead R (1965) A simplex method for function minimization. Comput J 7(4):308CrossRefzbMATHGoogle Scholar
  37. Parry W, Pollicott M (1990) Zeta functions and the periodic orbit structure of hyperbolic dynamics. No. 187–188 in Astérisque, Soc. Math. FranceGoogle Scholar
  38. Parry W, Tuncel S (1982) Classification problems in ergodic theory, London Mathematical Society lecture note series, vol 67. Cambridge University Press, Cambridge, statistics: textbooks and monographs, 41Google Scholar
  39. Picard R, Cook D (1984) Cross-validation of regression models. J Am Stat Assoc 79:575–583CrossRefMathSciNetzbMATHGoogle Scholar
  40. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517. doi: 10.1093/bioinformatics/btm344 CrossRefGoogle Scholar
  41. Salzburger W, Steinke D, Braasch I, Meyer A (2009) Genome desertification in Eutherians: can gene deserts explain the uneven distribution of genes in placental mammalian genomes? J Mol Evol 69:207–216CrossRefGoogle Scholar
  42. Schneidman E, Berry JI, Segev R, Bialek W (2006) Weak pairwise correlations imply strongly correlated network states in a neural population. Nature 440:1007–1012CrossRefGoogle Scholar
  43. Tkacik G, Schneidman E, Berry MJI, Bialek W (2006) Ising models for networks of real neurons. eprint arXiv:q-bio/0611072 arXiv:q-bio/0611072Google Scholar
  44. van Baren MJ, Koebbe BC, Brent MR (2007) Using n-scan or twinscan to predict gene structures in genomic dna sequences. In: Boyle A (ed) Current protocols in bioinformatics. John Wiley & Sons, New YorkGoogle Scholar
  45. Varshney R, Gross I, U H, Siefken R, Prasad M, Stein N, Langridge P, Altschmied L, Graner A (2006) Genetic mapping and BAC assignment of EST-derived SSR markers shows non-uniform distribution of genes in the barley genome. Theor Appl Genet 113:239–250Google Scholar
  46. Walters P (1982) An introduction to ergodic theory, graduate texts in mathematics, vol 79. Springer, New YorkCrossRefGoogle Scholar
  47. Washietl S, Findeiss S, Müller SA, Kalkhof S, von Bergen M, Hofacker IL, Stadler PF, Goldman N (2011) RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA 17(4):578–594. doi: 10.1261/rna.2536111 CrossRefGoogle Scholar
  48. Wilson M, Makova K (2009a) Evolution and survival on eutherian sex chromosomes. PLoS Genet 5(7):11. doi: 10.1371/journal.pgen.1000568 CrossRefGoogle Scholar
  49. Wilson M, Makova K (2009b) Genomic analyses of sex chromosome evolution. Annu Rev Genomics Hum Genet 10:333–354. doi: 10.1146/annurev-genom-082908-150105 CrossRefGoogle Scholar
  50. Yandell M, Ence D (2012) A beginner’s guide to eukaryotic genome annotation. Nat Rev 13:329–342CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.Department of MathematicsOregon State UniversityCorvallisUSA
  2. 2.Department of MathematicsThe Ohio State UniversityColumbusUSA

Personalised recommendations