Bulletin of Mathematical Biology

, Volume 69, Issue 7, pp 2429–2442 | Cite as

The Mystery of Two Straight Lines in Bacterial Genome Statistics

  • A. N. GorbanEmail author
  • A. Y. Zinovyev
Original Article


In special coordinates (codon position-specific nucleotide frequencies), bacterial genomes form two straight lines in 9-dimensional space: one line for eubacterial genomes, another for archaeal genomes. All the 348 distinct bacterial genomes available in Genbank in April 2007, belong to these lines with high accuracy. The main challenge now is to explain the observed high accuracy. The new phenomenon of complementary symmetry for codon position-specific nucleotide frequencies is observed. The results of analysis of several codon usage models are presented. We demonstrate that the mean-field approximation, which is also known as context-free, or complete independence model, or Segre variety, can serve as a reasonable approximation to the real codon usage. The first two principal components of codon usage correlate strongly with genomic G+C content and the optimal growth temperature, respectively. The variation of codon usage along the third component is related to the curvature of the mean-field approximation. First three eigenvalues in codon usage PCA explain 59.1%, 7.8% and 4.7% of variation. The eubacterial and archaeal genomes codon usage is clearly distributed along two third order curves with genomic G+C content as a parameter.


Codon usage Nucleotide frequency Context-free approximation Mean field Algebraic statistics Principal components Genome evolution 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Besemer, J., Borodovsky, M., 1999. Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 27(19), 3911–920. CrossRefGoogle Scholar
  2. Bharanidharan, D., Bhargavi, G.R., Uthanumallian, K., Gautham, N., 2004. Correlations between nucleotide frequencies and amino acid composition in 115 bacterial species. Biochem. Biophys. Res. Commun. 315, 1097–103. CrossRefGoogle Scholar
  3. Cangelosi, R., Goriely, A., 2007. Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct 2, 2, doi: 10.1186/1745-6150-2-2 CrossRefGoogle Scholar
  4. Carbone, A., Kepes, F., Zinovyev, A., 2005. Codon bias signatures, organisation of microorganisms in codon space and lifestyle. Mol. Biol. Evol. 22, 547–61. CrossRefGoogle Scholar
  5. Carlon, E., Malki, M.L., Blossey, R., 2005. Exons, introns, and DNA thermodynamics. Phys. Rev. Lett. 94, 178101. CrossRefGoogle Scholar
  6. Chen, S.L., Lee, W., Hottes, A.K., Shapiro, L., McAdams, H.H., 2004. Codon usage between genomes is constrained by genome-wide mutational processes. PNAS 101(10), 3480–485. CrossRefGoogle Scholar
  7. Cluster structures in genomic word frequency distributions, 2004. Web-site:
  8. Frappat, L., Sciarrino, A., 2006. Conspiracy in bacterial genomes. Physica A 369, 699–13. CrossRefGoogle Scholar
  9. Gorban, A.N., Zinovyev, A.Y., 2004. The mystery of two straight lines in bacterial genome statistics. arXiv q-bio.GN/0412015 Google Scholar
  10. Gorban, A.N., Zinovyev, A.Y., Popova, T.G., 2005a. Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences. In Silico Biol. 5, 0025. On-line: Google Scholar
  11. Gorban, A., Popova, T., Zinovyev, A., 2005b. Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences. Physica A 353, 365–87. CrossRefGoogle Scholar
  12. Knight, R.D., Freeland, S.J., Landweber, L.F., 2001. A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes. Genome Biol. 2, 0010.1–010.13 Google Scholar
  13. Lobry, J., 1997. Influence of genomic G+C content on average amino-acid composition of proteins from 59 bacterial species. Gene 205(1–2), 309–16. CrossRefGoogle Scholar
  14. Lobry, J.R., Sueoka, N., 2002. Asymmetric directional mutation pressures in bacteria. Genome Biol. 3(10), 0058. CrossRefGoogle Scholar
  15. Lobry, J.R., Chessel, D., 2003. Internal correspondence analysis of codon and amino-acid usage in thermophilic bacteria. J. Appl. Genet. 44(2), 235–61. Google Scholar
  16. Lynn, D.J., Gregory, A.C., Singer, G.A.C., Hickey, D.A., 2002. Synonymous codon usage is subject to selection in thermophilic bacteria. Nucleic Acids Res. 30(19), 4272–277. CrossRefGoogle Scholar
  17. Minichini, C., Sciarrino, A., 2006. Mutation model for nucleotide sequences based on crystal basis. Biosystems 84, 191–06, arXiv q-bio.BM/0506010 CrossRefGoogle Scholar
  18. Muto, A., Osawa, S., 1987. The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl. Acad. Sci. USA 84, 166–69. CrossRefGoogle Scholar
  19. Pachter, L., Sturmfels, B. (Eds.), 2005. Algebraic Statistics for Computational Biology. Cambridge University Press, Cambridge. zbMATHGoogle Scholar
  20. Pachter, L., Sturmfels, B., 2007. The mathematics of phylogenomics. SIAM Rev. 49(1), 3–1. zbMATHCrossRefMathSciNetGoogle Scholar
  21. Singer, G.A.C., Hickey, D.A., 2000. Nucleotide bias causes a genomewide bias in the amino acid composition of proteins. Mol. Biol. Evol. 17, 1581–588. Google Scholar
  22. Sueoka, N., 1962. On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl. Acad. Sci. USA 48, 582–92. CrossRefGoogle Scholar
  23. Sueoka, N., 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 85(8), 2653–657. CrossRefGoogle Scholar
  24. Wan, X.F., Xu, D., Kleinhofs, A., Zhou, J., 2004. Quantitative relationship between synonymous codon usage bias and GC composition across unicellular genomes. BMC Evol. Biol. 4(1), 19. CrossRefGoogle Scholar
  25. Yeramian, E., 2000a. Genes and the physics of the DNA double-helix. Gene 255, 139–50. CrossRefGoogle Scholar
  26. Yeramian, E., 2000b. The physics of DNA and the annotation of the Plasmodium falsiparum genome. Gene 255, 151–68. CrossRefGoogle Scholar
  27. Zinovyev, A.Y., Gorban, A.N., Popova, T.G., 2003. Self-organizing approach for automated gene identification. Open Syst. Inf. Dyn. 10, 321–33. zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Society for Mathematical Biology 2007

Authors and Affiliations

  1. 1.University of LeicesterLeicesterUK
  2. 2.Institut CurieParisFrance
  3. 3.Institut des Hautes Etudes ScientifiquesBures-sur-YvetteFrance

Personalised recommendations