Abstract
In special coordinates (codon position-specific nucleotide frequencies), bacterial genomes form two straight lines in 9-dimensional space: one line for eubacterial genomes, another for archaeal genomes. All the 348 distinct bacterial genomes available in Genbank in April 2007, belong to these lines with high accuracy. The main challenge now is to explain the observed high accuracy. The new phenomenon of complementary symmetry for codon position-specific nucleotide frequencies is observed. The results of analysis of several codon usage models are presented. We demonstrate that the mean-field approximation, which is also known as context-free, or complete independence model, or Segre variety, can serve as a reasonable approximation to the real codon usage. The first two principal components of codon usage correlate strongly with genomic G+C content and the optimal growth temperature, respectively. The variation of codon usage along the third component is related to the curvature of the mean-field approximation. First three eigenvalues in codon usage PCA explain 59.1%, 7.8% and 4.7% of variation. The eubacterial and archaeal genomes codon usage is clearly distributed along two third order curves with genomic G+C content as a parameter.
Similar content being viewed by others
References
Besemer, J., Borodovsky, M., 1999. Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 27(19), 3911–920.
Bharanidharan, D., Bhargavi, G.R., Uthanumallian, K., Gautham, N., 2004. Correlations between nucleotide frequencies and amino acid composition in 115 bacterial species. Biochem. Biophys. Res. Commun. 315, 1097–103.
Cangelosi, R., Goriely, A., 2007. Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct 2, 2, doi:10.1186/1745-6150-2-2
Carbone, A., Kepes, F., Zinovyev, A., 2005. Codon bias signatures, organisation of microorganisms in codon space and lifestyle. Mol. Biol. Evol. 22, 547–61.
Carlon, E., Malki, M.L., Blossey, R., 2005. Exons, introns, and DNA thermodynamics. Phys. Rev. Lett. 94, 178101.
Chen, S.L., Lee, W., Hottes, A.K., Shapiro, L., McAdams, H.H., 2004. Codon usage between genomes is constrained by genome-wide mutational processes. PNAS 101(10), 3480–485.
Cluster structures in genomic word frequency distributions, 2004. Web-site: http://www.ihes.fr//~zinovyev/7clusters
Frappat, L., Sciarrino, A., 2006. Conspiracy in bacterial genomes. Physica A 369, 699–13.
Gorban, A.N., Zinovyev, A.Y., 2004. The mystery of two straight lines in bacterial genome statistics. arXiv q-bio.GN/0412015
Gorban, A.N., Zinovyev, A.Y., Popova, T.G., 2005a. Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences. In Silico Biol. 5, 0025. On-line: http://www.bioinfo.de/isb/2005/05/0025/
Gorban, A., Popova, T., Zinovyev, A., 2005b. Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences. Physica A 353, 365–87.
Knight, R.D., Freeland, S.J., Landweber, L.F., 2001. A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes. Genome Biol. 2, 0010.1–010.13
Lobry, J., 1997. Influence of genomic G+C content on average amino-acid composition of proteins from 59 bacterial species. Gene 205(1–2), 309–16.
Lobry, J.R., Sueoka, N., 2002. Asymmetric directional mutation pressures in bacteria. Genome Biol. 3(10), 0058.
Lobry, J.R., Chessel, D., 2003. Internal correspondence analysis of codon and amino-acid usage in thermophilic bacteria. J. Appl. Genet. 44(2), 235–61.
Lynn, D.J., Gregory, A.C., Singer, G.A.C., Hickey, D.A., 2002. Synonymous codon usage is subject to selection in thermophilic bacteria. Nucleic Acids Res. 30(19), 4272–277.
Minichini, C., Sciarrino, A., 2006. Mutation model for nucleotide sequences based on crystal basis. Biosystems 84, 191–06, arXiv q-bio.BM/0506010
Muto, A., Osawa, S., 1987. The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl. Acad. Sci. USA 84, 166–69.
Pachter, L., Sturmfels, B. (Eds.), 2005. Algebraic Statistics for Computational Biology. Cambridge University Press, Cambridge.
Pachter, L., Sturmfels, B., 2007. The mathematics of phylogenomics. SIAM Rev. 49(1), 3–1.
Singer, G.A.C., Hickey, D.A., 2000. Nucleotide bias causes a genomewide bias in the amino acid composition of proteins. Mol. Biol. Evol. 17, 1581–588.
Sueoka, N., 1962. On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl. Acad. Sci. USA 48, 582–92.
Sueoka, N., 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 85(8), 2653–657.
Wan, X.F., Xu, D., Kleinhofs, A., Zhou, J., 2004. Quantitative relationship between synonymous codon usage bias and GC composition across unicellular genomes. BMC Evol. Biol. 4(1), 19.
Yeramian, E., 2000a. Genes and the physics of the DNA double-helix. Gene 255, 139–50.
Yeramian, E., 2000b. The physics of DNA and the annotation of the Plasmodium falsiparum genome. Gene 255, 151–68.
Zinovyev, A.Y., Gorban, A.N., Popova, T.G., 2003. Self-organizing approach for automated gene identification. Open Syst. Inf. Dyn. 10, 321–33.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gorban, A.N., Zinovyev, A.Y. The Mystery of Two Straight Lines in Bacterial Genome Statistics. Bull. Math. Biol. 69, 2429–2442 (2007). https://doi.org/10.1007/s11538-007-9229-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11538-007-9229-6