The Mystery of Two Straight Lines in Bacterial Genome Statistics

Gorban, A. N.; Zinovyev, A. Y.

doi:10.1007/s11538-007-9229-6

The Mystery of Two Straight Lines in Bacterial Genome Statistics

Original Article
Published: 19 June 2007

Volume 69, pages 2429–2442, (2007)
Cite this article

Bulletin of Mathematical Biology Aims and scope Submit manuscript

A. N. Gorban^1,3 &
A. Y. Zinovyev^2,3

105 Accesses
5 Citations
Explore all metrics

Abstract

In special coordinates (codon position-specific nucleotide frequencies), bacterial genomes form two straight lines in 9-dimensional space: one line for eubacterial genomes, another for archaeal genomes. All the 348 distinct bacterial genomes available in Genbank in April 2007, belong to these lines with high accuracy. The main challenge now is to explain the observed high accuracy. The new phenomenon of complementary symmetry for codon position-specific nucleotide frequencies is observed. The results of analysis of several codon usage models are presented. We demonstrate that the mean-field approximation, which is also known as context-free, or complete independence model, or Segre variety, can serve as a reasonable approximation to the real codon usage. The first two principal components of codon usage correlate strongly with genomic G+C content and the optimal growth temperature, respectively. The variation of codon usage along the third component is related to the curvature of the mean-field approximation. First three eigenvalues in codon usage PCA explain 59.1%, 7.8% and 4.7% of variation. The eubacterial and archaeal genomes codon usage is clearly distributed along two third order curves with genomic G+C content as a parameter.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Besemer, J., Borodovsky, M., 1999. Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 27(19), 3911–920.
Article Google Scholar
Bharanidharan, D., Bhargavi, G.R., Uthanumallian, K., Gautham, N., 2004. Correlations between nucleotide frequencies and amino acid composition in 115 bacterial species. Biochem. Biophys. Res. Commun. 315, 1097–103.
Article Google Scholar
Cangelosi, R., Goriely, A., 2007. Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct 2, 2, doi:10.1186/1745-6150-2-2
Article Google Scholar
Carbone, A., Kepes, F., Zinovyev, A., 2005. Codon bias signatures, organisation of microorganisms in codon space and lifestyle. Mol. Biol. Evol. 22, 547–61.
Article Google Scholar
Carlon, E., Malki, M.L., Blossey, R., 2005. Exons, introns, and DNA thermodynamics. Phys. Rev. Lett. 94, 178101.
Article Google Scholar
Chen, S.L., Lee, W., Hottes, A.K., Shapiro, L., McAdams, H.H., 2004. Codon usage between genomes is constrained by genome-wide mutational processes. PNAS 101(10), 3480–485.
Article Google Scholar
Cluster structures in genomic word frequency distributions, 2004. Web-site: http://www.ihes.fr//~zinovyev/7clusters
Frappat, L., Sciarrino, A., 2006. Conspiracy in bacterial genomes. Physica A 369, 699–13.
Article Google Scholar
Gorban, A.N., Zinovyev, A.Y., 2004. The mystery of two straight lines in bacterial genome statistics. arXiv q-bio.GN/0412015
Gorban, A.N., Zinovyev, A.Y., Popova, T.G., 2005a. Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences. In Silico Biol. 5, 0025. On-line: http://www.bioinfo.de/isb/2005/05/0025/
Google Scholar
Gorban, A., Popova, T., Zinovyev, A., 2005b. Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences. Physica A 353, 365–87.
Article Google Scholar
Knight, R.D., Freeland, S.J., Landweber, L.F., 2001. A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes. Genome Biol. 2, 0010.1–010.13
Google Scholar
Lobry, J., 1997. Influence of genomic G+C content on average amino-acid composition of proteins from 59 bacterial species. Gene 205(1–2), 309–16.
Article Google Scholar
Lobry, J.R., Sueoka, N., 2002. Asymmetric directional mutation pressures in bacteria. Genome Biol. 3(10), 0058.
Article Google Scholar
Lobry, J.R., Chessel, D., 2003. Internal correspondence analysis of codon and amino-acid usage in thermophilic bacteria. J. Appl. Genet. 44(2), 235–61.
Google Scholar
Lynn, D.J., Gregory, A.C., Singer, G.A.C., Hickey, D.A., 2002. Synonymous codon usage is subject to selection in thermophilic bacteria. Nucleic Acids Res. 30(19), 4272–277.
Article Google Scholar
Minichini, C., Sciarrino, A., 2006. Mutation model for nucleotide sequences based on crystal basis. Biosystems 84, 191–06, arXiv q-bio.BM/0506010
Article Google Scholar
Muto, A., Osawa, S., 1987. The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl. Acad. Sci. USA 84, 166–69.
Article Google Scholar
Pachter, L., Sturmfels, B. (Eds.), 2005. Algebraic Statistics for Computational Biology. Cambridge University Press, Cambridge.
MATH Google Scholar
Pachter, L., Sturmfels, B., 2007. The mathematics of phylogenomics. SIAM Rev. 49(1), 3–1.
Article MATH MathSciNet Google Scholar
Singer, G.A.C., Hickey, D.A., 2000. Nucleotide bias causes a genomewide bias in the amino acid composition of proteins. Mol. Biol. Evol. 17, 1581–588.
Google Scholar
Sueoka, N., 1962. On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl. Acad. Sci. USA 48, 582–92.
Article Google Scholar
Sueoka, N., 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 85(8), 2653–657.
Article Google Scholar
Wan, X.F., Xu, D., Kleinhofs, A., Zhou, J., 2004. Quantitative relationship between synonymous codon usage bias and GC composition across unicellular genomes. BMC Evol. Biol. 4(1), 19.
Article Google Scholar
Yeramian, E., 2000a. Genes and the physics of the DNA double-helix. Gene 255, 139–50.
Article Google Scholar
Yeramian, E., 2000b. The physics of DNA and the annotation of the Plasmodium falsiparum genome. Gene 255, 151–68.
Article Google Scholar
Zinovyev, A.Y., Gorban, A.N., Popova, T.G., 2003. Self-organizing approach for automated gene identification. Open Syst. Inf. Dyn. 10, 321–33.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

University of Leicester, Leicester, UK
A. N. Gorban
Institut Curie, Paris, France
A. Y. Zinovyev
Institut des Hautes Etudes Scientifiques, Bures-sur-Yvette, France
A. N. Gorban & A. Y. Zinovyev

Authors

A. N. Gorban
View author publications
You can also search for this author in PubMed Google Scholar
A. Y. Zinovyev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. N. Gorban.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gorban, A.N., Zinovyev, A.Y. The Mystery of Two Straight Lines in Bacterial Genome Statistics. Bull. Math. Biol. 69, 2429–2442 (2007). https://doi.org/10.1007/s11538-007-9229-6

Download citation

Received: 09 October 2006
Accepted: 04 May 2007
Published: 19 June 2007
Issue Date: October 2007
DOI: https://doi.org/10.1007/s11538-007-9229-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Mystery of Two Straight Lines in Bacterial Genome Statistics

Abstract

Access this article

Similar content being viewed by others

Reading Frame Retrieval of Genes: A New Parameter of Codon Usage Based on the Circular Code Theory

Codon influence on protein expression in E. coli correlates with mRNA levels

Three-dimensional eukaryotic genomic organization is strongly correlated with codon usage expression and function

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Mystery of Two Straight Lines in Bacterial Genome Statistics

Abstract

Access this article

Similar content being viewed by others

Reading Frame Retrieval of Genes: A New Parameter of Codon Usage Based on the Circular Code Theory

Codon influence on protein expression in E. coli correlates with mRNA levels

Three-dimensional eukaryotic genomic organization is strongly correlated with codon usage expression and function

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation