Organisms can essentially be classified according to two codon patterns
We recently classified 23 bacteria into two types based on their complete genomes; “S-type” as represented by Staphylococcus aureus and “E-type” as represented by Escherichia coli. Classification was characterized by concentrations of Arg, Ala or Lys in the amino acid composition calculated from the complete genome. Based on these previous classifications, not only prokaryotic but also eukaryotic genome structures were investigated by amino acid compositions and nucleotide contents. Organisms consisting of 112 bacteria, 15 archaea and 18 eukaryotes were classified into two major groups by cluster analysis using GC contents at the three codon positions calculated from complete genomes. The 145 organisms were classified into “AT-type” and “GC-type” represented by high A or T (low G or C) and high G or C (low A or T) contents, respectively, at every third codon position. Reciprocal changes between G or C and A or T contents at the third codon position occurred almost synchronously in every codon among the organisms. Correlations between amino acid concentrations (Ala, Ile and Lys) and the nucleotide contents at the codon position were obtained in both “AT-type” and “GC-type” organisms, but with different regression coefficients. In certain correlations of amino acid concentrations with GC contents, eukaryotes, archaea and bacteria showed different behaviors; thus these kingdoms evolved differently. All organisms are basically classifiable into two groups having characteristic codon patterns; organisms with low GC and high AT contents at the third codon position and their derivatives, and organisms with an inverse relationship.
KeywordsClassification Prokaryotes Eukaryotes Amino acid compositions Codon usage GC content Cluster analysis
During the past decade, more than 200 species genomes have been completed through interagency and international collaborations. Particularly, eukaryotic genomes that consist of huge numbers of genes have been analyzed; these included human (International Human Genome Sequencing Consortium 2001; Venter et al. 2001). The results have greatly contributed to progress in the understanding of various organisms; characteristically this understanding is based on nucleotide or amino acid sequences of genes. Indeed, the recently developed science of proteomics is based on this concept, and protein–protein or protein–small molecule interactions have been investigated, resulting in new drug developments and a better understanding of diseases. Further, changes in nucleotides or amino acid sequences have been applied to evolutionary research (Dayhoff et al. 1977; Sogin et al. 1986; Woese et al. 1990; Doolittle and Brown 1994; Maizels and Weiner 1994; DePouplana et al. 1998; Sakagami et al. 2006), on the assumption that amino acid sequence changes are linked to biological evolution.
The basic pattern of cellular amino acid composition is conserved in various organisms from bacteria to mammalian cells (Sorimachi 1999; Sorimachi et al. 2000; 2001); and differences in cellular amino acid composition among organisms seem to reflect biological evolution. In addition, cellular amino acid compositions obtained experimentally resemble those conveniently calculated from a complete genome (Sorimachi et al. 2001). Using amino acid compositions, it is possible to compare among organisms not only the same genes but also gene assemblies consisting of various different genes that represent the complete genome (Sorimachi and Okayasu 2003; 2004a; 2005a, b; 2008a–c). Based on their complete genomes, bacteria are classifiable into two groups, “S-type” represented by Staphylococcus aureus and “E-type” represented by Escherichia coli (Sorimachi and Okayasu 2004b).
We recently showed that correlations between the contents of each nucleotide in a genome can be expressed by linear formulas (Sorimachi and Okayasu 2008a). The genomic GC content is strongly correlated with the average amino acid composition of proteins (Sueoka 1961), which is theoretically supported by Lobry (1997). Further, the genomic GC contents at the three codon positions have been applied to organism classification (Rowe et al. 1984; Takeuchi et al. 2003). Thus, how organisms would be classified compared with the different kingdoms based on genomic structures appeared worthy of investigation to understand biological evolution. Therefore, in the present study cluster analyses using GC contents calculated from the complete genome at the three codon positions were applied to classify not only prokaryotes but also eukaryotes, although there are many cluster analyses based on amino acid and nucleotide sequence data (Rowe et al. 1984; Barloy-Hubler et al. 2001; Farlow et al. 2002; Martin et al. 2003; Dyhrman et al. 2006; Sakagami et al. 2006).
Graphic representation or a diagram approach to the study of complicated biological systems can provide an intuitive picture and provide useful insights. Various graphical approaches have been successfully used; for example, to study enzyme-catalyzed systems (Chou 1983, 1989, 1990; Chou et al. 1979; Kuzmic et al. 1992; Lin and Neet 1990; Zhou and Deng 1984), protein folding kinetics (Chou 1990), codon usage (Chou and Zhang 1992; Sorimachi and Okayasu 2003; 2004a; 2005a, b; 2008a, b, c; Zhang and Chou 1993), and HIV reverse transcriptase inhibition mechanisms (Althaus et al. 1993a–c; Chou et al. 1994). Cellular automaton images (Wolfram 1984, 2002) have also been used to represent biological sequences (Xiao et al. 2005b), to predict protein subcellular localization (Xiao et al. 2006b), predict transmembrane regions in proteins (Diao et al. 2008), predict the effect on replication ratio by HBV virus gene missense mutation (Xiao et al. 2005a), and to study hepatitis B viral infections (Xiao et al. 2006a). Graphic approaches have been used recently to represent DNA sequences (for example, Qi et al. 2007b), investigate p53 stress response networks (Qi et al. 2007a), analyze the network structure of the amino acid metabolism (Shikata et al. 2007), study cellular signaling networks (Diao et al. 2007) and proteomics (González-Díaz et al. 2008), and for a systematic biology analysis of the Drosophila phagosome (Stuart et al. 2007).
In our earlier study (Sorimachi and Okayasu 2004b), amino acid composition was calculated from all genes constructing a complete genome and from data obtained from GenomeNet (http://www.genome.ad.jp). In the present study, codon usage databases were obtained from the Kazusa DNA Research Institute (http://www.kazusa.or.jp/codon). However, we computationally calculated amino acid compositions, codon usages and the nucleotide contents of organisms to replace genomic data that was incomplete in Kazusa’s table.
Amino acid compositions and nucleotide contents at various codon positions were computationally calculated from codon usage databases. Cluster analysis was carried out using the software (multivariate analysis, version four), developed as an add-in program of EXCEL, which was purchased from ESUMI (Tokyo, Japan). In this program, as the cluster element was limited to 50, representative species from the same family were examined to reduce the sample numbers by as much as possible, although the complete genomes of more than 200 species had already been analyzed. Cluster analysis has five methods based on differences in calculation procedures to estimate the distance between two samples; Ward’s, nearest neighbor, furthest neighbor, group average and centroid methods. However, in the present analysis only the widely used Ward’s method was applied.
Amino acid compositions
Amino acid compositions of four archaea, Halobacterium, Aeropyrum pernix, Sulfolobus solfataricus and Methanococcus jannaschii, were calculated from their complete genomes. Their amino acid compositions differed, although the basic pattern of a “star-shape”, based on high concentrations of Asp, Glu, Gly, Arg, Ala, Val, Ile and Leu, was preserved among them (Fig. 1b). It was clearly shown that there are great changes in Ala, Ile and Lys concentrations among four archaea.
The patterns of amino acid compositions of four eukaryotes; Neurospora crassa (fungi), Homo sapiens (human), Plasmodium falciparum (protista) and Dictyoselium discoideum (cellular slime mold), were calculated from their complete genomes (Fig. 1c). Among them, the concentrations of Ala and Ile varied significantly. The characteristic shapes of the Leu and Ile relationship based on the H. sapiens and D. discoideum genomes were also observed in their cellular amino acid compositions (Sorimachi 1999), which were consistent with those of bacteria, as noted above.
Codon usage patterns
Codon usages of the four archaea presented in Fig. 2b were investigated. In Halobacterium, C or G contents at the third codon position were much higher than A or T in every codon, whereas the former was close to the latter in Aeropyrum pernix. On the other hand, A or T contents at the third codon position were higher than those of G or C in Sulfolobus solfataricus, and the former contents were much higher than the latter in Methanococcus jannaschii (Fig. 2b). These reciprocal changes occurred synchronously among different species.
In Neurospora crassa and Homo sapiens, C or G contents at the third codon position were higher than those of A or T (Fig. 2c). On the other hand, A or T contents at the third codon position were higher than those of C or G in Plasmodium falciparum and Dictyostelium discoideum. These reciprocal changes occurred synchronously among the four archaea, as now observed in both bacteria and archaea.
Classification of organisms by cluster analysis
We previously demonstrated that the amino acid compositions of four archaea, Methanococcus jannaschii, Archaeoglobus fulgidus, Pyrococcus horikoshii. and Methnobacterium autotrophicum, calculated from their complete genomes resembled those of cellular amino acid compositions obtained experimentally from amino acid analyses of cell lyzates (Sorimachi et al. 2001). Another 11 archaea with completely analyzed genomes were also examined in the present study. Using GC contents at the three different codon positions (Fig. 3b), the 15 archaea were classified into 2 major clusters. Similar two clusters were obtained using Ala, Arg and Lys as traits, although only A. fulgidus belonged to another group (Supplementary Fig. 1b).
Cluster analysis of 18 eukaryotes was carried out, using GC contents at the three codon positions as traits (Fig. 3c). Two major clusters were formed, with one containing a cluster of vertebrates, Homo sapiens (human), Mus musmusculus (mouse), Gallas galas (bird), Rattus norvegicus (rat) and Danio rerio (fish), and Drosophila melanogaster (insect) being closed to the vertebrate cluster (Fig. 3c). Caenorhabditis elegans (nematode) was completely separated from other animals. Arabidopsis thaliana (plant) was closed to Caenorhabditis elegans (nematode) while the major cluster containing this plant was separated from another major cluster containing Oryza sativa (plant). The former and latter are plants belonging to dicotyledons and monocotyledons, respectively. In our previous study (Sorimachi et al. 2000), the cellular amino acid compositions of carrot and Torenia fournieri (both dicotyledons) differed that of Cynbidium (a monocotyledon). Thus, phenotype expression is consistent with that of genotype in plants. Fungi and protists were distributed into both major clusters. Different classifications were obtained using Ala, Arg and Lys as traits (Supplementary Fig. 1c). Encephalitozoon cuniculi belongs to fungi, while a cluster that consists of A. thaliana, C. elegans and S. pombe belonged a different cluster consisting of another type of organisms. Using different components as traits, the sub-branches of the phylogenic trees changed naturally because of the different standards.
Archaea and eukaryotes were classified into two types. Vertebrates (D. rerio, H. sapiens, M. musculus, R. norvegicus) and insects (D. melanogaster) were incorporated into the same element, while another animal (C. elegans) was classified into a different major cluster. These results clearly demonstrated that vertebrate biological evolution occurred quite recently over a very short period. When Ala, Arg and Lys concentrations were used as traits to classify the 145 organisms, some “GC-type” and “AT-type” organisms belonged to the same cluster(s), as shown in Supplementary Fig. 1a–c. However, all organisms were apparently classified into two major groups except for some organisms (Supplementary Fig. 2a). Additionally, using 20 amino acid concentrations or 64 codons as traits, similar classifications were obtained (Supplementary Fig. 2b, c). These results indicate that codon usages are closely linked to the amino acid expression.
Correlations of amino acid concentrations with nucleotide contents
When other nucleotide contents such as total C and total A contents at the three codon positions were used instead of the third GC content, good correlations of Ala concentrations with these nucleotide contents were obtained in both “GC-type” and “AT-type” organisms (Supplementary Fig. 3).
Ala correlation with Lys
Our expressions by amino acid composition or nucleotide content might appear rough compared to counting replacement numbers of nucleotides or amino acids in a gene or genome, but the values in the percentage calculations, based on the primary sequences of amino acid residues and nucleotides, are absolute values that exclude deviations. Using data based on complete genomes expressed by a nucleotide sequence, the standard deviation is null. Our studies using amino acid compositions, codon usages and nucleotide contents are applicable to analyses not only of single genes but also of gene assemblies that consist of different genes. Indeed, using nucleotide contents at the three codon positions as traits in multivariate analysis, 145 organisms were classified into 50 elements in the present study (Fig. 4); therefore, based on the same standard, it would be possible to investigate all organisms from bacteria to mammalian cells. The present study demonstrated that amino acid compositions, codon usages and nucleotide contents as well as amino acid or nucleotide sequences are useful values to investigate genomic structures and biological evolution.
Ala or Lys concentration showed good correlation with GC content at the third codon position (Fig. 5a, b), and organisms classified into “GC-type” and “AT-type” were separated into two distributions by arranging in order of decreasing Ala or Lys concentration. Thus, when organisms classified into “GC-type” and “AT-type” based on GC contents can be separated into two distributions by arranging in order of decreasing levels of certain amino acid concentrations, the amino acid concentration correlates with GC content. Therefore, Ala, Gly, Pro, Arg and Val increase with GC content at the third codon position, while Lys, Phe, Ile, Asn and Tyr decrease with GC content (unpublished data). These results are consistent with other results obtained from total genomic GC content (Sueoka 1961; Lobry 1997).
The high GC contents at the third codon position were linked to low Ala and Arg concentrations or high Lys concentrations. Additionally, in the cluster analyses GC contents at the first and second codon positions (Fig. 3) strongly contributed to differentiating the two groups. Thus, in biological evolution intra-codon alterations appear to be strongly controlled by amino acid composition. As these relationships were observed in archaea and eukaryotes (Fig. 2), biological evolution progresses under this form of control in all organisms. Correlations between the contents of each nucleotide in a genome can be expressed by linear formulas (Sorimachi and Okayasu 2008a). Eukaryotes, archaea and bacteria behaved differently from each other in respect of correlations of certain amino acid concentrations with nucleotide contents (Fig. 5a–c). The consistent results were obtained from nucleotide alternations in various organisms (Sorimachi and Okayasu 2008a). Thus, biological evolution can be said to progress differently among different kingdoms.
Although codon usage patterns differ among bacterial species (Fig. 2a), their amino acid composition patterns based on complete genomes resemble each other (Sorimachi and Okayasu 2004a). The latter point indicates that the basic pattern of amino acid compositions based on complete genomes is conserved among bacterial species (Sorimachi and Okayasu 2004b), and this was experimentally proved in our previous studies of phenotype (Sorimachi 1999; Sorimachi et al. 2000; 2001). GC or AT contents at the first and second codon positions apparently influence the third codon formation, as shown in Fig. 2. Thus, eventually the conservation of the basic pattern of amino acid compositions induces a reciprocal relationship between C or G and T or A at the third codon position in codons that have the same two nucleotides at the first and second codon positions; something observed as GC biases among various genes (Sueoka 1988). The relationship between GC and AT contents at the third codon position is comparatively conserved in every codon (Fig. 2), and eventually their changes appeared synchronously among different species. As these changes occurred even in the degenerated codon, they are apparently based on neutral mutations, although these mutations are obviously controlled by particular forces in the degenerated codon. Based on a random choice of nucleotides or amino acids that consisted of a certain composition, simulation analyses suggested that codon formation followed chronologically protein formation in the origin of life (Sorimachi and Okayasu 2008b). This conclusion is strictly controlled by the compositions of nucleotides or amino acids. Similarly, as the nucleotide composition in a genome is determined, random mutations are strongly controlled by nucleotide composition even in degenerated codons. Thus, the effect of composition is equal to a particular force. Random mutation is also supported by the relationship between amino acid frequencies and codon usages among various genes (King and Jukes 1989). Based on mathematical calculations, other groups have proposed neutral mutation in biological evolution (Kimura 1977).
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- Chou KC, Jiang SP, Liu WM et al (1979) Graph theory of enzyme kinetics: 1. Steady-state reaction system. Sci Sin 22:341–358Google Scholar
- Dayhoff MO, Park CM, McLaughlin PJ (1977) Building a phylogenetic trees: cytochrome C. In: Atlas of protein sequence and structure, vol 5. National Biomedical Foundation, Washington, DC, pp 7–16Google Scholar
- González-Díaz H, González-Díaz Y, Santana L et al (2008) Proteomics, networks, and connectivity indices. Proteomics. doi:10.1002/pmic.200700638
- Mayer E (1965) Animal species and evolution. Harvard University Press, CambridgeGoogle Scholar
- Sakagami M, Nakayama T, Hashimoto T et al (2006) Phylogeny of the centrohelida inferred from SSU rRNA, tubulin, and actin genes. J Mol Evol 61:765–775Google Scholar
- Sorimachi K, Okayasu T (2005a) Simulation analysis of genomic amino acid composition homogeneity based on putative small units. In Proceedings of the 9th world multi-conference on systemics, cybernetics and informatics, Orlando, Florida, USA, vol VI, pp 190–196Google Scholar
- Sorimachi K, Okayasu T (2005b) Genomic structure consisting of putative units coding similar amino acid composition: synchronous mutations in biological evolution. Dokkyo J Med Sci 32:101–106Google Scholar
- Sorimachi K, Okayasu T (2008a) Codon evolution is governed by linear formulas. Amino Acids. doi:10.1007/s00726-007-0024-3
- Sorimachi K, Okayasu T (2008b) Mathematical proof of the chronological precedence of protein formation over codon formation. Curr Top Pept Protein Res (in press)Google Scholar
- Sorimachi K, Okayasu T (2008c) Genome structure is homogeneous based on codon usages. Curr Top Pept Protein Res (in press)Google Scholar
- Wolfram S (2002) A new kind of science. Wolfram Media Inc., ChampaignGoogle Scholar