Introduction

During the past decade, more than 200 species genomes have been completed through interagency and international collaborations. Particularly, eukaryotic genomes that consist of huge numbers of genes have been analyzed; these included human (International Human Genome Sequencing Consortium 2001; Venter et al. 2001). The results have greatly contributed to progress in the understanding of various organisms; characteristically this understanding is based on nucleotide or amino acid sequences of genes. Indeed, the recently developed science of proteomics is based on this concept, and protein–protein or protein–small molecule interactions have been investigated, resulting in new drug developments and a better understanding of diseases. Further, changes in nucleotides or amino acid sequences have been applied to evolutionary research (Dayhoff et al. 1977; Sogin et al. 1986; Woese et al. 1990; Doolittle and Brown 1994; Maizels and Weiner 1994; DePouplana et al. 1998; Sakagami et al. 2006), on the assumption that amino acid sequence changes are linked to biological evolution.

The basic pattern of cellular amino acid composition is conserved in various organisms from bacteria to mammalian cells (Sorimachi 1999; Sorimachi et al. 2000; 2001); and differences in cellular amino acid composition among organisms seem to reflect biological evolution. In addition, cellular amino acid compositions obtained experimentally resemble those conveniently calculated from a complete genome (Sorimachi et al. 2001). Using amino acid compositions, it is possible to compare among organisms not only the same genes but also gene assemblies consisting of various different genes that represent the complete genome (Sorimachi and Okayasu 2003; 2004a; 2005a, b; 2008ac). Based on their complete genomes, bacteria are classifiable into two groups, “S-type” represented by Staphylococcus aureus and “E-type” represented by Escherichia coli (Sorimachi and Okayasu 2004b).

We recently showed that correlations between the contents of each nucleotide in a genome can be expressed by linear formulas (Sorimachi and Okayasu 2008a). The genomic GC content is strongly correlated with the average amino acid composition of proteins (Sueoka 1961), which is theoretically supported by Lobry (1997). Further, the genomic GC contents at the three codon positions have been applied to organism classification (Rowe et al. 1984; Takeuchi et al. 2003). Thus, how organisms would be classified compared with the different kingdoms based on genomic structures appeared worthy of investigation to understand biological evolution. Therefore, in the present study cluster analyses using GC contents calculated from the complete genome at the three codon positions were applied to classify not only prokaryotes but also eukaryotes, although there are many cluster analyses based on amino acid and nucleotide sequence data (Rowe et al. 1984; Barloy-Hubler et al. 2001; Farlow et al. 2002; Martin et al. 2003; Dyhrman et al. 2006; Sakagami et al. 2006).

Graphic representation or a diagram approach to the study of complicated biological systems can provide an intuitive picture and provide useful insights. Various graphical approaches have been successfully used; for example, to study enzyme-catalyzed systems (Chou 1983, 1989, 1990; Chou et al. 1979; Kuzmic et al. 1992; Lin and Neet 1990; Zhou and Deng 1984), protein folding kinetics (Chou 1990), codon usage (Chou and Zhang 1992; Sorimachi and Okayasu 2003; 2004a; 2005a, b; 2008a, b, c; Zhang and Chou 1993), and HIV reverse transcriptase inhibition mechanisms (Althaus et al. 1993ac; Chou et al. 1994). Cellular automaton images (Wolfram 1984, 2002) have also been used to represent biological sequences (Xiao et al. 2005b), to predict protein subcellular localization (Xiao et al. 2006b), predict transmembrane regions in proteins (Diao et al. 2008), predict the effect on replication ratio by HBV virus gene missense mutation (Xiao et al. 2005a), and to study hepatitis B viral infections (Xiao et al. 2006a). Graphic approaches have been used recently to represent DNA sequences (for example, Qi et al. 2007b), investigate p53 stress response networks (Qi et al. 2007a), analyze the network structure of the amino acid metabolism (Shikata et al. 2007), study cellular signaling networks (Diao et al. 2007) and proteomics (González-Díaz et al. 2008), and for a systematic biology analysis of the Drosophila phagosome (Stuart et al. 2007).

Methods

Genomic data

In our earlier study (Sorimachi and Okayasu 2004b), amino acid composition was calculated from all genes constructing a complete genome and from data obtained from GenomeNet (http://www.genome.ad.jp). In the present study, codon usage databases were obtained from the Kazusa DNA Research Institute (http://www.kazusa.or.jp/codon). However, we computationally calculated amino acid compositions, codon usages and the nucleotide contents of organisms to replace genomic data that was incomplete in Kazusa’s table.

Calculations

Amino acid compositions and nucleotide contents at various codon positions were computationally calculated from codon usage databases. Cluster analysis was carried out using the software (multivariate analysis, version four), developed as an add-in program of EXCEL, which was purchased from ESUMI (Tokyo, Japan). In this program, as the cluster element was limited to 50, representative species from the same family were examined to reduce the sample numbers by as much as possible, although the complete genomes of more than 200 species had already been analyzed. Cluster analysis has five methods based on differences in calculation procedures to estimate the distance between two samples; Ward’s, nearest neighbor, furthest neighbor, group average and centroid methods. However, in the present analysis only the widely used Ward’s method was applied.

Results

Amino acid compositions

Bacteria consisting of 11 Gram positive and 12 Gram negative bacteria were classified into “S-type” and “E-type” groups based on differences in concentrations of Arg, Ala or Lys in their 17 amino acid compositions as calculated from the all genes in the genome (Sorimachi and Okayasu 2004a). In “S-type”, Ala and Arg concentrations were lower than those in “E-type”, while Lys concentrations were higher than those in “E-type”. The pattern of Mycobacterium tuberculosis amino acid composition resembled that of E. coli which represents “E-type”, and the pattern of Ureaplasma urelyticum resembled that of S. aureus which represents “S-type”, as shown in Fig. 1a. Shapes based on the relationship between Leu and Ile concentrations differed between “S-type” and “E-type”. Two characteristic shapes were identified in the cellular amino acid compositions of S. aureus and E. coli (Sorimachi 1999). Thus, phenotype expression is consistent with genotype expression in these two amino acids, indicating that data based on a complete genome are linked to biological meaning. Further, the two patterns were clearly characterized by concentrations of Ala, Arg or Lys. Similarly, radar charts have been used to illustrate differences in amino acid compositions to predict protein subcellular localization (Chou and Elrod 1999). Also, radar charts have been applied in a different manner to show the subsite coupling for the cleavable peptides by HIV protease. Radar charts have also been applied in a different manner to show the interaction of HIV protease and proteins.

Fig. 1
figure 1

Amino acid compositions of various organisms. Amino acid compositions are expressed on radar charts. Asn and Gln were calculated as Asp and Glu, respectively, and Trp, having concentrations less than 1%, was omitted from this presentation (Sorimachi 1999)

Amino acid compositions of four archaea, Halobacterium, Aeropyrum pernix, Sulfolobus solfataricus and Methanococcus jannaschii, were calculated from their complete genomes. Their amino acid compositions differed, although the basic pattern of a “star-shape”, based on high concentrations of Asp, Glu, Gly, Arg, Ala, Val, Ile and Leu, was preserved among them (Fig. 1b). It was clearly shown that there are great changes in Ala, Ile and Lys concentrations among four archaea.

The patterns of amino acid compositions of four eukaryotes; Neurospora crassa (fungi), Homo sapiens (human), Plasmodium falciparum (protista) and Dictyoselium discoideum (cellular slime mold), were calculated from their complete genomes (Fig. 1c). Among them, the concentrations of Ala and Ile varied significantly. The characteristic shapes of the Leu and Ile relationship based on the H. sapiens and D. discoideum genomes were also observed in their cellular amino acid compositions (Sorimachi 1999), which were consistent with those of bacteria, as noted above.

Codon usage patterns

The four nucleotide frequencies in human (Zhang and Chou 1993; 1996) and E. coli (Zhang and Chou 1994a, b) genes were graphically presented by a point in a three-dimensional space. Meanwhile, a similar codon usage approach was used to analyze HIV (Chou and Zhang 1992) and anti-sense (Chou et al. 1996) proteins. Codon usage patterns were compared among the four bacteria presented in Fig. 2a. In Mycobacterium tuberculosis, which was “E-type”, C or G contents were very high at the third codon position, while A or T contents at the third codon position were very low compared among codons consisting of the same characteristics at the first and second codon positions. These relationships were independent of the degenerated codon. In the present study, “E-type” is called “GC-type”. C or G contents at the third codon position reduced, whereas the A or T contents increased in E. coli representing the “E-type”. On the other hand, C or G contents at the third codon position were much lower than the A or T contents in S. aureus representing the “S-type” (Fig. 2a), and C or G contents were extremely lower than the A or T contents in U. urelyticum. GC contents at the third codon position varied synchronously in 64 codon usages among different organisms. In the present study, “S-type” is called “AT-type”.

Fig. 2
figure 2

Codon usage patterns. a bacteria, b archaea, c eukaryotes. Codons that have C or G at the third codon position are red online version and those that have A or T at the third codon position are blue. The horizontal axis represents the codon and amino acid

Codon usages of the four archaea presented in Fig. 2b were investigated. In Halobacterium, C or G contents at the third codon position were much higher than A or T in every codon, whereas the former was close to the latter in Aeropyrum pernix. On the other hand, A or T contents at the third codon position were higher than those of G or C in Sulfolobus solfataricus, and the former contents were much higher than the latter in Methanococcus jannaschii (Fig. 2b). These reciprocal changes occurred synchronously among different species.

In Neurospora crassa and Homo sapiens, C or G contents at the third codon position were higher than those of A or T (Fig. 2c). On the other hand, A or T contents at the third codon position were higher than those of C or G in Plasmodium falciparum and Dictyostelium discoideum. These reciprocal changes occurred synchronously among the four archaea, as now observed in both bacteria and archaea.

Classification of organisms by cluster analysis

GC contents at the third codon position differ among various organisms (Sorimachi and Okayasu 2004b[e1], 2008a); therefore, GC contents at three different codon positions were calculated from complete genomes. To classify 112 bacteria, cluster analyses were carried out. Using GC contents at the three codon positions as traits, the classification of just two groups was obtained (Fig. 3a). When cluster analyses using Ala, Arg and Lys as traits were applied to the bacteria, slightly different classifications were obtained (Supplementary Fig. 1a). Some organisms were distributed into another group.

Fig. 3
figure 3

Dendrogram of organism classifications obtained utilizing the Ward method. As traits, GC contents at the three codon positions were used. a 112 bacteria, b 15 archaea, c 18 eukaryotes. Blue characters online version (Group II) represent “AT-type” and red (Group I) represent “GC-type”

We previously demonstrated that the amino acid compositions of four archaea, Methanococcus jannaschii, Archaeoglobus fulgidus, Pyrococcus horikoshii. and Methnobacterium autotrophicum, calculated from their complete genomes resembled those of cellular amino acid compositions obtained experimentally from amino acid analyses of cell lyzates (Sorimachi et al. 2001). Another 11 archaea with completely analyzed genomes were also examined in the present study. Using GC contents at the three different codon positions (Fig. 3b), the 15 archaea were classified into 2 major clusters. Similar two clusters were obtained using Ala, Arg and Lys as traits, although only A. fulgidus belonged to another group (Supplementary Fig. 1b).

Cluster analysis of 18 eukaryotes was carried out, using GC contents at the three codon positions as traits (Fig. 3c). Two major clusters were formed, with one containing a cluster of vertebrates, Homo sapiens (human), Mus musmusculus (mouse), Gallas galas (bird), Rattus norvegicus (rat) and Danio rerio (fish), and Drosophila melanogaster (insect) being closed to the vertebrate cluster (Fig. 3c). Caenorhabditis elegans (nematode) was completely separated from other animals. Arabidopsis thaliana (plant) was closed to Caenorhabditis elegans (nematode) while the major cluster containing this plant was separated from another major cluster containing Oryza sativa (plant). The former and latter are plants belonging to dicotyledons and monocotyledons, respectively. In our previous study (Sorimachi et al. 2000), the cellular amino acid compositions of carrot and Torenia fournieri (both dicotyledons) differed that of Cynbidium (a monocotyledon). Thus, phenotype expression is consistent with that of genotype in plants. Fungi and protists were distributed into both major clusters. Different classifications were obtained using Ala, Arg and Lys as traits (Supplementary Fig. 1c). Encephalitozoon cuniculi belongs to fungi, while a cluster that consists of A. thaliana, C. elegans and S. pombe belonged a different cluster consisting of another type of organisms. Using different components as traits, the sub-branches of the phylogenic trees changed naturally because of the different standards.

All organisms

All organisms, 112 bacteria, 15 archaea and 18 eukaryotes, were simultaneously applied to cluster analysis using CG contents at the three codon positions as traits, two major clusters were observed in the 145 analyzed (Fig. 4).

Fig. 4
figure 4

Dendrogram of the classifications of 145 organisms obtained utilizing the Ward method. As traits, GC contents at the three codon positions were used. Blue characters online version (Group II) represent “AT-type” and red (Group I) represent “GC-type”. Dark yellow online version and gray boxes represent eukaryotes and archaea, respectively

Archaea and eukaryotes were classified into two types. Vertebrates (D. rerio, H. sapiens, M. musculus, R. norvegicus) and insects (D. melanogaster) were incorporated into the same element, while another animal (C. elegans) was classified into a different major cluster. These results clearly demonstrated that vertebrate biological evolution occurred quite recently over a very short period. When Ala, Arg and Lys concentrations were used as traits to classify the 145 organisms, some “GC-type” and “AT-type” organisms belonged to the same cluster(s), as shown in Supplementary Fig. 1a–c. However, all organisms were apparently classified into two major groups except for some organisms (Supplementary Fig. 2a). Additionally, using 20 amino acid concentrations or 64 codons as traits, similar classifications were obtained (Supplementary Fig. 2b, c). These results indicate that codon usages are closely linked to the amino acid expression.

Correlations of amino acid concentrations with nucleotide contents

When Ala concentrations were plotted against GC contents at the third codon position, good correlations were obtained in both “AT-type” organisms (r = 0.73) and “GC-type” organisms (r = 0.83) (Fig. 5a). The regression lines between Ala concentration and GC content were expressed by slightly different slopes in both the types of organisms. Eukaryotes and archaea located under the regression line in “AT-type” organisms, and a similar result was obtained in “GC-type” organisms with two exceptions in eukaryotes and with one exception in archaea. However, when Ile concentrations were used, good correlations were obtained in both “AT-type” (r = 0.77) and “GC-type” (r = 0.65) (Fig. 5b). Archaea and eukaryotes located above and under the regression line, respectively, in “AT-type” organisms. In “GC-type” organisms, eukaryotes located under the regression line with one exception. In addition, correlations were also obtained between Lys concentrations against GC contents at the third codon position in both “AT-type” organisms (r = 0.73) and “GC-type” organisms (r = 0.70) (Fig. 5c). Characteristically, in eukaryotes, Lys concentrations were nearly constant among the 18 eukaryotes examined in the present study.

Fig. 5
figure 5

Correlation of Ala, Ile or Lys concentration with GC content at the third codon position. a Ala correlation with GC content, b Ile correlation with GC content, c Lys correlation with GC content. “GC-type” and “AT-type” are presented in red online version (right side half) and blue (left side half), respectively. Diamond shape, closed circle and closed triangle represent bacteria, eukaryotes and archaea, respectively

When other nucleotide contents such as total C and total A contents at the three codon positions were used instead of the third GC content, good correlations of Ala concentrations with these nucleotide contents were obtained in both “GC-type” and “AT-type” organisms (Supplementary Fig. 3).

Ala correlation with Lys

When Ala concentration increased in various organisms, Lys concentration correspondingly decreased (Fig. 1). Therefore, when the former was plotted against the latter, good correlations were obtained in both “AT-type” organisms (r = 0.78) and “GC-type” organisms (r = 0.85) (Fig. 6), with the regression line slope differing between the two groups. Ala, Arg and Lys concentrations, which seemed strongly linked to biological evolution, were used unless otherwise stated.

Fig. 6
figure 6

Correlation of Ala with Lys concentration. “GC-type” and “AT-type” are presented in red online version (left side half) and blue (right side half), respectively. Diamond shape, closed circle and closed triangle represent bacteria, eukaryotes and archaea, respectively

Discussion

Our expressions by amino acid composition or nucleotide content might appear rough compared to counting replacement numbers of nucleotides or amino acids in a gene or genome, but the values in the percentage calculations, based on the primary sequences of amino acid residues and nucleotides, are absolute values that exclude deviations. Using data based on complete genomes expressed by a nucleotide sequence, the standard deviation is null. Our studies using amino acid compositions, codon usages and nucleotide contents are applicable to analyses not only of single genes but also of gene assemblies that consist of different genes. Indeed, using nucleotide contents at the three codon positions as traits in multivariate analysis, 145 organisms were classified into 50 elements in the present study (Fig. 4); therefore, based on the same standard, it would be possible to investigate all organisms from bacteria to mammalian cells. The present study demonstrated that amino acid compositions, codon usages and nucleotide contents as well as amino acid or nucleotide sequences are useful values to investigate genomic structures and biological evolution.

Ala or Lys concentration showed good correlation with GC content at the third codon position (Fig. 5a, b), and organisms classified into “GC-type” and “AT-type” were separated into two distributions by arranging in order of decreasing Ala or Lys concentration. Thus, when organisms classified into “GC-type” and “AT-type” based on GC contents can be separated into two distributions by arranging in order of decreasing levels of certain amino acid concentrations, the amino acid concentration correlates with GC content. Therefore, Ala, Gly, Pro, Arg and Val increase with GC content at the third codon position, while Lys, Phe, Ile, Asn and Tyr decrease with GC content (unpublished data). These results are consistent with other results obtained from total genomic GC content (Sueoka 1961; Lobry 1997).

The high GC contents at the third codon position were linked to low Ala and Arg concentrations or high Lys concentrations. Additionally, in the cluster analyses GC contents at the first and second codon positions (Fig. 3) strongly contributed to differentiating the two groups. Thus, in biological evolution intra-codon alterations appear to be strongly controlled by amino acid composition. As these relationships were observed in archaea and eukaryotes (Fig. 2), biological evolution progresses under this form of control in all organisms. Correlations between the contents of each nucleotide in a genome can be expressed by linear formulas (Sorimachi and Okayasu 2008a). Eukaryotes, archaea and bacteria behaved differently from each other in respect of correlations of certain amino acid concentrations with nucleotide contents (Fig. 5a–c). The consistent results were obtained from nucleotide alternations in various organisms (Sorimachi and Okayasu 2008a). Thus, biological evolution can be said to progress differently among different kingdoms.

Although codon usage patterns differ among bacterial species (Fig. 2a), their amino acid composition patterns based on complete genomes resemble each other (Sorimachi and Okayasu 2004a). The latter point indicates that the basic pattern of amino acid compositions based on complete genomes is conserved among bacterial species (Sorimachi and Okayasu 2004b), and this was experimentally proved in our previous studies of phenotype (Sorimachi 1999; Sorimachi et al. 2000; 2001). GC or AT contents at the first and second codon positions apparently influence the third codon formation, as shown in Fig. 2. Thus, eventually the conservation of the basic pattern of amino acid compositions induces a reciprocal relationship between C or G and T or A at the third codon position in codons that have the same two nucleotides at the first and second codon positions; something observed as GC biases among various genes (Sueoka 1988). The relationship between GC and AT contents at the third codon position is comparatively conserved in every codon (Fig. 2), and eventually their changes appeared synchronously among different species. As these changes occurred even in the degenerated codon, they are apparently based on neutral mutations, although these mutations are obviously controlled by particular forces in the degenerated codon. Based on a random choice of nucleotides or amino acids that consisted of a certain composition, simulation analyses suggested that codon formation followed chronologically protein formation in the origin of life (Sorimachi and Okayasu 2008b). This conclusion is strictly controlled by the compositions of nucleotides or amino acids. Similarly, as the nucleotide composition in a genome is determined, random mutations are strongly controlled by nucleotide composition even in degenerated codons. Thus, the effect of composition is equal to a particular force. Random mutation is also supported by the relationship between amino acid frequencies and codon usages among various genes (King and Jukes 1989). Based on mathematical calculations, other groups have proposed neutral mutation in biological evolution (Kimura 1977).

In the present study, organisms with low GC and high AT contents at the third codon position were classified into “AT-type”, while organisms with high GC and low AT contents at the third codon position were classified into “GC-type”. Even organisms with similar AT and GC contents at the third codon position were classified into two types. In addition, correlations of certain amino acid concentrations with nucleotide contents differed between the two types, “AT-type” and “GC-type”; not only in prokaryotes, but also in eukaryotes (Fig. 5a, b). Thus, all organisms are classified into two major groups; organisms with low GC and high AT contents at the third codon position and their derivatives, and organisms with high GC and low AT contents at the third codon position and their derivatives. The average amino acid compositions of “AT-type” and “GC-type” and their combination are shown in Fig. 7. The amino acid compositions of “AT-type” and “GC-type”, based on 72 and 73 complete genomes, respectively, are very similar to those based on S. aureus representing “S-type” and E. coli representing “E-type”. The pattern of amino acid composition obtained from all 145 organisms analyzed here resembles that obtained in various cells from bacteria to mammalian cells (Sorimachi 1999). This confirms that genomes are constructed from putative small units with similar amino acid compositions; suggesting that synchronous mutations might occur over the genome (Sorimachi and Okayasu 2003; 2004a; 2005a, b; 2008ac). This “star-shape” represents organisms existing on the earth. On the basis of Darwin’s theory, the origin of life has been assumed to be a single event (Mayer 1965[e3], 2000); however, an opposite theory supposing a plural origin is also acknowledged (Woese 1998; Doolittle 1999). The present results, based on the two different codon usage patterns, indicate that all organisms have diverged in two main directions.

Fig. 7
figure 7

Amino acid compositions. “GC-type” and “AT-type” based on 72 and 73 completed genomes, respectively, classified as shown in Fig. 4. “Genotype” was based on all 145 completed genomes, and “Phenotype” was calculated from cellular amino acid compositions obtained from 30 living cells from bacteria to mammalian cells (Sorimachi 1999; Sorimachi et al. 2000, 2001)