Background

Temperatures lower than 50°C are common in different habitats of the Earth and most organisms are mesophiles with an optimal growth temperature (OGT) at 24-40°C. Life at temperatures higher than 55-60°C can be found associated to low pH, high salinity or high-pressure environments, including members of Archaea and Bacteria. Evidence of eukaryotic life above 60°C is scarce [1]. Prokaryotes data grouped moderate thermophiles with OGT in the range of 50°C to 70°C and hyperthermophiles with OGT above 80°C. The hyperthermophilic members of Archaea and Bacteria, which can grow at 80°C-105°C, are unable to propagate in temperatures lower than their OGT [2].

There are not many clues as to how life can thrive in extreme environments. In general the biochemistry of hyperthermophiles proteins is very similar to that of mesophiles. When protein sequences and three-dimensional structures are compared, there are no significant differences among molecules: the sequences of homologous proteins from hyperthermophiles and mesophiles are 40 to 80% similar; their three-dimensional structures are superposable; and they have de same catalytic mechanisms [3]. Nevertheless, most enzymes from hyperthermophiles show an optimum catalytic activity above 100°C. Stability at high temperatures appears to be the result of very subtle synergistic and cooperative intra and inter molecular interactions, or from extrinsic protectants [4,5]. Some findings relevant for explaining protein thermostability were: i) increase in the number of hydrogen bonds, and extension in the inter subunit ion-pairs networks [6,7]; ii) increase in the number of charged amino acids [8,9]; iii) decreased length of superficial loops and an increase in protein compactness [10,11]. On the other hand, the fully functional and stable folded state of hyperthermophiles might require specific chaperones to be reached [3], as the chaperonin systems, which are included in the molecular chaperone family [12] of proteins.

In the present work, the proteome data from hyperthermophiles (HT), moderate thermophiles (T) and mesophiles (M) was compared, looking for characteristics which could be related to thermal adaptation, allowing distinguishing high and low-temperature organisms and proteins. So far two parameters were utilized: a) amino acid composition and amino acids couplets in each proteome, b) codon usage in the whole genome. The study was complemented using the same parameters in analyzes of two types of proteins: chaperonins and DNA ligases. These proteins were elected for analysis considering their thermal stability and their presence in all organisms. Chaperonins are potentially thermostable in all OGT groups and amongst Hsps they are unique in being present in all three domains of life [13]. On the other hand, DNA ligases are not necessarily thermostable in M but they are in HT and T. The results showed that high (E+K)/(Q+H) values were a characteristic of hyperthermophilic organisms and could be related to protein thermostability. Moreover, AGR codon bias for arginine was a signature for thermophiles and hyperthermophiles.

Results and Discussion

1. Proteome

The proteomes of 18 M, 4 T and 6 HT were analyzed for amino acid composition. In Table 1, the ratio (HT, T)/M corresponds to the average proportion of each amino acid in all HT and T (>60°C) organisms divided by their average proportion in all M (24-50°C). Only the amino acids with extreme ratios (> 1.30 and < 0.80) are listed. Preferred amino acids at high temperature were Glu, Lys, Tyr, Ile, and at mesophilics were Gln, His, Ala, Cys. Among the amino acids with extreme ratios we could highlight two couplets of preferred and avoid: E and Q, and K and H. These four amino acids are a homogenous group of high polar requirements, > 8,40 [14]; are preferentially found in α-helices [15, 16]; its second codon position is A; and, at least for Gln and His, the ranges in HT and T are not extensively superposed to the ranges in M.

Table 1 Preferred and avoided amino acids from proteomes of 10 hyperthermophiles (HT) and thermophiles (T), and 18 mesophiles (M)
Table 2 Hyperthermophiles, Thermophiles and Mesophiles from Archaea (A), Bacteria (B) e Eukarya (E)

The idea that the physico-chemical properties of the most frequent amino acids are reflected in the protein secondary structures constitutes the basic information for prediction methods. An analysis of prokaryotic and human proteins shows that Glu, Lys, Asp, Asn, Gln and His prefer α-helix and aperiodic structures, and that Ile, Leu, Val and Phe prefer β-sheets [16]. Moreover, the measure of the water accessible surfaces (WAS) of amino acids calculated for 131 proteins from Bacteria and Archaea, showed that the WAS percentage of charged amino acids increased in hyperthermophiles, mainly due to Lys and Glu, and this increase was accompanied by a decrease in Gln, Ala and His [9]. In several comparisons of organisms with different life styles, the proteins sequences of hyperthermophiles included more charged (Glu, Arg, Lys) and fewer uncharged polar residues (Ser, Thr, Asn, Gln, His, Cys), increased residue hydrophobicity (Ile, Val), and increased residue volume (Tyr) [8,9,3].

Our comparison of 28 proteomes, from the tree domains of life, did not show correlation of Arg, Ser or Asn with temperature changes, but did show that a good indicator of lifestyle was the (HT, T)/M ratios, better than the mere abundance of amino acids (Table 1). So, considering the two couplets of preferred and avoided, the ratios E/Q and K/H were calculated for each one of the 28 proteomes. Figure 1 shows a clear correlation between lifestyle and (E+K)/(Q+H) ratios.

Figure 1
figure 1

E+K/Q+H values for the proteomes of 28 organisms versus optimal growth temperature (OGT). E+K/Q+H ratio = %E (Glu) + %K (Lys) divided by %H (His) + %Q (Gln) in the same proteome. Amongst HT, OGT ranges from 80-100°C, in T from 60-65°C and in M from 24-37°C. E+K/Q+H ratio discriminated hyperthermophilic, thermophlllc and mesophilic lifestyles. The identification of the organisms is in Table 2.

The ratio (E+K)/(Q+H) was always greater than 4.5 in HT, between 3.2 and 4.6 in T and lower than 2.5 in M showing that as the OGT decreases the proportion of E and K also decreases whereas the proportion of Q and H increases (Table 3). So, the (E+K)/(Q+H) ratios are an excellent indicator for lifestyles since the correlation between OGTs and preference for some amino acids couplets permits to discriminate HT, T and M across the three domains of life. Such results corroborate the proposition that adaptation to high temperatures might involve different mechanisms in HT and T [11].

Table 3 E+K, and Q+H contents (%) in the 28 proteomes and E+K/Q+H ratios for the Proteome, Chaperonin and Ligase in each organism

Figure 2 shows that the differences observed in HT, T and M in relation to the relative concentrations of E, K, Q and H could not be attributed exclusively to the G+C content of the organisms. Codons for E and K are respectively GAR and AAR with a predominance of A over G whereas codons for Q and H are respectively CAR and CAY (no predominance). Although the higher E+K values were observed in hyperthermophiles, which show G+C content inferior to 50%, at similar G+C contents the organisms showed different E+K values. In M the G+C content runs from 35% to 67% (Figure 2, Table 3).

Figure 2
figure 2

Preferred amino acids couplets and G+C contents in HT, T and M. The percent of His (H) plus the percent of Gln (Q) and the percent of Glu (E) plus the percent of Lys (K), in each proteome, were plotted against G+C contents. The G+C content in HT and T was below 50%. Preference for E+K was not correlated to G+C content. The identification of the organisms is in Table 2.

2. Chaperonins and DNA ligases analysis

(E+K)/(Q+H) ratio was calculated for the chaperonins GroEl (Bacteria), THS (Archaea) and TCP1 (Eukarya). As these proteins are thermostable they could exhibit a profile similar to that of HT and T proteomes even when they originate from mesophiles. Figure 3A shows that the (E+K)/(Q+H) ratios for chaperonins, in prokaryotes, are usually in the limits of thermophilic and hyperthermophilic proteome values, that is above 2.5. The (E+K)/(Q+H) values, for chaperonin, ranged from 3.32 to 7.64 among prokaryotes (Table 3). The only exceptions were V. cholerae (2.1) and Halobacterium (1.84). Amongst eukaryotes the (E+K)/(Q+H) values for chaperonins were above 2.5, except for A. thaliana (Table 3). A possible relationship between (E+K)/(Q+H) ratio and protein thermostability, at least in prokaryotes, could be envisaged.

Figure 3
figure 3

The (E+K)/(Q+H) values for chaperonins (3A) and ligases (3B) versus the (E+K)/(Q+H) values for the proteomes of HT, T and M. The (E+K)/(Q+H) ratios for the chaperonins and for the DNA ligases, described in Material and Methods, were calculated in each one of the 28 organisms. In mesophilic prokaryotes the ratios for chaperonins followed the hyperthermophilic values (> 2.5), except for Vibrio cholarae (Vc) and Halobacterium sp (Hal). In mesophilic eukaryotes, only Homo sapiens (Hs) and Drosophila melanogaster (Dm) followed the rule. The ratios for DNA ligases always followed the proteome values.

If the relative proportions of E, K, H and Q were related to protein stability we should expect that in mesophiles the (E+K)/(Q+H) values for DNA ligases were similar to that of their respective proteomes. Such profile would be striking different from that for chaperonins. Figure 3B shows that the (E+K)/(Q+H) values for DNA ligase follow the ones of the 28 proteomes. In HT and T, the values ranged from 3.8 to 9.4, and in M prokaryotes the values were, usually, lower than 2.5, except for the Eukarya group (Table 3). These results corroborate the idea that the relative concentrations of these amino acids couplets, in the composition of the peptides, could be associated to their thermostability.

3. Codon bias for Arginine

As the preference for one synonymous codon over another in organisms is a long known phenomenon, we looked for codon bias associated to temperature lifestyles. Striking differences, related to Arg (R) codons, were found between HT and T on one hand and M on another hand. Figure 4A shows that HT and T highly preferred AGR at expenses of CGN codons, which are preferred by M. One exception in prokaryotes was Mycoplasma genitalium with 58% of AGR and 48% of GCN and in eukaryotes the exceptions were S. cerevisiae and C. elegans with 69.5% and 55.8% of AGR, respectively (Table 4). Nevertheless the % of AGR in these organisms was never so high as in hyperthermophiles and thermophiles (Table 4).

Figure 4
figure 4

Preferred codons for arginine (AGR versus CGN), according to life styles (A) and arginine utilization according to G+C content (B). A) HT and T preferred codons AGR whereas codons CGN were preferred by M. The mesophiles Mycoplasma senitalium (Mg) and Caenorhabditis elegans (Ce) showed no preference whereas Saccharomyces cerevisiae (Sc) preferred AGR codons. B) The percent of Arg in the proteome increased with the G+C content, except for the halophilic Archaea, Halobacterium sp (Hal). The preference for CGN codons (black circles) cannot be attributed to the G+C content.

Table 4 Arginine and codon usage in each one of the 28 proteomes

Figure 4B shows that the % of Arg, in the proteome of these organisms, increases with the G+C content, and no correlation was found between the content of Arg and OTG. The preference for CGN codons cannot be attributed to the G+C content since very different values of CGN were found for similar values of G+C. Compare, for instance, the G+C content in A. aeolicus (43.2%), A. fulgidus (43.2%), T. maritima (46.1%), Sulfolobus sp (35.7%) B. subtilis (43.5%), V. cholerae (47.6), Synechocystis (47.6) and H. influenza (38%) with their values for CGN usage, which are 8.0%, 9.4%, 18.0%, 8.9%, 63.0%, 91.5%, 80.0% and 91.0%, respectively (Table 4). The data reveal a superimposed correlation: at similar G+C contents the preference for the synonyms codons is determined by lifestyles, hot or non-hot.

Recently it was suggested that GC content drives codon usage and that the relationship between G+C content and codon and amino acid usage is ahistorical [17]. Nevertheless, a broader connection with selective pressure based on the physicochemical properties of the amino acids might be considered. A transitional mutation in the second position of AGR converts Arg to Lys (AAR), which can be convert to Glu by a transitional mutation in the first position (GAR). All these amino acids have high polar requirements, 9.1, 10.0 and 12.5, respectively and, both, Lys and Glu are preferred amino acids in hyperthermophilic organisms. Also, a transitional mutation in the second position of CGY codons converts Arg to His (CAY), which can be converted to Gln (CAR) by a transversional mutation in the third position. Both, His and Gln, with similar polar requirements (8.6 and 8.4), are preferred amino acid in mesophilic organisms. The codons AGR (Arg), AAR (Lys) and GAR (Glu), are constituted of dinucleotides RR whereas the codons CGN (Arg), CAY (His) and CAR (Gln) are constituted mainly of YR and RY dinucleotides. A high correlation (higher than those expected from a random combination of mononucleotides), between RR or YY and hyperthermophiles, has been found [18]. All the connections discussed above could help to explain the adaptation of organisms to high temperatures including preference for AGR and the high E+K/Q+H ratio in hyperthermophiles.

Conclusions

The results showed that i) the E+K/Q+H values for proteome discriminated organisms according their OGT: >4.5 for HT; 3.2 to 4.6 for T; <2.5 for ME; ii) the high percent of E and K associated to the low % of H and Q could be related to protein thermostability; iii) the AGR codon bias for Arg can be used as a signature for HT and T; iv) the E+K/Q+H ratio and codon bias for Arg are not apparently related to phylogeny. Members HT of the Bacteria domain show the same values as the HT members of the Archaea domain; the values for T organisms are related to their lifestyle (intermediate temperature) and not to their domain (Archaea) and the values for M are similar in Eukarya, Bacteria and Archaea.

Materials and Methods

Organisms

Data from the complete genome of 28 organisms were analyzed: 06 hyperthermophiles (1-6), 04 moderate thermophiles (7-10) and 18 mesophiles (11-28), listed in Table 2.

Loci for chaperonins

1. Pyrococcus abyssi (PAB2341), 2. Pyrococcus horikoshii (PH1256), 3. Aquifex aeolicus (AQ2200), 4. Methanococcus jannashii (MJ0999), 5. Archeoglobus fulgidus (AF2238), 6. Termotoga marítima (TM0506), 7. Methanobacterium thermoautotroficum (MTH0218), 8. Thermoplasma Volcanium (TVN0507), 9. Thermus acidophilum (Ta0980), 10. Sulpholobus (SS00862), 11. Escherichia coli (B4143), 12. Vibrio cholerae (VC2664), 13. Deinococcus radiodurans (DR0607), 14. Synechocystis (SLR2076), 15. Bacillus subtilis (YKKC), 16. Mycobacterium tuberculosis (MT0456), 17. Neisseria miningitidis (NMB1972), 18. Mycoplasma genitalium (MG0392), 19. Treponema pallidum (TP0030), 20. Pseudomonas aeruginosa (PA4385), 21. Haemophilus influenza (HI0543), 22. Xylella fastidiosa (XF0615), 23. Halobacterium sp (CCTB), 24. Saccharomyces cerevisiae (YDR212W), 25. Arabidopsis thaliana (AT1g24510), 26. Drosophila melanosaster (gi7291004), 27. Caenorhabditis elegans (t21b10.7), 28. Homo sapiens (TCPB).

Loci for DNA ligases

1. Pyrococcus abyssi (PAB2002), 2. Pyrococcus horikoshii (PH1622), 3. Aquifex aeolicus (AQ0633), 4. Methanococcus jannashii (MJ0171), 5. Archeoglobus fulgidus (AF0623), 6. Termotoga maritima (TM0100), 7. Methanobacterium thermoautotroficum (MTH1580), 8. Thermoplasma Volcanium (TVN1237), 9. Thermus acidophilum (Ta1148), 10. Sulphulobus(SSO0189), 11. Escherichia coli (B2411), 12. Vibrio cholerae (VC0971), 13. Deinococcus radiodurans (DR2069), 14. Synechocystis (gi1651660), 15. Bacillus subtilis (YERG) 16. Mycobacterium tuberculosis (MT3148), 17. Neisseria miningitidis (NMB0666), 18. Mycoplasma genitaliurn (MG254), 19. Treponema pallidum (TP0634), 20. Pseudomonas aeruginosa (PA1529), 21. Haemophilus influenza (HI1100), 22. Xylella fastidiosa (XI2556), 23. Halobacterium sp (LIG), 24. Saccharomyces cerevisiae (YOR005C). 25. Arabidopsis thaliana (AT1g49250), 26. Drosophila melanogaster (gi7299616), 27. Caenorhabditis elegans (c07h6.1), 28. Homo sapiens (LIG4).

Data base

The 28 proteomes used in this work are listed in Table 2 and the data for the analysis of preferred amino for each proteome were obtained from the Institute of Genomic Research, TIGR (http://www.tigr.org) and from the Protein Extraction, Description, and Analysis Tool, PEDANT (http://pedant.mips.biochem.mpg.de/). The data for the analysis of codon usage for each proteome were obtained from the Kazusa DNA Research Institute (http://www.kazusa.or.jp) via Countcodon program.

E+K/H+Q ratio

E+K/H+Q. ratio was calculated as follows: the percent of Glu plus the percent of Lys in a proteome (or ORF) was divided by the percent of His plus the percent of Gln in the same proteome (or ORF). The ORFs were deduced from the loci cited above.