Abstract
In alignments of 1969 protein sequences the amino acid glycine and others were found concentrated at most-conserved sites within ∼15 Å of catalytic/active centers (C/AC) of highly conserved kinases, dehydrogenases or lyases of Archaea, Bacteria and Eukaryota. Lysine and glutamic acid were concentrated at least-conserved sites furthest from their C/ACs. Logistic-regression analyses corroborated the “movement” of glycine towards and lysine away from their C/ACs: the odds of a glycine occupying a site were decreased by 19%, while the odds for a lysine were increased by 53%, for every 10 Å moving away from the C/AC. Average conservation of MSA consensus sites was highest surrounding the C/AC and directly decreased in transition toward model’s peripheries. Findings held with statistical confidence using sequences restricted to individual Domains or enzyme classes or to both. Our data describe variability in the rate of mutation and likelihoods for phylogenetic trees based on protein sequence data and endorse the extension of substitution models by incorporating data on conservation and distance to C/ACs rather than only using cumulative levels. The data support the view that in the most-conserved environment immediately surrounding the C/AC of taxonomically distant and highly conserved essential enzymes of central metabolism there are amino acids whose identity and degree of occupancy is similar to a proposed amino acid set and frequency associated with prebiotic evolution.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Background/Introduction
The circumstances surrounding the prebiotic assembly of amino acids into more complex molecules and polymers is one of the most important questions in science (Lazcano 2006). We found in initial studies of 3-D models of essential and highly conserved phosphoglycerate kinase (PGK) sequences from Archaea, Bacteria and Eukaryota unreported localizations of prebiotic amino acids at both most-conserved sites surrounding the catalytic/active center (C/AC) and at least-conserved sites most distant from the C/AC. At the most-conserved proximal C/AC sites were generally concentrated G and V and in some analyses D, R, T and S. At the distant and least-conserved sites were concentrated K and E. Alanine was typically evenly distributed. Higgs and Pudritz (2007) have indicated that early proteins were composed of a somewhat similar set (GADEV) and that these amino acids could have formed in primitive organisms useful “structures” and that modern proteins may still contain a signal related to an amino acid frequency of early evolution. Gulik et al. (2009) concluded that prebiotic short functional peptides were primarily made of GADV and assumed that some traces of these peptides still exist.
The likely first prebiotic assemblage was described as condensation products of glycine and DL-alanine with other α-amino acids (Brack and Orgel 1975). We considered that aggregations of small prebiotic amino acids at most-conserved sites surrounding catalytic/active centers of metabolically essential highly-conserved enzymes might represent the contemporary presence of such a chemical characteristic(s) of prebiotic amino acid assemblage. The aggregations of glycine at the most evolutionary conserved sites might also relate to the evolution of essential enzymes. To more confidently relate our preliminary findings with PGK to these reports and to prebiotic assembly and a conceptual relationship if any to the Last Universal Ancestor we studied in a number of statistically valid ways various alignment sets of larger and taxonomically diverse collections of protein sequences from all Domains and some enzyme classes.
It was also necessary to study separately both amino acid conservation at the C/ACs and enzyme function in each Domain because it was reported that in contrast to eukaryotic genes essential bacterial genes were more conserved than non-essential genes (Jordan et al. 2002) and because protein sequence alignments and derived phylogenetic trees might be misinterpreted due to the inclusion of examples representing different enzymatic classes (Varfolomeev et al. 2005). Results might also be skewed by inclusion of both enzymes and non-enzymes (Varfolomeev et al. 2002) or by simultaneously analyzing both informational (e.g., rRNA) and operational (e.g., fructose-1,6-bisphosphate aldolase) sequences (Rivera et al. 1998), or by mingling sequences of, e.g., strictly intracellular, membrane or excretable enzymes.
To respond to these issues we collected 1969 unique sequences of twelve different operational enzymes. The proteins were primarily associated with central metabolism (as glycolysis) and categorized as “household or house-keeping enzymes”, i.e., highly conserved proteins, essential in maintaining basic functions for sustenance, as, e.g., phosphoglycerate kinase, pyruvate kinase, aldolase, lactate dehydrogenase. They were also recognized as constitutively expressed, globular (soluble) and intracellular. In Bacteria, house-keeping enzymes are described as slowly evolving and their genetic variations are believed to be relatively neutral (Hanage et al. 2006). We primarily studied glycolytic members belonging to three enzyme classes: lyases, transferases (phosphotransferases, kinases) and oxidoreductases (dehydrogenases) of the Archaea, Bacteria and Eukaryota (Enzyme Nomenclature Database <http://www.chem.qmul.ac.uk/iubmb/enzyme/>). For our examples, we will use the terms: kinase(s), dehydrogenase(s) and lyase(s). After alignment, we found that over 4158 consensus MSA amino acids and their sites distributed over 12 enzyme sets could be additionally characterized by their hydropathy value, their sites’ interatomic distance to respective catalytic/active centers (C/AC) and using the same MSAs obtained a statistically significant conservation score at each site relatable to molecular structure. We also studied subsets of the 1969 sequences restricted to each Domain or to one of the three enzyme classes or to their combinations.
Methods
Sequences
We studied the MSA consensus amino acid content of unique protein sequences in alignments of twelve highly conserved enzymes primarily of central metabolism. Prospective sequences were retrieved from public databases following criteria we used before (Wolf et al. 2004; Pollack et al. 2005). We endeavored to select non-redundant sequences from taxonomically diverse species (Etzold and Argos 1993; Gasteiger et al. 2003; Pruitt et al. 2007). All sequences were reviewed using the highly cross-referenced UniProt Knowledgebase (UniProtKB/Swiss-Prot, <http://www.expasy.org/>). When identified, sequences described, e.g., as “undecided”, “indefinite” or “mutations” were not used. Thirteen sequences described as “fragments” and many described as “probable” were used. The 1969 selected sequences representing the 12 enzymes were used in the construction of three major data sets called “Categories” in order to test the consistency and constancy of our major observations. These sequences are identified in this study as Category 1, 2 or 3 and their construction and general content are described below.
The individual enzymes and their enzyme class are identified by their Enzyme Commission (EC) designations (the number of sequences we studied are in brackets { }): five of the twelve enzymes were transferases, subclass “transferring phosphorus-containing groups” or phosphotransferases (EC 2.7.x.x) (called “kinases”): 3-phosphoglycerate kinase (EC 2.7.2.3) (PGK) {141}, pyruvate kinase (EC 2.7.1.40) (PKY) {370}, adenylate kinase (EC 2.7.4.3) (ADK) {136}, nucleoside diphosphate kinase (EC 2.7.4.6) (NDK) {103} and acetylglutamate kinase (EC 2.7.2.8) (ACGK) {135}. Four enzymes were lyases, subclass “carbon–carbon cleaving” (EC 4.1.x.x) or subclass “carbon–oxygen cleaving” (EC 4.2.x.x) (“lyases”): deoxyribose aldolase (EC 4.1.2.4) (DERA) {81}, enolase (EC 4.2.1.11) (ENO) {213}, D-fructose-1,6-bisphosphate aldolase (EC 4.1.2.13) (FBPA) {169} and tryptophan synthase (EC 4.2.1.20) (α-subunit) (TRPA) {297}. Three enzymes were oxidoreductases, subclass “acting on either the CH-OH group of 1° and 2° alcohols and hemi-acetals” (EC 1.1.x.x.) or “acting on aldehydes or oxo groups” (EC 1.2.x.x), (called “dehydrogenases”): alcohol dehydrogenase (EC 1.1.1.1) (ADH) {75}, L-lactate dehydrogenase (EC 1.1.1.27) (L-LDH) {119} and glyceraldehyde-3-phosphate dehydrogenase (EC 1.2.1.12 and −.13, −.59) (GAPDH) {163}.
The Category 1 set contained the 1969 sequences of the 12 enzymes. The set was separated into six subsets each containing only Archaea, Bacteria or Eukaryota regardless of enzyme class, or as kinases, lyases or dehydrogenases regardless of Domain. The six subsets were aligned and analyzed (Results, Fig. 2). The Category 2 data base contained the same 1969 sequences but these were separated into the 12 homologous enzyme sets that were individually, aligned and analyzed. The data in each were collated and averaged to produce 12 averaged data sets representative of one of the 12 enzymes (see, below). In these 12 averaged sets we recovered a total of 4158 amino acid consensus sites that could be identified by their Domain as well as enzyme class (Results, Tables 1A and 1B). The 18 FASTA MSAs of Categories 1 and 2 with the identity of their sequences are available at our website (<http://www.stat.osu.edu/~dkp/ppp/>). The FBPA sequence set of Category 2 Bacteria lyase sequences is restricted to FBPA-Class II and also unclassified FBPA examples, it is devoid of FBPA examples identified as Class I. Category 2 ADH sequences were almost all examples of Zn2+/Fe2+ binding species.
In order to reduce any proposed or presumed effect in using mixed data sets composed of sequences of both widely varied taxonomies and enzyme classes we constructed six sets each containing sequences of only the same Domain and enzyme class (Category 3). For example, the Bacteria kinase collection contained sequences of five enzymes: ACGK, ADYK, NDK, PGK and PYK. Collections of each kinase were aligned and analyzed. The data from the five analyses were collated and averaged in order to obtain one file representing all Bacteria kinases. We similarly treated Bacteria lyases (DERA, ENO, FBPA, TRPAα), Bacteria dehydrogenases (ADH, GAPDH, L-LDH), Eukaryota kinases (ADYK, NDK, PGK, PYK), Eukaryota lyases (DERA, ENO, FBPA, TRPAα) and Eukaryota dehydrogenases (ADH, GAPDH, LDH). Archaea were not studied because of insufficient sequences.
“Scaffolds”
A particularly important component of our methodology was the use of “scaffolds”. Scaffolds are PDB-protein sequences included in all sequence sets. Scaffolds are identified in both Supplementary File 1 (SF-1) and in every FASTA alignments by their italicized four digit PDB identities (Alignment File). The files are found at <http://www.stat.osu.edu/~dkp/ppp/>. Scaffolds were selected when both the amino acids and spatial coordinates involved in their enzymatic mechanism or function were reported (Berman et al. 2002; Porter et al. 2004; Nagano 2005; Holliday et al. 2007). Scaffolds are essential in our analyses, acting not only as specific queries in determining conservation scores described below but also to calculate distances from a special amino acid atom reported to be mechanistically or catalytically involved in their protein’s catalytic action to the Cα of each residue in its protein sequence. The catalytically critical atom is called the “anchor-atom”, its amino acid is called the “anchor-amino acid.
We used additional scaffolds not included in SF-1 but like all the others they are noted in their respective MSAs. In Categories 1 and 2 they are: 1v6s [2643], 1rvg [1920], 1zmr [2539], 2e28 [238], 1w6t [6833], 3c4u [1820], 1a5z [2507] and 1rjw [10221]. Numbers in [ ] identify the involved atoms further described below using their PDB HETATM or PDB ATOM designation. Additional scaffolds used in Category 3 analyses are identified in Results, Table 2A and Table 2B footnotes.
Alignments: Multiple Sequence Alignment of, e.g., Nucleoside Diphosphate Kinase (NDK)
To describe our general procedure in more detail we will use the nucleoside diphosphate kinase (NDK) subset from the 4158 sequences of Category 2 as the example. (A second more detailed example using ADYK/ADK that describes our methodology in a stepwise fashion is found in a “Support” folder at <http://www.stat.osu.edu/~dkp/ppp/>). The Category 2 NDK subset contained 222 unique NDK sequences from all Domains. Five taxonomically diverse “scaffold” NDK sequences were included in the set: 1jxv (Eukaryota, human), 1k44 (Bacteria, Mycobacterium tuberculosis), 1nb2 (Bacteria, Bacillus sp.), 1pku (Eukaryota, Oryza sativa, rice), 2az1 (Archaea, Halobacterium sp.). The inclusion of scaffolds of diverse origins here and in each of the other 11 enzyme sets imparts some taxonomic generality to their averaged data.
The NDK alignment and all others were prepared by MUSCLE (<http://www.drive5.com/muscle/>) (Edgar 2004) and opened in the Jalview editor (<http://www.jalview.org/>) (Clamp et al. 2004). This MSA is characterized as the NDK “founding” FASTA MSA. In addition to the NDK FASTA MSA, Jalview shows the MSA consensus sequence on a scale that identifies its initial residue as position “1” and is aligned with the five variously gapped reference scaffold sequences. Jalview also calculates an MSA consensus that can be aligned with each NDK scaffold included in the alignment. This permits the combination of all available NDK conservation and distance data into one file that can be appropriately averaged and assigned to the sites in the NDK consensus sequence, as further described below.
Assays for Conservation
Using the founding-NDK MSA in each case, the five NDK scaffold sequences were also individually entered as the “query” in the Consurf program (<http://consurf.tau.ac.il/>) (Landau et al. 2005). The program is linkable to the molecular structure of a scaffold or homologous 3D-template and reports conservation results for each PDA site: 1) its position in its PDA file, 2) its normalized conservation score and 3) a color representing the score. The Consurf output is also linked to the Protein Explorer program (Martz 2002) to produce a 3-D image of the PDA-scaffold query molecule used in the particular analysis and colored according to their assigned conservation score color as seen in Results, Fig. 1.
The Consurf program calculates two values of site conservation: a conservation score and a “Zone” designation. The conservation scores are normalized by the Consurf program as standard scores so that the average score for all residues is zero and the standard deviation is one: in our studies these scores fall within the range of −3.00 to +3.00. The conservation scores are also scaled into 9 approximately equal sized partitions or “Zones”. Low conservation scores (i.e., negative values) reflect high conservation and are also designated by high numbered Zones. Zone 9 is the most-conserved zone. As the conservation scores become more positive, that is, less conserved, the Zone designation becomes smaller. Zone 1 is the least-conserved zone. We describe positive-integer low conservation scores as “less-conserved” or “least-conserved” sites rather than as originally described as “variable” (<http://consurf.tau.ac.il/>). Conservation scores do not indicate the absolute magnitude of evolution but rather the relative degree of conservation of each amino acid position or site. The conservation for each averaged consensus site was associated with its similarly associated distance measures to C/ACs and then collated as described in the next two sections.
Distance Measures: Assays for Distance to the Catalytic/Active Center to the Cα of All Amino Acids in its MSA Consensus-Sequence
As noted before, in each PDB-scaffold sequence we chose an amino acid reported to be associated with the C/AC (the “anchor-amino acid”). We selected an atom of the anchor-amino acid also reported to be associated with enzymatic function, e.g., binding or catalytic function. The atom is defined by us as representing the C/AC and is called an “anchor-atom”. Using the Yasara program, we determined the distance from each anchor-atom to every other Cα in its PDB chain (<http://www.yasara.org>) (Krieger et al. 2002). The anchor-atoms of this study are listed in Supplementary File 1 (SF-1) (<http://www.stat.osu.edu/~dkp/ppp/>). There we identify three types of anchor-atoms, their amino acid host and their PDB position (ATOM or HETATM). The anchor-atoms were either a 1) non-datively bound metal cofactor (Mg2+, Mn2+, Zn2+), 2) an atom of the amino acid (E, K, H, G, R, N) or 3) an atom of a non-metal ligand known to be close to the C/AC and mechanistically involved in the catalysis (NAD+, NADP+, NAI, NBD). For NDK, the data for each of the five NDK scaffolds were collated and averaged as described in the next section.
Collation and Averaging of All Conservation and Distance Data: Addition of Hydropathy Values
Collation-averaging occurred in two steps. The first preparatory step attaches conservation and distance data separately for each scaffolding protein structure for the enzyme (e.g., the five NDK sets to the same founding NDK MSA consensus site). This is possible because the output data in conservation and distance sets are relatable to the same copy of the consensus sequence included in their construction. When complete, for each site in each of the separated scaffold-specific files there was: 1) the consensus amino acid identity at that site, 2) the site’s position in the same consensus sequence common to all, 3) its conservation score and conservation zone and 4) its distance measure to its C/AC. The second step in the collation-averaging process was combining conservation and distance data of the different scaffolds into one file. This requires relating them to the original consensus MSA and averaging across those subsets without a gap at each site. Conservation scores or distance measures found at the same MSA consensus site were averaged. Hydropathy values were assigned to each compiled MSA consensus site (Kyte and Doolittle 1982).
Binary Logistic Regression Analyses
To establish with some statistical confidence the occupational trend or “movement” of specific amino acids between Zone 9 and Zone 1 we used a logistic regression model (Hosmer and Lemeshow 2000). We made two estimates for the 4158 primary sites used in Categories 1 and 2, the parameters are: 1) the odds of each amino acid appearing at a specific site with different conservation zones or 2) the odds each amino acid appearing at different linear distances from their anchor-atom.
These logistic regression models were carried out separately for each of 20 amino acids for each of the 12 enzymes. Models with the three pooled conservation zone groups (Zones 1, 2, 3 vs 4, 5, 6 vs 7, 8, 9) as the explanatory variable allowed for a variable overall amino acid usage across enzymes. They assumed an equal increase (decrease) in the log odds of occupancy in moving from group 1, 2, 3 to group 4, 5, 6 as in moving from group 4, 5, 6 to group 7, 8, 9. Models with distance as the explanatory variable assumed a linear increase in the log odds of occupancy (in units of 10 Ångstroms). Again, analyses were performed both for each enzyme separately and for the combined data under the assumption of varying overall usage but a common slope.
Interpretation of the statistical significance of the data requires that the integer 1.00 be absent from the calculated 95% confidence interval (95% CI). The presence of “1.00” indicates non-significance. Significant data (*) when greater than 1.00 indicate higher likelihood in peripheral/low conservation areas, while less than 1.00, indicate lower likelihood. We only show in Results the overall (averaged) odds values for each amino acid in both parameters in the 12 enzymes. The individual non-overall odds data for each of the 20 amino acids in each of the 12 enzymes is available in tabular form from the authors.
Additional Analyses
As further confirmatory assessment of our methodology, we made two alternative scorings of conservation and distance measures at each MSA site using traditional measures of distributional diversity: the Shannon-entropy index (Shannon 1948) and the Simpson-diversity index (Simpson 1949). The Shannon-entropy index for a site is given by \( - \sum\nolimits_i {Pi\ln \left( {Pi} \right)} \) where P i denotes the proportion of all amino acids that are of type i at that site. The Simpson-diversity index is given by \( {\sum\nolimits_i {Pi}^2} \). We came by either analysis to the same qualitative conclusions in terms of the association of conservation with either amino acid occupancy or with distance to the C/AC (data not shown).
Results
Clustering of Glycine, Lysine and Glutamic Acid Associated with Location and Conservation in an MSA Model of 3-Phosphoglycerate Kinase
In an initial analysis of a set of 3-phosphoglycerate kinase (PGK) protein sequences (Fig. 1) the greatest concentration of specific and most-conserved amino acids were associated with the C/ACs and ligands, while concentrations of other but least-conserved amino acids were apparently associated with a peripheral area (Zone) most distant from the C/ACs and ligands. In the most-conserved Zone 9 (Fig. 1b) that is associated with the catalytic/active center (C/AC) six amino acids constituted 68% of all the identified amino acids. Glycine was the “predominant amino acid” (PDAA) at 26.2%, arginine and aspartic acid at 9.2% and threonine and serine and glutamic acid at 7.7% (Figs. 1c, d). Glycine has been localized by others at protein surfaces (Naor et al. 1996; Fukuchi and Nishikawa 2001), or both the inside and the surface of proteins or as relatively evenly distributed between the two (Miller et al. 1987). However, our PGK glycine distributions differed, in our analyses glycine was concentrated at the core. We also found that the six most hydrophilic amino acids (R, K, D, E, N, Q) (hydropathic range of −4.5 to −3.5, (Kyte and Doolittle 1982) constituted 36.9% of Zone 9, while the most hydrophobic amino acids (M, C, F, L, V, I) within the hydropathy range of 1.9 to 4.5 comprised 9.2%.
In Fig. 1a, the area of least-conserved sites (Zone 1) clearly formed the periphery of the protein and was predominantly occupied (56.9%) by three amino acids: lysine (K) 25.9%, glutamic acid (E) 22.4% and aspartic acid (D) 8.6%. Lysine’s particular surface association has been reported (Rose et al. 1985; Varfolomeev and Gurevich 2001; Brooks and Fresco 2003). In our case, K (and E, and D) was also the “predominant amino acid” (PDAA) found in peripheral-surface areas of the PGK phylogenetic model. Additionally, K was principally found at least-conserved sites.
Figure 2a, represents the analyses of alignments containing Category 1 sequences separated into Domains, each containing all three enzyme classes. Figure 2b shows similarly data separated into kinases or lyases or dehydrogenases, each containing all Domains (Fig. 2b). Figure 2c, without individual data points, shows the superimposed analytical data of all six analyses. Each figure shows non-parametric confidence kernels that enscribe 68.27% of all unaveraged-uncollated consensus amino acids in each alignment from each set. Marginal density curves of the total distributions are shown. The data indicates that the sites of Archaea, Bacteria, Eukaryota, kinases, dehydrogenases and lyases are both similarly conserved and distanced from their respective C/AC, and that there is a similar degree of association between these two variables. The regression lines (Lowess, tension 0.5) are also very similar. There are some differing levels of variability in the distance distribution by enzyme class that are attributable to the larger size of some of the proteins. We interpret these findings as indicating some overall statistical similarity, as we also found that the distributions of most sites identified by their conservation scores and proximity to the C/AC in each Domain or in enzymatic class are similar. The Category 1 sequence set was used only for Fig. 2.
Figure 3 compares the distribution of all 4158 consensus amino acids derived from the Category 2 analyses. In these analyses data from Archaea, Bacteria, Eukaryota kinases, dehydrogenases, lyases are separated by their presence in the most- or least-conserved Zones 9 or 1, respectively. The Figure inserts include the average data for each of the four groups: the averaged conservation score (−3 = most-conserved to +3 = least-conserved) and the average distance from the Cα of each consensus amino acid to their C/AC or average Å (Y d ) ± std. The data in Fig. 3 indicate that regardless of enzyme class or Domain that lysine, glutamic acid and alanine are predominant in least-conserved Zone 1 most distant from the C/AC, while glycine, aspartic acid and alanine are predominant in most-conserved Zone 9 proximal to the C/AC. The intermediate conservation zones are not shown. There is also a pronounced elevated content of Archaea arginine in the Zone 1 least-conserved R group (Fig. 3, B1, red bar).
Clustering of Specific Amino Acids at Catalytic/Active Centers and Peripheries of Twelve Enzyme Models: Relationships to Conservation Zones
Tables 1A, 1B, Figs. 4 and 5 show the results of further analyses of the 4158 averaged consensus sites of each of the 12 enzymes of Category 2. Table 1A shows amino acid averaged conservation scores (Zones) and Table 1B, the averaged distances to their C/ACs.
In Table 1A, we computed the % occupancy and the distances to the C/ACs in nine conservation zones rather than by conservation scores. The ordered overall frequency is: A, G/L, V, E, K, I/D, T, P, S, R, N, F, Y, M, H, Q, C, W. However, only the twelve most frequently counted consensus or predominant amino acids (PDAAs) of our study (A through R) are emphasized and highlighted in the two Tables. In Table 1A, the highest (top 3–4) concentrations of the 12 major amino acids in each conservation are: Zone 1 (the least-conserved sites), K, E, A; Zone 2, E, K, A; Zone 3, E, L, K; Zone 4, L, A, E; Zone 5, L, A, V; Zone 6, L, V, A/ I; Zone 7, A, V, I, L; Zone 8, G, A, V, I, and Zone 9 (the most-conserved sites) G, D, A.
Table 1B shows that the decreasing average zonal distance of the amino acids towards their C/AC is a smooth transition averaging about 1.25 Å per zone. The average distance measure (Y d ) of the amino acids in the most-conserved sites (Zone 9) to the Cαs of their C/AC amino acid atoms is (avg. ± std., n): 14.45 Å ± 0.242, 551, while in the least-conserved sites (Zone 1) it is 25.77 Å ± 0.282, 624. The Tables indicate that there is a consistent decrease in distance of Cα sites to more conserved zones that occurs regardless of the changing distributions of the averaged consensus or predominant amino acid (PDAA) occupants.
Figure 4 shows the analyses of the 4158 Category 2 consensus amino acids also studied in Tables 1A and 1B. The xyz-axes of the 3D-histograms are: occupancy (% of total) vs average Ångstroms to the respective C/AC vs average site conservation score. The histogram bars are also identified by their conservation zone. Zone 9 are the most-conserved sites and Zone 1, least-conserved sites. Panels A–C, show that concentrations of glycine, serine and threonine are in most-conserved sites and are concentrated at positions nearest the C/AC. Panels D–F, show that lysine, glutamic acid and perhaps alanine are concentrated at sites least-conserved and furthest from the C/AC. Panels G–I, show isoleucine, valine and leucine concentrated in interior regions, while in Panels J–L, proline, aspartic acid and arginine are concentrated at both conservation extremes.
Figure 5 separates the 4158 Category 2 sequence data used for Tables 1A, 1B and Fig. 4, but here the 12 averaged enzyme sets were first separated into kinases, dehydrogenases or lyases and analyzed as for Fig. 4 (MSAs in Supplementary File 1 at <http://www.stat.osu.edu/~dkp/ppp/>). Figure 5 shows the distributions of glycine, serine, lysine and glutamic acid in the separated kinases, lyases and dehydrogenases. Glycine and serine regardless of the enzyme class are concentrated at most-conserved sites nearest the C/AC, Panels A–F. Lysine and glutamic acid regardless of the enzyme class are concentrated at least-conserved sites furthest from the C/AC, Panels G–L.
Analyses Using Sets Restricted to Single Domain Single Enzyme Class Sequences: “Two-Way Analyses”
Analyses of sets containing Category 3 sequences belonging to one Domain and one enzyme class are found in both Table 2A, Bacteria and in Table 2B, Eukaryota. Each Table has two parts: Part A, “Conservation Zones-Scores and Average Distances (Ångstroms) to C/AC” and Part B, the “% Occupancies” of 20 amino acids in conservation Zones 9 and 1. Archaea were not studied because of insufficient sequences. The data show that the three prominent (top) amino acids (G, V, D) in most-conserved sites (Zone 9) in the three enzyme classes of Bacteria are all within 15.78 ± 2.15 Å of the C/AC. Their average % occupancies are: G (16.98 ± 0.18), V (11.93 ± 3.3), D (9.49 ± 1.17). Similarly, in least-conserved sites (Zone 1) the prominent amino acids (E, K, A) in the three enzyme classes of Bacteria are all within 26.84 ± 3.73 Å of the C/AC. Their average % occupancy are: E (20.17 ± 5.38), K (18.84 ± 3.31), A (14.75 ± 7.00). The prominent amino acid examples (G, A) in the most-conserved sites (Zone 9) of the three Eukaryota enzyme classes are all within 17.13 ± 0.81 Å of the C/AC. Their average % occupancies are: G (15.99 ± 2.37), A (9.81 ± 0.71). In all the least-conserved Eukaryota sites (Zone 1) the prominent amino acids (K, E, A, D) are within 26.00 ± 0.42 Å of the C/AC. Their average %occupancies are: K (18.3 ± 3.65), E (11.57 ± 1.92), and both A and D (8.91 ± 0.81). The data indicate that the prominent amino acids in each category are similar regardless of the restriction of each data set to sequences of both the same Domain and same enzyme class. Further, these observations are in agreement with previous analyses.
To this point, we believed the variously analyzed data was consistent and indicated that the structural distribution of the averaged consensus or predominant amino acid (PDAA) sites and the degree of their evolutionary conservation are not random—there appeared to be groups or sets of amino acids significantly concentrated at both most-conserved and least-conserved sites and at specific distances from the C/ACs regardless of their Domain or identity as one of the three enzyme classes (Table 1A, Figs. 3, 4, 5).
The data in Tables 2A and 2B support the consistency of our findings: regardless of whether the sequence set contained only examples of kinases or dehydrogenases or lyases from either Bacteria or Eukaryota (Category 3) that G, V and D appeared to be moving closer to the C/AC in most-conserved sites and that K, E and A were moving further from the C/AC in least-conserved sites. As an additional statistical corroboration of “progression” or “movement” we examined amino acid occupancy levels and the distances from their C/ACs using binary logistic regression analyses of the entire 4158 average consensus data obtained from the study of Category 2 sequences.
Binary Logistic Regression: Changing Odds of Amino Acid Occupancy with Distance from Catalytic/Active Centers and Movement Between Low to Middle and Middle to High Conservation Zones
The binary logistic regression analyses in Table 3 show by two parameters statistically significant overall estimates of the rate (fold increase) and the 95% confidence intervals of changing odds of amino acid occupancy using the 4158 Category 1 and 2 sequence derived examples. The “overall” logistic regression data in Table 3 are derived from pooling all amino acid sites across the twelve enzymes and allow for possible enzyme-to-enzyme differences in occupancy odds but assume the same relationship to distance or conservation across enzymes. The tabular data for 20 amino acids in each of the 12 enzymes are available from the authors.
Table 3 shows by one parameter the likelihood of amino acids occupying a change of site moving between conservation zone groups (from Zones 1, 2, 3 (low-conservation) to Zones 4, 5, 6 or from Zones 4, 5, 6 to Zones 7, 8, 9 (high-conservation). The other parameter, irrespective of conservation, indicates the likelihood of occupancy moving every 10 Å away from the C/AC. The data are arranged by the decreasing polarity of the amino acids according to the Kyte-Doolittle hydropathy index (Kyte and Doolittle 1982).
Lysine, glutamic acid and leucine show statistically significant likelihoods (high odds) of occupancy per zone moving toward less-conserved sites. By the same parameter glycine, isoleucine, serine, threonine, asparagine and valine demonstrated a significantly low odds, that is, an increasing likelihood of occupancy toward more-conserved and C/AC associated sites.
Examination of the likelihood of amino acids moving 10 Å away from the C/AC shows statistically significant likelihoods (high odds) of lysine, glutamic acid and alanine occupying sites moving toward the periphery. However, for glycine, valine, threonine and isoleucine the likelihoods are statistically low, indicating again for these amino acids an increasing likelihood of occupancy moving toward more-conserved C/AC associated sites.
We conclude from these logistic regression analyses of data from three different enzyme groups and all Domains that glycine, probably threonine, and hydrophobic isoleucine and valine are preferentially associated with sites of highest conservation and proximity to their C/AC. Hydrophilic lysine and glutamic acid are preferentially associated with sites of lowest conservation and greatest distance from their C/AC. There may be other tendencies of movement and occupancy away from or towards the C/AC. However, for these concerns and the study of infrequently recorded amino acids we do not have sufficient data to form any firm statistical conclusions, we estimate that would require the study of considerably more (∼12–16 K) enzyme sequences and computations.
Hydropathy
Figure 6 shows the levels of occupancy as they relate to site conservation, enzyme class and hydropathy. The three panels (A, B, C) describe the contents of different conservation Zones: Zone 1, 5, 9. The stacked bars show the total % occupancy of amino acids in each enzyme class vs their hydropathy index. Each bar shows the standard deviation of all occupancy values of its three rectangular components. The small size of these standard deviations illustrate the consistent nature of these findings across enzyme classes. The sum of the heights of the same color in each panel (A, B or C) is 100%. Panel A shows conservation Zone 1, sites least-conserved and most distant from the C/AC. The Y d (average distance in Ångstroms of all Zone 1 sites to their C/ACs) is ∼26 Å: lysine (15%), glutamic acid (12%) and alanine (13%). Panel B shows conservation Zone 5, interior “mid”-conserved, distance from C/AC, Y d = ∼21 Å: leucine (14%), alanine (13%) and valine (12%). Panel C shows conservation Zone 9, most-conserved, closest to the C/AC, Y d = ∼15 Å: glycine (18%), aspartic acid (10%), alanine (9%). The % occupancy of the most polar amino acids (RKDE) decreases from 40% in Zone 1 to 23% in Zone 5, to 27% in Zone 9, while the % occupancy of the “non”-polar amino acids (FLVI) in Zone 1 is 19%, in Zone 5 is 34% and in Zone 9, 14%. The three panels illustrate the changing hydropathic content in the three zones moving towards the C/AC, from Zone 1 to Zone 9. Noteworthy are the predominance of polar amino acids in Zone 1, the predominance of non-polar amino acids in Zone 5 and the predominance of relatively “neutral” glycine in Zone 9. We did not find any significant distinctions attributable to Domain or the enzyme classes.
Discussion
Sampling issues and phylogenetic effects are a potential source of bias in any study relying as ours on hundreds of distantly related taxa. However, we believe that issues associated with the distribution of the twenty amino acids in this study are not likely to be strongly affected based on the observed consistency seen across the three Domains and the different enzymes classes. Our data suggesting Domain consistency (Figs. 2a, 2c and 3B1, 3B2) appears contradictory to other reports (Pe’er et al. 2004; Bogatyreva et al. 2006). We attribute the differences to the specificity of the sequence sets we employed. We only studied aligned sets of protein sequences of twelve highly conserved operational mostly constitutive enzymes of central metabolism that were separated by their enzyme class and/or Domain. MSA site conservation scores were related to the distance to their functional catalytic centers and hydropathy indexes to further identify and characterize patterns of consistency. We believe that our general 12-enzyme findings would be obscured with less sequence discrimination.
Varfolomeev and co-workers (Varfolomeev et al. 2001, 2002) determined the Shannon entropy values of MSA sites to determine the relative degree of conservation in hydrolases, an enzyme class we did not study. We found using the same procedure to measure conservation qualitatively identical conclusions (see, Methods). These authors (op. cit.) characterized aspartic acid as highly conserved. We found depending on the data set that concentrations of aspartic acid may be localized in both the least-conserved Zone 1 and the most-conserved Zone 9 (Table 1A and Fig. 4).
Two studies have characterized various amino acids as either “inside” or “buried” or “accessible” or “surface accessible” (Janin 1979; Miller et al. 1987). Their findings are similar to ours. Glycines were described generally as found “inside” or “buried” or “core” and lysines and glutamic acids were described as “accessible” or “surface accessible” or in “peripheral areas”. The interatomic distances of the amino acids, conservation scores, enzyme class or taxonomic relationships were not reported. We did not as these authors use any measures of surface, inside volumes or solvent-accessible surface area values in our determinations.
Other studies have calculated the distance (Ångstroms) of atoms to the nearest surface water or solvent-accessible neighbor protein (Chakravarty and Varadarajan 1999; Pintar et al. 2003). Estimations of conservation in families of protein folds (e.g., Rossmann fold, immunoglobulin fold, TIM barrel) were reported by the same authors. The “atom-depth” techniques measure the mean residue depth from the “inside-out”, our procedures are complementary, they measure the distances from a single locus the “anchor-atom” of the C/AC to the Cα of an amino acid, i.e., “inside-in”. In one of these reports, using their “atom-depth” algorithim the authors studied 136 non-homologous single sequence PDA protein crystal structures (Pintar et al. 2003). The sequence set was selected from an apparently mixed collection of 301 highly curated but mostly unidentified representatives, they reported the atom-depth distances of all 20 amino acids: K (with the lowest Ångstroms was described as nearest the surface and was followed by <E<D<Q<R<N<P<S<G<T<H<A<Y<C<M<W<L<F<V<I). Isoleucine (I) with the highest Ångstroms was furthest from the surface. The authors (op cit.) concluded that certain amino acids occur at greater depths than others and as measured by sequence entropy that the deepest residues are most conserved. Our study is in general agreement with these findings using entirely different methods and parameters. Another method was reported to more fully express the 3-D character of the atom-depth measure by also taking into account the overall size and shape of the protein (Varrazzo et al. 2005).
The determination of the interatomic distance between a functional enzyme center and the Cα of mutant sites, a procedure very similar to ours, was reported in an extensive study designed to select mutations and improve enzyme properties (Morley and Kazlauskas 2005). Others have also described catalytic residues and their local environment as highly conserved (Zvelebil et al. 1987; Dean and Golding 2000; Bartlett et al. 2002). The concept of an “environment” that we analogize with our ∼30 Å diameter most-conserved Zone 9 was suggested by reports indicating that more effective functional change (enantioselectivity, substrate specificity, new catalytic activity) are associated with amino acid substitution close to catalytic sites. However, there is evidence that amino acid replacements outside of reported active sites can also affect not only specificity but catalytic efficiency (Dean and Golding 2000; Lichtarge and Sowa 2002; Zhang et al. 2004; Morley and Kazlauskas 2005). We emphasize that most-conserved sites are not found only at or near the C/AC locus. We consistently found some most-conserved consensus sites at most distant sites from the C/AC and occupied by, for example, glycine. We also found most-conserved sites closest to the C/AC that were occupied by lysine. These few most-conserved variously localized examples may belong to a reported dynamically coupled network effecting kinetic events that promote catalysis (Hammes-Schiffer 2002; Benkovic et al. 2008).
We emphasize only glycine because as a major observation of our study in almost every instance we find it as the most abundant residue at the most-conserved sites that surround the C/AC in different classes of essential enzymes and distant taxonomies. Glycine may be the earliest amino acid (Eigen and Schuster 1978; Trifonov 2000, 2004). Conserved glycines were reported to be located near the catalytically active residues of five hydrolases (Varfolomeev et al. 2002). Fermentation of glycine was described as the most ancient catabolic pathway (Clarke and Elsden 1980), although this opinion was questioned (Conchillos and Lecointre 2005). Glycine is described as “indispensable” in any prebiotic scenario (Suwannachot and Rode 1999) however, glycine’s catalytic role is uncommonly reported. An essential functional role for glycine is reported in enzymes that contain glycyl radicals like formate acetyltransferase I (pyruvate formate-lyase) (MACiE M0030, EC 2.3.1.54) (Holliday et al. 2007). Glycyl radical enzymes have been found in obligate and facultative anaerobic Archaea, Bacteria and Eukaryota where they serve as biocatalysts in anoxic environments (Selmer et al. 2005; Lehtiö et al. 2006). As a class they are thought to predate the appearance of molecular oxygen (Sawers and Watson 1998).
Glycine is characterized as multifunctional, it stabilizes transition state intermediates (Bartlett et al. 2002), is involved in folding (Sasai 1995), modulates peptide helicity (Li and Deber 1992), mediates helix–helix interactions in membrane proteins (Oppegård et al. 2008), is involved in molecular packing requirements (Elbaz et al. 2008), the transport of proteins (Zhou and Kanner 2005) and the flexibility and role of loops (Kwasigroch et al. 1996). Glycine rich sequences were associated with kinases (Bossemeyer 1994). A contrary effect of glycine was reported in a study to demonstrate nucleic acid synthesis under prebiotic conditions. Glycine and other amino acids in the presence of purines and pyrimidines were found to catalyze in high yield the dehydration of the α,β positions of 2-deoxyribose to form primidino and purino pentoses (Nelsestuen 1979). The reactions described were considered to rapidly deplete components of nucleic acids and present a major problem for prebiotic nucleic acid assembly.
A possible initiating role for diglycine relevant to the formation of a biosphere has been described (Plankensteiner et al. 2002). Those authors reported that diglycine is catalytic, in its presence there is a salt-induced synthesis of other peptides, however, it was pointed out that there are significant thermodynamic and kinetic limitations to the formation of diglycine in aqueous solution (Fitz et al. 2007). In that report and others the readiness of amino acids particularly glycine to form prebiotic peptides in the presence of minerals and a discussion of aqueous peptide synthesis by the salt-induced peptide formation (SIPF) reaction in the presence of NaCl and Cu(II) is described (e.g., Rode 1999; Bujdák and Rode 2002). Although glycine is not specifically indicated the role of amino acids as prebiotic catalysts has been emphasized by others (e.g., Bar-Nun et al. 1994; Shimizu et al. 2008).
Site-specific substitution of glycine by alanine was shown to be functionally deleterious (Sun and Sampson 1998). Computer substitutions of “conserved glycines” by alanine resulted in a significant change in catalytic site geometry while substitution of “non-conserved glycines” had little effect (Varfolomeev et al. 2001). Glycine’s small volume is often noted as an impediment to substitution by larger amino acids (Oppegård et al. 2008). In a study of nine proteins, glycine has the smallest average amino acid buried residue volume: V R = 66.4 ± 4.7 Å3 (Richards 1977).
Glycine stabilizes different proteins in different ways (Ganter and Pluckthun 1990). When glycine is mutated to alanine or proline the protein stability increases by decreasing the entropy of the unfolded state of bacteriophage T4 lysozyme (Matthews et al. 1987). Although, not in chicken glyceraldehyde-3-phosphate dehydrogenase (GAPDH). In this GAPDH, a glycine → alanine substitution does not stabilize the protein by affecting the entropy of the unfolded state, but rather by filling an internal cavity and thereby stabilizing the native state (Ganter and Pluckthun 1990). Additionally, mutation of aspartic acid → glycine increases k cat/K m for ATP 3800-fold in phosphofructokinases. These latter results are interpreted as an enhanced effect on the enzymatic activity of a nucleotide binding site associated with glycine insertion (Chi and Kemp 2000). There are very many diverse reports noting relevant properties and consequential roles for glycine residues, e.g., in the interaction of antigens and antibodies (Roitt and Delves 2001).
Glycine’s role at the C/AC in our study is uncertain, it may have multiple roles. However, of the possibilities some noted above we favor the opinion that with a small energy of rotation around its C-N and C-C bonds glycine provides some advantageous conformational flexibility for active enzyme sites, a desirable enzymatic property that has already been emphasized by others (Tsou 1993; Mesecar et al. 1997; Varfolomeev et al. 2001). The glycine-rich C/AC region or environment analogous to Zone 9 and roughly extending ∼10–15 Å from the C/AC may have more conformational flexibility than other conservation Zones. In an interesting study of the “fluctuational amplitude” of amino acids in 19 proteins, achiral glycine was characterized as having the highest average “flexibility index” of 20 amino acids (Yan and Sun 1997). In that report the authors used their differential equation model that considers the combined influences of the chemical, physical, conformational and energetic properties on the fluctuational displacements of each residue in a protein and particularly on the effect of the residues spatial position. Our data may relate to such conformational flexibility and spatial relationships as well as to an association noted above with enzyme motions essential for the catalytic process.
In addition to our findings with G, we report that K and E and A were consistently concentrated at least-conserved most distant sites from the C/AC. It is reasonable to consider that the concentration of polar species at the peripheral regions of these globular constitutive proteins are associated with their solubility. The location of concentrations of these amino acids may as well be indicators of enzymatic function (Damodharan and Pattabhi 2004) and also recognition of their interaction between molecules, stabilization of tertiary structures and thermostability (references are cited by Leunissen et al. 1990).
Some of the most predominant amino acids besides glycine in Zone 9 of our study were frequently proposed as members of an early evolutionary and minimal amino acid set (G, A, D and in most studies V and E as well) (e.g., Miller 1987; Trifonov 2000; Ikehara 2005; Jordan et al. 2005; Higgs and Pudritz 2007). Glycine, A, V, D, E were synthesized in the Stanley L. Miller’s seminal sparking experiment and are components of the Murchison and other meteorites (Brack 2007). Other reports have indicated that the inferred proteins of the Last Universal Ancestor (LUA) had a greater abundance of amino acids attributed to a presumably prebiotic period, as G, A, D and V, again predominant species in our study (Brooks and Fresco 2003; Brooks et al. 2004).
The study of Gulik et al. (2009) converges with ours. They wrote that early functional peptides were 3–8 amino acids long and were made of G, A, D, V and that traces of these prebiotic peptides still exist in the form of active sites in present-day proteins. Their criteria included a search of the entire PDB data base specifically for traces of prebiotic peptides that contained a protein structure interacting with a metal ion and were built almost exclusively with amino acids they deemed most abundant prebiotically: G, A, D, V. Their statistical analyses were confirmatory and interpreted to indicate that G, A, D and possibly V were the “true abundant prebiotic aa’s”. They also found three classes of ion-binding motifs associated with either a DNA-directed or an RNA-dependent RNA polymerase [-D(F/Y)DGD-], three mutases [-DGD(G/A)D-] and a dihydroxyacetone kinase [-DAKVGDGD-]. The motifs were thought with reservation to correspond to the first functional peptides and that the submotif [-DGD-] is the common ancestor to all active peptides. Our methodology differs. We made, e.g., no experimental assumptions as to which amino acids were either primordial or associated with catalytic function. We studied all amino acids and found concentrations of G, A, D and often V at highly conserved sites measurably nearest the C/ACs in non-redundant sequence data sets of kinases, dehydrogenases and lyases of the Bacteria and Eukaryota that are associated with the “trunk” glycolytic pathway (Tables 1A, 1B, 2A and 2B). Lysine and E were concentrated at the least-conserved sites most distant from the C/AC.
Edward N. Trifonov concluded based on a very comprehensive analyses of 60 “chronology” vectors as criteria that the pair of complimentary GGC and GCC codons for glycine and alanine appeared first (Trifonov 2000, 2004). This study led to the development of his consensus temporal order of the appearance amino acids, the series indicating that glycine was the oldest amino acid. The reported evolutionary amino acid chronology was: G/A, D, V, P, S, E, L/T, R, I/Q/N, H, K, C, F, Y, M, W. Correlation between protein sequence age and conservation in bacterial octapeptides has been specifically reported (Sobolevsky and Trifonov 2005). These authors concluded that A, G, D, V, S, P, again prominent members of our study, are components of the oldest protein sequences.
The analyses of orthologous proteins encoded by triplets of closely related genomes indicated that there was a set of amino acids with declining presence in proteins over the last 106 years: these were proline, alanine, glutamic acid and glycine (Brooks and Fresco 2003). These amino acids (PAEG) were characterized as the first incorporated into the genetic code and among the six considered to be abiogenic—the most ancient. In this study they were characterized as “strong losers” in an irreversible (evolutionary) decline of their presence in proteins. The losses or gains were not considered to be due to mutation-selection.
Jordan et al. (2005) in their “Supplementary Table 3” compared their rankings of amino acid recent gain or loss in protein evolution with amino acid rankings of recruitment into the genetic code, abundance in spark experiments and the Murchison meteorite. We have added our occupancy data and rearranged their table. Our Supplementary File SF-2 (<http://www.stat.osu.edu/~dkp/ppp/>) compares our data to amino acid abundance-ranking in laboratory syntheses (e.g., Miller 1987), presence in meteorite(s) (Brack 2007), temporal order of appearance (Trifonov 2000, 2004) and literature emphasizing the probable appearance and putative role of these amino acids (Brooks et al. 2002, Brooks and Fresco 2003, Lazcano 2006, Zaia et al. 2008, Cleaves et al. 2008). Several amino acids newly detected by liquid or gas chromatography-mass spectroscopy in preserved residues of S. Miller’s experiments (Johnson et al. 2008) and amino acid enantiomers in the Murchison meteorite (Cronin and Pizzarello 1997) were not included in our Supplementary File SF-2.
Studies of the proteomes of the Domains by others rather than enzymes per se as in our study have reported distinguishing amino acid signatures or compositional patterns attributed to evolutionary memory, phylogeny and life-style (Pe’er et al. 2004; Bogatyreva et al. 2006; Tekaia and Yeramian 2006). In Supplementary File SF-2, we found no statistically significant similarity in the orders of frequency between any of the referenced data versus Zone 9 compared with Zone 1. Our data principally differs because it distinguishes the enzyme amino acid content by their averaged conservation scores that were obtained from multiple sequence alignments of taxonomically dispersed examples.
Other reports that might be viewed as contrary to ours describe losses of amino acids, e.g., glycine, during evolution (Brooks and Fresco 2003; Jordan et al. 2005). These reports are interpreted by us as compatible with our view that evolutionary losses of certain amino acids as glycine apparently occur less frequently at the most-conserved C/AC than in less-conserved sites moving towards the periphery of molecules. That is, we suggest that if relatable to our studies, these reported evolutionary amino acid losses of, for example, glycine and valine occur predominantly in areas we identify as outside the most-conserved zone of ∼15 Å radius from the C/AC anchor-atom. We believe that unless localized concentrations and conservations of the amino acids are taken into account such attributed evolutionary changes may be obscured.
The predominance or “clustering” of specific amino acids in a particular conservation Zone was a consistent finding in our tests whether sequences were analyzed as an unrestricted enzyme set of all sequences (Category 1 sequence set) (Figs. 3, 4, Tables 1A, 1B) or when separated before analyses by individual Domain or enzyme class (Category 2 sequence sets) (Fig. 5) or when separated before analyses into sets of the same enzyme class and Domain (“Two-Way”) (Category 3 sequence sets) (Tables 2A and 2B).
However, we acknowledge that our observation of the concentrations of “early” amino acids as glycine at the C/AC of highly conserved enzymes is not rigorous proof of evolutionary continuity between prebiotic chemistry and contemporary biochemical catalysis. Nor is there evidence that the enzymes we have studied are functionally identical or similar to the earliest pro-enzymes. Some of our twelve enzymes choices may not be the oldest protein(s) nor be equally primitive. Informational enzymes as, e.g., amino acyl tRNA synthases or ATPases might have been chosen (Becerra et al. 2007).
We primarily studied operational enzymes that are involved in the three-carbon “trunk” portion of glycolysis. We prefer glycolysis for a variety of reasons. Glycolysis might allow life under prebiotic anaerobic conditions and assures a fast response in supplying ATP (references are cited by Meléndez-Hevia et al. 1997). Sugar is described in chemical terms both as the optimal biosynthetic carbon source of aqueous life in the Universe and as an indispensible component of a model describing the irreversible catalytic flow of reactions ascribed to the origin of life (Weber 2000, 2001, 2002). This anaerobic redox disproportionation of sugar, e.g., glucose, with the production of ATP that is also called substrate phosphorylation is mechanistically the simplest and presumably the oldest type of energy conservation (Gest and Schopf 1983). Glycolysis’s antiquity is reflected in the fact that it occurs in a soluble system without the involvement of membranes and the relatively low-level of ATP synthesis by substrate phosphorylation is believed relatable to the presumed inefficiency of a primitive energy conserving function (Gest and Schopf 1983). Glycolysis is presumed present in the last universal ancestor (references are cited by Ronimus and Morgan 2003). The supposed and characterized primitiveness of glycolysis is compatible with the fact that its components ATP and NAD are present in almost every extant cellular process that either supplies or depends on utilizable energy (Krebs and Kornberg 1957). There is now genetic evidence connecting DNA chain elongation to glycolysis (Jannière et al. 2007).
We emphasize that in our work, with widely diverse taxonomic examples of some household-glycolytic enzymes, the averaged MSA site conservation scores decrease in a relatively smooth manner proceeding from the C/AC anchor-atom and closely surrounding sites to the molecule’s periphery and that this progression is associated with local concentrations of specific amino acids.
The generally consistent signal we have found in the context of natural variability accentuates the usefulness of statistical analyses in studying multiple and diverse species. We suggest that study of conservation of sites in relation to any historical time frame should also include a study of interior nodes of the phylogenetic tree space. For example, the posterior distribution of the interior nodal sequences of a phylogenetic tree can be estimated by our Bayesian tree building algorithim (Li et al. 2000). Our previous reports showed that 3-phosphoglycerate kinase, a product of an operational housekeeping gene sequence, carries a high degree of evolutionary signal for phylogenetic studies (Wolf et al. 2004; Pollack et al. 2005). The results of the present study indicate that this signal can be further enhanced. The consistencies of the association between site conservation and distance from the C/AC across differing enzyme classes and over the three Domains should have consequences for the stochastic models used in such studies. For example, introducing the distance from the C/AC as an explanatory factor to help describe the site-to-site variability in the rate of mutation may greatly improve the likelihoods for phylogenetic trees based on amino acid sequence data (Pan 2008). Furthermore, knowledge of the identification and sequence positions of these most-conserved amino acids common to a wide taxonomy and localized at the C/AC of highly-conserved essential enzymes of central metabolism may be useful in choosing residues for a variety of studies.
Our results showing that modal occupancy rates for the distributions of specific amino acids are linked to this conservation/distance information may have implications for amino acid substitution models and perhaps chemical or enzyme evolution or design. For example, improvements might be developed to the classical PAM (Schwartz and Dayhoff 1978) and the more recent BLOSUM (Henikoff and Henikoff 1992) families of amino acid substitution matrices used heavily in phylogenetic research. These families are indexed by the cumulative degree of conservation (e.g., the BLOSUM62 matrix contains data from comparisons having at least 62% similarity within blocks of multiply aligned related sequences). Our results suggest that for enzymes such matrices would possibly provide more phylogenetic information as they incorporate modifications that focus on specific levels of conservation and proximity to the C/AC rather than only on cumulative levels.
In those enzyme examples we have studied, the most-conserved environments closely surrounding the functional cores have similar amino acid content. Regardless of Domain or enzyme class or both in sequences of some operational essential enzymes certain residues notably G and V and perhaps D are concentrated at most-conserved sites within ∼15 Å of the catalytic/active centers and others as K and perhaps E are concentrated at most distant least-conserved sites. Alanine seems to be more generally distributed. Our strikingly consistent statistical results regarding those most-conserved C/AC localized amino acids perhaps with others constitute data supportive of reports suggesting that they are a contemporary remnant or signal of prebiotic amino acid aggregation.
References
Bar-Nun A, Kochavi E, Bar-Nun S (1994) Assemblies of free amino acids as possible prebiotic catalysts. J Mol Evol 39:116–122
Bartlett GJ, Porter CT, Borkakoti N, Thornton JM (2002) Analysis of catalytic residues in enzyme active sites. J Mol Biol 324:105–121
Becerra A, Delaye L, Islas S, Lazcano A (2007) The very early stages of biological evolution and the nature of the last common ancestor of the three major cell domains. Annu Rev Ecol Evol Syst 38:361–379
Benkovic SJ, Hammes GG, Hammes-Schiffer S (2008) Free-energy landscape of enzyme catalysis. Biochemistry 47:3317–3321
Berman HM, Battistuz T, Bhat TN et al (2002) The protein data bank. Acta Crystallogr D Biol Crystallogr 58:899–907
Bogatyreva NS, Finkelstein AV, Galzitskaya OV (2006) Trend of amino acid composition of proteins of different taxa. J Bioinfo Comput Biol 4:597–608
Bossemeyer D (1994) The glycine-rich sequence of protein kinases, a multifunctional element. Trends Biochem Sci 19:201–205
Brack A (2007) From interstellar amino acids to prebiotic catalytic peptides: a review. Chem Biodiversity 4:665–679
Brack A, Orgel LE (1975) β structures of alternating polypeptides and their possible prebiotic significance. Nature 256:383–387
Brooks DJ, Fresco JR (2003) Greater GNN pattern bias in sequence elements encoding conserved residues of ancient proteins may be an indicator of amino acid composition of early proteins. Gene 303:177–185
Brooks DJ, Fresco JR, Lesk AM, Singh M (2002) Evolution of amino acid frequencies in proteins over deep time: inferred order of introduction of amino acids into the genetic code. Mol Biol Evol 19:1645–1655
Brooks D, Fresco JR, Singh M (2004) A novel method for estimating ancestral amino acid composition and its application to proteins of the last universal ancestor. Bioinformatics 20:2251–2257
Bujdák J, Rode BM (2002) Preferential amino acid sequences in alumina-catalysed peptide bond formation. J Inorg Biochem 90:1–7
Chakravarty S, Varadarajan R (1999) Residue depth: a novel parameter for the analysis of protein structure and stability. Structure 7:723–732
Chi A, Kemp R (2000) The primordial high energy compound, ATP or inorganic pyrophosphate? J Biol Chem 275:35677–35679
Clamp M, Cuff J, Searle SM, Barton GJ (2004) The Jalview Java alignment editor. Bioinformatics 20:426–427
Clarke PH, Elsden SR (1980) The earliest catabolic pathways. J Mol Evol 15:333–338
Cleaves HJ, Chalmers JH, Lazcano A, Miller SL, Bada JL (2008) A reassessment of prebiotic organic synthesis in neutral planetary atmospheres. Orig Life Evol Biosph 38:105–115
Conchillos C, Lecointre G (2005) Integrating the universal metabolism into a phylogenetic analysis. Mol Biol Evol 22:1–11
Cronin JR, Pizzarello S (1997) Enantiomeric excesses in meteoric amino acids. Science 275:951–955
Damodharan L, Pattabhi V (2004) Hydropathy analysis to correlate structure and function of proteins. Biochem Biophys Res Commun 323:996–1002
Dean AM, Golding GB (2000) Enzyme evolution explained (sort of). Pacific Symp Biocomp 5:6–17
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 5:1792–1797
Eigen M, Schuster P (1978) The hypercycle. A principal of natural self-organization. Part C: The realistic hypercycle. Naturwissenschaften 65:341–369
Elbaz Y, Salomon T, Schuldiner S (2008) Identification of a glycine motif required for packing in EmrE, a multidrug transporter from Escherichia coli. J Biol Chem 283:12276–12283
Etzold T, Argos P (1993) SRS—an indexing and retrieval tool for flat file databases. Comput Appl Biosci 9:49–57
Fitz D, Reiner H, Rode BM (2007) Chemical evolution toward the origin of life. Pure Appl Chem 79:2101–2117
Fukuchi S, Nishikawa K (2001) Protein surface amino acid compositions distinctively differ between thermophilic and mesophilic bacteria. J Mol Biol 309:835–843
Ganter C, Pluckthun A (1990) Glycine substitutions in helices of glyceraldehyde-3-dehydrogenase: effects on stability. Biochemistry 29:9395–9402
Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel RD, Bairoch A (2003) ExPASy: the proteomic server for in-depth protein knowledge and analyses. Nucleic Acids Res 31:3784–3788
Gest H, Schopf JW (1983) Biochemical evolution of anaerobic energy conversion: the transition from fermentation to anoxygenic photosynthesis. In: Schopf JW (ed) Earth’s earliest biosphere. Its origin and evolution. Princeton University Press, NJ, pp 135–148
Gulik P, Massar S, Gilis D, Buhrman H, Rooman M (2009) The first peptides: the evolutionary transition between prebiotic amino acids and early proteins. J Theor Biol. doi:10.1016/j.jtbi.2009.09.004
Hammes-Schiffer S (2002) Impact of enzyme motion on activity. Biochemistry 41:13335–13343
Hanage WP, Fraser C, Spratt BG (2006) Sequences, sequence clusters and bacterial species. Phil Trans R Soc B 361:1917–1927
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919
Higgs PG, Pudritz RE (2007) From protoplanetary disks to prebiotic amino acids and the origin of the genetic code. In: Pudritz RE, Higgs PG, Stone J (eds) Planetary systems and the origins of life. Cambridge Series in Astrobiology, Volume 3. Cambridge University, Cambridge, pp 1–29
Holliday GL, Almonacid DE, Bartlett GJ, O’Boyle NM, Torrance JM, Murray-Rust P, Mitchell JB, Thornton JM (2007) MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for searching catalytic mechanisms. Nucleic Acids Res 35:D515–D520
Hosmer DW, Lemeshow S (2000) Applied logistic regression. Wiley, New York
Ikehara K (2005) Possible steps to the emergence of life. The GADV-protein world hypothesis. Chem Record 5:107–118
Janin J (1979) Surface and inside volumes in globular proteins. Nature 277:491–492
Jannière L, Canceill D, Suski C, Kanga S, Dalmais B, Lestini R, Monnier A-F, Chapuis J, Bolotin A, Titok M, Le Chatelier E, Ehrlich SD (2007) Genetic evidence for a link between glycolysis and DNA replication. PLoS ONE 2:e447
Johnson AP, Cleaves HJ, Dworkin JP, Glavin DP, Lazcano A, Bada JL (2008) The Miller volcanic spark discharge experiment. Science 322:404 (and supplementary files)
Jordan IK, Rogozin IB, Wolf YI, Koonin EV (2002) Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res 12:962–968
Jordan IK, Kondrashov FA, Adzhubel IA, Wolf YI, Koonin EV, Kondrashov AS, Sunyaev S (2005) A universal trend of amino acid gain and loss in protein evolution. Nature 433:633–637
Krebs HA, Kornberg HL (1957) Energy transformation in living matter. Ergeb Physiol Biol Chem Exp Pharmakol 49:212–298
Krieger E, Koraimann G, Vriend G (2002) Increasing the precision of comparative models with YASARA NOVA—a self parameterizing force. Proteins 47:393–402
Kwasigroch J-M, Chomilier M, Mornon J-P (1996) A global taxonomy of loops in globular proteins. J Mol Biol 259:855–872
Kyte J, Doolittle R (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157:105–132
Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben-Tal N (2005) ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res 33, Web Server Issue W299-W302, doi:10.1093/nar/gki370
Lazcano A (2006) The origins of life. Natural History 115:36–41
Lehtiö L, Grossmann JG, Kokona B, Fairman R, Goldman A (2006) Crystal structure of a glycyl radical enzyme from Archaeoglobus fulgidus. J Mol Biol 357:221–235
Leunissen JAM, van den Hooven HW, de Jong WW (1990) Extreme differences in charge changes during protein evolution. J Mol Evol 31:33–39
Li S-C, Deber CM (1992) Glycine and β-branched residues support and modulate peptide helicity in membrane environments. FEBS 311:217–220
Li S, Pearl DK, Doss H (2000) Phylogenetic tree construction using Markov chain Monte Carlo. J Amer Stat Assoc 95:493–508
Lichtarge O, Sowa ME (2002) Evolutionary predictions of binding surfaces and interactions. Curr Opin Struct Biol 12:21–27
Martz E (2002) Protein Explorer: easy yet powerful macromolecular visualization. Trends Biochem Sci 27:107–109
Matthews BW, Nicholson H, Becktel WJ (1987) Enhanced protein thermostability from site-directed mutations that decrease the entropy of folding. Proc Natl Acad Sci USA 84:6663–6667
Meléndez-Hevia E, Waddell TG, Heinrich R, Montero F (1997) Theoretical approaches to the evolutionary optimization of glycolysis: chemical analysis. Eur J Biochem 244:527–543
Mesecar AD, Stoddard BL, Koshland DE Jr (1997) Orbital steering in the catalytic power of enzymes: small structural changes with large catalytic consequences. Science 277:202–206
Miller SL (1987) Which organic compounds could have occurred on the prebiotic earth? Cold Spring Harbor Symp Quant Biol 52:17–27
Miller S, Janin J, Lesk AM, Chothia C (1987) Interior and surface of monomeric proteins. J Mol Biol 196:641–656
Morley KL, Kazlauskas RJ (2005) Improving enzyme properties: when are closer mutations better? Trends Biotechnol 23:231–237
Nagano N (2005) The enzyme catalytic mechanism data base. Nucleic Acids Res 33:D407–D412
Naor D, Fisher D, Jernigan RL, Wolfson HJ, Nussinov R (1996) Amino acid pair interchange at spatially conserved locations. J Mol Biol 256:924–938
Nelsestuen GL (1979) Amino acid catalyzed condensation of purines and pyrimidines with 2-deoxribose. Biochemistry 18:2843–2846
Oppegård C, Schmidt J, Kristiansen PE, Nissen-Meyer J (2008) Mutational analysis of putative helix–helix interacting GxxxG-motifs and tryptophan residues in the two-peptide bacteriocin lactococcin G. Biochemistry 47:5242–5249
Pan X (2008) Using structural information in modeling and multiple alignments for phylogenetics. Ph.D. Dissertation. The Ohio State University, Department of Statistics, Columbus, Ohio, 43210 USA
Pe’er I, Felder CE, Man O, Silman I, Sussman JL, Beckmann JS (2004) Proteomic signatures: amino acid and oligopeptide compositions differentiate among phyla. Proteins 54:20–40
Pintar A, Carugo O, Pongor S (2003) Atom depth as a descriptor of the protein interior. Biophys J 84:2553–2561
Plankensteiner K, Righi A, Rode BM (2002) Glycine and diglycine as possible catalytic factors in the prebiotic evolution of peptides. Orig Life Evol Biosphere 32:225–236
Pollack JD, Li Q, Pearl DK (2005) Taxonomic utility of a phylogenetic analysis of phosphoglycerate kinase proteins of Archaea, Bacteria, and Eukaryota: insights by Bayesian analyses. Mol Phylogen Evol 35:420–430
Porter CT, Bartlett GJ, Thornton JM (2004) The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 32:D129–D133
Pruitt KD, Tatusova T, Maglott DR (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35(Database issue):D61–D65
Richards FM (1977) Areas, volumes, packing, and protein structure. Ann Rev Biophys Bioeng 6:151–176
Rivera MC, Jain R, Moore JE, Lake JA (1998) Genomic evidence for two fundamentally distinct gene classes. Proc Natl Acad Sci USA 95:6239–6244
Rode BM (1999) Peptides and the origin of life. Peptides 20:773–786
Roitt IM, Delves PJ (2001) Antibodies. In: Roitt IM, Delves PJ (eds) Roitt’s essential immunology, 10th edn. Blackwell Science, Malden, pp 37–58 (see, Fig. 3.12, “The binding site”)
Ronimus RS, Morgan HW (2003) Distribution and phylogenies of enzymes of the Embden-Meyerhof-Parnas pathway from archaea and hyperthermophilic bacteria support a gluconeogenic origin of metabolism. Archaea 1:199–221
Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH (1985) Hydrophobicity of amino acid residues in globular proteins. Science 229:834–838
Sasai M (1995) Conformation, energy and folding ability of selected amino acid sequences. Proc Natl Acad Sci USA 92:8438–8442
Sawers G, Watson G (1998) A glycyl radical solution: oxygen-dependent interconversion of pyruvate formate-lyase. Mol Microbiol 29:945–954
Schwartz RM, Dayhoff MO (1978) Matrices for detecting distant relationships. In: Dayhoff MO (ed) Atlas of protein sequence and structure. Volume 5, Supplement 3. National Biomedical Research Foundation, Washington, pp 353–358
Selmer T, Pierik AJ, Heider J (2005) New glycyl radical enzymes catalyzing key metabolic steps in anaerobic bacteria. Biol Chem 386:981–988
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423 623–656
Shimizu M, Yamagishi A, Kinoshita K, Shida Y, Oshima T (2008) Prebiotic origin of glycolytic metabolism: histidine and cysteine can produce acetyl CoA from glucose via reactions homologous to non-phosphorylated Entner-Duodoroff pathway. J Biochem 144:383–388
Simpson EH (1949) Measurement of diversity. Nature 163:688
Sobolevsky Y, Trifonov EN (2005) Conserved sequences of prokaryotic proteomes and their compositional age. J Mol Evol 61:591–596
Sun J, Sampson NS (1998) Determination of the amino acid requirements for a protein hinge in triose phosphate isomerase. Protein Sci 7:1495–1505
Suwannachot Y, Rode B (1999) Mutual amino acid catalysis in salt-induced peptide formation supports this mechanism’s role in prebiotic peptide formation. Org Life Evol Biosphere 29:463–471
Tekaia F, Yeramian E (2006) Evolution of proteomes: fundamental signatures and global trends in amino acid compositions. BMC Genomics 5:307
Trifonov EN (2000) Consensus temporal order of amino acids and evolution of the triplet code. Gene 261:139–151
Trifonov EN (2004) The triplet code from first principles. J Biomol Str Dyn 22:1–11
Tsou CL (1993) Conformational flexibility of enzyme active sites. Science 262:380–381
Varfolomeev SD, Gurevich KG (2001) Enzyme active sites: bioinformatics, architecture, and mechanisms of action. Russian Chem Bull 50:1709–1717
Varfolomeev SD, Gurevich KG, Poroykov VV, Sobolev BN, Fomenko AE (2001) Catalytic sites of enzymes as conserved elements of amino acid sequence alignment: a unique role of glycine and aspartic acid in formation of enzyme active sites. Dokl Biochem Biophys 379:252–254
Varfolomeev SD, Uporov IV, Fedorov EV (2002) Bioinformatics and molecular modeling in chemical enzymology. Active sites of hydrolases. Biochemistry (Moscow) 67:1099–1108
Varfolomeev SD, Gariev IA, Uporov IV (2005) Catalytic sites of hydrolases: structures and catalytic cycles. Russian Chem Revs 74:61–76. doi:10.1070/ RC2005v074n01ABEH001159
Varrazzo D, Bernini A, Spiga O, Ciutti A, Chiellini S, Venditti V, Bracci L, Niccolai N (2005) Three-dimensional computation of atom depth in complex molecular structures. Bioinformatics 21:2856–2860
Weber AL (2000) Sugars as the optimal biosynthetic carbon substrate of aqueous life throughout the universe. Orig Life Evol Biosphere 30:33–43
Weber AL (2001) The sugar model: catalytic flow reactor dynamics of pyruvaldehyde synthesis from triose catalyzed by poly-L-lysine contained in a dialyzer. Orig Life Evol Biosphere 31:231–240
Weber AL (2002) Chemical constraints governing the origin of metabolism: the thermodynamic landscape of carbon group transformations under mild aqueous conditions. Orig Life Evol Biosphere 32:333–357
Wolf M, Müller T, Dandekar T, Pollack JD (2004) Phylogeny of Firmicutes with special reference to Mycoplasma (Mollicutes) as inferred from phosphoglycerate kinase amino acid sequence data. Int J Syst Evol Microbiol 54:871–875
Yan BX, Sun YQ (1997) Glycine residues provide flexibility for enzyme active sites. J Biol Chem 272:3190–3194
Zaia DAM, Zaia CTBV, De Santana H (2008) Which amino acids should be used in prebiotic chemistry studies? Orig Life Evol Biosph 38:469–488
Zhang J, Dean AM, Brunet F, Long M (2004) Evolving protein functional diversity in new genes of Drosophila. Proc Natl Acad Sci USA 101:16246–16250
Zhou Y, Kanner BI (2005) Transporter-associated currents in the γ-aminobutyric acid transporter GAT-1 are conditionally impaired by mutations of a conserved glycine residue. J Biol Chem 280:20316–20324
Zvelebil MJ, Barton GJ, Taylor WR, Sternberg MJ (1987) Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J Mol Biol 195:957–961
Acknowledgements
The study is dedicated to the memory of Professor Robert Cawrse Cleverdon, Ph.D., The University of Connecticut, Storrs, Connecticut, USA (JDP). We wish to thank the Bioinformatics Unit, G. S. Wise Faculty of Life Sciences, at Tel Aviv University, Israel for the availability of the public Consurf server. We wish to thank Professor Steven Krawiec, Lehigh University, Pennsylvania, USA who brought to our attention the special role of glycine in the interaction of antigens and antibodies. We also wish to thank an anonymous reviewer (Reviewer1) for particularly valuable suggestions and comments.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JDP conceived and executed the study and drafted the text and graphics. DKP participated in the design, execution of the study and drafting of the manuscript. XP participated in the design, execution and interpretation of the statistical analyses of the manuscript. All authors read and approve the final manuscript.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary File 1 (SF-1)
(PDF 205 kb)
Supplementary File 2 (SF-2)
(PDF 590 kb)
Rights and permissions
About this article
Cite this article
Pollack, J.D., Pan, X. & Pearl, D.K. Concentration of Specific Amino Acids at the Catalytic/Active Centers of Highly-Conserved “Housekeeping” Enzymes of Central Metabolism in Archaea, Bacteria and Eukaryota: Is There a Widely Conserved Chemical Signal of Prebiotic Assembly?. Orig Life Evol Biosph 40, 273–302 (2010). https://doi.org/10.1007/s11084-009-9188-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11084-009-9188-z