Background/Introduction

The circumstances surrounding the prebiotic assembly of amino acids into more complex molecules and polymers is one of the most important questions in science (Lazcano 2006). We found in initial studies of 3-D models of essential and highly conserved phosphoglycerate kinase (PGK) sequences from Archaea, Bacteria and Eukaryota unreported localizations of prebiotic amino acids at both most-conserved sites surrounding the catalytic/active center (C/AC) and at least-conserved sites most distant from the C/AC. At the most-conserved proximal C/AC sites were generally concentrated G and V and in some analyses D, R, T and S. At the distant and least-conserved sites were concentrated K and E. Alanine was typically evenly distributed. Higgs and Pudritz (2007) have indicated that early proteins were composed of a somewhat similar set (GADEV) and that these amino acids could have formed in primitive organisms useful “structures” and that modern proteins may still contain a signal related to an amino acid frequency of early evolution. Gulik et al. (2009) concluded that prebiotic short functional peptides were primarily made of GADV and assumed that some traces of these peptides still exist.

The likely first prebiotic assemblage was described as condensation products of glycine and DL-alanine with other α-amino acids (Brack and Orgel 1975). We considered that aggregations of small prebiotic amino acids at most-conserved sites surrounding catalytic/active centers of metabolically essential highly-conserved enzymes might represent the contemporary presence of such a chemical characteristic(s) of prebiotic amino acid assemblage. The aggregations of glycine at the most evolutionary conserved sites might also relate to the evolution of essential enzymes. To more confidently relate our preliminary findings with PGK to these reports and to prebiotic assembly and a conceptual relationship if any to the Last Universal Ancestor we studied in a number of statistically valid ways various alignment sets of larger and taxonomically diverse collections of protein sequences from all Domains and some enzyme classes.

It was also necessary to study separately both amino acid conservation at the C/ACs and enzyme function in each Domain because it was reported that in contrast to eukaryotic genes essential bacterial genes were more conserved than non-essential genes (Jordan et al. 2002) and because protein sequence alignments and derived phylogenetic trees might be misinterpreted due to the inclusion of examples representing different enzymatic classes (Varfolomeev et al. 2005). Results might also be skewed by inclusion of both enzymes and non-enzymes (Varfolomeev et al. 2002) or by simultaneously analyzing both informational (e.g., rRNA) and operational (e.g., fructose-1,6-bisphosphate aldolase) sequences (Rivera et al. 1998), or by mingling sequences of, e.g., strictly intracellular, membrane or excretable enzymes.

To respond to these issues we collected 1969 unique sequences of twelve different operational enzymes. The proteins were primarily associated with central metabolism (as glycolysis) and categorized as “household or house-keeping enzymes”, i.e., highly conserved proteins, essential in maintaining basic functions for sustenance, as, e.g., phosphoglycerate kinase, pyruvate kinase, aldolase, lactate dehydrogenase. They were also recognized as constitutively expressed, globular (soluble) and intracellular. In Bacteria, house-keeping enzymes are described as slowly evolving and their genetic variations are believed to be relatively neutral (Hanage et al. 2006). We primarily studied glycolytic members belonging to three enzyme classes: lyases, transferases (phosphotransferases, kinases) and oxidoreductases (dehydrogenases) of the Archaea, Bacteria and Eukaryota (Enzyme Nomenclature Database <http://www.chem.qmul.ac.uk/iubmb/enzyme/>). For our examples, we will use the terms: kinase(s), dehydrogenase(s) and lyase(s). After alignment, we found that over 4158 consensus MSA amino acids and their sites distributed over 12 enzyme sets could be additionally characterized by their hydropathy value, their sites’ interatomic distance to respective catalytic/active centers (C/AC) and using the same MSAs obtained a statistically significant conservation score at each site relatable to molecular structure. We also studied subsets of the 1969 sequences restricted to each Domain or to one of the three enzyme classes or to their combinations.

Methods

Sequences

We studied the MSA consensus amino acid content of unique protein sequences in alignments of twelve highly conserved enzymes primarily of central metabolism. Prospective sequences were retrieved from public databases following criteria we used before (Wolf et al. 2004; Pollack et al. 2005). We endeavored to select non-redundant sequences from taxonomically diverse species (Etzold and Argos 1993; Gasteiger et al. 2003; Pruitt et al. 2007). All sequences were reviewed using the highly cross-referenced UniProt Knowledgebase (UniProtKB/Swiss-Prot, <http://www.expasy.org/>). When identified, sequences described, e.g., as “undecided”, “indefinite” or “mutations” were not used. Thirteen sequences described as “fragments” and many described as “probable” were used. The 1969 selected sequences representing the 12 enzymes were used in the construction of three major data sets called “Categories” in order to test the consistency and constancy of our major observations. These sequences are identified in this study as Category 1, 2 or 3 and their construction and general content are described below.

The individual enzymes and their enzyme class are identified by their Enzyme Commission (EC) designations (the number of sequences we studied are in brackets { }): five of the twelve enzymes were transferases, subclass “transferring phosphorus-containing groups” or phosphotransferases (EC 2.7.x.x) (called “kinases”): 3-phosphoglycerate kinase (EC 2.7.2.3) (PGK) {141}, pyruvate kinase (EC 2.7.1.40) (PKY) {370}, adenylate kinase (EC 2.7.4.3) (ADK) {136}, nucleoside diphosphate kinase (EC 2.7.4.6) (NDK) {103} and acetylglutamate kinase (EC 2.7.2.8) (ACGK) {135}. Four enzymes were lyases, subclass “carbon–carbon cleaving” (EC 4.1.x.x) or subclass “carbon–oxygen cleaving” (EC 4.2.x.x) (“lyases”): deoxyribose aldolase (EC 4.1.2.4) (DERA) {81}, enolase (EC 4.2.1.11) (ENO) {213}, D-fructose-1,6-bisphosphate aldolase (EC 4.1.2.13) (FBPA) {169} and tryptophan synthase (EC 4.2.1.20) (α-subunit) (TRPA) {297}. Three enzymes were oxidoreductases, subclass “acting on either the CH-OH group of 1° and 2° alcohols and hemi-acetals” (EC 1.1.x.x.) or “acting on aldehydes or oxo groups” (EC 1.2.x.x), (called “dehydrogenases”): alcohol dehydrogenase (EC 1.1.1.1) (ADH) {75}, L-lactate dehydrogenase (EC 1.1.1.27) (L-LDH) {119} and glyceraldehyde-3-phosphate dehydrogenase (EC 1.2.1.12 and −.13, −.59) (GAPDH) {163}.

The Category 1 set contained the 1969 sequences of the 12 enzymes. The set was separated into six subsets each containing only Archaea, Bacteria or Eukaryota regardless of enzyme class, or as kinases, lyases or dehydrogenases regardless of Domain. The six subsets were aligned and analyzed (Results, Fig. 2). The Category 2 data base contained the same 1969 sequences but these were separated into the 12 homologous enzyme sets that were individually, aligned and analyzed. The data in each were collated and averaged to produce 12 averaged data sets representative of one of the 12 enzymes (see, below). In these 12 averaged sets we recovered a total of 4158 amino acid consensus sites that could be identified by their Domain as well as enzyme class (Results, Tables 1A and 1B). The 18 FASTA MSAs of Categories 1 and 2 with the identity of their sequences are available at our website (<http://www.stat.osu.edu/~dkp/ppp/>). The FBPA sequence set of Category 2 Bacteria lyase sequences is restricted to FBPA-Class II and also unclassified FBPA examples, it is devoid of FBPA examples identified as Class I. Category 2 ADH sequences were almost all examples of Zn2+/Fe2+ binding species.

Table 1 Percentage occupancy and distance to catalytic/active centers of amino acids vs conservation zones

In order to reduce any proposed or presumed effect in using mixed data sets composed of sequences of both widely varied taxonomies and enzyme classes we constructed six sets each containing sequences of only the same Domain and enzyme class (Category 3). For example, the Bacteria kinase collection contained sequences of five enzymes: ACGK, ADYK, NDK, PGK and PYK. Collections of each kinase were aligned and analyzed. The data from the five analyses were collated and averaged in order to obtain one file representing all Bacteria kinases. We similarly treated Bacteria lyases (DERA, ENO, FBPA, TRPAα), Bacteria dehydrogenases (ADH, GAPDH, L-LDH), Eukaryota kinases (ADYK, NDK, PGK, PYK), Eukaryota lyases (DERA, ENO, FBPA, TRPAα) and Eukaryota dehydrogenases (ADH, GAPDH, LDH). Archaea were not studied because of insufficient sequences.

“Scaffolds”

A particularly important component of our methodology was the use of “scaffolds”. Scaffolds are PDB-protein sequences included in all sequence sets. Scaffolds are identified in both Supplementary File 1 (SF-1) and in every FASTA alignments by their italicized four digit PDB identities (Alignment File). The files are found at <http://www.stat.osu.edu/~dkp/ppp/>. Scaffolds were selected when both the amino acids and spatial coordinates involved in their enzymatic mechanism or function were reported (Berman et al. 2002; Porter et al. 2004; Nagano 2005; Holliday et al. 2007). Scaffolds are essential in our analyses, acting not only as specific queries in determining conservation scores described below but also to calculate distances from a special amino acid atom reported to be mechanistically or catalytically involved in their protein’s catalytic action to the Cα of each residue in its protein sequence. The catalytically critical atom is called the “anchor-atom”, its amino acid is called the “anchor-amino acid.

We used additional scaffolds not included in SF-1 but like all the others they are noted in their respective MSAs. In Categories 1 and 2 they are: 1v6s [2643], 1rvg [1920], 1zmr [2539], 2e28 [238], 1w6t [6833], 3c4u [1820], 1a5z [2507] and 1rjw [10221]. Numbers in [ ] identify the involved atoms further described below using their PDB HETATM or PDB ATOM designation. Additional scaffolds used in Category 3 analyses are identified in Results, Table 2A and Table 2B footnotes.

Table 2 Analyses using Category 3 sequence sets of the same Domain and enzyme class

Alignments: Multiple Sequence Alignment of, e.g., Nucleoside Diphosphate Kinase (NDK)

To describe our general procedure in more detail we will use the nucleoside diphosphate kinase (NDK) subset from the 4158 sequences of Category 2 as the example. (A second more detailed example using ADYK/ADK that describes our methodology in a stepwise fashion is found in a “Support” folder at <http://www.stat.osu.edu/~dkp/ppp/>). The Category 2 NDK subset contained 222 unique NDK sequences from all Domains. Five taxonomically diverse “scaffold” NDK sequences were included in the set: 1jxv (Eukaryota, human), 1k44 (Bacteria, Mycobacterium tuberculosis), 1nb2 (Bacteria, Bacillus sp.), 1pku (Eukaryota, Oryza sativa, rice), 2az1 (Archaea, Halobacterium sp.). The inclusion of scaffolds of diverse origins here and in each of the other 11 enzyme sets imparts some taxonomic generality to their averaged data.

The NDK alignment and all others were prepared by MUSCLE (<http://www.drive5.com/muscle/>) (Edgar 2004) and opened in the Jalview editor (<http://www.jalview.org/>) (Clamp et al. 2004). This MSA is characterized as the NDK “founding” FASTA MSA. In addition to the NDK FASTA MSA, Jalview shows the MSA consensus sequence on a scale that identifies its initial residue as position “1” and is aligned with the five variously gapped reference scaffold sequences. Jalview also calculates an MSA consensus that can be aligned with each NDK scaffold included in the alignment. This permits the combination of all available NDK conservation and distance data into one file that can be appropriately averaged and assigned to the sites in the NDK consensus sequence, as further described below.

Assays for Conservation

Using the founding-NDK MSA in each case, the five NDK scaffold sequences were also individually entered as the “query” in the Consurf program (<http://consurf.tau.ac.il/>) (Landau et al. 2005). The program is linkable to the molecular structure of a scaffold or homologous 3D-template and reports conservation results for each PDA site: 1) its position in its PDA file, 2) its normalized conservation score and 3) a color representing the score. The Consurf output is also linked to the Protein Explorer program (Martz 2002) to produce a 3-D image of the PDA-scaffold query molecule used in the particular analysis and colored according to their assigned conservation score color as seen in Results, Fig. 1.

Fig. 1
figure 1

3-D distributions of the MSA consensus amino acids of 3-phosphoglycerate kinase. Analysis (Consurf ver. 3.0, PDB template 1vpe) of a Muscle (ver. 3.6) alignment of the 131 protein sequences of the globular 3-phosphoglycerate kinase (EC 2.7.2.3) (PGK) of the Embden-Meyerhof-Parnas pathway (glycolysis) (Pollack et al. 2005). a Zone 1. Only the least-conserved (turquoise) sites and CPK-colored ligands are pictured. b Zone 9. As in a, but only the most-conserved (magenta) sites and CPK-colored ligands are pictured. c Percent zonal occupancy of each amino acid in Zones 9 and 1. d As in c, percentage occupancy values of each amino acid in each zone, the highest three are in red (nZone9 = 66, nZone1 = 58; nZones9-1 = 387)

The Consurf program calculates two values of site conservation: a conservation score and a “Zone” designation. The conservation scores are normalized by the Consurf program as standard scores so that the average score for all residues is zero and the standard deviation is one: in our studies these scores fall within the range of −3.00 to +3.00. The conservation scores are also scaled into 9 approximately equal sized partitions or “Zones”. Low conservation scores (i.e., negative values) reflect high conservation and are also designated by high numbered Zones. Zone 9 is the most-conserved zone. As the conservation scores become more positive, that is, less conserved, the Zone designation becomes smaller. Zone 1 is the least-conserved zone. We describe positive-integer low conservation scores as “less-conserved” or “least-conserved” sites rather than as originally described as “variable” (<http://consurf.tau.ac.il/>). Conservation scores do not indicate the absolute magnitude of evolution but rather the relative degree of conservation of each amino acid position or site. The conservation for each averaged consensus site was associated with its similarly associated distance measures to C/ACs and then collated as described in the next two sections.

Distance Measures: Assays for Distance to the Catalytic/Active Center to the Cα of All Amino Acids in its MSA Consensus-Sequence

As noted before, in each PDB-scaffold sequence we chose an amino acid reported to be associated with the C/AC (the “anchor-amino acid”). We selected an atom of the anchor-amino acid also reported to be associated with enzymatic function, e.g., binding or catalytic function. The atom is defined by us as representing the C/AC and is called an “anchor-atom”. Using the Yasara program, we determined the distance from each anchor-atom to every other Cα in its PDB chain (<http://www.yasara.org>) (Krieger et al. 2002). The anchor-atoms of this study are listed in Supplementary File 1 (SF-1) (<http://www.stat.osu.edu/~dkp/ppp/>). There we identify three types of anchor-atoms, their amino acid host and their PDB position (ATOM or HETATM). The anchor-atoms were either a 1) non-datively bound metal cofactor (Mg2+, Mn2+, Zn2+), 2) an atom of the amino acid (E, K, H, G, R, N) or 3) an atom of a non-metal ligand known to be close to the C/AC and mechanistically involved in the catalysis (NAD+, NADP+, NAI, NBD). For NDK, the data for each of the five NDK scaffolds were collated and averaged as described in the next section.

Collation and Averaging of All Conservation and Distance Data: Addition of Hydropathy Values

Collation-averaging occurred in two steps. The first preparatory step attaches conservation and distance data separately for each scaffolding protein structure for the enzyme (e.g., the five NDK sets to the same founding NDK MSA consensus site). This is possible because the output data in conservation and distance sets are relatable to the same copy of the consensus sequence included in their construction. When complete, for each site in each of the separated scaffold-specific files there was: 1) the consensus amino acid identity at that site, 2) the site’s position in the same consensus sequence common to all, 3) its conservation score and conservation zone and 4) its distance measure to its C/AC. The second step in the collation-averaging process was combining conservation and distance data of the different scaffolds into one file. This requires relating them to the original consensus MSA and averaging across those subsets without a gap at each site. Conservation scores or distance measures found at the same MSA consensus site were averaged. Hydropathy values were assigned to each compiled MSA consensus site (Kyte and Doolittle 1982).

Binary Logistic Regression Analyses

To establish with some statistical confidence the occupational trend or “movement” of specific amino acids between Zone 9 and Zone 1 we used a logistic regression model (Hosmer and Lemeshow 2000). We made two estimates for the 4158 primary sites used in Categories 1 and 2, the parameters are: 1) the odds of each amino acid appearing at a specific site with different conservation zones or 2) the odds each amino acid appearing at different linear distances from their anchor-atom.

These logistic regression models were carried out separately for each of 20 amino acids for each of the 12 enzymes. Models with the three pooled conservation zone groups (Zones 1, 2, 3 vs 4, 5, 6 vs 7, 8, 9) as the explanatory variable allowed for a variable overall amino acid usage across enzymes. They assumed an equal increase (decrease) in the log odds of occupancy in moving from group 1, 2, 3 to group 4, 5, 6 as in moving from group 4, 5, 6 to group 7, 8, 9. Models with distance as the explanatory variable assumed a linear increase in the log odds of occupancy (in units of 10 Ångstroms). Again, analyses were performed both for each enzyme separately and for the combined data under the assumption of varying overall usage but a common slope.

Interpretation of the statistical significance of the data requires that the integer 1.00 be absent from the calculated 95% confidence interval (95% CI). The presence of “1.00” indicates non-significance. Significant data (*) when greater than 1.00 indicate higher likelihood in peripheral/low conservation areas, while less than 1.00, indicate lower likelihood. We only show in Results the overall (averaged) odds values for each amino acid in both parameters in the 12 enzymes. The individual non-overall odds data for each of the 20 amino acids in each of the 12 enzymes is available in tabular form from the authors.

Additional Analyses

As further confirmatory assessment of our methodology, we made two alternative scorings of conservation and distance measures at each MSA site using traditional measures of distributional diversity: the Shannon-entropy index (Shannon 1948) and the Simpson-diversity index (Simpson 1949). The Shannon-entropy index for a site is given by \( - \sum\nolimits_i {Pi\ln \left( {Pi} \right)} \) where P i denotes the proportion of all amino acids that are of type i at that site. The Simpson-diversity index is given by \( {\sum\nolimits_i {Pi}^2} \). We came by either analysis to the same qualitative conclusions in terms of the association of conservation with either amino acid occupancy or with distance to the C/AC (data not shown).

Results

Clustering of Glycine, Lysine and Glutamic Acid Associated with Location and Conservation in an MSA Model of 3-Phosphoglycerate Kinase

In an initial analysis of a set of 3-phosphoglycerate kinase (PGK) protein sequences (Fig. 1) the greatest concentration of specific and most-conserved amino acids were associated with the C/ACs and ligands, while concentrations of other but least-conserved amino acids were apparently associated with a peripheral area (Zone) most distant from the C/ACs and ligands. In the most-conserved Zone 9 (Fig. 1b) that is associated with the catalytic/active center (C/AC) six amino acids constituted 68% of all the identified amino acids. Glycine was the “predominant amino acid” (PDAA) at 26.2%, arginine and aspartic acid at 9.2% and threonine and serine and glutamic acid at 7.7% (Figs. 1c, d). Glycine has been localized by others at protein surfaces (Naor et al. 1996; Fukuchi and Nishikawa 2001), or both the inside and the surface of proteins or as relatively evenly distributed between the two (Miller et al. 1987). However, our PGK glycine distributions differed, in our analyses glycine was concentrated at the core. We also found that the six most hydrophilic amino acids (R, K, D, E, N, Q) (hydropathic range of −4.5 to −3.5, (Kyte and Doolittle 1982) constituted 36.9% of Zone 9, while the most hydrophobic amino acids (M, C, F, L, V, I) within the hydropathy range of 1.9 to 4.5 comprised 9.2%.

In Fig. 1a, the area of least-conserved sites (Zone 1) clearly formed the periphery of the protein and was predominantly occupied (56.9%) by three amino acids: lysine (K) 25.9%, glutamic acid (E) 22.4% and aspartic acid (D) 8.6%. Lysine’s particular surface association has been reported (Rose et al. 1985; Varfolomeev and Gurevich 2001; Brooks and Fresco 2003). In our case, K (and E, and D) was also the “predominant amino acid” (PDAA) found in peripheral-surface areas of the PGK phylogenetic model. Additionally, K was principally found at least-conserved sites.

Figure 2a, represents the analyses of alignments containing Category 1 sequences separated into Domains, each containing all three enzyme classes. Figure 2b shows similarly data separated into kinases or lyases or dehydrogenases, each containing all Domains (Fig. 2b). Figure 2c, without individual data points, shows the superimposed analytical data of all six analyses. Each figure shows non-parametric confidence kernels that enscribe 68.27% of all unaveraged-uncollated consensus amino acids in each alignment from each set. Marginal density curves of the total distributions are shown. The data indicates that the sites of Archaea, Bacteria, Eukaryota, kinases, dehydrogenases and lyases are both similarly conserved and distanced from their respective C/AC, and that there is a similar degree of association between these two variables. The regression lines (Lowess, tension 0.5) are also very similar. There are some differing levels of variability in the distance distribution by enzyme class that are attributable to the larger size of some of the proteins. We interpret these findings as indicating some overall statistical similarity, as we also found that the distributions of most sites identified by their conservation scores and proximity to the C/AC in each Domain or in enzymatic class are similar. The Category 1 sequence set was used only for Fig. 2.

Fig. 2
figure 2

Distributions of consensus amino acids of Archaea, Bacteria, Eukaryota or kinases, dehydrogenases and lyases. The statistically mutual relationships of conservation scores and distances to C/ACs are shown as non-parametric confidence kernels (68.27% of each consensus amino acid data set) and their regression lines (Lowess, tension 0.5) with individual marginal distribution curves

Figure 3 compares the distribution of all 4158 consensus amino acids derived from the Category 2 analyses. In these analyses data from Archaea, Bacteria, Eukaryota kinases, dehydrogenases, lyases are separated by their presence in the most- or least-conserved Zones 9 or 1, respectively. The Figure inserts include the average data for each of the four groups: the averaged conservation score (−3 = most-conserved to +3 = least-conserved) and the average distance from the Cα of each consensus amino acid to their C/AC or average Å (Y d ) ± std. The data in Fig. 3 indicate that regardless of enzyme class or Domain that lysine, glutamic acid and alanine are predominant in least-conserved Zone 1 most distant from the C/AC, while glycine, aspartic acid and alanine are predominant in most-conserved Zone 9 proximal to the C/AC. The intermediate conservation zones are not shown. There is also a pronounced elevated content of Archaea arginine in the Zone 1 least-conserved R group (Fig. 3, B1, red bar).

Fig. 3
figure 3

Distributions of the “most-” or “least-conserved” amino acids by enzyme class or Domain of the 4158 consensus amino acids derived from the Category 2 sequence set. The panel distinctions are: A1 and A2 by enzyme class and B1 and B2 by Domain, A1 and B1 also represent amino acids occupying least-conserved sites (Zone 1), while A2 and B2 represent amino acids at most-conserved sites (Zone 9). Whether classified by enzyme class or Domain, glutamic acid, lysine, alanine are dominant components of the least-conserved amino acids (Zone 1) and glycine, alanine, aspartic acid are dominant components of the most-conserved amino acids (Zone 9). Each panel includes average distances (Y d ) to their C/AC and average conservation scores. (source Category 2 sequence set)

Clustering of Specific Amino Acids at Catalytic/Active Centers and Peripheries of Twelve Enzyme Models: Relationships to Conservation Zones

Tables 1A, 1B, Figs. 4 and 5 show the results of further analyses of the 4158 averaged consensus sites of each of the 12 enzymes of Category 2. Table 1A shows amino acid averaged conservation scores (Zones) and Table 1B, the averaged distances to their C/ACs.

Fig. 4
figure 4

Distributions of glycine, serine, threonine, lysine, glutamic acid, alanine of the 4158 consensus amino acids derived from the Category 2 sequence set. The 3-D-histograms compare the % occupancy (z-axis) vs conservation scores (x-axis “CONSCORE”) of the twelve most frequently identified amino acids in this study vs the distance from their Cα to the anchor-atom of their C/AC anchor-amino acid (y-axis). Concentrations of G, S, and T were highest in most-conserved Zones and closest to the C/AC. Concentrations of K, E, A were highest in least-conserved Zones and furthest from the C/AC

Fig. 5
figure 5

Distributions of glycine, serine, lysine, and glutamic acids of the kinases, dehydrogenases and lyases of the 4158 consensus amino acids derived from the Category 2 sequence set. As shown in Fig. 4, concentrations of G and S were highest in most-conserved Zones and closest to the C/AC. Concentrations of K and E were highest in least-conserved Zones and furthest from the C/AC. Tagged (*) d, e, f, l are not considered to be significant because of insufficient counts

In Table 1A, we computed the % occupancy and the distances to the C/ACs in nine conservation zones rather than by conservation scores. The ordered overall frequency is: A, G/L, V, E, K, I/D, T, P, S, R, N, F, Y, M, H, Q, C, W. However, only the twelve most frequently counted consensus or predominant amino acids (PDAAs) of our study (A through R) are emphasized and highlighted in the two Tables. In Table 1A, the highest (top 3–4) concentrations of the 12 major amino acids in each conservation are: Zone 1 (the least-conserved sites), K, E, A; Zone 2, E, K, A; Zone 3, E, L, K; Zone 4, L, A, E; Zone 5, L, A, V; Zone 6, L, V, A/ I; Zone 7, A, V, I, L; Zone 8, G, A, V, I, and Zone 9 (the most-conserved sites) G, D, A.

Table 1B shows that the decreasing average zonal distance of the amino acids towards their C/AC is a smooth transition averaging about 1.25 Å per zone. The average distance measure (Y d ) of the amino acids in the most-conserved sites (Zone 9) to the Cαs of their C/AC amino acid atoms is (avg. ± std., n): 14.45 Å ± 0.242, 551, while in the least-conserved sites (Zone 1) it is 25.77 Å ± 0.282, 624. The Tables indicate that there is a consistent decrease in distance of Cα sites to more conserved zones that occurs regardless of the changing distributions of the averaged consensus or predominant amino acid (PDAA) occupants.

Figure 4 shows the analyses of the 4158 Category 2 consensus amino acids also studied in Tables 1A and 1B. The xyz-axes of the 3D-histograms are: occupancy (% of total) vs average Ångstroms to the respective C/AC vs average site conservation score. The histogram bars are also identified by their conservation zone. Zone 9 are the most-conserved sites and Zone 1, least-conserved sites. Panels A–C, show that concentrations of glycine, serine and threonine are in most-conserved sites and are concentrated at positions nearest the C/AC. Panels D–F, show that lysine, glutamic acid and perhaps alanine are concentrated at sites least-conserved and furthest from the C/AC. Panels G–I, show isoleucine, valine and leucine concentrated in interior regions, while in Panels J–L, proline, aspartic acid and arginine are concentrated at both conservation extremes.

Figure 5 separates the 4158 Category 2 sequence data used for Tables 1A, 1B and Fig. 4, but here the 12 averaged enzyme sets were first separated into kinases, dehydrogenases or lyases and analyzed as for Fig. 4 (MSAs in Supplementary File 1 at <http://www.stat.osu.edu/~dkp/ppp/>). Figure 5 shows the distributions of glycine, serine, lysine and glutamic acid in the separated kinases, lyases and dehydrogenases. Glycine and serine regardless of the enzyme class are concentrated at most-conserved sites nearest the C/AC, Panels A–F. Lysine and glutamic acid regardless of the enzyme class are concentrated at least-conserved sites furthest from the C/AC, Panels G–L.

Analyses Using Sets Restricted to Single Domain Single Enzyme Class Sequences: “Two-Way Analyses”

Analyses of sets containing Category 3 sequences belonging to one Domain and one enzyme class are found in both Table 2A, Bacteria and in Table 2B, Eukaryota. Each Table has two parts: Part A, “Conservation Zones-Scores and Average Distances (Ångstroms) to C/AC” and Part B, the “% Occupancies” of 20 amino acids in conservation Zones 9 and 1. Archaea were not studied because of insufficient sequences. The data show that the three prominent (top) amino acids (G, V, D) in most-conserved sites (Zone 9) in the three enzyme classes of Bacteria are all within 15.78 ± 2.15 Å of the C/AC. Their average % occupancies are: G (16.98 ± 0.18), V (11.93 ± 3.3), D (9.49 ± 1.17). Similarly, in least-conserved sites (Zone 1) the prominent amino acids (E, K, A) in the three enzyme classes of Bacteria are all within 26.84 ± 3.73 Å of the C/AC. Their average % occupancy are: E (20.17 ± 5.38), K (18.84 ± 3.31), A (14.75 ± 7.00). The prominent amino acid examples (G, A) in the most-conserved sites (Zone 9) of the three Eukaryota enzyme classes are all within 17.13 ± 0.81 Å of the C/AC. Their average % occupancies are: G (15.99 ± 2.37), A (9.81 ± 0.71). In all the least-conserved Eukaryota sites (Zone 1) the prominent amino acids (K, E, A, D) are within 26.00 ± 0.42 Å of the C/AC. Their average %occupancies are: K (18.3 ± 3.65), E (11.57 ± 1.92), and both A and D (8.91 ± 0.81). The data indicate that the prominent amino acids in each category are similar regardless of the restriction of each data set to sequences of both the same Domain and same enzyme class. Further, these observations are in agreement with previous analyses.

To this point, we believed the variously analyzed data was consistent and indicated that the structural distribution of the averaged consensus or predominant amino acid (PDAA) sites and the degree of their evolutionary conservation are not random—there appeared to be groups or sets of amino acids significantly concentrated at both most-conserved and least-conserved sites and at specific distances from the C/ACs regardless of their Domain or identity as one of the three enzyme classes (Table 1A, Figs. 3, 4, 5).

The data in Tables 2A and 2B support the consistency of our findings: regardless of whether the sequence set contained only examples of kinases or dehydrogenases or lyases from either Bacteria or Eukaryota (Category 3) that G, V and D appeared to be moving closer to the C/AC in most-conserved sites and that K, E and A were moving further from the C/AC in least-conserved sites. As an additional statistical corroboration of “progression” or “movement” we examined amino acid occupancy levels and the distances from their C/ACs using binary logistic regression analyses of the entire 4158 average consensus data obtained from the study of Category 2 sequences.

Binary Logistic Regression: Changing Odds of Amino Acid Occupancy with Distance from Catalytic/Active Centers and Movement Between Low to Middle and Middle to High Conservation Zones

The binary logistic regression analyses in Table 3 show by two parameters statistically significant overall estimates of the rate (fold increase) and the 95% confidence intervals of changing odds of amino acid occupancy using the 4158 Category 1 and 2 sequence derived examples. The “overall” logistic regression data in Table 3 are derived from pooling all amino acid sites across the twelve enzymes and allow for possible enzyme-to-enzyme differences in occupancy odds but assume the same relationship to distance or conservation across enzymes. The tabular data for 20 amino acids in each of the 12 enzymes are available from the authors.

Table 3 Logistic regression: odds ratio analyses of amino acids either “moving” to higher or lower conservation zones or from their C/AC

Table 3 shows by one parameter the likelihood of amino acids occupying a change of site moving between conservation zone groups (from Zones 1, 2, 3 (low-conservation) to Zones 4, 5, 6 or from Zones 4, 5, 6 to Zones 7, 8, 9 (high-conservation). The other parameter, irrespective of conservation, indicates the likelihood of occupancy moving every 10 Å away from the C/AC. The data are arranged by the decreasing polarity of the amino acids according to the Kyte-Doolittle hydropathy index (Kyte and Doolittle 1982).

Lysine, glutamic acid and leucine show statistically significant likelihoods (high odds) of occupancy per zone moving toward less-conserved sites. By the same parameter glycine, isoleucine, serine, threonine, asparagine and valine demonstrated a significantly low odds, that is, an increasing likelihood of occupancy toward more-conserved and C/AC associated sites.

Examination of the likelihood of amino acids moving 10 Å away from the C/AC shows statistically significant likelihoods (high odds) of lysine, glutamic acid and alanine occupying sites moving toward the periphery. However, for glycine, valine, threonine and isoleucine the likelihoods are statistically low, indicating again for these amino acids an increasing likelihood of occupancy moving toward more-conserved C/AC associated sites.

We conclude from these logistic regression analyses of data from three different enzyme groups and all Domains that glycine, probably threonine, and hydrophobic isoleucine and valine are preferentially associated with sites of highest conservation and proximity to their C/AC. Hydrophilic lysine and glutamic acid are preferentially associated with sites of lowest conservation and greatest distance from their C/AC. There may be other tendencies of movement and occupancy away from or towards the C/AC. However, for these concerns and the study of infrequently recorded amino acids we do not have sufficient data to form any firm statistical conclusions, we estimate that would require the study of considerably more (∼12–16 K) enzyme sequences and computations.

Hydropathy

Figure 6 shows the levels of occupancy as they relate to site conservation, enzyme class and hydropathy. The three panels (A, B, C) describe the contents of different conservation Zones: Zone 1, 5, 9. The stacked bars show the total % occupancy of amino acids in each enzyme class vs their hydropathy index. Each bar shows the standard deviation of all occupancy values of its three rectangular components. The small size of these standard deviations illustrate the consistent nature of these findings across enzyme classes. The sum of the heights of the same color in each panel (A, B or C) is 100%. Panel A shows conservation Zone 1, sites least-conserved and most distant from the C/AC. The Y d (average distance in Ångstroms of all Zone 1 sites to their C/ACs) is ∼26 Å: lysine (15%), glutamic acid (12%) and alanine (13%). Panel B shows conservation Zone 5, interior “mid”-conserved, distance from C/AC, Y d = ∼21 Å: leucine (14%), alanine (13%) and valine (12%). Panel C shows conservation Zone 9, most-conserved, closest to the C/AC, Y d = ∼15 Å: glycine (18%), aspartic acid (10%), alanine (9%). The % occupancy of the most polar amino acids (RKDE) decreases from 40% in Zone 1 to 23% in Zone 5, to 27% in Zone 9, while the % occupancy of the “non”-polar amino acids (FLVI) in Zone 1 is 19%, in Zone 5 is 34% and in Zone 9, 14%. The three panels illustrate the changing hydropathic content in the three zones moving towards the C/AC, from Zone 1 to Zone 9. Noteworthy are the predominance of polar amino acids in Zone 1, the predominance of non-polar amino acids in Zone 5 and the predominance of relatively “neutral” glycine in Zone 9. We did not find any significant distinctions attributable to Domain or the enzyme classes.

Fig. 6
figure 6

Relationship of hydropathicity vs occupancy vs conservation vs enzyme class of the 4158 consensus amino acids derived from the Category 2 sequence set. a Zone 1, least-conserved, furthest from the C/AC, b Zone 5, intermediate-middle conservation and c Zone 9, most-conserved, closest to the C/AC. Overall occupancy levels are from Table 1A. Each of the three segments in each stacked bar represent the amino acid percentage in its conservation Zone and enzyme class. The value at the top of each stacked bar is the ± standard deviation of its three segment values. The panels also note the average distance (Y d ) to the C/AC of all its members and the concentrations of the most hydrophilic amino acids (RKDE) highest in Zone 1 and most hydrophobic amino acids (FLVI) highest in Zone 5. (see Results)

Discussion

Sampling issues and phylogenetic effects are a potential source of bias in any study relying as ours on hundreds of distantly related taxa. However, we believe that issues associated with the distribution of the twenty amino acids in this study are not likely to be strongly affected based on the observed consistency seen across the three Domains and the different enzymes classes. Our data suggesting Domain consistency (Figs. 2a, 2c and 3B1, 3B2) appears contradictory to other reports (Pe’er et al. 2004; Bogatyreva et al. 2006). We attribute the differences to the specificity of the sequence sets we employed. We only studied aligned sets of protein sequences of twelve highly conserved operational mostly constitutive enzymes of central metabolism that were separated by their enzyme class and/or Domain. MSA site conservation scores were related to the distance to their functional catalytic centers and hydropathy indexes to further identify and characterize patterns of consistency. We believe that our general 12-enzyme findings would be obscured with less sequence discrimination.

Varfolomeev and co-workers (Varfolomeev et al. 2001, 2002) determined the Shannon entropy values of MSA sites to determine the relative degree of conservation in hydrolases, an enzyme class we did not study. We found using the same procedure to measure conservation qualitatively identical conclusions (see, Methods). These authors (op. cit.) characterized aspartic acid as highly conserved. We found depending on the data set that concentrations of aspartic acid may be localized in both the least-conserved Zone 1 and the most-conserved Zone 9 (Table 1A and Fig. 4).

Two studies have characterized various amino acids as either “inside” or “buried” or “accessible” or “surface accessible” (Janin 1979; Miller et al. 1987). Their findings are similar to ours. Glycines were described generally as found “inside” or “buried” or “core” and lysines and glutamic acids were described as “accessible” or “surface accessible” or in “peripheral areas”. The interatomic distances of the amino acids, conservation scores, enzyme class or taxonomic relationships were not reported. We did not as these authors use any measures of surface, inside volumes or solvent-accessible surface area values in our determinations.

Other studies have calculated the distance (Ångstroms) of atoms to the nearest surface water or solvent-accessible neighbor protein (Chakravarty and Varadarajan 1999; Pintar et al. 2003). Estimations of conservation in families of protein folds (e.g., Rossmann fold, immunoglobulin fold, TIM barrel) were reported by the same authors. The “atom-depth” techniques measure the mean residue depth from the “inside-out”, our procedures are complementary, they measure the distances from a single locus the “anchor-atom” of the C/AC to the Cα of an amino acid, i.e., “inside-in”. In one of these reports, using their “atom-depth” algorithim the authors studied 136 non-homologous single sequence PDA protein crystal structures (Pintar et al. 2003). The sequence set was selected from an apparently mixed collection of 301 highly curated but mostly unidentified representatives, they reported the atom-depth distances of all 20 amino acids: K (with the lowest Ångstroms was described as nearest the surface and was followed by <E<D<Q<R<N<P<S<G<T<H<A<Y<C<M<W<L<F<V<I). Isoleucine (I) with the highest Ångstroms was furthest from the surface. The authors (op cit.) concluded that certain amino acids occur at greater depths than others and as measured by sequence entropy that the deepest residues are most conserved. Our study is in general agreement with these findings using entirely different methods and parameters. Another method was reported to more fully express the 3-D character of the atom-depth measure by also taking into account the overall size and shape of the protein (Varrazzo et al. 2005).

The determination of the interatomic distance between a functional enzyme center and the Cα of mutant sites, a procedure very similar to ours, was reported in an extensive study designed to select mutations and improve enzyme properties (Morley and Kazlauskas 2005). Others have also described catalytic residues and their local environment as highly conserved (Zvelebil et al. 1987; Dean and Golding 2000; Bartlett et al. 2002). The concept of an “environment” that we analogize with our ∼30 Å diameter most-conserved Zone 9 was suggested by reports indicating that more effective functional change (enantioselectivity, substrate specificity, new catalytic activity) are associated with amino acid substitution close to catalytic sites. However, there is evidence that amino acid replacements outside of reported active sites can also affect not only specificity but catalytic efficiency (Dean and Golding 2000; Lichtarge and Sowa 2002; Zhang et al. 2004; Morley and Kazlauskas 2005). We emphasize that most-conserved sites are not found only at or near the C/AC locus. We consistently found some most-conserved consensus sites at most distant sites from the C/AC and occupied by, for example, glycine. We also found most-conserved sites closest to the C/AC that were occupied by lysine. These few most-conserved variously localized examples may belong to a reported dynamically coupled network effecting kinetic events that promote catalysis (Hammes-Schiffer 2002; Benkovic et al. 2008).

We emphasize only glycine because as a major observation of our study in almost every instance we find it as the most abundant residue at the most-conserved sites that surround the C/AC in different classes of essential enzymes and distant taxonomies. Glycine may be the earliest amino acid (Eigen and Schuster 1978; Trifonov 2000, 2004). Conserved glycines were reported to be located near the catalytically active residues of five hydrolases (Varfolomeev et al. 2002). Fermentation of glycine was described as the most ancient catabolic pathway (Clarke and Elsden 1980), although this opinion was questioned (Conchillos and Lecointre 2005). Glycine is described as “indispensable” in any prebiotic scenario (Suwannachot and Rode 1999) however, glycine’s catalytic role is uncommonly reported. An essential functional role for glycine is reported in enzymes that contain glycyl radicals like formate acetyltransferase I (pyruvate formate-lyase) (MACiE M0030, EC 2.3.1.54) (Holliday et al. 2007). Glycyl radical enzymes have been found in obligate and facultative anaerobic Archaea, Bacteria and Eukaryota where they serve as biocatalysts in anoxic environments (Selmer et al. 2005; Lehtiö et al. 2006). As a class they are thought to predate the appearance of molecular oxygen (Sawers and Watson 1998).

Glycine is characterized as multifunctional, it stabilizes transition state intermediates (Bartlett et al. 2002), is involved in folding (Sasai 1995), modulates peptide helicity (Li and Deber 1992), mediates helix–helix interactions in membrane proteins (Oppegård et al. 2008), is involved in molecular packing requirements (Elbaz et al. 2008), the transport of proteins (Zhou and Kanner 2005) and the flexibility and role of loops (Kwasigroch et al. 1996). Glycine rich sequences were associated with kinases (Bossemeyer 1994). A contrary effect of glycine was reported in a study to demonstrate nucleic acid synthesis under prebiotic conditions. Glycine and other amino acids in the presence of purines and pyrimidines were found to catalyze in high yield the dehydration of the α,β positions of 2-deoxyribose to form primidino and purino pentoses (Nelsestuen 1979). The reactions described were considered to rapidly deplete components of nucleic acids and present a major problem for prebiotic nucleic acid assembly.

A possible initiating role for diglycine relevant to the formation of a biosphere has been described (Plankensteiner et al. 2002). Those authors reported that diglycine is catalytic, in its presence there is a salt-induced synthesis of other peptides, however, it was pointed out that there are significant thermodynamic and kinetic limitations to the formation of diglycine in aqueous solution (Fitz et al. 2007). In that report and others the readiness of amino acids particularly glycine to form prebiotic peptides in the presence of minerals and a discussion of aqueous peptide synthesis by the salt-induced peptide formation (SIPF) reaction in the presence of NaCl and Cu(II) is described (e.g., Rode 1999; Bujdák and Rode 2002). Although glycine is not specifically indicated the role of amino acids as prebiotic catalysts has been emphasized by others (e.g., Bar-Nun et al. 1994; Shimizu et al. 2008).

Site-specific substitution of glycine by alanine was shown to be functionally deleterious (Sun and Sampson 1998). Computer substitutions of “conserved glycines” by alanine resulted in a significant change in catalytic site geometry while substitution of “non-conserved glycines” had little effect (Varfolomeev et al. 2001). Glycine’s small volume is often noted as an impediment to substitution by larger amino acids (Oppegård et al. 2008). In a study of nine proteins, glycine has the smallest average amino acid buried residue volume: V R = 66.4 ± 4.7 Å3 (Richards 1977).

Glycine stabilizes different proteins in different ways (Ganter and Pluckthun 1990). When glycine is mutated to alanine or proline the protein stability increases by decreasing the entropy of the unfolded state of bacteriophage T4 lysozyme (Matthews et al. 1987). Although, not in chicken glyceraldehyde-3-phosphate dehydrogenase (GAPDH). In this GAPDH, a glycine → alanine substitution does not stabilize the protein by affecting the entropy of the unfolded state, but rather by filling an internal cavity and thereby stabilizing the native state (Ganter and Pluckthun 1990). Additionally, mutation of aspartic acid → glycine increases k cat/K m for ATP 3800-fold in phosphofructokinases. These latter results are interpreted as an enhanced effect on the enzymatic activity of a nucleotide binding site associated with glycine insertion (Chi and Kemp 2000). There are very many diverse reports noting relevant properties and consequential roles for glycine residues, e.g., in the interaction of antigens and antibodies (Roitt and Delves 2001).

Glycine’s role at the C/AC in our study is uncertain, it may have multiple roles. However, of the possibilities some noted above we favor the opinion that with a small energy of rotation around its C-N and C-C bonds glycine provides some advantageous conformational flexibility for active enzyme sites, a desirable enzymatic property that has already been emphasized by others (Tsou 1993; Mesecar et al. 1997; Varfolomeev et al. 2001). The glycine-rich C/AC region or environment analogous to Zone 9 and roughly extending ∼10–15 Å from the C/AC may have more conformational flexibility than other conservation Zones. In an interesting study of the “fluctuational amplitude” of amino acids in 19 proteins, achiral glycine was characterized as having the highest average “flexibility index” of 20 amino acids (Yan and Sun 1997). In that report the authors used their differential equation model that considers the combined influences of the chemical, physical, conformational and energetic properties on the fluctuational displacements of each residue in a protein and particularly on the effect of the residues spatial position. Our data may relate to such conformational flexibility and spatial relationships as well as to an association noted above with enzyme motions essential for the catalytic process.

In addition to our findings with G, we report that K and E and A were consistently concentrated at least-conserved most distant sites from the C/AC. It is reasonable to consider that the concentration of polar species at the peripheral regions of these globular constitutive proteins are associated with their solubility. The location of concentrations of these amino acids may as well be indicators of enzymatic function (Damodharan and Pattabhi 2004) and also recognition of their interaction between molecules, stabilization of tertiary structures and thermostability (references are cited by Leunissen et al. 1990).

Some of the most predominant amino acids besides glycine in Zone 9 of our study were frequently proposed as members of an early evolutionary and minimal amino acid set (G, A, D and in most studies V and E as well) (e.g., Miller 1987; Trifonov 2000; Ikehara 2005; Jordan et al. 2005; Higgs and Pudritz 2007). Glycine, A, V, D, E were synthesized in the Stanley L. Miller’s seminal sparking experiment and are components of the Murchison and other meteorites (Brack 2007). Other reports have indicated that the inferred proteins of the Last Universal Ancestor (LUA) had a greater abundance of amino acids attributed to a presumably prebiotic period, as G, A, D and V, again predominant species in our study (Brooks and Fresco 2003; Brooks et al. 2004).

The study of Gulik et al. (2009) converges with ours. They wrote that early functional peptides were 3–8 amino acids long and were made of G, A, D, V and that traces of these prebiotic peptides still exist in the form of active sites in present-day proteins. Their criteria included a search of the entire PDB data base specifically for traces of prebiotic peptides that contained a protein structure interacting with a metal ion and were built almost exclusively with amino acids they deemed most abundant prebiotically: G, A, D, V. Their statistical analyses were confirmatory and interpreted to indicate that G, A, D and possibly V were the “true abundant prebiotic aa’s”. They also found three classes of ion-binding motifs associated with either a DNA-directed or an RNA-dependent RNA polymerase [-D(F/Y)DGD-], three mutases [-DGD(G/A)D-] and a dihydroxyacetone kinase [-DAKVGDGD-]. The motifs were thought with reservation to correspond to the first functional peptides and that the submotif [-DGD-] is the common ancestor to all active peptides. Our methodology differs. We made, e.g., no experimental assumptions as to which amino acids were either primordial or associated with catalytic function. We studied all amino acids and found concentrations of G, A, D and often V at highly conserved sites measurably nearest the C/ACs in non-redundant sequence data sets of kinases, dehydrogenases and lyases of the Bacteria and Eukaryota that are associated with the “trunk” glycolytic pathway (Tables 1A, 1B, 2A and 2B). Lysine and E were concentrated at the least-conserved sites most distant from the C/AC.

Edward N. Trifonov concluded based on a very comprehensive analyses of 60 “chronology” vectors as criteria that the pair of complimentary GGC and GCC codons for glycine and alanine appeared first (Trifonov 2000, 2004). This study led to the development of his consensus temporal order of the appearance amino acids, the series indicating that glycine was the oldest amino acid. The reported evolutionary amino acid chronology was: G/A, D, V, P, S, E, L/T, R, I/Q/N, H, K, C, F, Y, M, W. Correlation between protein sequence age and conservation in bacterial octapeptides has been specifically reported (Sobolevsky and Trifonov 2005). These authors concluded that A, G, D, V, S, P, again prominent members of our study, are components of the oldest protein sequences.

The analyses of orthologous proteins encoded by triplets of closely related genomes indicated that there was a set of amino acids with declining presence in proteins over the last 106 years: these were proline, alanine, glutamic acid and glycine (Brooks and Fresco 2003). These amino acids (PAEG) were characterized as the first incorporated into the genetic code and among the six considered to be abiogenic—the most ancient. In this study they were characterized as “strong losers” in an irreversible (evolutionary) decline of their presence in proteins. The losses or gains were not considered to be due to mutation-selection.

Jordan et al. (2005) in their “Supplementary Table 3” compared their rankings of amino acid recent gain or loss in protein evolution with amino acid rankings of recruitment into the genetic code, abundance in spark experiments and the Murchison meteorite. We have added our occupancy data and rearranged their table. Our Supplementary File SF-2 (<http://www.stat.osu.edu/~dkp/ppp/>) compares our data to amino acid abundance-ranking in laboratory syntheses (e.g., Miller 1987), presence in meteorite(s) (Brack 2007), temporal order of appearance (Trifonov 2000, 2004) and literature emphasizing the probable appearance and putative role of these amino acids (Brooks et al. 2002, Brooks and Fresco 2003, Lazcano 2006, Zaia et al. 2008, Cleaves et al. 2008). Several amino acids newly detected by liquid or gas chromatography-mass spectroscopy in preserved residues of S. Miller’s experiments (Johnson et al. 2008) and amino acid enantiomers in the Murchison meteorite (Cronin and Pizzarello 1997) were not included in our Supplementary File SF-2.

Studies of the proteomes of the Domains by others rather than enzymes per se as in our study have reported distinguishing amino acid signatures or compositional patterns attributed to evolutionary memory, phylogeny and life-style (Pe’er et al. 2004; Bogatyreva et al. 2006; Tekaia and Yeramian 2006). In Supplementary File SF-2, we found no statistically significant similarity in the orders of frequency between any of the referenced data versus Zone 9 compared with Zone 1. Our data principally differs because it distinguishes the enzyme amino acid content by their averaged conservation scores that were obtained from multiple sequence alignments of taxonomically dispersed examples.

Other reports that might be viewed as contrary to ours describe losses of amino acids, e.g., glycine, during evolution (Brooks and Fresco 2003; Jordan et al. 2005). These reports are interpreted by us as compatible with our view that evolutionary losses of certain amino acids as glycine apparently occur less frequently at the most-conserved C/AC than in less-conserved sites moving towards the periphery of molecules. That is, we suggest that if relatable to our studies, these reported evolutionary amino acid losses of, for example, glycine and valine occur predominantly in areas we identify as outside the most-conserved zone of ∼15 Å radius from the C/AC anchor-atom. We believe that unless localized concentrations and conservations of the amino acids are taken into account such attributed evolutionary changes may be obscured.

The predominance or “clustering” of specific amino acids in a particular conservation Zone was a consistent finding in our tests whether sequences were analyzed as an unrestricted enzyme set of all sequences (Category 1 sequence set) (Figs. 3, 4, Tables 1A, 1B) or when separated before analyses by individual Domain or enzyme class (Category 2 sequence sets) (Fig. 5) or when separated before analyses into sets of the same enzyme class and Domain (“Two-Way”) (Category 3 sequence sets) (Tables 2A and 2B).

However, we acknowledge that our observation of the concentrations of “early” amino acids as glycine at the C/AC of highly conserved enzymes is not rigorous proof of evolutionary continuity between prebiotic chemistry and contemporary biochemical catalysis. Nor is there evidence that the enzymes we have studied are functionally identical or similar to the earliest pro-enzymes. Some of our twelve enzymes choices may not be the oldest protein(s) nor be equally primitive. Informational enzymes as, e.g., amino acyl tRNA synthases or ATPases might have been chosen (Becerra et al. 2007).

We primarily studied operational enzymes that are involved in the three-carbon “trunk” portion of glycolysis. We prefer glycolysis for a variety of reasons. Glycolysis might allow life under prebiotic anaerobic conditions and assures a fast response in supplying ATP (references are cited by Meléndez-Hevia et al. 1997). Sugar is described in chemical terms both as the optimal biosynthetic carbon source of aqueous life in the Universe and as an indispensible component of a model describing the irreversible catalytic flow of reactions ascribed to the origin of life (Weber 2000, 2001, 2002). This anaerobic redox disproportionation of sugar, e.g., glucose, with the production of ATP that is also called substrate phosphorylation is mechanistically the simplest and presumably the oldest type of energy conservation (Gest and Schopf 1983). Glycolysis’s antiquity is reflected in the fact that it occurs in a soluble system without the involvement of membranes and the relatively low-level of ATP synthesis by substrate phosphorylation is believed relatable to the presumed inefficiency of a primitive energy conserving function (Gest and Schopf 1983). Glycolysis is presumed present in the last universal ancestor (references are cited by Ronimus and Morgan 2003). The supposed and characterized primitiveness of glycolysis is compatible with the fact that its components ATP and NAD are present in almost every extant cellular process that either supplies or depends on utilizable energy (Krebs and Kornberg 1957). There is now genetic evidence connecting DNA chain elongation to glycolysis (Jannière et al. 2007).

We emphasize that in our work, with widely diverse taxonomic examples of some household-glycolytic enzymes, the averaged MSA site conservation scores decrease in a relatively smooth manner proceeding from the C/AC anchor-atom and closely surrounding sites to the molecule’s periphery and that this progression is associated with local concentrations of specific amino acids.

The generally consistent signal we have found in the context of natural variability accentuates the usefulness of statistical analyses in studying multiple and diverse species. We suggest that study of conservation of sites in relation to any historical time frame should also include a study of interior nodes of the phylogenetic tree space. For example, the posterior distribution of the interior nodal sequences of a phylogenetic tree can be estimated by our Bayesian tree building algorithim (Li et al. 2000). Our previous reports showed that 3-phosphoglycerate kinase, a product of an operational housekeeping gene sequence, carries a high degree of evolutionary signal for phylogenetic studies (Wolf et al. 2004; Pollack et al. 2005). The results of the present study indicate that this signal can be further enhanced. The consistencies of the association between site conservation and distance from the C/AC across differing enzyme classes and over the three Domains should have consequences for the stochastic models used in such studies. For example, introducing the distance from the C/AC as an explanatory factor to help describe the site-to-site variability in the rate of mutation may greatly improve the likelihoods for phylogenetic trees based on amino acid sequence data (Pan 2008). Furthermore, knowledge of the identification and sequence positions of these most-conserved amino acids common to a wide taxonomy and localized at the C/AC of highly-conserved essential enzymes of central metabolism may be useful in choosing residues for a variety of studies.

Our results showing that modal occupancy rates for the distributions of specific amino acids are linked to this conservation/distance information may have implications for amino acid substitution models and perhaps chemical or enzyme evolution or design. For example, improvements might be developed to the classical PAM (Schwartz and Dayhoff 1978) and the more recent BLOSUM (Henikoff and Henikoff 1992) families of amino acid substitution matrices used heavily in phylogenetic research. These families are indexed by the cumulative degree of conservation (e.g., the BLOSUM62 matrix contains data from comparisons having at least 62% similarity within blocks of multiply aligned related sequences). Our results suggest that for enzymes such matrices would possibly provide more phylogenetic information as they incorporate modifications that focus on specific levels of conservation and proximity to the C/AC rather than only on cumulative levels.

In those enzyme examples we have studied, the most-conserved environments closely surrounding the functional cores have similar amino acid content. Regardless of Domain or enzyme class or both in sequences of some operational essential enzymes certain residues notably G and V and perhaps D are concentrated at most-conserved sites within ∼15 Å of the catalytic/active centers and others as K and perhaps E are concentrated at most distant least-conserved sites. Alanine seems to be more generally distributed. Our strikingly consistent statistical results regarding those most-conserved C/AC localized amino acids perhaps with others constitute data supportive of reports suggesting that they are a contemporary remnant or signal of prebiotic amino acid aggregation.