Introduction

Is there anything solid known about the origin of the genetic code? To answer this question posed at the Erice Conference (2006), the present study examines the nature of the proof for the coevolution theory of the genetic code, or CET (Wong 2005a).

It is generally supposed that the development of life forms on Earth or elsewhere requires energy, information and catalysis. The need for energy is prescribed by thermodynamics, and evolution entails competition between replicating information systems, but is catalysis essential? This question is answered in the affirmative by recognizing the finite chemical stability of heteropolymeric templates/genes, which cannot multiply if one or more breaks in their structures occur during one replication cycle. The stability theorem follows:

$$ kLT < 1 $$

where k is the rate of gene scission, L the number of inter-monomeric bonds in the genes and T the replication time. For a primitive ribo-organism consisting of three RNA genes each 50 nucleotide long, according to the stability theorem T must be less than 8.6 years. No biotic system utilizing RNA genes could acquire such a fast replication rate without catalysis. Enhancement of catalytic efficiency thus had to be a foremost evolutionary incentive at life’s origin (Wong and Xue 2002).

Since ribozymes often display a low k cat, thereby causing a low k cat/K m, or catalytic efficiency (Wong and Xue 2002), the transition from an RNA-like World to a Protein World is postulated to begin with the addition of amino acids and peptides to ribozymes to improve catalytic efficiency (Wong 1991; Szathmary 1993), which is supported by the observed ribozymic synthesis of peptide-RNA conjugates (Zhang and Cech 1997) and peptide activation of ribozymic function (Robertson et al. 2004).

Tenets of Coevolution Theory

The Asp-family amino acids produced from Asp as biosynthetic precursor occupy three codon boxes across the ANN row of the code. Ile and Met, sharing the AUN codon box, are sibling amino acids derived from Asp through homoserine. Cys and Trp, sharing the UGN codon box, are both biosynthetic products of Ser. Based on such biosynthesis-codon allocation relationships, CET proposes that the code at first encoded not all 20 proteinous amino acids, but only about 10 Phase 1 amino acids that were readily supplied by prebiotic synthesis. Subsequently, these brought into the code through primitive biosynthesis the Phase 2 amino acids, which vastly enhanced the catalytic and specificity performance of proteins. Still later, enhancement was continued with the entry of Phase 3 amino acids through post-translational modifications. Pretran synthesis, whereby a precursor amino acid while bonded to its tRNA is converted to a product amino acid, furnished an important mechanism for the entry of some Phase 2 amino acids into the code. The advantage of pretran synthesis is that the nascent product immediately receives the anticodons on the tRNA (Wong 1975a, 1981, 2005a; Di Giulio 2004). Accordingly, the basic tenets of CET are:

  1. Tenet 1

    The prebiotic environment did not supply all 20 proteinous amino acids at life’s origin, but had to be complemented by sourcing through inventive biosynthesis.

  2. Tenet 2

    Pretran synthesis provided mechanisms for the encoding of some Phase 2 amino acids.

  3. Tenet 3

    Biosynthetic relationships between amino acids were an important determinant of codon allocations.

  4. Tenet 4

    The amino acid ensemble encoded by the genetic code is mutable, allowing early code expansion to admit the Phase 2 amino acids.

Genetic Code Mutation

All extant organisms use the same 20 encoded amino acids. In the face of this 3-billion year invariance, there is only one way to prove Tenet 4, which is to mutate the code. This was first achieved by the isolation of genetic code mutants of Bacillus subtilis where 4-fluoroTrp effectively replaces Trp as an encoded amino acid for indefinite cell growth, in some mutants even displacing Trp entirely, with Trp being reduced to the status of an inhibitory analogue (Wong 1983, 2005a). More recently, 5-fluoroTrp and 6-fluoroTrp have also become genetically encoded amino acids fully capable of supporting indefinite growth (Mat et al. 2005). The addition of genetically encoded amino acids also has been extended to E. coli, yeast, mammalian cells, and to over 30 unnatural amino acids (Doring et al. 2001; Bacher and Ellington 2003; Bacher et al. 2004; Kohrer et al. 2004; Xie and Schultz 2005; Budisa 2006). As well, besides the top-down proteome-wide approach of code mutation employed in the displacement of Trp by 4-fuloroTrp, which throws direct light on code evolution, the low reactivity of some aaRS with tRNAs from another biological domain (Kwok and Wong 1980) has enabled a bottom-up position-specific approach for the genetic encoding of unnatural amino acids.

Active code evolution followed by a 3-billion year freeze is at first glance surprising, but it finds a ready parallel in human languages, where alphabets evolved to arrive at an adequate representation of the 40 different basic sounds of the human voice, and froze. Different alphabets froze with a different number of letters – Hebrew with 22, Latin 23, English 26, Cyrillic 33, and archaic Hungarian runan 39, but once the usage of any alphabet is established, it resists further evolutionary change. In this light, the Phase 1 amino acids from the prebiotic environment could launch life, but not allow the construction of high performance polypeptides. Therefore the Phase 2 code expansion was the Protein World’s search for excellence. The dynamics of the coevolution process are such that evolution of the encoded amino acid ensemble, constantly enhancing the catalytic and specificity capabilities of proteins, never ceases until it arrives at a collection of amino acid side-chains with sufficient chemical versatility to ensure an extremely low error rate in the translation machinery, whereupon the code freezes because further revision would create an unacceptable level of noise in the context of low-noise translation and thus become an over-burdensome selective disadvantage (Wong 1976). The versatility of the 20-member amino acid code has withstood the test of time immemorial, underwriting such singular accomplishments of the Protein World as enzymes with diffusion-controlled kinetics (Wong 1975b), multicellular life and human intelligence.

The proving of Tenet 4 has opened up the genetic code to modifications and expansions to deepen understanding of protein structure and function, and broaden the scope of genetic engineering. Evolution is no longer confined merely to the endless sequence permutations of 20 standard amino acids. Instead, from now on in a sequel to life, both amino acid sequences and the amino acids themselves can be varied (Cohen 2000). Since the 20-amino acid code is so fundamental an attribute of life, the new genetic code mutants employing a different encoded amino acid ensemble in effect represent new types of life (Hesman 2000).

Primordial Pretran Synthesis

The lack of an efficient prebiotic synthesis for all 20 standard amino acids (Wong 1988, 2005a), and the chemical instability of some amino acids (Wong and Bronskill 1979; Wong 1984) support Tenet 1. The evident correlations between biosynthesis and codon allocations support Tenets 1–3. However, strong as these lines of evidence are, they fall short of a rigorous proof. Instead, rigorous proof has to be derived as follows.

Gln-tRNA is produced using GlnRS in a direct pathway in some organisms, but from Glu-tRNA using pretran synthesis in an indirect pathway in other organisms. The question is, which of these two alternate pathways is primordial, and which is modern? The same question arises for the synthesis of Asn-tRNA and Cys-tRNA, where both direct and indirect pathways are known. Because Tenet 2 postulates that some Phase 2 amino acids entered the primitive expanding code through pretran synthesis, it is disproven if pretran synthesis is strictly a modern invention. If pretran synthesis is primordial, proving all of Tenets 1–3 becomes straightforward. Three lines of evidence are germane in this regard:

  1. (a)

    The genetic distances between alloacceptor tRNAs accepting dissimilar amino acids indicate that tRNA evolution began with sequences closely clustered in sequence space, which became dispersed in time. Methanopyrus, with the lowest alloacceptor tRNA distances, represents the slowest evolver that stands closest to the last universal common ancestor, or LUCA (Xue et al. 2003). Anticodon usages (Tong and Wong 2004), as well as sequence homologies between potentially paralogous aaRS pairs (Xue et al. 2005) have provided independent evidence for a Methanopyrus-proximal LUCA. On this basis, the absence of GlnRS, AsnRS and CysRS from Methanopyrus (Sauerwald et al. 2005; Wong 2005a, b) establishes that LUCA lacked these three aaRS and employed pretran synthesis for the encoding of Gln, Asn and Cys.

  2. (b)

    Comparative phylogenetics indicate that the indirect pathways using pretran synthesis to produce Gln-tRNA and Asn-tRNA are primordial, whereas both direct and indirect pathways are equally ancient for Cys-tRNA (O’Donoghue et al. 2005, Sauerwald et al. 2005).

  3. (c)

    Selenocysteine, or Sec, enters proteins through pretran synthesis from Ser-tRNA via either the SelA or the PSTK/SepSecS pathway. Since no SecRS is known, only pretran synthesis is employed for Sec encoding in organisms. Comparative phylogenetics indicates that pretran synthesis of Sec is primordial, and was utilized by LUCA (Yuan et al. 2006).

These lines of evidence converge to the conclusion that pretran synthesis is not a modern invention like much of secondary metabolism, but a primordial occurrence that brought Gln, Asn, Sec, and likely Cys and some other Phase 2 amino acids into the code, thereby proving Tenet 2. The entry of Gln, Asn and Sec into the biotic system through pretran synthesis proves Tenet 1. The pretran synthesis origins of Gln and Asn is corroborated by the thermal instabilities of Gln and Asn which are such that Gln could not exceed 3.7 × 10−12 M and Asn could not exceed 2.4 × 10−8 M in the prebiotic environment (Wong and Bronskill 1979): Gln and Asn were simply unavailable at the start of life. The UV-instabilities of Cys, Met, Trp, His, Tyr and Phe (Wong 1984) also favor the bulk of these amino acids being supplied to the pre-LUCA biotic system by primitive biosynthesis. The pretran synthesis origins of Sec and Cys constitute a remarkable validation of CET’s suggestion that the UGN codon box was originally a Ser-box that connects the UCN and AGY codons into a contiguous Ser-domain with single-base separations between codons.

Proving Tenet 2 is tantamount to also proving Tenet 3. The reason is, when Gln, Asn, Sec or Cys is formed in situ on tRNA through pretran synthesis, it immediately acquires the anticodon on the tRNA to which it is bonded. Consequently, CAA and CAG are allocated to Gln, AAU and AAC to Asn, UGA to Sec, and UGU and UGC to Cys because in each instance the allocated codons belonged to the pretran synthesis precursor. In these instances physicochemical attributes such as hydrophobicity or molecular volume could make only a minor contribution to codon allocation by influencing which among the precursor’s codons were to be assigned to the product, and potential stereochemical interactions between Gln, Asn, Sec and Cys with their cognate codons/anticodons had little role to play.

Conclusion

CET suggested that amino acid biosynthesis was the predominant, but not the sole, determinant of codon allocations (Wong 1975a). Recent estimates of the strengths of three different determinants of codon allocations have made possible a quantitative assessment of their relative contributions to the selection of the universal genetic code (Wong 2005a) as:

$$ \begin{array}{*{20}c} {{{\left[ {{\text{Amino Acid Biosynthesis}}} \right]}:{\left[ {{\text{Error Minimization}}} \right]}:{\left[ {{\text{Stereochemical Interaction}}} \right]}}} \\ { = } \\ {{40,000,000:400:1}} \\ \end{array} $$

Thus the contribution of amino acid biosynthesis relative to other factors in shaping the code turns out to be far more overwhelming than could have been surmised. In Chance and Necessity, Monod (1972) identified three frontiers that represent the foremost challenge of biology: the problem of life’s origins, the riddle of the code’s origins, and the central nervous system. The proving of CET reveals that amino acid biosynthesis is the key to deciphering the riddle posed by the structure of the genetic code. Just as the map of a country so often tells the story of its history, the structure of the universal genetic code is a lasting inscription of the history of its coevolution with the primordial pathways of amino acid biosynthesis.