Skip to main content

A Realistic Model Under Which the Genetic Code is Optimal

Abstract

The genetic code has a high level of error robustness. Using values of hydrophobicity scales as a proxy for amino acid character, and the mean square measure as a function quantifying error robustness, a value can be obtained for a genetic code which reflects the error robustness of that code. By comparing this value with a distribution of values belonging to codes generated by random permutations of amino acid assignments, the level of error robustness of a genetic code can be quantified. We present a calculation in which the standard genetic code is shown to be optimal. We obtain this result by (1) using recently updated values of polar requirement as input; (2) fixing seven assignments (Ile, Trp, His, Phe, Tyr, Arg, and Leu) based on aptamer considerations; and (3) using known biosynthetic relations of the 20 amino acids. This last point is reflected in an approach of subdivision (restricting the random reallocation of assignments to amino acid subgroups, the set of 20 being divided in four such subgroups). The three approaches to explain robustness of the code (specific selection for robustness, amino acid–RNA interactions leading to assignments, or a slow growth process of assignment patterns) are reexamined in light of our findings. We offer a comprehensive hypothesis, stressing the importance of biosynthetic relations, with the code evolving from an early stage with just glycine and alanine, via intermediate stages, towards 64 codons carrying todays meaning.

This is a preview of subscription content, access via your institution.

Fig. 1

Notes

  1. For a recent update on prebiotic synthesis see (Parker et al. 2011) and references therein.

  2. In the original calculation, Haig and Hurst ignored the three ‘stop codons’ encoding chain termination.

  3. When using the Freeland and Hurst weights (and hence the MS FH0 -values), it is possible to fix another set of three amino acids Phe, His, Trp in order to make the SGC optimal.

References

  • Aboderin AA (1971) An empirical hydrophobicity scale for α-amino-acids and some of its applications. Int J Biochem 2(11):537–544

    Article  CAS  Google Scholar 

  • Alff-Steinberger C (1969) The genetic code and error transmission. Proc Natl Acad Sci USA 64(2):584–591

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Ardell DH (1998) On error minimization in a sequential origin of the standard genetic code. J Mol Evol 47(1):1–13

    Article  CAS  PubMed  Google Scholar 

  • Berg JM, Tymoszko JL, Stryer L (2007) Biochemistry, 6th edn. W.H. Freeman and Company, New York, p 664

  • Biou V, Gibrat JF, Levin JM, Robson B, Garnier J (1988) Secondary structure prediction: combination of three different methods. Protein Eng 2(3):185–191

    Article  CAS  PubMed  Google Scholar 

  • Buhrman H, van der Gulik PTS, Kelk SM, Koolen WM, Stougie L (2011) Some mathematical refinements concerning error minimization in the genetic code. IEEE/ACM Trans Comput Biol Bioinf 8(5):1358–1372

    Article  Google Scholar 

  • Burkard R, Derigs U (1980) Assignment and matching problems: solution methods with FORTRAN-programs. Lecture notes in economics and mathematical systems. Springer-Verlag, Berlin. http://books.google.nl/books?id=0jwZAQAAIAAJ

    Book  Google Scholar 

  • Burkard RE, Rendl F (1984) A thermodynamically motivated simulation procedure for combinatorial optimization problems. Eur J Oper Res 17(2):169–174

    Article  Google Scholar 

  • Butler T, Goldenfeld N, Mathew D, Luthey-Schulten Z (2009) Extreme genetic code optimality from a molecular dynamics calculation of amino acid polar requirement. Phys Rev E 79(6):060,901(R)

    Article  CAS  Google Scholar 

  • Caporaso JG, Yarus M, Knight R (2005) Error minimization and coding triplet/binding site associations are independent features of the canonical genetic code. J Mol Evol 61(5):597–607

    Article  CAS  PubMed  Google Scholar 

  • Cornette JL, Cease KB, Margalit H, Spouge JL, Berzofsky JA, DeLisi C (1987) Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J Mol Biol 195(3):659–685

    Article  CAS  PubMed  Google Scholar 

  • Crick FHC (1968) The origin of the genetic code. J Mol Biol 38(3):367–379

    Article  CAS  PubMed  Google Scholar 

  • Crick FHC, Barnett L, Brenner S, Watts-Tobin RJ (1961) General nature of the genetic code for proteins. Nature 192(4809):1227–1232

    Article  CAS  PubMed  Google Scholar 

  • Di Giulio M (1989) The extension reached by the minimization of the polarity distances during the evolution of the genetic code. J Mol Evol 29(4):288–293

    Article  CAS  PubMed  Google Scholar 

  • Di Giulio M (2008) An extension of the coevolution theory of the origin of the genetic code. Biol Direct 3:37

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  • Eigen M, Schuster P (1978) A principle of natural self organization. Part C: the realistic hypercycle. Naturwissenschaften 65(7):341–369

    Article  CAS  Google Scholar 

  • Eisenberg D, McLachlan AD (1986) Solvation energy in protein folding and binding. Nature 319(6050):199–203

    Article  CAS  PubMed  Google Scholar 

  • Ellington AD, Szostak JW (1990) In vitro selection of RNA molecules that bind specific ligands. Nature 346:818–822

    Article  CAS  PubMed  Google Scholar 

  • Eppstein D (2003) Setting parameters by example. SIAM J Comput 32(3):643–653

    Article  Google Scholar 

  • Erives A (2011) A model of proto-anti-codon RNA enzymes requiring L-amino acid homochirality. J Mol Evol 73:10–22. doi:10.1007/s00239-011-9453-4

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Freeland SJ, Hurst LD (1998a) The genetic code is one in a million. J Mol Evol 47(3):238–248

    Article  CAS  PubMed  Google Scholar 

  • Freeland SJ, Hurst LD (1998b) Load minimization of the genetic code: history does not explain the pattern. Proc R Soc B Biol Sci 265(1410):2111–2119

    Article  CAS  Google Scholar 

  • Freeland SJ, Knight RD, Landweber LF, Hurst LD (2000) Early fixation of an optimal genetic code. Mol Biol Evol 17(4):511–518

    Article  CAS  PubMed  Google Scholar 

  • Freeland SJ, Wu T, Keulmann N (2003) The case for an error minimizing standard genetic code. Orig Life Evol Biosp 33(4-5):457–477

    Article  CAS  PubMed  Google Scholar 

  • Gilis D, Massar S, Cerf NJ, Rooman M (2001) Optimality of the genetic code with respect to protein stability and amino-acid frequencies. Genome Biol 2(11):R49

    Article  Google Scholar 

  • Grantham R (1974) Amino acid difference formula to help explain protein evolution. Science 185(4154):862–864

    Article  CAS  PubMed  Google Scholar 

  • Grosjean H, de Crecy-Lagard V, Marck C (2010) Deciphering synonymous codons in the three domains of life: co-evolution with specific tRNA modification enzymes. FEBS Lett 584(2):252–264

    Article  CAS  PubMed  Google Scholar 

  • Haig D, Hurst LD (1991) A quantitative measure of error minimization in the genetic code. J Mol Evol 33(5):412–417

    Article  CAS  PubMed  Google Scholar 

  • Higgs PG (2009) A four- column theory for the origin of the genetic code: tracing the evolutionary pathways that gave rise to an optimized code. Biol Direct 4:16

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  • Higgs PG, Pudritz RE (2009) A thermodynamic basis for prebiotic amino acid synthesis and the nature of the first genetic code. Astrobiology 9(5):483–490

    Article  CAS  PubMed  Google Scholar 

  • Ikehara K (2002) Origins of gene, genetic code, protein and life: comprehensive view of life systems from a GNC-SNS primitive genetic code hypothesis. J Biosci 27(2):165–186

    Article  CAS  PubMed  Google Scholar 

  • Ikehara K, Omori Y, Arai R, Hirose A (2002) A novel theory on the origin of the genetic code: a GNC-SNS hypothesis. J Mol Evol 54(4):530–538

    Article  CAS  PubMed  Google Scholar 

  • Illangasekare M, Yarus M (2002) Phenylalanine-binding RNAs and genetic code evolution. J Mol Evol 54(3):298–311

    Article  CAS  PubMed  Google Scholar 

  • Janas T, Widmann JJ, Knight R, Yarus M (2010) Simple, recurring RNA binding sites for l-arginine. RNA 16(4):805–816

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Jensen RA (1976) Enzyme recruitment in evolution of new function. Annu Rev Microbiol 30:409–425

    Article  CAS  PubMed  Google Scholar 

  • Johansson MJO, Esberg A, Huang B, Bjork GR, Bystrom AS (2008) Eukaryotic wobble uridine modifications promote a functionally redundant decoding system. Mol Cell Biol 28(10):3301–3312

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Johnson DBF, Wang L (2010) Imprints of the genetic code in the ribosome. Proc Natl Acad Sci USA 107(18):8298–8303

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Kawashima S, Ogata H, Kanehisa M (1999) AAindex: amino acid index database. Nucleic Acids Res 27(1):368–369

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Knight RD, Freeland SJ, Landweber LF (1999) Selection, history and chemistry: the three faces of the genetic code. Trends Biochem Sci 24(6):241–247

    Article  CAS  PubMed  Google Scholar 

  • Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157(1):105–132

    Article  CAS  PubMed  Google Scholar 

  • Lehman N, Jukes TH (1988) Genetic code development by stop codon takeover. J Theor Biol 135(2):203–214

    Article  CAS  PubMed  Google Scholar 

  • Li Y, Pardalos P, Resende M (1994) A greedy randomized adaptive search procedure for the quadratic assignment problem. Quadratic Assign Relat Probl 16:237–261

    Article  Google Scholar 

  • Lozupone C, Changayil S, Majerfeld I, Yarus M (2003) Selection of the simplest RNA that binds isoleucine. RNA 9(11):1315–1322

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Majerfeld I, Chocholousova J, Malaiya V, Widmann J, McDonald D, Reeder J, Iyer M, Illangasekare M, Yarus M, Knight R (2010) Nucleotides that are essential but not conserved; a sufficient l-tryptophan site in RNA. RNA 16(10):1915–1924

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Majerfeld I, Puthenvedu D, Yarus M (2005) RNA affinity for molecular l-histidine; genetic code origins. J Mol Evol 61:226–235

    Article  CAS  PubMed  Google Scholar 

  • Majerfeld I, Yarus M (1994) An RNA pocket for an aliphatic hydrophobe. Nat Struct Biol 1(5):287–292

    Article  CAS  PubMed  Google Scholar 

  • Majerfeld I, Yarus M (2005) A diminutive and specific RNA binding site for l-tryptophan. Nucleic Acids Res 33(17):5482–5493. doi:10.1093/nar/gki861

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Massey SE (2006) A sequential "2-1-3" model of genetic code evolution that explains codon constraints. J Mol Evol 62(6):809–810

    Article  CAS  PubMed  Google Scholar 

  • Massey SE (2008) A neutral origin for error minimization in the genetic code. J Mol Evol 67(5):510–516

    Article  CAS  PubMed  Google Scholar 

  • Mathew DC, Luthey-Schulten Z (2008) On the physical basis of the amino acid polar requirement. J Mol Evol 66(5):519–528

    Article  CAS  PubMed  Google Scholar 

  • MATLAB: version 7.12.0 (R2011a) The MathWorks Inc., Natick, Massachusetts (2011)

  • Meirovitch H, Rackovsky S, Scheraga HA (1980) Empirical studies of hydrophobicity. 1. Effect of protein size on the hydrophobic behavior of amino acids. Macromolecules 13(6):1398–1405

    Article  CAS  Google Scholar 

  • Miyazawa S, Jernigan RL (1985) Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 18(3):534–552

    Article  CAS  Google Scholar 

  • Miyazawa S, Jernigan RL (1999) Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins 34(1):49–68

    Article  CAS  PubMed  Google Scholar 

  • Noller HF (2004) The driving force for molecular evolution of translation. RNA 10(12):1833–1837

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Novozhilov AS, Wolf YI, Koonin EV (2007) Evolution of the genetic code: partial optimization of a random code for robustness to translation error in a rugged fitness landscape. Biol Direct 2:24

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  • Ohno S (1970) Evolution by gene duplication. Springer, Berlin

    Book  Google Scholar 

  • Oobatake M, Ooi T (1977) An analysis of non-bonded energy of proteins. J Theor Biol 67(3):567–584

    Article  CAS  PubMed  Google Scholar 

  • Parker ET, Cleaves HJ, Dworkin JP, Glavin DP, Callahan M, Aubrey A, Lazcano A, Bada JL (2011) Primordial synthesis of amines and amino acids in a 1958 Miller H2S-rich spark discharge experiment. Proc Natl Acad Sci USA 108(14):5526–5531

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Philip GK, Freeland SJ (2011) Did evolution select a nonrandom "alphabet" of amino acids? Astrobiology 11(3):235–240

    Article  CAS  PubMed  Google Scholar 

  • Ponnuswamy PK, Prabhakaran M, Manavalan P (1980) Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins. Biochim Biophys Acta 623(2):301–316

    Article  CAS  PubMed  Google Scholar 

  • Rahman S, Bashton M, Holliday G, Schrader R, Thornton J (2009) Small molecule subgraph detector (SMSD) toolkit. J Cheminform 1(1):12. doi:10.1186/1758-2946-1-12 http://www.jcheminf.com/content/1/1/12

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  • Rode BM, Son HL, Suwannachot Y, Bujdak J (1999) The combination of salt induced peptide formation reaction and clay catalysis: a way to higher peptides under primitive earth conditions. Orig Life Evol Biosph 29(3):273–286

    Article  CAS  PubMed  Google Scholar 

  • Schwendinger MG, Rode BM (1989) Possible role of copper and sodium in prebiotic evolution of peptides. Anal Sci 5:411–414

    Article  CAS  Google Scholar 

  • Sweet RM, Eisenberg D (1983) Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. J Mol Biol 171(4):479–488

    Article  CAS  PubMed  Google Scholar 

  • Szostak JW (2012) The eightfold path to non-enzymatic rna replication. J Syst Chem 3:2

    Article  CAS  Google Scholar 

  • Taylor FJR, Coates D (1989) The code within the codons. BioSystems 22(3):177–187

    Article  CAS  PubMed  Google Scholar 

  • Turk RM, Chumachenko NV, Yarus M (2010) Multiple translational products from a five-nucleotide ribozyme. Proc Natl Acad Sci USA 107(10):4585–4589

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • van der Gulik P, Massar S, Gilis D, Buhrman H, Rooman M (2009) The first peptides: the evolutionary transition between prebiotic amino acids and early proteins. J Theor Biol 261(4):531–539

    Article  PubMed  CAS  Google Scholar 

  • van der Gulik PTS, Hoff WD (2011) Unassigned codons, nonsense suppression, and anticodon modifications in the evolution of the genetic code. J Mol Evol 73(3-4):59–69

    Article  CAS  PubMed  Google Scholar 

  • Vetsigian K, Woese C, Goldenfeld N (2006) Collective evolution and the genetic code. Proc Natl Acad Sci USA 103(28):10,696–10,701

    Article  CAS  Google Scholar 

  • Voet D, Voet JG (1995) Biochemistry, 2nd edn, Wiley, New York, p 773

  • Woese CR (1965) Order in the genetic code. Proc Natl Acad Sci USA 54(1):71–75

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Woese CR (1967) The genetic code. Harper and Row, New York

  • Woese CR (1973) Evolution of the genetic code. Naturwissenschaften 60(10):447–459

    Article  CAS  PubMed  Google Scholar 

  • Woese CR, Dugre DH, Dugre SA, Kondo M, Saxinger WC (1966a) On the fundamental nature and evolution of the genetic code. Cold Spring Harb Symp Quant Biol 31:723–736

    Article  CAS  PubMed  Google Scholar 

  • Woese CR, Dugre DH, Saxinger WC, Dugre SA (1966b) The molecular basis for the genetic code. Proc Natl Acad Sci USA 55(4):966–974

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Wolf YI, Koonin EV (2007) On the origin of the translation system and the genetic code in the RNA world by means of natural selection, exaptation, and subfunctionalization. Biol Direct 2:14

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  • Wong JT (1975) A co-evolution theory of the genetic code. Proc Natl Acad Sci USA 72(5):1909–1912

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Wong JT (1980) Role of minimization of chemical distances between amino acids in the evolution of the genetic code. Proc Natl Acad Sci USA 77(2 II):1083–1086

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Wong JT (2007) Question 6: coevolution theory of the genetic code: a proven theory. Orig Life Evol Biosph 37(4-5):403–408

    Article  CAS  PubMed  Google Scholar 

  • Wong JTF (2005) Coevolution theory of genetic code at age thirty. BioEssays 27(4):416–425

    Article  CAS  PubMed  Google Scholar 

  • Yarus M (2011) The meaning of a minuscule ribozyme. Philos Trans R Soc B Biol Sci 366(1580):2902–2909

    Article  CAS  Google Scholar 

  • Yarus M, Widmann JJ, Knight R (2009) RNA-amino acid binding: a stereochemical era for the genetic code. J Mol Evol 69(5):406–429

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank the EiC and two anonymous reviewers for suggestions which improved the manuscript. Part of this research has been funded by NWO-VICI Grant 639-023-302, by the NWO-CLS MEMESA Grant, by the Tinbergen Institute, and by a NWO-VENI Grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter T. S. van der Gulik.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ZIP (7,235 KB)

Appendices

Appendices

Four further observations are reported here. Firstly, as explained in ‘Introduction’ section , consideration of the biosynthetic pathways leading to the different amino acids suggests an aspect of organization of the SGC, in which GNN codons tend to be assigned to ‘prebiotic amino acids’, ANN codons to comparatively small, aspartate-derived amino acids, CNN codons to larger amino acids, and UNN codons to the largest, or (in the case of cysteine) the most instable and reactive amino acid. In other words: the first position of the codon might have a link with the complexity of biochemistry, e.g. the UNN codons being the only ones encoding aromatic amino acids and the instable cysteine, and reflecting the most advanced stage of biochemistry during the evolution of the genetic code (when the biochemistry was sufficiently complex to handle cysteine, and to build tryptophan). In ‘Appendix :Molecular Structure Matrix’, we study this link with the biosynthetic development of amino acids by measuring how many one-atom changes are required to transform one amino acid into another. With respect to this distance measure, amino acids derived from the same precursor (like e.g. Ile and Thr) are comparatively close, because they share structure parts. Changing the second position of the codon (in the case of Ile and Thr: changing AUU to ACU) would then replace an amino acid by one with a comparatively similar structure, reflecting their membership of the same biosynthetic family. If the error-robustness calculation is performed with these molecular-structure distances, the SGC is found to have error protection in substitution mutations in the second position (and therefore grouping e.g. ANU codons together). The results are given in ‘Appendix :Molecular Structure Matrix’.

Secondly, we tried to find numerical values for the 20 amino acids which make the SGC optimal in terms of error robustness among all possible genetic codes. Using a numerical optimization approach developed by Eppstein (2003), we were able to find 20 such values. In fact, many different sets of 20 values have this property. Details about these SGC-optimality calculations can be found in ‘Appendix: Inverse Parametric Optimization’.

Thirdly, we screened a large list of physico-chemical amino acid characteristics on their performance in our error-robustness calculations. Polar requirement was one of the best performing measures. This strongly supports the remark by Haig and Hurst [‘The natural code is very conservative with respect to polar requirement. The striking correspondence between codon assignments and such a simple measure deserves further study’ (Haig and Hurst 1991)]. The observation of Vetsigian, Woese, and Goldenfeld (‘Although we do not know what defines amino acid ‘similarity’ in the case of the code, we do know one particular amino acid measure that seems to express it quite remarkably in the coding context. That measure is amino acid polar requirement […]’ (Vetsigian et al. 2006) should also be mentioned. More details are given in ‘Appendix: Scan of Other Amino Acid Properties’.

Finally, we wondered if, by fixing just one or two more assignments, the SGC would be optimal without using the subdivision leading to the historically reasonable set of possible codes (as explained in ‘Introduction’ section) This was not the case. When working with Haig–Hurst weights (i.e. equal weighting), there exist 34 sets of 9 fixed assignments which do have this characteristic. However, none of these 34 sets consists of the seven fixed assignments based on aptamer considerations plus two more amino acids. The smallest set containing the seven has size 10. When working with Freeland–Hurst weights (see ‘Methods’ section), sets of 8 or 9 fixed assignments with the required characteristic, do not exist. This work is presented in ‘Appendix: Minimal Number of Fixed Assignments’.

Molecular Structure Matrix

Polar requirement is just one physico-chemical aspect of amino acids. The discovery that only 1 in 10000 random codes has a lower error-robustness value than the SGC when polar requirement is used as an amino acid characteristic (Haig and Hurst 1991) is compelling evidence that error robustness is present in the SGC. When a conservative attitude is taken, and a phenomenon is considered noteworthy only when the probability to encounter it as a random effect is <0.1 %, the SGC is clearly noteworthy. If one considers the error-robustness values for the three positions separately [please refer to Buhrman et al. (2011) for details] the results in the left column of Fig. 2 are obtained. The third position is in the <0.1 % category, the first position is in the <1 % category, while the second position, with about 22 %, is not even in the <5 % category, and can thus not be considered special.

Fig. 2
figure 2

Histograms of the MS-values of 10 million random samples using updated polar requirement (Mathew and Luthey-Schulten 2008) (4 histograms on the left) and molecular-structure distances from Table 3 squared (4 histograms on the right). The top row shows the MS0 value, the second row is the component from the first codon position (MScore1), third and forth row the components from the middle (MScore2) and last (MScore3) codon position. In contrast to the original definition (Haig and Hurst 1991) of MSi for i ≥ 1, we have chosen to normalize MScorei with the same constant as MS0 so that MS0 = ∑ 3 i=1 MScorei. The dashed red line indicates the value of the SGC

This result is not entirely satisfactory, because the codons of several pairs of similar amino acids are related by second position changes. For instance, a change from phenylalanine (Phe) to tyrosine (Tyr) is clearly a conservative change from a biological viewpoint. To develop a measure for this kind of amino acid relatedness, we introduce a new way of measuring amino acid similarity by one-atom changes which yields a measure of similarity in terms of molecular structure. We should stress that this measure does not reflect actual chemical reactions/steps. As an example, we compute the distance between Phe and Tyr to be 3 as follows: the hydrogen atom at the end of the side chain of Phe is taken off as a first step. An oxygen atom is placed on the position, which the hydrogen atom had before as a second step. The Tyr molecule is completed by addition of an hydrogen atom on top of this oxygen atom, producing the hydroxyl group at the end of the side chain of Tyr, and this is the third and final step. Generally, the distance between two molecules is defined to be the minimal number of ‘allowed one-atom changes’ to transform one molecule into the other, where the allowed one-atom changes are the following:

  • taking off or attaching an arbitrary single atom,

  • creating or destroying a single bond (thereby possibly opening or closing a ring structure),

  • changing a single bond to a double bond or vice versa.

It is not hard to see that an algorithmic way of computing the distance between two molecules m1 and m2 is to find the maximal common sub-graph mc of their molecular structure, and to sum up how many steps are required to go from m1 to mc and from m2 to mc. The distance matrix between the 20 amino acids in Table 3 has been obtained in this way, using the Small Molecule Subgraph Detector (SMSD) toolkit (Rahman et al. 2009) to find the maximal common subgraph and post-processing this information with a python script. The software code can be found in the supplemental information.

Table 3 Molecular structure matrix

In order to perform the error-robustness calculations, we followed the procedure by Haig and Hurst (1991) and considered the squared distances. In this way, the zeroes in the diagonal remain zero. The values for small changes become slightly larger (so the edge from Phe to Tyr gets a value 9), while the values for large changes (like going from Gly to Tyr) become considerably larger (in the case of Gly to Tyr 20 becomes 400). Large changes thus get stronger emphasis (Di Giulio 1989). Whether squaring is the right way to make these kind of calculations has been discussed elsewhere (Ardell 1998; Freeland et al. 2000); we just want to compare molecular structure as an input to characteristics like polar requirement, hydropathy, volume and isoelectric point, as studied by Haig and Hurst (1991). The histograms of the error-robustness in terms of molecular structure are shown in the right column of Fig. 2.

Although not producing (unlike polar requirement) a result in the <0.1 % category, it is still remarkable that the SGC is, with 0.151 %, in the <1 % category when molecular structure is used as input. This means that this matrix is performing better than volume or the hydropathy scale of hydrophobicity in the work of Haig and Hurst (1991). Even more remarkable, the error robustness comes mainly from the second position, using this measure (Fig. 2).

Inverse Parametric Optimization

Instead of asking the question ‘What is the most error-robust genetic code in terms of e.g. polar requirement?’, one could also ask the question ‘Is there a set of numerical values for the 20 amino acids such that the SGC is the optimal code in terms of error robustness?’ If one particular set of 20 values turns out to have that property, one can compare this set with different sets of amino acid characteristics, and suggest which characteristic resembles the ‘ideal values’ best. This then might be the factor playing a selective role during evolution of the SGC.

Let A be the set of amino acids and let \(\mathcal{F}\) be the set of all codes. We aim at solving the following problem: find a non-trivial vector \({\bf x} \in {\rm I R}_{\geq 0}^{A}\) of amino acid property values such that \(MS^{\alpha, {\bf x}}_0({\rm SGC}) = \mathop{\rm arg min}\limits_{F \in \mathcal{F}}MS^{\alpha, {\bf x}}_0(F)\).

To solve this problem, we used a modification of the method of Eppstein (2003). We define variables \({\bf x} \in {\rm I R}^{20}\) and consider the following constraint satisfaction problem: Find x such that

$$ {\bf x} \neq {\bf 0} $$
(1)
$$ {\bf x} \geq {\bf 0} $$
(2)
$$ MS^{\alpha,{\bf x}}_0({\rm SGC}) \leq MS^{\alpha, {\bf x}}_0(F) \quad\quad \hbox{for all}\, F \,\in\, {\mathcal{F}} $$
(3)

Note that the number of inequalities (3) equals the size of the code space, which can be quite large. To deal with the potentially large number of constraints we follow a cutting plane approach. We work with intermediate solutions \(\overline{{\bf x}}_i\), start with i = 0, and set \(\overline{{\bf x}}_0\) to some random values that satisfy constraints (1) and (2). We then solve the separation problem for the class of constraints (3). That is, we have to find a code F such that \(MS^{\alpha,\overline{{\bf x}_i}}_0(F) < MS^{\alpha,\overline{{\bf x}}_i}_0({\rm SGC})\) or prove that no such code exists. We can answer this question by finding

$$ F^* = \mathop{{\rm arg\;min}}\limits_{F \in {\mathcal{F}}} MS^{\alpha,\overline{{\bf x}}_i}_0(F), $$

using the quadratic assignment approach described in Buhrman et al. (2011). In fact, for the actual procedure it suffices to use much faster QAP heuristics, e.g. based on simulated annealing (Burkard and Rendl 1984) or the GRASP heuristic (Li et al. 1994), instead of full QAP solvers. If we find an F with \(MS^{\alpha,\overline{{\bf x}_i}}_0(F) <\; MS^{\alpha,\overline{{\bf x}}_i}_0({\rm SGC})\), we have found a violated inequality

$$ MS^{\alpha,{\bf x}}_0({\rm SGC}) \leq MS^{\alpha,{\bf x}}_0(F), $$

which we add to the constraint satisfaction problem. We solve this set of quadratic constraints using the non-linear constraint solver fmincon from MATLAB’s optimization toolbox (MATLAB 2011), obtain a new set of values \(\overline{{\bf x}}_{i+1}\) and iterate the process until no more violated inequalities can be separated. A final solution x* can be verified by a QAP solver such as Burkard and Derigs (1980). All software used is provided as supplemental information.

Using this procedure, we found many different sets of 20 values under which the SGC is optimal with respect to error-robustness. We steered the values towards the polar requirement values r by using the distance to r as the objective function in our approach. See Fig. 3 for an illustration of some of the solutions we found.

Fig. 3
figure 3

Eight examples of sets of values for the 20 amino acids that make the SGC the most error-robust genetic code. The (artificial) values are found by using inverse parametric optimization, as described in Appendix: Inverse Parametric Optimization. All sets have been normalized to have mean 0 and standard deviation 1. For comparison, we also show the original polar requirements on top (1), and the updated polar-requirement values on the second row (2). Value sets 3–6 make the SGC optimal with respect to MS0. Value sets 7–10 make the SGC optimal with respect to MS FH0

An analysis of the correlation coefficients of these ‘ideal’ values with a database of 744 known amino acid properties from the literature (AAindex: Kawashima et al. 1999) shows no correlation above 0.82 except with polar requirement. In other words, we do not know of any sets of straightforward physico-chemical amino acid properties which resemble one of these ‘ideal’ sets. This might suggest that a combination of several aspects of code evolution and amino acid properties [as suggested by e.g. Higgs (2009)] resulted in the configuration of the SGC.

Scan of Other Amino Acid Properties

We performed error-robustness calculations for all (complete) amino acid properties of the AAindex-database (Kawashima et al. 1999). For the purpose of comparison, we extended the database to include the original polar requirements (Woese et al. 1966a), and the updated polar requirements (Mathew and Luthey-Schulten 2008), as well as two sets of numerical values found by the procedure described in ‘Appendix: Inverse Parametric Optimization’.

In a first scan, 50000 random codes were sampled from

  1. 1.

    all codes,

  2. 2.

    codes with the seven assignments of Phe, Tyr, Trp, His, Leu, Ile, and Arg fixed,

  3. 3.

    codes with seven fixed assignments and respecting the structure enforced by the constraint of the historically reasonable set of possible codes (all 11,520 codes were computed in this case).

For all of the three settings above, error-robustness values were computed using Haig–Hurst and Freeland–Hurst weights (the same random samples were used for the two weight sets, the results are thus statistically correlated).

Out of the 55 best-performing codes, the same calculations as above were performed with 106 samples. The 20 best performing properties are presented in Table 4. Not surprisingly, our two sets of (artificial) numerical values found by inverse parametric optimization (described in ‘Appendix Inverse Parametric Optimization’) end up on the top.

Table 4 Table of the 20 most error-robust amino acid properties from the AAindex-database (Kawashima et al. 1999)

Furthermore, we observe that the SGC is error-robust in terms of several measures of polar requirement [as noted, e.g. in Vetsigian et al. (2006)]. One of these (for which this is not immediately obvious) is Grantham’s polarity scale (1974), which is a combination of Aboderin’s scale (1971) and polar requirement. It is especially noteworthy that the updated polar requirement (Mathew and Luthey-Schulten 2008) is consistently showing up within the best four sets of numerical values. When the sets found by inverse parametric optimization are left out, the updated values of polar requirement are in all three settings (no blocks fixed, 7 blocks fixed, and the set of 11,520 codes resulting from 7 fixed blocks plus the constraint of the historically reasonable set of possible codes) the best set of values when Freeland–Hurst weights are used.

Minimal Number of Fixed Assignments

In this appendix, we investigate how many amino acid assignments need to be fixed such that the SGC is the most error-robust genetic code with respect to the updated polar requirements (Mathew and Luthey-Schulten 2008), when we do not use the constraint of the historically reasonable set of possible codes.

For the case of the Haig–Hurst weights, there are 67 different minimal subsets \(S_1, S_2, \ldots, S_{67} \subseteq \{\hbox{Phe}, \hbox{Leu}, \hbox{Ile}, \ldots, \hbox{Ser}, \hbox{Gly}\)} such that for any \(i \in \{1,2,\ldots, 67\)}, fixing the assignments of all amino acids in Si makes the SGC the most error-robust genetic code. Any super-set of these 67 minimal subsets will also have this property, because fixing more assignments only limits the number of possible genetic codes. Out of the 67 minimal subsets, 34 of them are of size 9, 15 of size 10, 15 of size 11, and 3 of size 12.

When fixing the seven assignments of Phe, Tyr, Trp, His, Leu, Ile, and Arg (based on aptamer experiments) the minimal sets of assignments that need to be fixed in addition are: {Ser, Gln, Cys} or {Met, Ser, Gln}.

For the case of the Freeland–Hurst weights, there are 186 different minimal subsets: 2 subsets of size 10, 4 of size 11, 13 of size 12, 44 of size 13, 52 of size 14, 45 of size 15, 21 of size 16, and 5 of size 17. When fixing the seven assignments of Phe, Tyr, Trp, His, Leu, Ile, and Arg (based on aptamer experiments), there are 6 different minimal sets (of size 6) each of which can be fixed in addition in order to make the SGC the most error-robust genetic code.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Buhrman, H., van der Gulik, P.T.S., Klau, G.W. et al. A Realistic Model Under Which the Genetic Code is Optimal. J Mol Evol 77, 170–184 (2013). https://doi.org/10.1007/s00239-013-9571-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-013-9571-2

Keywords